2026-02-21T08:05:14.2301907Z Current runner version: '2.331.0'
2026-02-21T08:05:14.2304727Z Runner name: 'linux.rocm.gpu.gfx942.2-n2gvb-runner-2vdnn'
2026-02-21T08:05:14.2305182Z Runner group name: 'default'
2026-02-21T08:05:14.2305581Z Machine name: 'linux'
2026-02-21T08:05:14.2306790Z ##[group]GITHUB_TOKEN Permissions
2026-02-21T08:05:14.2307857Z Contents: read
2026-02-21T08:05:14.2308110Z Metadata: read
2026-02-21T08:05:14.2308345Z ##[endgroup]
2026-02-21T08:05:14.2309389Z Secret source: Actions
2026-02-21T08:05:14.2309700Z Prepare workflow directory
2026-02-21T08:05:14.2576887Z Prepare all required actions
2026-02-21T08:05:14.2596514Z Getting action download info
2026-02-21T08:05:14.5733057Z Download action repository 'actions/checkout@v6' (SHA:de0fac2e4500dabe0009e67214ff5f5447ce83dd)
2026-02-21T08:05:14.9792738Z Download action repository 'actions/setup-python@v6' (SHA:a309ff8b426b58ec0e2a45f0f869d46889d02405)
2026-02-21T08:05:15.4150774Z Download action repository 'astral-sh/setup-uv@v7' (SHA:eac588ad8def6316056a12d4907a9d4d84ff7a3b)
2026-02-21T08:05:15.8568637Z Download action repository 'pytorch/test-infra@main' (SHA:bb8f04ff3961233c844fde6533c7c6c5f0857909)
2026-02-21T08:05:16.5667078Z Download action repository 'actions/upload-artifact@v6' (SHA:b7c566a772e6b6bfb58ed0dc250532a479d7789f)
2026-02-21T08:05:17.0747658Z Getting action download info
2026-02-21T08:05:17.2238014Z Uses: pytorch/helion/.github/workflows/benchmark.yml@refs/heads/main (874a7d0cadab18218a84ad3579d329dc95c51820)
2026-02-21T08:05:17.2240361Z ##[group] Inputs
2026-02-21T08:05:17.2240517Z   runner: linux.rocm.gpu.gfx942.2
2026-02-21T08:05:17.2240646Z   python-version: 3.12
2026-02-21T08:05:17.2240771Z   image: rocm/dev-ubuntu-24.04:6.4.4-complete
2026-02-21T08:05:17.2240904Z   runtime-version: rocm6.4
2026-02-21T08:05:17.2241062Z   container-options: --device=/dev/kfd --device=/dev/dri
2026-02-21T08:05:17.2241204Z   alias: mi325x
2026-02-21T08:05:17.2241309Z   kernels: int4_gemm,flash_attention
2026-02-21T08:05:17.2241439Z   env-vars: 
2026-02-21T08:05:17.2241551Z   custom-args: 
2026-02-21T08:05:17.2241834Z   run_h100: true
2026-02-21T08:05:17.2241927Z   run_b200: true
2026-02-21T08:05:17.2242019Z   run_mi325x: true
2026-02-21T08:05:17.2242119Z ##[endgroup]
2026-02-21T08:05:17.2242351Z Complete job name: run-mi325x (int4_gemm,flash_attention) / benchmark-rocm6.4-int4_gemm,flash_attention-py3.12-mi325x
2026-02-21T08:05:17.2450631Z ##[group]Checking docker version
2026-02-21T08:05:17.2457267Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}'
2026-02-21T08:05:17.2605496Z '1.51'
2026-02-21T08:05:17.2620210Z Docker daemon API version: '1.51'
2026-02-21T08:05:17.2620516Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}'
2026-02-21T08:05:17.2748376Z '1.51'
2026-02-21T08:05:17.2761616Z Docker client API version: '1.51'
2026-02-21T08:05:17.2764332Z ##[endgroup]
2026-02-21T08:05:17.2765591Z ##[group]Clean up resources from previous jobs
2026-02-21T08:05:17.2767747Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=ec4e10"
2026-02-21T08:05:17.2882356Z ##[command]/usr/bin/docker network prune --force --filter "label=ec4e10"
2026-02-21T08:05:17.3009538Z ##[endgroup]
2026-02-21T08:05:17.3009668Z ##[group]Create local container network
2026-02-21T08:05:17.3014434Z ##[command]/usr/bin/docker network create --label ec4e10 github_network_2fc8484199d94410beae10a38ccd998e
2026-02-21T08:05:17.3252137Z 57e0e7788a2aef9c834902a24870ebc4a4e8aa9a49e24de76cd8ade1d6b56e3e
2026-02-21T08:05:17.3266011Z ##[endgroup]
2026-02-21T08:05:17.3286861Z ##[group]Starting job container
2026-02-21T08:05:17.3298277Z ##[command]/usr/bin/docker pull rocm/dev-ubuntu-24.04:6.4.4-complete
2026-02-21T08:05:19.4539921Z 6.4.4-complete: Pulling from rocm/dev-ubuntu-24.04
2026-02-21T08:05:19.4540231Z 953cdd413371: Pulling fs layer
2026-02-21T08:05:19.4540345Z 3a9c27801271: Pulling fs layer
2026-02-21T08:05:19.4540466Z 4c8a4cb43e3b: Pulling fs layer
2026-02-21T08:05:19.4540662Z 624e685c2697: Pulling fs layer
2026-02-21T08:05:19.4540767Z 624e685c2697: Waiting
2026-02-21T08:05:20.2574465Z 3a9c27801271: Verifying Checksum
2026-02-21T08:05:20.2574735Z 3a9c27801271: Download complete
2026-02-21T08:05:20.8122336Z 624e685c2697: Verifying Checksum
2026-02-21T08:05:20.8122726Z 624e685c2697: Download complete
2026-02-21T08:05:23.0098030Z 953cdd413371: Verifying Checksum
2026-02-21T08:05:23.0098569Z 953cdd413371: Download complete
2026-02-21T08:05:23.6783803Z 953cdd413371: Pull complete
2026-02-21T08:05:23.8039571Z 3a9c27801271: Pull complete
2026-02-21T08:05:46.1801431Z 4c8a4cb43e3b: Download complete
2026-02-21T08:06:25.9784575Z 4c8a4cb43e3b: Pull complete
2026-02-21T08:06:26.1113282Z 624e685c2697: Pull complete
2026-02-21T08:06:26.1133090Z Digest: sha256:31418ac10a3769a71eaef330c07280d1d999d7074621339b8f93c484c35f6078
2026-02-21T08:06:26.1136636Z Status: Downloaded newer image for rocm/dev-ubuntu-24.04:6.4.4-complete
2026-02-21T08:06:26.1145836Z docker.io/rocm/dev-ubuntu-24.04:6.4.4-complete
2026-02-21T08:06:26.1206265Z ##[command]/usr/bin/docker create --name c6efec98bce142318ef5a57c93d78ef5_rocmdevubuntu2404644complete_87ce82 --label ec4e10 --workdir /__w/helion/helion --network github_network_2fc8484199d94410beae10a38ccd998e --device=/dev/kfd --device=/dev/dri -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/_work":"/__w" -v "/home/runner/externals":"/__e":ro -v "/home/runner/_work/_temp":"/__w/_temp" -v "/home/runner/_work/_actions":"/__w/_actions" -v "/home/runner/_work/_tool":"/__w/_tool" -v "/home/runner/_work/_temp/_github_home":"/github/home" -v "/home/runner/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" rocm/dev-ubuntu-24.04:6.4.4-complete "-f" "/dev/null"
2026-02-21T08:06:26.3422630Z 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54
2026-02-21T08:06:26.3446048Z ##[command]/usr/bin/docker start 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54
2026-02-21T08:06:26.5085777Z 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54
2026-02-21T08:06:26.5100724Z ##[command]/usr/bin/docker ps --all --filter id=9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 --filter status=running --no-trunc --format "{{.ID}} {{.Status}}"
2026-02-21T08:06:26.5207144Z 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 Up Less than a second
2026-02-21T08:06:26.5219961Z ##[command]/usr/bin/docker inspect --format "{{range .Config.Env}}{{println .}}{{end}}" 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54
2026-02-21T08:06:26.5322961Z HOME=/github/home
2026-02-21T08:06:26.5323078Z GITHUB_ACTIONS=true
2026-02-21T08:06:26.5323252Z CI=true
2026-02-21T08:06:26.5323482Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:06:26.5341500Z ##[endgroup]
2026-02-21T08:06:26.5349558Z ##[group]Waiting for all services to be ready
2026-02-21T08:06:26.5350694Z ##[endgroup]
2026-02-21T08:06:26.5454800Z ##[group]Run echo "Detected ROCm image"
2026-02-21T08:06:26.5454988Z [36;1mecho "Detected ROCm image"[0m
2026-02-21T08:06:26.5455150Z [36;1mrocminfo || echo "rocminfo not found"[0m
2026-02-21T08:06:26.5456699Z shell: bash -l {0}
2026-02-21T08:06:26.5456871Z env:
2026-02-21T08:06:26.5456957Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:26.5457068Z ##[endgroup]
2026-02-21T08:06:26.5901830Z Detected ROCm image
2026-02-21T08:06:26.7771283Z [37mROCk module version 6.12.12 is loaded[0m
2026-02-21T08:06:26.7772328Z =====================    
2026-02-21T08:06:26.7772633Z HSA System Attributes    
2026-02-21T08:06:26.7772932Z =====================    
2026-02-21T08:06:26.7773219Z Runtime Version:         1.15
2026-02-21T08:06:26.7773525Z Runtime Ext Version:     1.7
2026-02-21T08:06:26.7773881Z System Timestamp Freq.:  1000.000000MHz
2026-02-21T08:06:26.7774418Z Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
2026-02-21T08:06:26.7775109Z Machine Model:           LARGE                              
2026-02-21T08:06:26.7775573Z System Endianness:       LITTLE                             
2026-02-21T08:06:26.7776585Z Mwaitx:                  DISABLED
2026-02-21T08:06:26.7776889Z XNACK enabled:           NO
2026-02-21T08:06:26.7777213Z DMAbuf Support:          YES
2026-02-21T08:06:26.7777475Z VMM Support:             YES
2026-02-21T08:06:26.7777563Z 
2026-02-21T08:06:26.7777623Z ==========               
2026-02-21T08:06:26.7777753Z HSA Agents               
2026-02-21T08:06:26.7777889Z ==========               
2026-02-21T08:06:26.7778019Z *******                  
2026-02-21T08:06:26.7778142Z Agent 1                  
2026-02-21T08:06:26.7778261Z *******                  
2026-02-21T08:06:26.7778441Z   Name:                    AMD EPYC 9575F 64-Core Processor   
2026-02-21T08:06:26.7778660Z   Uuid:                    CPU-XX                             
2026-02-21T08:06:26.7778876Z   Marketing Name:          AMD EPYC 9575F 64-Core Processor   
2026-02-21T08:06:26.7779108Z   Vendor Name:             CPU                                
2026-02-21T08:06:26.7779318Z   Feature:                 None specified                     
2026-02-21T08:06:26.7779572Z   Profile:                 FULL_PROFILE                       
2026-02-21T08:06:26.7779787Z   Float Round Mode:        NEAR                               
2026-02-21T08:06:26.7780018Z   Max Queue Number:        0(0x0)                             
2026-02-21T08:06:26.7780235Z   Queue Min Size:          0(0x0)                             
2026-02-21T08:06:26.7780447Z   Queue Max Size:          0(0x0)                             
2026-02-21T08:06:26.7780653Z   Queue Type:              MULTI                              
2026-02-21T08:06:26.7780860Z   Node:                    0                                  
2026-02-21T08:06:26.7781063Z   Device Type:             CPU                                
2026-02-21T08:06:26.7781505Z   Cache Info:              
2026-02-21T08:06:26.7781705Z     L1:                      49152(0xc000) KB                   
2026-02-21T08:06:26.7781910Z   Chip ID:                 0(0x0)                             
2026-02-21T08:06:26.7782128Z   ASIC Revision:           0(0x0)                             
2026-02-21T08:06:26.7782342Z   Cacheline Size:          64(0x40)                           
2026-02-21T08:06:26.7782558Z   Max Clock Freq. (MHz):   3300                               
2026-02-21T08:06:26.7782763Z   BDFID:                   0                                  
2026-02-21T08:06:26.7782975Z   Internal Node ID:        0                                  
2026-02-21T08:06:26.7783199Z   Compute Unit:            128                                
2026-02-21T08:06:26.7783413Z   SIMDs per CU:            0                                  
2026-02-21T08:06:26.7783622Z   Shader Engines:          0                                  
2026-02-21T08:06:26.7783835Z   Shader Arrs. per Eng.:   0                                  
2026-02-21T08:06:26.7784292Z   WatchPts on Addr. Ranges:1                                  
2026-02-21T08:06:26.7784483Z   Memory Properties:       
2026-02-21T08:06:26.7784620Z   Features:                None
2026-02-21T08:06:26.7784776Z   Pool Info:               
2026-02-21T08:06:26.7784913Z     Pool 1                   
2026-02-21T08:06:26.7785102Z       Segment:                 GLOBAL; FLAGS: FINE GRAINED        
2026-02-21T08:06:26.7785319Z       Size:                    1584690840(0x5e747698) KB          
2026-02-21T08:06:26.7785538Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7785765Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7786004Z       Alloc Recommended Granule:4KB                                
2026-02-21T08:06:26.7786295Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7786526Z       Accessible by all:       TRUE                               
2026-02-21T08:06:26.7786705Z     Pool 2                   
2026-02-21T08:06:26.7786902Z       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
2026-02-21T08:06:26.7787129Z       Size:                    1584690840(0x5e747698) KB          
2026-02-21T08:06:26.7787435Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7787640Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7787819Z       Alloc Recommended Granule:4KB                                
2026-02-21T08:06:26.7788008Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7788187Z       Accessible by all:       TRUE                               
2026-02-21T08:06:26.7788363Z     Pool 3                   
2026-02-21T08:06:26.7788511Z       Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
2026-02-21T08:06:26.7788680Z       Size:                    1584690840(0x5e747698) KB          
2026-02-21T08:06:26.7788849Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7789026Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7789228Z       Alloc Recommended Granule:4KB                                
2026-02-21T08:06:26.7789425Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7789607Z       Accessible by all:       TRUE                               
2026-02-21T08:06:26.7789760Z     Pool 4                   
2026-02-21T08:06:26.7789918Z       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
2026-02-21T08:06:26.7790086Z       Size:                    1584690840(0x5e747698) KB          
2026-02-21T08:06:26.7790252Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7790432Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7790611Z       Alloc Recommended Granule:4KB                                
2026-02-21T08:06:26.7790799Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7790988Z       Accessible by all:       TRUE                               
2026-02-21T08:06:26.7791134Z   ISA Info:                
2026-02-21T08:06:26.7791249Z *******                  
2026-02-21T08:06:26.7791357Z Agent 2                  
2026-02-21T08:06:26.7791465Z *******                  
2026-02-21T08:06:26.7791602Z   Name:                    AMD EPYC 9575F 64-Core Processor   
2026-02-21T08:06:26.7791775Z   Uuid:                    CPU-XX                             
2026-02-21T08:06:26.7791950Z   Marketing Name:          AMD EPYC 9575F 64-Core Processor   
2026-02-21T08:06:26.7792133Z   Vendor Name:             CPU                                
2026-02-21T08:06:26.7792307Z   Feature:                 None specified                     
2026-02-21T08:06:26.7792481Z   Profile:                 FULL_PROFILE                       
2026-02-21T08:06:26.7792658Z   Float Round Mode:        NEAR                               
2026-02-21T08:06:26.7792834Z   Max Queue Number:        0(0x0)                             
2026-02-21T08:06:26.7793057Z   Queue Min Size:          0(0x0)                             
2026-02-21T08:06:26.7793227Z   Queue Max Size:          0(0x0)                             
2026-02-21T08:06:26.7793402Z   Queue Type:              MULTI                              
2026-02-21T08:06:26.7793566Z   Node:                    1                                  
2026-02-21T08:06:26.7793738Z   Device Type:             CPU                                
2026-02-21T08:06:26.7793885Z   Cache Info:              
2026-02-21T08:06:26.7794024Z     L1:                      49152(0xc000) KB                   
2026-02-21T08:06:26.7794185Z   Chip ID:                 0(0x0)                             
2026-02-21T08:06:26.7794353Z   ASIC Revision:           0(0x0)                             
2026-02-21T08:06:26.7794527Z   Cacheline Size:          64(0x40)                           
2026-02-21T08:06:26.7794703Z   Max Clock Freq. (MHz):   3300                               
2026-02-21T08:06:26.7794873Z   BDFID:                   0                                  
2026-02-21T08:06:26.7795045Z   Internal Node ID:        1                                  
2026-02-21T08:06:26.7795252Z   Compute Unit:            128                                
2026-02-21T08:06:26.7795420Z   SIMDs per CU:            0                                  
2026-02-21T08:06:26.7795593Z   Shader Engines:          0                                  
2026-02-21T08:06:26.7795769Z   Shader Arrs. per Eng.:   0                                  
2026-02-21T08:06:26.7795954Z   WatchPts on Addr. Ranges:1                                  
2026-02-21T08:06:26.7796107Z   Memory Properties:       
2026-02-21T08:06:26.7796223Z   Features:                None
2026-02-21T08:06:26.7796336Z   Pool Info:               
2026-02-21T08:06:26.7796448Z     Pool 1                   
2026-02-21T08:06:26.7796597Z       Segment:                 GLOBAL; FLAGS: FINE GRAINED        
2026-02-21T08:06:26.7796775Z       Size:                    1581097388(0x5e3da1ac) KB          
2026-02-21T08:06:26.7796958Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7797140Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7797370Z       Alloc Recommended Granule:4KB                                
2026-02-21T08:06:26.7797548Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7797724Z       Accessible by all:       TRUE                               
2026-02-21T08:06:26.7797871Z     Pool 2                   
2026-02-21T08:06:26.7798017Z       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
2026-02-21T08:06:26.7798191Z       Size:                    1581097388(0x5e3da1ac) KB          
2026-02-21T08:06:26.7798356Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7798544Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7798721Z       Alloc Recommended Granule:4KB                                
2026-02-21T08:06:26.7798922Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7799094Z       Accessible by all:       TRUE                               
2026-02-21T08:06:26.7799270Z     Pool 3                   
2026-02-21T08:06:26.7799436Z       Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
2026-02-21T08:06:26.7799604Z       Size:                    1581097388(0x5e3da1ac) KB          
2026-02-21T08:06:26.7799771Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7799940Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7800128Z       Alloc Recommended Granule:4KB                                
2026-02-21T08:06:26.7800306Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7800512Z       Accessible by all:       TRUE                               
2026-02-21T08:06:26.7800659Z     Pool 4                   
2026-02-21T08:06:26.7800849Z       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
2026-02-21T08:06:26.7801042Z       Size:                    1581097388(0x5e3da1ac) KB          
2026-02-21T08:06:26.7801209Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7801380Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7801557Z       Alloc Recommended Granule:4KB                                
2026-02-21T08:06:26.7801738Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7801916Z       Accessible by all:       TRUE                               
2026-02-21T08:06:26.7802058Z   ISA Info:                
2026-02-21T08:06:26.7802164Z *******                  
2026-02-21T08:06:26.7802267Z Agent 3                  
2026-02-21T08:06:26.7802369Z *******                  
2026-02-21T08:06:26.7802495Z   Name:                    gfx942                             
2026-02-21T08:06:26.7802771Z   Uuid:                    GPU-eba66cc485172b60               
2026-02-21T08:06:26.7802941Z   Marketing Name:          AMD Instinct MI325X                
2026-02-21T08:06:26.7803120Z   Vendor Name:             AMD                                
2026-02-21T08:06:26.7803355Z   Feature:                 KERNEL_DISPATCH                    
2026-02-21T08:06:26.7803524Z   Profile:                 BASE_PROFILE                       
2026-02-21T08:06:26.7803688Z   Float Round Mode:        NEAR                               
2026-02-21T08:06:26.7803856Z   Max Queue Number:        128(0x80)                          
2026-02-21T08:06:26.7804018Z   Queue Min Size:          64(0x40)                           
2026-02-21T08:06:26.7804180Z   Queue Max Size:          131072(0x20000)                    
2026-02-21T08:06:26.7804340Z   Queue Type:              MULTI                              
2026-02-21T08:06:26.7804498Z   Node:                    2                                  
2026-02-21T08:06:26.7804658Z   Device Type:             GPU                                
2026-02-21T08:06:26.7804796Z   Cache Info:              
2026-02-21T08:06:26.7804923Z     L1:                      32(0x20) KB                        
2026-02-21T08:06:26.7805072Z     L2:                      4096(0x1000) KB                    
2026-02-21T08:06:26.7805219Z     L3:                      262144(0x40000) KB                 
2026-02-21T08:06:26.7805365Z   Chip ID:                 29861(0x74a5)                      
2026-02-21T08:06:26.7805525Z   ASIC Revision:           1(0x1)                             
2026-02-21T08:06:26.7805687Z   Cacheline Size:          128(0x80)                          
2026-02-21T08:06:26.7805856Z   Max Clock Freq. (MHz):   2100                               
2026-02-21T08:06:26.7806016Z   BDFID:                   29952                              
2026-02-21T08:06:26.7806173Z   Internal Node ID:        2                                  
2026-02-21T08:06:26.7806338Z   Compute Unit:            304                                
2026-02-21T08:06:26.7806502Z   SIMDs per CU:            4                                  
2026-02-21T08:06:26.7806665Z   Shader Engines:          32                                 
2026-02-21T08:06:26.7806833Z   Shader Arrs. per Eng.:   1                                  
2026-02-21T08:06:26.7807006Z   WatchPts on Addr. Ranges:4                                  
2026-02-21T08:06:26.7807178Z   Coherent Host Access:    FALSE                              
2026-02-21T08:06:26.7807322Z   Memory Properties:       
2026-02-21T08:06:26.7807443Z   Features:                KERNEL_DISPATCH 
2026-02-21T08:06:26.7807594Z   Fast F16 Operation:      TRUE                               
2026-02-21T08:06:26.7807763Z   Wavefront Size:          64(0x40)                           
2026-02-21T08:06:26.7807929Z   Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7808079Z   Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7808216Z     x                        1024(0x400)                        
2026-02-21T08:06:26.7808411Z     y                        1024(0x400)                        
2026-02-21T08:06:26.7808565Z     z                        1024(0x400)                        
2026-02-21T08:06:26.7808756Z   Max Waves Per CU:        32(0x20)                           
2026-02-21T08:06:26.7808926Z   Max Work-item Per CU:    2048(0x800)                        
2026-02-21T08:06:26.7809114Z   Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7809272Z   Grid Max Size per Dimension:
2026-02-21T08:06:26.7809397Z     x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7809544Z     y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7809697Z     z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7809879Z   Max fbarriers/Workgrp:   32                                 
2026-02-21T08:06:26.7820148Z   Packet Processor uCode:: 185                                
2026-02-21T08:06:26.7820536Z   SDMA engine uCode::      24                                 
2026-02-21T08:06:26.7820720Z   IOMMU Support::          None                               
2026-02-21T08:06:26.7820866Z   Pool Info:               
2026-02-21T08:06:26.7821074Z     Pool 1                   
2026-02-21T08:06:26.7821222Z       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
2026-02-21T08:06:26.7821392Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7821560Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7821758Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7821938Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7822119Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7822293Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7822437Z     Pool 2                   
2026-02-21T08:06:26.7822582Z       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
2026-02-21T08:06:26.7822752Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7822930Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7823099Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7823272Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7823446Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7823619Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7823761Z     Pool 3                   
2026-02-21T08:06:26.7823899Z       Segment:                 GLOBAL; FLAGS: FINE GRAINED        
2026-02-21T08:06:26.7824059Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7824221Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7824388Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7824566Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7824747Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7824916Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7825060Z     Pool 4                   
2026-02-21T08:06:26.7825208Z       Segment:                 GROUP                              
2026-02-21T08:06:26.7825364Z       Size:                    64(0x40) KB                        
2026-02-21T08:06:26.7825525Z       Allocatable:             FALSE                              
2026-02-21T08:06:26.7825690Z       Alloc Granule:           0KB                                
2026-02-21T08:06:26.7825864Z       Alloc Recommended Granule:0KB                                
2026-02-21T08:06:26.7826042Z       Alloc Alignment:         0KB                                
2026-02-21T08:06:26.7826248Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7826393Z   ISA Info:                
2026-02-21T08:06:26.7826502Z     ISA 1                    
2026-02-21T08:06:26.7826653Z       Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
2026-02-21T08:06:26.7826835Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7827012Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7827185Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7827360Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7827530Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7827697Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7827850Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7827993Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7828146Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7828296Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7828494Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7828642Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7828775Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7828926Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7829073Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7829232Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7829373Z     ISA 2                    
2026-02-21T08:06:26.7829528Z       Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack-
2026-02-21T08:06:26.7829721Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7829896Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7830067Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7830245Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7830413Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7830592Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7830745Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7830889Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7831062Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7831210Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7831364Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7831512Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7831675Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7831845Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7832005Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7832163Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7832301Z *******                  
2026-02-21T08:06:26.7832401Z Agent 4                  
2026-02-21T08:06:26.7832510Z *******                  
2026-02-21T08:06:26.7832630Z   Name:                    gfx942                             
2026-02-21T08:06:26.7832796Z   Uuid:                    GPU-c8f81c2eb8adeb06               
2026-02-21T08:06:26.7832978Z   Marketing Name:          AMD Instinct MI325X                
2026-02-21T08:06:26.7833154Z   Vendor Name:             AMD                                
2026-02-21T08:06:26.7833319Z   Feature:                 KERNEL_DISPATCH                    
2026-02-21T08:06:26.7833548Z   Profile:                 BASE_PROFILE                       
2026-02-21T08:06:26.7833718Z   Float Round Mode:        NEAR                               
2026-02-21T08:06:26.7833883Z   Max Queue Number:        128(0x80)                          
2026-02-21T08:06:26.7834084Z   Queue Min Size:          64(0x40)                           
2026-02-21T08:06:26.7834269Z   Queue Max Size:          131072(0x20000)                    
2026-02-21T08:06:26.7834435Z   Queue Type:              MULTI                              
2026-02-21T08:06:26.7834596Z   Node:                    3                                  
2026-02-21T08:06:26.7834751Z   Device Type:             GPU                                
2026-02-21T08:06:26.7834891Z   Cache Info:              
2026-02-21T08:06:26.7835017Z     L1:                      32(0x20) KB                        
2026-02-21T08:06:26.7835168Z     L2:                      4096(0x1000) KB                    
2026-02-21T08:06:26.7835313Z     L3:                      262144(0x40000) KB                 
2026-02-21T08:06:26.7835467Z   Chip ID:                 29861(0x74a5)                      
2026-02-21T08:06:26.7835627Z   ASIC Revision:           1(0x1)                             
2026-02-21T08:06:26.7835841Z   Cacheline Size:          128(0x80)                          
2026-02-21T08:06:26.7836008Z   Max Clock Freq. (MHz):   2100                               
2026-02-21T08:06:26.7836164Z   BDFID:                   1280                               
2026-02-21T08:06:26.7836327Z   Internal Node ID:        3                                  
2026-02-21T08:06:26.7836487Z   Compute Unit:            304                                
2026-02-21T08:06:26.7836650Z   SIMDs per CU:            4                                  
2026-02-21T08:06:26.7836813Z   Shader Engines:          32                                 
2026-02-21T08:06:26.7836977Z   Shader Arrs. per Eng.:   1                                  
2026-02-21T08:06:26.7837150Z   WatchPts on Addr. Ranges:4                                  
2026-02-21T08:06:26.7837318Z   Coherent Host Access:    FALSE                              
2026-02-21T08:06:26.7837469Z   Memory Properties:       
2026-02-21T08:06:26.7837586Z   Features:                KERNEL_DISPATCH 
2026-02-21T08:06:26.7837740Z   Fast F16 Operation:      TRUE                               
2026-02-21T08:06:26.7837906Z   Wavefront Size:          64(0x40)                           
2026-02-21T08:06:26.7838070Z   Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7838222Z   Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7838353Z     x                        1024(0x400)                        
2026-02-21T08:06:26.7838497Z     y                        1024(0x400)                        
2026-02-21T08:06:26.7838635Z     z                        1024(0x400)                        
2026-02-21T08:06:26.7838787Z   Max Waves Per CU:        32(0x20)                           
2026-02-21T08:06:26.7838950Z   Max Work-item Per CU:    2048(0x800)                        
2026-02-21T08:06:26.7839119Z   Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7839261Z   Grid Max Size per Dimension:
2026-02-21T08:06:26.7839384Z     x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7839534Z     y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7839673Z     z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7839831Z   Max fbarriers/Workgrp:   32                                 
2026-02-21T08:06:26.7840006Z   Packet Processor uCode:: 185                                
2026-02-21T08:06:26.7840180Z   SDMA engine uCode::      24                                 
2026-02-21T08:06:26.7840347Z   IOMMU Support::          None                               
2026-02-21T08:06:26.7840485Z   Pool Info:               
2026-02-21T08:06:26.7840592Z     Pool 1                   
2026-02-21T08:06:26.7840731Z       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
2026-02-21T08:06:26.7853427Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7853616Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7853791Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7853965Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7854140Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7854357Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7854498Z     Pool 2                   
2026-02-21T08:06:26.7854635Z       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
2026-02-21T08:06:26.7854798Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7854968Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7855135Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7855326Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7855520Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7855763Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7855905Z     Pool 3                   
2026-02-21T08:06:26.7856039Z       Segment:                 GLOBAL; FLAGS: FINE GRAINED        
2026-02-21T08:06:26.7856238Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7856422Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7856603Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7856780Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7856978Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7857158Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7857299Z     Pool 4                   
2026-02-21T08:06:26.7857505Z       Segment:                 GROUP                              
2026-02-21T08:06:26.7857658Z       Size:                    64(0x40) KB                        
2026-02-21T08:06:26.7857864Z       Allocatable:             FALSE                              
2026-02-21T08:06:26.7858029Z       Alloc Granule:           0KB                                
2026-02-21T08:06:26.7858735Z       Alloc Recommended Granule:0KB                                
2026-02-21T08:06:26.7858915Z       Alloc Alignment:         0KB                                
2026-02-21T08:06:26.7859080Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7859222Z   ISA Info:                
2026-02-21T08:06:26.7859331Z     ISA 1                    
2026-02-21T08:06:26.7859476Z       Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
2026-02-21T08:06:26.7859655Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7859828Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7860001Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7860175Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7860340Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7860501Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7860653Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7860790Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7860933Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7861079Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7861231Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7861375Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7861551Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7861699Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7861846Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7862001Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7862141Z     ISA 2                    
2026-02-21T08:06:26.7862291Z       Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack-
2026-02-21T08:06:26.7862480Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7862650Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7862821Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7862994Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7863154Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7863319Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7863466Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7863632Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7863772Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7863916Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7864069Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7864210Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7864339Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7864483Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7864629Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7864783Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7864924Z *******                  
2026-02-21T08:06:26.7865026Z Agent 5                  
2026-02-21T08:06:26.7865125Z *******                  
2026-02-21T08:06:26.7865247Z   Name:                    gfx942                             
2026-02-21T08:06:26.7865406Z   Uuid:                    GPU-8b5a3495e4b669cf               
2026-02-21T08:06:26.7865570Z   Marketing Name:          AMD Instinct MI325X                
2026-02-21T08:06:26.7865733Z   Vendor Name:             AMD                                
2026-02-21T08:06:26.7865895Z   Feature:                 KERNEL_DISPATCH                    
2026-02-21T08:06:26.7866052Z   Profile:                 BASE_PROFILE                       
2026-02-21T08:06:26.7866216Z   Float Round Mode:        NEAR                               
2026-02-21T08:06:26.7866380Z   Max Queue Number:        128(0x80)                          
2026-02-21T08:06:26.7866539Z   Queue Min Size:          64(0x40)                           
2026-02-21T08:06:26.7866735Z   Queue Max Size:          131072(0x20000)                    
2026-02-21T08:06:26.7866900Z   Queue Type:              MULTI                              
2026-02-21T08:06:26.7867094Z   Node:                    4                                  
2026-02-21T08:06:26.7867250Z   Device Type:             GPU                                
2026-02-21T08:06:26.7867413Z   Cache Info:              
2026-02-21T08:06:26.7867549Z     L1:                      32(0x20) KB                        
2026-02-21T08:06:26.7867713Z     L2:                      4096(0x1000) KB                    
2026-02-21T08:06:26.7867871Z     L3:                      262144(0x40000) KB                 
2026-02-21T08:06:26.7868017Z   Chip ID:                 29861(0x74a5)                      
2026-02-21T08:06:26.7868178Z   ASIC Revision:           1(0x1)                             
2026-02-21T08:06:26.7868338Z   Cacheline Size:          128(0x80)                          
2026-02-21T08:06:26.7868541Z   Max Clock Freq. (MHz):   2100                               
2026-02-21T08:06:26.7868762Z   BDFID:                   25856                              
2026-02-21T08:06:26.7868918Z   Internal Node ID:        4                                  
2026-02-21T08:06:26.7869098Z   Compute Unit:            304                                
2026-02-21T08:06:26.7869258Z   SIMDs per CU:            4                                  
2026-02-21T08:06:26.7869443Z   Shader Engines:          32                                 
2026-02-21T08:06:26.7869607Z   Shader Arrs. per Eng.:   1                                  
2026-02-21T08:06:26.7869775Z   WatchPts on Addr. Ranges:4                                  
2026-02-21T08:06:26.7869942Z   Coherent Host Access:    FALSE                              
2026-02-21T08:06:26.7870090Z   Memory Properties:       
2026-02-21T08:06:26.7870209Z   Features:                KERNEL_DISPATCH 
2026-02-21T08:06:26.7870382Z   Fast F16 Operation:      TRUE                               
2026-02-21T08:06:26.7870551Z   Wavefront Size:          64(0x40)                           
2026-02-21T08:06:26.7870716Z   Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7870866Z   Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7871029Z     x                        1024(0x400)                        
2026-02-21T08:06:26.7871174Z     y                        1024(0x400)                        
2026-02-21T08:06:26.7871313Z     z                        1024(0x400)                        
2026-02-21T08:06:26.7871468Z   Max Waves Per CU:        32(0x20)                           
2026-02-21T08:06:26.7871634Z   Max Work-item Per CU:    2048(0x800)                        
2026-02-21T08:06:26.7871797Z   Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7871939Z   Grid Max Size per Dimension:
2026-02-21T08:06:26.7872060Z     x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7872206Z     y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7872351Z     z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7872505Z   Max fbarriers/Workgrp:   32                                 
2026-02-21T08:06:26.7872684Z   Packet Processor uCode:: 185                                
2026-02-21T08:06:26.7872853Z   SDMA engine uCode::      24                                 
2026-02-21T08:06:26.7873019Z   IOMMU Support::          None                               
2026-02-21T08:06:26.7873155Z   Pool Info:               
2026-02-21T08:06:26.7873261Z     Pool 1                   
2026-02-21T08:06:26.7873396Z       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
2026-02-21T08:06:26.7873563Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7873728Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7873895Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7874069Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7874242Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7874414Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7874557Z     Pool 2                   
2026-02-21T08:06:26.7874695Z       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
2026-02-21T08:06:26.7874859Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7875016Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7875183Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7875352Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7875525Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7875691Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7875831Z     Pool 3                   
2026-02-21T08:06:26.7876008Z       Segment:                 GLOBAL; FLAGS: FINE GRAINED        
2026-02-21T08:06:26.7876165Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7876325Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7876490Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7876661Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7876831Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7877001Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7877141Z     Pool 4                   
2026-02-21T08:06:26.7877270Z       Segment:                 GROUP                              
2026-02-21T08:06:26.7877426Z       Size:                    64(0x40) KB                        
2026-02-21T08:06:26.7877582Z       Allocatable:             FALSE                              
2026-02-21T08:06:26.7877751Z       Alloc Granule:           0KB                                
2026-02-21T08:06:26.7877920Z       Alloc Recommended Granule:0KB                                
2026-02-21T08:06:26.7878129Z       Alloc Alignment:         0KB                                
2026-02-21T08:06:26.7878298Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7878437Z   ISA Info:                
2026-02-21T08:06:26.7878542Z     ISA 1                    
2026-02-21T08:06:26.7878682Z       Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
2026-02-21T08:06:26.7878863Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7879032Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7879231Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7879404Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7879565Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7879731Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7879878Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7880018Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7880162Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7880310Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7880465Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7880608Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7880737Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7880883Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7881030Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7881186Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7881330Z     ISA 2                    
2026-02-21T08:06:26.7881479Z       Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack-
2026-02-21T08:06:26.7881671Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7881843Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7882012Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7882186Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7882349Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7882513Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7882708Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7882845Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7882987Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7883175Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7883331Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7883475Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7883604Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7883749Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7883896Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7884053Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7884188Z *******                  
2026-02-21T08:06:26.7884290Z Agent 6                  
2026-02-21T08:06:26.7884388Z *******                  
2026-02-21T08:06:26.7884511Z   Name:                    gfx942                             
2026-02-21T08:06:26.7884664Z   Uuid:                    GPU-a8cd01be4ce60285               
2026-02-21T08:06:26.7884836Z   Marketing Name:          AMD Instinct MI325X                
2026-02-21T08:06:26.7885002Z   Vendor Name:             AMD                                
2026-02-21T08:06:26.7885164Z   Feature:                 KERNEL_DISPATCH                    
2026-02-21T08:06:26.7885377Z   Profile:                 BASE_PROFILE                       
2026-02-21T08:06:26.7885538Z   Float Round Mode:        NEAR                               
2026-02-21T08:06:26.7885702Z   Max Queue Number:        128(0x80)                          
2026-02-21T08:06:26.7885860Z   Queue Min Size:          64(0x40)                           
2026-02-21T08:06:26.7886019Z   Queue Max Size:          131072(0x20000)                    
2026-02-21T08:06:26.7886177Z   Queue Type:              MULTI                              
2026-02-21T08:06:26.7886331Z   Node:                    5                                  
2026-02-21T08:06:26.7886485Z   Device Type:             GPU                                
2026-02-21T08:06:26.7886619Z   Cache Info:              
2026-02-21T08:06:26.7886747Z     L1:                      32(0x20) KB                        
2026-02-21T08:06:26.7886891Z     L2:                      4096(0x1000) KB                    
2026-02-21T08:06:26.7887038Z     L3:                      262144(0x40000) KB                 
2026-02-21T08:06:26.7887186Z   Chip ID:                 29861(0x74a5)                      
2026-02-21T08:06:26.7887345Z   ASIC Revision:           1(0x1)                             
2026-02-21T08:06:26.7887508Z   Cacheline Size:          128(0x80)                          
2026-02-21T08:06:26.7887668Z   Max Clock Freq. (MHz):   2100                               
2026-02-21T08:06:26.7887822Z   BDFID:                   5376                               
2026-02-21T08:06:26.7887977Z   Internal Node ID:        5                                  
2026-02-21T08:06:26.7888139Z   Compute Unit:            304                                
2026-02-21T08:06:26.7888295Z   SIMDs per CU:            4                                  
2026-02-21T08:06:26.7888457Z   Shader Engines:          32                                 
2026-02-21T08:06:26.7888621Z   Shader Arrs. per Eng.:   1                                  
2026-02-21T08:06:26.7888791Z   WatchPts on Addr. Ranges:4                                  
2026-02-21T08:06:26.7888960Z   Coherent Host Access:    FALSE                              
2026-02-21T08:06:26.7889103Z   Memory Properties:       
2026-02-21T08:06:26.7889222Z   Features:                KERNEL_DISPATCH 
2026-02-21T08:06:26.7889367Z   Fast F16 Operation:      TRUE                               
2026-02-21T08:06:26.7889532Z   Wavefront Size:          64(0x40)                           
2026-02-21T08:06:26.7889694Z   Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7889972Z   Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7890130Z     x                        1024(0x400)                        
2026-02-21T08:06:26.7890310Z     y                        1024(0x400)                        
2026-02-21T08:06:26.7890508Z     z                        1024(0x400)                        
2026-02-21T08:06:26.7890712Z   Max Waves Per CU:        32(0x20)                           
2026-02-21T08:06:26.7890906Z   Max Work-item Per CU:    2048(0x800)                        
2026-02-21T08:06:26.7891095Z   Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7891284Z   Grid Max Size per Dimension:
2026-02-21T08:06:26.7891433Z     x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7891602Z     y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7891774Z     z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7891973Z   Max fbarriers/Workgrp:   32                                 
2026-02-21T08:06:26.7892168Z   Packet Processor uCode:: 185                                
2026-02-21T08:06:26.7892381Z   SDMA engine uCode::      24                                 
2026-02-21T08:06:26.7892598Z   IOMMU Support::          None                               
2026-02-21T08:06:26.7892754Z   Pool Info:               
2026-02-21T08:06:26.7892896Z     Pool 1                   
2026-02-21T08:06:26.7893092Z       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
2026-02-21T08:06:26.7893290Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7893496Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7893684Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7893892Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7894086Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7894300Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7894461Z     Pool 2                   
2026-02-21T08:06:26.7894636Z       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
2026-02-21T08:06:26.7894842Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7895019Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7895224Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7895419Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7895620Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7895803Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7895988Z     Pool 3                   
2026-02-21T08:06:26.7896152Z       Segment:                 GLOBAL; FLAGS: FINE GRAINED        
2026-02-21T08:06:26.7896327Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7896527Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7896708Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7896914Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7897131Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7897359Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7897530Z     Pool 4                   
2026-02-21T08:06:26.7897687Z       Segment:                 GROUP                              
2026-02-21T08:06:26.7897933Z       Size:                    64(0x40) KB                        
2026-02-21T08:06:26.7915479Z       Allocatable:             FALSE                              
2026-02-21T08:06:26.7915661Z       Alloc Granule:           0KB                                
2026-02-21T08:06:26.7915842Z       Alloc Recommended Granule:0KB                                
2026-02-21T08:06:26.7916019Z       Alloc Alignment:         0KB                                
2026-02-21T08:06:26.7916190Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7916338Z   ISA Info:                
2026-02-21T08:06:26.7916528Z     ISA 1                    
2026-02-21T08:06:26.7916724Z       Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
2026-02-21T08:06:26.7916915Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7917096Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7917270Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7917442Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7917609Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7917770Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7917919Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7918058Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7918201Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7918345Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7918498Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7918680Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7918805Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7918949Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7919091Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7919243Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7919380Z     ISA 2                    
2026-02-21T08:06:26.7919528Z       Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack-
2026-02-21T08:06:26.7919715Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7919883Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7920055Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7920262Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7920455Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7920625Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7920772Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7920902Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7921044Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7921204Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7921359Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7921506Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7921650Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7921795Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7921938Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7922096Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7922264Z *******                  
2026-02-21T08:06:26.7922359Z Agent 7                  
2026-02-21T08:06:26.7922453Z *******                  
2026-02-21T08:06:26.7922631Z   Name:                    gfx942                             
2026-02-21T08:06:26.7922786Z   Uuid:                    GPU-9c66436f78af1ebe               
2026-02-21T08:06:26.7922948Z   Marketing Name:          AMD Instinct MI325X                
2026-02-21T08:06:26.7923111Z   Vendor Name:             AMD                                
2026-02-21T08:06:26.7923271Z   Feature:                 KERNEL_DISPATCH                    
2026-02-21T08:06:26.7923427Z   Profile:                 BASE_PROFILE                       
2026-02-21T08:06:26.7923649Z   Float Round Mode:        NEAR                               
2026-02-21T08:06:26.7923818Z   Max Queue Number:        128(0x80)                          
2026-02-21T08:06:26.7923978Z   Queue Min Size:          64(0x40)                           
2026-02-21T08:06:26.7924156Z   Queue Max Size:          131072(0x20000)                    
2026-02-21T08:06:26.7924313Z   Queue Type:              MULTI                              
2026-02-21T08:06:26.7924467Z   Node:                    6                                  
2026-02-21T08:06:26.7924618Z   Device Type:             GPU                                
2026-02-21T08:06:26.7924753Z   Cache Info:              
2026-02-21T08:06:26.7924877Z     L1:                      32(0x20) KB                        
2026-02-21T08:06:26.7925028Z     L2:                      4096(0x1000) KB                    
2026-02-21T08:06:26.7925178Z     L3:                      262144(0x40000) KB                 
2026-02-21T08:06:26.7925326Z   Chip ID:                 29861(0x74a5)                      
2026-02-21T08:06:26.7925490Z   ASIC Revision:           1(0x1)                             
2026-02-21T08:06:26.7925656Z   Cacheline Size:          128(0x80)                          
2026-02-21T08:06:26.7925860Z   Max Clock Freq. (MHz):   2100                               
2026-02-21T08:06:26.7926013Z   BDFID:                   62720                              
2026-02-21T08:06:26.7926172Z   Internal Node ID:        6                                  
2026-02-21T08:06:26.7926334Z   Compute Unit:            304                                
2026-02-21T08:06:26.7926490Z   SIMDs per CU:            4                                  
2026-02-21T08:06:26.7926650Z   Shader Engines:          32                                 
2026-02-21T08:06:26.7926812Z   Shader Arrs. per Eng.:   1                                  
2026-02-21T08:06:26.7926980Z   WatchPts on Addr. Ranges:4                                  
2026-02-21T08:06:26.7927149Z   Coherent Host Access:    FALSE                              
2026-02-21T08:06:26.7927301Z   Memory Properties:       
2026-02-21T08:06:26.7927421Z   Features:                KERNEL_DISPATCH 
2026-02-21T08:06:26.7927567Z   Fast F16 Operation:      TRUE                               
2026-02-21T08:06:26.7927736Z   Wavefront Size:          64(0x40)                           
2026-02-21T08:06:26.7927899Z   Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7928054Z   Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7928191Z     x                        1024(0x400)                        
2026-02-21T08:06:26.7928340Z     y                        1024(0x400)                        
2026-02-21T08:06:26.7928487Z     z                        1024(0x400)                        
2026-02-21T08:06:26.7928643Z   Max Waves Per CU:        32(0x20)                           
2026-02-21T08:06:26.7928816Z   Max Work-item Per CU:    2048(0x800)                        
2026-02-21T08:06:26.7928983Z   Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7929135Z   Grid Max Size per Dimension:
2026-02-21T08:06:26.7929267Z     x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7929423Z     y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7929570Z     z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7929735Z   Max fbarriers/Workgrp:   32                                 
2026-02-21T08:06:26.7929923Z   Packet Processor uCode:: 185                                
2026-02-21T08:06:26.7930092Z   SDMA engine uCode::      24                                 
2026-02-21T08:06:26.7930264Z   IOMMU Support::          None                               
2026-02-21T08:06:26.7930403Z   Pool Info:               
2026-02-21T08:06:26.7930517Z     Pool 1                   
2026-02-21T08:06:26.7930659Z       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
2026-02-21T08:06:26.7930834Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7931048Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7931228Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7931422Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7931608Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7931787Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7931962Z     Pool 2                   
2026-02-21T08:06:26.7932155Z       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
2026-02-21T08:06:26.7932330Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7932496Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7932684Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7932862Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7933096Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7933267Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7933447Z     Pool 3                   
2026-02-21T08:06:26.7933638Z       Segment:                 GLOBAL; FLAGS: FINE GRAINED        
2026-02-21T08:06:26.7933813Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7933973Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7934233Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7934445Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7934614Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7934846Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7935006Z     Pool 4                   
2026-02-21T08:06:26.7935170Z       Segment:                 GROUP                              
2026-02-21T08:06:26.7935336Z       Size:                    64(0x40) KB                        
2026-02-21T08:06:26.7935508Z       Allocatable:             FALSE                              
2026-02-21T08:06:26.7935676Z       Alloc Granule:           0KB                                
2026-02-21T08:06:26.7935844Z       Alloc Recommended Granule:0KB                                
2026-02-21T08:06:26.7936037Z       Alloc Alignment:         0KB                                
2026-02-21T08:06:26.7936204Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7936350Z   ISA Info:                
2026-02-21T08:06:26.7936459Z     ISA 1                    
2026-02-21T08:06:26.7936642Z       Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
2026-02-21T08:06:26.7936821Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7936997Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7937169Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7937340Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7937509Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7937673Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7937821Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7937959Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7938102Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7938247Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7938401Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7938547Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7938675Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7938857Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7939003Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7939159Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7939300Z     ISA 2                    
2026-02-21T08:06:26.7939449Z       Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack-
2026-02-21T08:06:26.7939639Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7939811Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7939980Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7940152Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7940313Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7940477Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7940626Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7940761Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7940940Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7941080Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7941234Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7941374Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7941504Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7941647Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7941792Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7941949Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7942083Z *******                  
2026-02-21T08:06:26.7942185Z Agent 8                  
2026-02-21T08:06:26.7942283Z *******                  
2026-02-21T08:06:26.7942403Z   Name:                    gfx942                             
2026-02-21T08:06:26.7942559Z   Uuid:                    GPU-489fccc039800b1a               
2026-02-21T08:06:26.7942722Z   Marketing Name:          AMD Instinct MI325X                
2026-02-21T08:06:26.7942886Z   Vendor Name:             AMD                                
2026-02-21T08:06:26.7943047Z   Feature:                 KERNEL_DISPATCH                    
2026-02-21T08:06:26.7943209Z   Profile:                 BASE_PROFILE                       
2026-02-21T08:06:26.7943369Z   Float Round Mode:        NEAR                               
2026-02-21T08:06:26.7943532Z   Max Queue Number:        128(0x80)                          
2026-02-21T08:06:26.7943691Z   Queue Min Size:          64(0x40)                           
2026-02-21T08:06:26.7943850Z   Queue Max Size:          131072(0x20000)                    
2026-02-21T08:06:26.7944008Z   Queue Type:              MULTI                              
2026-02-21T08:06:26.7944196Z   Node:                    7                                  
2026-02-21T08:06:26.7944349Z   Device Type:             GPU                                
2026-02-21T08:06:26.7944486Z   Cache Info:              
2026-02-21T08:06:26.7944636Z     L1:                      32(0x20) KB                        
2026-02-21T08:06:26.7944785Z     L2:                      4096(0x1000) KB                    
2026-02-21T08:06:26.7944928Z     L3:                      262144(0x40000) KB                 
2026-02-21T08:06:26.7945078Z   Chip ID:                 29861(0x74a5)                      
2026-02-21T08:06:26.7945255Z   ASIC Revision:           1(0x1)                             
2026-02-21T08:06:26.7945466Z   Cacheline Size:          128(0x80)                          
2026-02-21T08:06:26.7945673Z   Max Clock Freq. (MHz):   2100                               
2026-02-21T08:06:26.7945835Z   BDFID:                   34048                              
2026-02-21T08:06:26.7946026Z   Internal Node ID:        7                                  
2026-02-21T08:06:26.7946189Z   Compute Unit:            304                                
2026-02-21T08:06:26.7946348Z   SIMDs per CU:            4                                  
2026-02-21T08:06:26.7946511Z   Shader Engines:          32                                 
2026-02-21T08:06:26.7946671Z   Shader Arrs. per Eng.:   1                                  
2026-02-21T08:06:26.7946867Z   WatchPts on Addr. Ranges:4                                  
2026-02-21T08:06:26.7947076Z   Coherent Host Access:    FALSE                              
2026-02-21T08:06:26.7947218Z   Memory Properties:       
2026-02-21T08:06:26.7947338Z   Features:                KERNEL_DISPATCH 
2026-02-21T08:06:26.7947492Z   Fast F16 Operation:      TRUE                               
2026-02-21T08:06:26.7947677Z   Wavefront Size:          64(0x40)                           
2026-02-21T08:06:26.7947844Z   Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7948040Z   Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7948174Z     x                        1024(0x400)                        
2026-02-21T08:06:26.7948349Z     y                        1024(0x400)                        
2026-02-21T08:06:26.7948503Z     z                        1024(0x400)                        
2026-02-21T08:06:26.7948653Z   Max Waves Per CU:        32(0x20)                           
2026-02-21T08:06:26.7948816Z   Max Work-item Per CU:    2048(0x800)                        
2026-02-21T08:06:26.7948977Z   Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7949118Z   Grid Max Size per Dimension:
2026-02-21T08:06:26.7949240Z     x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7949382Z     y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7949526Z     z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7949683Z   Max fbarriers/Workgrp:   32                                 
2026-02-21T08:06:26.7949860Z   Packet Processor uCode:: 185                                
2026-02-21T08:06:26.7950031Z   SDMA engine uCode::      24                                 
2026-02-21T08:06:26.7950199Z   IOMMU Support::          None                               
2026-02-21T08:06:26.7950336Z   Pool Info:               
2026-02-21T08:06:26.7950438Z     Pool 1                   
2026-02-21T08:06:26.7950576Z       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
2026-02-21T08:06:26.7950742Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7950907Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7951072Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7951247Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7951420Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7951586Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7951732Z     Pool 2                   
2026-02-21T08:06:26.7951873Z       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
2026-02-21T08:06:26.7952040Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7952199Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7952366Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7952540Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7952711Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7952879Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7953017Z     Pool 3                   
2026-02-21T08:06:26.7953149Z       Segment:                 GLOBAL; FLAGS: FINE GRAINED        
2026-02-21T08:06:26.7953349Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7953512Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7953676Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7953848Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7954024Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7954187Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7954332Z     Pool 4                   
2026-02-21T08:06:26.7954462Z       Segment:                 GROUP                              
2026-02-21T08:06:26.7954621Z       Size:                    64(0x40) KB                        
2026-02-21T08:06:26.7954778Z       Allocatable:             FALSE                              
2026-02-21T08:06:26.7954945Z       Alloc Granule:           0KB                                
2026-02-21T08:06:26.7955122Z       Alloc Recommended Granule:0KB                                
2026-02-21T08:06:26.7955291Z       Alloc Alignment:         0KB                                
2026-02-21T08:06:26.7955487Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7955630Z   ISA Info:                
2026-02-21T08:06:26.7955739Z     ISA 1                    
2026-02-21T08:06:26.7955893Z       Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
2026-02-21T08:06:26.7956108Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7956286Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7956457Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7956642Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7956805Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7956998Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7957156Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7957295Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7957486Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7957637Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7957819Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7958028Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7958161Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7958314Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7958484Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7958690Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7958831Z     ISA 2                    
2026-02-21T08:06:26.7958984Z       Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack-
2026-02-21T08:06:26.7959170Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7959399Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7959583Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7959795Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7959967Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7960135Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7960286Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7960417Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7960560Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7960701Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7960894Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7961034Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7961164Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7961310Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7961453Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7961607Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7961742Z *******                  
2026-02-21T08:06:26.7961843Z Agent 9                  
2026-02-21T08:06:26.7961939Z *******                  
2026-02-21T08:06:26.7962059Z   Name:                    gfx942                             
2026-02-21T08:06:26.7962215Z   Uuid:                    GPU-fac84a106f8362ee               
2026-02-21T08:06:26.7962378Z   Marketing Name:          AMD Instinct MI325X                
2026-02-21T08:06:26.7962546Z   Vendor Name:             AMD                                
2026-02-21T08:06:26.7962753Z   Feature:                 KERNEL_DISPATCH                    
2026-02-21T08:06:26.7962948Z   Profile:                 BASE_PROFILE                       
2026-02-21T08:06:26.7963111Z   Float Round Mode:        NEAR                               
2026-02-21T08:06:26.7963273Z   Max Queue Number:        128(0x80)                          
2026-02-21T08:06:26.7963432Z   Queue Min Size:          64(0x40)                           
2026-02-21T08:06:26.7963591Z   Queue Max Size:          131072(0x20000)                    
2026-02-21T08:06:26.7963748Z   Queue Type:              MULTI                              
2026-02-21T08:06:26.7963897Z   Node:                    8                                  
2026-02-21T08:06:26.7964050Z   Device Type:             GPU                                
2026-02-21T08:06:26.7964184Z   Cache Info:              
2026-02-21T08:06:26.7964308Z     L1:                      32(0x20) KB                        
2026-02-21T08:06:26.7964456Z     L2:                      4096(0x1000) KB                    
2026-02-21T08:06:26.7964598Z     L3:                      262144(0x40000) KB                 
2026-02-21T08:06:26.7964748Z   Chip ID:                 29861(0x74a5)                      
2026-02-21T08:06:26.7964902Z   ASIC Revision:           1(0x1)                             
2026-02-21T08:06:26.7965066Z   Cacheline Size:          128(0x80)                          
2026-02-21T08:06:26.7965225Z   Max Clock Freq. (MHz):   2100                               
2026-02-21T08:06:26.7965383Z   BDFID:                   58624                              
2026-02-21T08:06:26.7965538Z   Internal Node ID:        8                                  
2026-02-21T08:06:26.7965699Z   Compute Unit:            304                                
2026-02-21T08:06:26.7965856Z   SIMDs per CU:            4                                  
2026-02-21T08:06:26.7966011Z   Shader Engines:          32                                 
2026-02-21T08:06:26.7966178Z   Shader Arrs. per Eng.:   1                                  
2026-02-21T08:06:26.7966346Z   WatchPts on Addr. Ranges:4                                  
2026-02-21T08:06:26.7966519Z   Coherent Host Access:    FALSE                              
2026-02-21T08:06:26.7966661Z   Memory Properties:       
2026-02-21T08:06:26.7966783Z   Features:                KERNEL_DISPATCH 
2026-02-21T08:06:26.7966936Z   Fast F16 Operation:      TRUE                               
2026-02-21T08:06:26.7967097Z   Wavefront Size:          64(0x40)                           
2026-02-21T08:06:26.7967262Z   Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7967407Z   Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7967588Z     x                        1024(0x400)                        
2026-02-21T08:06:26.7967727Z     y                        1024(0x400)                        
2026-02-21T08:06:26.7967887Z     z                        1024(0x400)                        
2026-02-21T08:06:26.7968090Z   Max Waves Per CU:        32(0x20)                           
2026-02-21T08:06:26.7968261Z   Max Work-item Per CU:    2048(0x800)                        
2026-02-21T08:06:26.7968429Z   Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7968595Z   Grid Max Size per Dimension:
2026-02-21T08:06:26.7968743Z     x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7968933Z     y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7969074Z     z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7969230Z   Max fbarriers/Workgrp:   32                                 
2026-02-21T08:06:26.7969405Z   Packet Processor uCode:: 185                                
2026-02-21T08:06:26.7969590Z   SDMA engine uCode::      24                                 
2026-02-21T08:06:26.7969753Z   IOMMU Support::          None                               
2026-02-21T08:06:26.7969891Z   Pool Info:               
2026-02-21T08:06:26.7969996Z     Pool 1                   
2026-02-21T08:06:26.7970131Z       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
2026-02-21T08:06:26.7970331Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7970526Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7970693Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7970874Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7971081Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7971248Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7971391Z     Pool 2                   
2026-02-21T08:06:26.7971526Z       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
2026-02-21T08:06:26.7971691Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7971850Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7972013Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7972187Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7972358Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7972525Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7972666Z     Pool 3                   
2026-02-21T08:06:26.7972801Z       Segment:                 GLOBAL; FLAGS: FINE GRAINED        
2026-02-21T08:06:26.7972957Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7973114Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7973279Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7973447Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7973624Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7973788Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7973933Z     Pool 4                   
2026-02-21T08:06:26.7974060Z       Segment:                 GROUP                              
2026-02-21T08:06:26.7974215Z       Size:                    64(0x40) KB                        
2026-02-21T08:06:26.7974373Z       Allocatable:             FALSE                              
2026-02-21T08:06:26.7974534Z       Alloc Granule:           0KB                                
2026-02-21T08:06:26.7974707Z       Alloc Recommended Granule:0KB                                
2026-02-21T08:06:26.7974877Z       Alloc Alignment:         0KB                                
2026-02-21T08:06:26.7975043Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7975183Z   ISA Info:                
2026-02-21T08:06:26.7975287Z     ISA 1                    
2026-02-21T08:06:26.7975460Z       Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
2026-02-21T08:06:26.7975637Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7975813Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7975980Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7976152Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7976314Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7976477Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7976628Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7976763Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7976908Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7977049Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7977205Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7977347Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7977504Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7977651Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7977794Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7977952Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7978136Z     ISA 2                    
2026-02-21T08:06:26.7978298Z       Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack-
2026-02-21T08:06:26.7978516Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7978725Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7978893Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7979095Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7979265Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7979429Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7979581Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7979712Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7979963Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7980103Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7980281Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7980423Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7980551Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7980697Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7980851Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7981008Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7981147Z *******                  
2026-02-21T08:06:26.7981247Z Agent 10                 
2026-02-21T08:06:26.7981342Z *******                  
2026-02-21T08:06:26.7981463Z   Name:                    gfx942                             
2026-02-21T08:06:26.7981621Z   Uuid:                    GPU-56bb44b4843f18b0               
2026-02-21T08:06:26.7981783Z   Marketing Name:          AMD Instinct MI325X                
2026-02-21T08:06:26.7981950Z   Vendor Name:             AMD                                
2026-02-21T08:06:26.7982109Z   Feature:                 KERNEL_DISPATCH                    
2026-02-21T08:06:26.7982270Z   Profile:                 BASE_PROFILE                       
2026-02-21T08:06:26.7982429Z   Float Round Mode:        NEAR                               
2026-02-21T08:06:26.7982625Z   Max Queue Number:        128(0x80)                          
2026-02-21T08:06:26.7982786Z   Queue Min Size:          64(0x40)                           
2026-02-21T08:06:26.7982942Z   Queue Max Size:          131072(0x20000)                    
2026-02-21T08:06:26.7983102Z   Queue Type:              MULTI                              
2026-02-21T08:06:26.7983252Z   Node:                    9                                  
2026-02-21T08:06:26.7983404Z   Device Type:             GPU                                
2026-02-21T08:06:26.7983540Z   Cache Info:              
2026-02-21T08:06:26.7983662Z     L1:                      32(0x20) KB                        
2026-02-21T08:06:26.7983806Z     L2:                      4096(0x1000) KB                    
2026-02-21T08:06:26.7983955Z     L3:                      262144(0x40000) KB                 
2026-02-21T08:06:26.7984102Z   Chip ID:                 29861(0x74a5)                      
2026-02-21T08:06:26.7984260Z   ASIC Revision:           1(0x1)                             
2026-02-21T08:06:26.7984425Z   Cacheline Size:          128(0x80)                          
2026-02-21T08:06:26.7984584Z   Max Clock Freq. (MHz):   2100                               
2026-02-21T08:06:26.7984682Z   BDFID:                   38144                              
2026-02-21T08:06:26.7984748Z   Internal Node ID:        9                                  
2026-02-21T08:06:26.7984813Z   Compute Unit:            304                                
2026-02-21T08:06:26.7984884Z   SIMDs per CU:            4                                  
2026-02-21T08:06:26.7984951Z   Shader Engines:          32                                 
2026-02-21T08:06:26.7985021Z   Shader Arrs. per Eng.:   1                                  
2026-02-21T08:06:26.7985091Z   WatchPts on Addr. Ranges:4                                  
2026-02-21T08:06:26.7985164Z   Coherent Host Access:    FALSE                              
2026-02-21T08:06:26.7985210Z   Memory Properties:       
2026-02-21T08:06:26.7985264Z   Features:                KERNEL_DISPATCH 
2026-02-21T08:06:26.7985334Z   Fast F16 Operation:      TRUE                               
2026-02-21T08:06:26.7985399Z   Wavefront Size:          64(0x40)                           
2026-02-21T08:06:26.7985468Z   Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7985519Z   Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7985575Z     x                        1024(0x400)                        
2026-02-21T08:06:26.7985630Z     y                        1024(0x400)                        
2026-02-21T08:06:26.7985684Z     z                        1024(0x400)                        
2026-02-21T08:06:26.7985753Z   Max Waves Per CU:        32(0x20)                           
2026-02-21T08:06:26.7985821Z   Max Work-item Per CU:    2048(0x800)                        
2026-02-21T08:06:26.7985885Z   Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7985934Z   Grid Max Size per Dimension:
2026-02-21T08:06:26.7985991Z     x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7986047Z     y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7986106Z     z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7986175Z   Max fbarriers/Workgrp:   32                                 
2026-02-21T08:06:26.7986249Z   Packet Processor uCode:: 185                                
2026-02-21T08:06:26.7986318Z   SDMA engine uCode::      24                                 
2026-02-21T08:06:26.7986389Z   IOMMU Support::          None                               
2026-02-21T08:06:26.7986432Z   Pool Info:               
2026-02-21T08:06:26.7986475Z     Pool 1                   
2026-02-21T08:06:26.7986550Z       Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
2026-02-21T08:06:26.7986613Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7986682Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7986787Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7986865Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7986933Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7987002Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7987047Z     Pool 2                   
2026-02-21T08:06:26.7987118Z       Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
2026-02-21T08:06:26.7987180Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7987249Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7987314Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7987388Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7987457Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7987527Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7987568Z     Pool 3                   
2026-02-21T08:06:26.7987674Z       Segment:                 GLOBAL; FLAGS: FINE GRAINED        
2026-02-21T08:06:26.7987733Z       Size:                    268419072(0xfffc000) KB            
2026-02-21T08:06:26.7987798Z       Allocatable:             TRUE                               
2026-02-21T08:06:26.7987864Z       Alloc Granule:           4KB                                
2026-02-21T08:06:26.7987940Z       Alloc Recommended Granule:2048KB                             
2026-02-21T08:06:26.7988007Z       Alloc Alignment:         4KB                                
2026-02-21T08:06:26.7988077Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7988119Z     Pool 4                   
2026-02-21T08:06:26.7988183Z       Segment:                 GROUP                              
2026-02-21T08:06:26.7988242Z       Size:                    64(0x40) KB                        
2026-02-21T08:06:26.7988311Z       Allocatable:             FALSE                              
2026-02-21T08:06:26.7988380Z       Alloc Granule:           0KB                                
2026-02-21T08:06:26.7988452Z       Alloc Recommended Granule:0KB                                
2026-02-21T08:06:26.7988519Z       Alloc Alignment:         0KB                                
2026-02-21T08:06:26.7988591Z       Accessible by all:       FALSE                              
2026-02-21T08:06:26.7988633Z   ISA Info:                
2026-02-21T08:06:26.7988672Z     ISA 1                    
2026-02-21T08:06:26.7988750Z       Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
2026-02-21T08:06:26.7988821Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7988888Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7988962Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7989034Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7989094Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7989164Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7989218Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7989275Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7989332Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7989449Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7989519Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7989570Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7989632Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7989690Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7989781Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7989854Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7989900Z     ISA 2                    
2026-02-21T08:06:26.7989984Z       Name:                    amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack-
2026-02-21T08:06:26.7990056Z       Machine Models:          HSA_MACHINE_MODEL_LARGE            
2026-02-21T08:06:26.7990127Z       Profiles:                HSA_PROFILE_BASE                   
2026-02-21T08:06:26.7990198Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7990269Z       Default Rounding Mode:   NEAR                               
2026-02-21T08:06:26.7990333Z       Fast f16:                TRUE                               
2026-02-21T08:06:26.7990402Z       Workgroup Max Size:      1024(0x400)                        
2026-02-21T08:06:26.7990452Z       Workgroup Max Size per Dimension:
2026-02-21T08:06:26.7990514Z         x                        1024(0x400)                        
2026-02-21T08:06:26.7990569Z         y                        1024(0x400)                        
2026-02-21T08:06:26.7990653Z         z                        1024(0x400)                        
2026-02-21T08:06:26.7990720Z       Grid Max Size:           4294967295(0xffffffff)             
2026-02-21T08:06:26.7990770Z       Grid Max Size per Dimension:
2026-02-21T08:06:26.7990827Z         x                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7990883Z         y                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7990942Z         z                        4294967295(0xffffffff)             
2026-02-21T08:06:26.7991010Z       FBarrier Max Size:       32                                 
2026-02-21T08:06:26.7991051Z *** Done ***             
2026-02-21T08:06:26.8230260Z ##[group]Run set -x
2026-02-21T08:06:26.8230384Z [36;1mset -x[0m
2026-02-21T08:06:26.8230430Z [36;1mapt-get update[0m
2026-02-21T08:06:26.8230478Z [36;1mapt-get install -y git[0m
2026-02-21T08:06:26.8230673Z shell: bash -l {0}
2026-02-21T08:06:26.8230713Z env:
2026-02-21T08:06:26.8230787Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:26.8230829Z ##[endgroup]
2026-02-21T08:06:26.9187670Z + apt-get update
2026-02-21T08:06:27.0285165Z Get:1 https://repo.radeon.com/amdgpu/6.4.4/ubuntu noble InRelease [5465 B]
2026-02-21T08:06:27.0349441Z Get:2 https://repo.radeon.com/rocm/apt/6.4.4 noble InRelease [2605 B]
2026-02-21T08:06:27.1615600Z Get:3 https://repo.radeon.com/amdgpu/6.4.4/ubuntu noble/main amd64 Packages [14.6 kB]
2026-02-21T08:06:27.2100417Z Get:4 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB]
2026-02-21T08:06:27.2550172Z Get:5 https://repo.radeon.com/rocm/apt/6.4.4 noble/main amd64 Packages [60.5 kB]
2026-02-21T08:06:27.2867807Z Get:6 http://archive.ubuntu.com/ubuntu noble InRelease [256 kB]
2026-02-21T08:06:27.3590950Z Get:7 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [1207 kB]
2026-02-21T08:06:27.4693224Z Get:8 http://security.ubuntu.com/ubuntu noble-security/multiverse amd64 Packages [34.8 kB]
2026-02-21T08:06:27.4695877Z Get:9 http://security.ubuntu.com/ubuntu noble-security/restricted amd64 Packages [3196 kB]
2026-02-21T08:06:27.5181901Z Get:10 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [1857 kB]
2026-02-21T08:06:27.8715248Z Get:11 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB]
2026-02-21T08:06:28.0108343Z Get:12 http://archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB]
2026-02-21T08:06:28.1507810Z Get:13 http://archive.ubuntu.com/ubuntu noble/restricted amd64 Packages [117 kB]
2026-02-21T08:06:28.1839001Z Get:14 http://archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages [331 kB]
2026-02-21T08:06:28.2796584Z Get:15 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages [19.3 MB]
2026-02-21T08:06:28.8452508Z Get:16 http://archive.ubuntu.com/ubuntu noble/main amd64 Packages [1808 kB]
2026-02-21T08:06:28.8666471Z Get:17 http://archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages [2016 kB]
2026-02-21T08:06:28.8878293Z Get:18 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [2240 kB]
2026-02-21T08:06:28.9154355Z Get:19 http://archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages [3381 kB]
2026-02-21T08:06:28.9622347Z Get:20 http://archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Packages [38.1 kB]
2026-02-21T08:06:28.9628201Z Get:21 http://archive.ubuntu.com/ubuntu noble-backports/universe amd64 Packages [34.6 kB]
2026-02-21T08:06:28.9628814Z Get:22 http://archive.ubuntu.com/ubuntu noble-backports/main amd64 Packages [49.5 kB]
2026-02-21T08:06:29.4088776Z Fetched 36.3 MB in 2s (14.7 MB/s)
2026-02-21T08:06:29.8538881Z Reading package lists...
2026-02-21T08:06:29.8652777Z W: https://repo.radeon.com/amdgpu/6.4.4/ubuntu/dists/noble/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
2026-02-21T08:06:29.8653796Z W: https://repo.radeon.com/rocm/apt/6.4.4/dists/noble/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details.
2026-02-21T08:06:29.8659736Z + apt-get install -y git
2026-02-21T08:06:30.3020570Z Reading package lists...
2026-02-21T08:06:30.4440502Z Building dependency tree...
2026-02-21T08:06:30.4441351Z Reading state information...
2026-02-21T08:06:30.5419366Z The following additional packages will be installed:
2026-02-21T08:06:30.5420925Z   git-man less libcbor0.10 libcurl3t64-gnutls liberror-perl libfido2-1
2026-02-21T08:06:30.5422473Z   libxmuu1 openssh-client xauth
2026-02-21T08:06:30.5425755Z Suggested packages:
2026-02-21T08:06:30.5426176Z   gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-email git-gui
2026-02-21T08:06:30.5427310Z   gitk gitweb git-cvs git-mediawiki git-svn keychain libpam-ssh monkeysphere
2026-02-21T08:06:30.5427558Z   ssh-askpass
2026-02-21T08:06:30.5641183Z The following NEW packages will be installed:
2026-02-21T08:06:30.5645533Z   git git-man less libcbor0.10 libcurl3t64-gnutls liberror-perl libfido2-1
2026-02-21T08:06:30.5647021Z   libxmuu1 openssh-client xauth
2026-02-21T08:06:30.7894174Z 0 upgraded, 10 newly installed, 0 to remove and 101 not upgraded.
2026-02-21T08:06:30.7894696Z Need to get 6330 kB of archives.
2026-02-21T08:06:30.7895109Z After this operation, 29.8 MB of additional disk space will be used.
2026-02-21T08:06:30.7895819Z Get:1 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 less amd64 590-2ubuntu2.1 [142 kB]
2026-02-21T08:06:31.2801027Z Get:2 http://archive.ubuntu.com/ubuntu noble/main amd64 libcbor0.10 amd64 0.10.2-1.2ubuntu2 [25.8 kB]
2026-02-21T08:06:31.2924893Z Get:3 http://archive.ubuntu.com/ubuntu noble/main amd64 libfido2-1 amd64 1.14.0-1build3 [83.5 kB]
2026-02-21T08:06:31.3397239Z Get:4 http://archive.ubuntu.com/ubuntu noble/main amd64 libxmuu1 amd64 2:1.1.3-3build2 [8958 B]
2026-02-21T08:06:31.3441066Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 openssh-client amd64 1:9.6p1-3ubuntu13.14 [906 kB]
2026-02-21T08:06:31.6257104Z Get:6 http://archive.ubuntu.com/ubuntu noble/main amd64 xauth amd64 1:1.1.2-1build1 [25.6 kB]
2026-02-21T08:06:31.6287906Z Get:7 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libcurl3t64-gnutls amd64 8.5.0-2ubuntu10.6 [333 kB]
2026-02-21T08:06:31.6660475Z Get:8 http://archive.ubuntu.com/ubuntu noble/main amd64 liberror-perl all 0.17029-2 [25.6 kB]
2026-02-21T08:06:31.6677790Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git-man all 1:2.43.0-1ubuntu7.3 [1100 kB]
2026-02-21T08:06:31.7542840Z Get:10 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git amd64 1:2.43.0-1ubuntu7.3 [3680 kB]
2026-02-21T08:06:31.9598989Z debconf: delaying package configuration, since apt-utils is not installed
2026-02-21T08:06:31.9777922Z Fetched 6330 kB in 1s (4814 kB/s)
2026-02-21T08:06:31.9895156Z Selecting previously unselected package less.
2026-02-21T08:06:31.9924868Z (Reading database ... 
2026-02-21T08:06:31.9931246Z (Reading database ... 5%
2026-02-21T08:06:31.9942298Z (Reading database ... 10%
2026-02-21T08:06:31.9949373Z (Reading database ... 15%
2026-02-21T08:06:31.9955502Z (Reading database ... 20%
2026-02-21T08:06:31.9961323Z (Reading database ... 25%
2026-02-21T08:06:31.9970874Z (Reading database ... 30%
2026-02-21T08:06:31.9972994Z (Reading database ... 35%
2026-02-21T08:06:31.9973149Z (Reading database ... 40%
2026-02-21T08:06:31.9973351Z (Reading database ... 45%
2026-02-21T08:06:31.9973506Z (Reading database ... 50%
2026-02-21T08:06:31.9973664Z (Reading database ... 55%
2026-02-21T08:06:31.9973818Z (Reading database ... 60%
2026-02-21T08:06:31.9973974Z (Reading database ... 65%
2026-02-21T08:06:31.9974129Z (Reading database ... 70%
2026-02-21T08:06:31.9974282Z (Reading database ... 75%
2026-02-21T08:06:31.9974446Z (Reading database ... 80%
2026-02-21T08:06:31.9974604Z (Reading database ... 85%
2026-02-21T08:06:31.9974762Z (Reading database ... 90%
2026-02-21T08:06:31.9974922Z (Reading database ... 95%
2026-02-21T08:06:31.9975084Z (Reading database ... 100%
2026-02-21T08:06:31.9975347Z (Reading database ... 28634 files and directories currently installed.)
2026-02-21T08:06:31.9979532Z Preparing to unpack .../0-less_590-2ubuntu2.1_amd64.deb ...
2026-02-21T08:06:31.9991478Z Unpacking less (590-2ubuntu2.1) ...
2026-02-21T08:06:32.0084888Z Selecting previously unselected package libcbor0.10:amd64.
2026-02-21T08:06:32.0100044Z Preparing to unpack .../1-libcbor0.10_0.10.2-1.2ubuntu2_amd64.deb ...
2026-02-21T08:06:32.0104100Z Unpacking libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ...
2026-02-21T08:06:32.0188921Z Selecting previously unselected package libfido2-1:amd64.
2026-02-21T08:06:32.0207078Z Preparing to unpack .../2-libfido2-1_1.14.0-1build3_amd64.deb ...
2026-02-21T08:06:32.0209103Z Unpacking libfido2-1:amd64 (1.14.0-1build3) ...
2026-02-21T08:06:32.0306556Z Selecting previously unselected package libxmuu1:amd64.
2026-02-21T08:06:32.0321809Z Preparing to unpack .../3-libxmuu1_2%3a1.1.3-3build2_amd64.deb ...
2026-02-21T08:06:32.0323981Z Unpacking libxmuu1:amd64 (2:1.1.3-3build2) ...
2026-02-21T08:06:32.0394543Z Selecting previously unselected package openssh-client.
2026-02-21T08:06:32.0410812Z Preparing to unpack .../4-openssh-client_1%3a9.6p1-3ubuntu13.14_amd64.deb ...
2026-02-21T08:06:32.0441039Z Unpacking openssh-client (1:9.6p1-3ubuntu13.14) ...
2026-02-21T08:06:32.0607361Z Selecting previously unselected package xauth.
2026-02-21T08:06:32.0622317Z Preparing to unpack .../5-xauth_1%3a1.1.2-1build1_amd64.deb ...
2026-02-21T08:06:32.0623895Z Unpacking xauth (1:1.1.2-1build1) ...
2026-02-21T08:06:32.0691034Z Selecting previously unselected package libcurl3t64-gnutls:amd64.
2026-02-21T08:06:32.0705520Z Preparing to unpack .../6-libcurl3t64-gnutls_8.5.0-2ubuntu10.6_amd64.deb ...
2026-02-21T08:06:32.0707285Z Unpacking libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ...
2026-02-21T08:06:32.0793522Z Selecting previously unselected package liberror-perl.
2026-02-21T08:06:32.0808586Z Preparing to unpack .../7-liberror-perl_0.17029-2_all.deb ...
2026-02-21T08:06:32.0810230Z Unpacking liberror-perl (0.17029-2) ...
2026-02-21T08:06:32.0886856Z Selecting previously unselected package git-man.
2026-02-21T08:06:32.0902765Z Preparing to unpack .../8-git-man_1%3a2.43.0-1ubuntu7.3_all.deb ...
2026-02-21T08:06:32.0904274Z Unpacking git-man (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:32.1007913Z Selecting previously unselected package git.
2026-02-21T08:06:32.1022664Z Preparing to unpack .../9-git_1%3a2.43.0-1ubuntu7.3_amd64.deb ...
2026-02-21T08:06:32.1047645Z Unpacking git (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:32.1680502Z Setting up libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ...
2026-02-21T08:06:32.1685402Z Setting up libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ...
2026-02-21T08:06:32.1690000Z Setting up less (590-2ubuntu2.1) ...
2026-02-21T08:06:32.1717215Z Setting up liberror-perl (0.17029-2) ...
2026-02-21T08:06:32.1721648Z Setting up git-man (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:32.1726328Z Setting up libfido2-1:amd64 (1.14.0-1build3) ...
2026-02-21T08:06:32.1730798Z Setting up libxmuu1:amd64 (2:1.1.3-3build2) ...
2026-02-21T08:06:32.1735981Z Setting up openssh-client (1:9.6p1-3ubuntu13.14) ...
2026-02-21T08:06:32.1989138Z Setting up git (1:2.43.0-1ubuntu7.3) ...
2026-02-21T08:06:32.2020178Z Setting up xauth (1:1.1.2-1build1) ...
2026-02-21T08:06:32.2025878Z Processing triggers for libc-bin (2.39-0ubuntu8.6) ...
2026-02-21T08:06:32.2476291Z ##[group]Run actions/checkout@v6
2026-02-21T08:06:32.2476426Z with:
2026-02-21T08:06:32.2476519Z   repository: pytorch/helion
2026-02-21T08:06:32.2476696Z   token: ***
2026-02-21T08:06:32.2476785Z   ssh-strict: true
2026-02-21T08:06:32.2476872Z   ssh-user: git
2026-02-21T08:06:32.2476970Z   persist-credentials: true
2026-02-21T08:06:32.2477074Z   clean: true
2026-02-21T08:06:32.2477170Z   sparse-checkout-cone-mode: true
2026-02-21T08:06:32.2477283Z   fetch-depth: 1
2026-02-21T08:06:32.2477370Z   fetch-tags: false
2026-02-21T08:06:32.2477458Z   show-progress: true
2026-02-21T08:06:32.2477548Z   lfs: false
2026-02-21T08:06:32.2477628Z   submodules: false
2026-02-21T08:06:32.2477721Z   set-safe-directory: true
2026-02-21T08:06:32.2477986Z env:
2026-02-21T08:06:32.2478068Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:32.2478176Z ##[endgroup]
2026-02-21T08:06:32.2502266Z ##[command]/usr/bin/docker exec  9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:06:32.4892468Z Syncing repository: pytorch/helion
2026-02-21T08:06:32.4893426Z ##[group]Getting Git version info
2026-02-21T08:06:32.4893628Z Working directory is '/__w/helion/helion'
2026-02-21T08:06:32.4893901Z [command]/usr/bin/git version
2026-02-21T08:06:32.4894019Z git version 2.43.0
2026-02-21T08:06:32.4895009Z ##[endgroup]
2026-02-21T08:06:32.4904512Z Temporarily overriding HOME='/__w/_temp/bd5d00ae-b4c0-4db9-886a-94e3a2619ff1' before making global git config changes
2026-02-21T08:06:32.4904847Z Adding repository directory to the temporary git global config as a safe directory
2026-02-21T08:06:32.4907161Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion
2026-02-21T08:06:32.4929907Z Deleting the contents of '/__w/helion/helion'
2026-02-21T08:06:32.4931579Z ##[group]Initializing the repository
2026-02-21T08:06:32.4932751Z [command]/usr/bin/git init /__w/helion/helion
2026-02-21T08:06:32.4953495Z hint: Using 'master' as the name for the initial branch. This default branch name
2026-02-21T08:06:32.4953778Z hint: is subject to change. To configure the initial branch name to use in all
2026-02-21T08:06:32.4953997Z hint: of your new repositories, which will suppress this warning, call:
2026-02-21T08:06:32.4954154Z hint: 
2026-02-21T08:06:32.4954307Z hint: 	git config --global init.defaultBranch <name>
2026-02-21T08:06:32.4954450Z hint: 
2026-02-21T08:06:32.4954584Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
2026-02-21T08:06:32.4954806Z hint: 'development'. The just-created branch can be renamed via this command:
2026-02-21T08:06:32.4954992Z hint: 
2026-02-21T08:06:32.4955083Z hint: 	git branch -m <name>
2026-02-21T08:06:32.4955739Z Initialized empty Git repository in /__w/helion/helion/.git/
2026-02-21T08:06:32.4961049Z [command]/usr/bin/git remote add origin https://github.com/pytorch/helion
2026-02-21T08:06:32.4979794Z ##[endgroup]
2026-02-21T08:06:32.4979997Z ##[group]Disabling automatic garbage collection
2026-02-21T08:06:32.4981285Z [command]/usr/bin/git config --local gc.auto 0
2026-02-21T08:06:32.4997946Z ##[endgroup]
2026-02-21T08:06:32.4998136Z ##[group]Setting up auth
2026-02-21T08:06:32.4998921Z Removing SSH command configuration
2026-02-21T08:06:32.5001471Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2026-02-21T08:06:32.5017141Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2026-02-21T08:06:32.5150674Z Removing HTTP extra header
2026-02-21T08:06:32.5151542Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2026-02-21T08:06:32.5164128Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2026-02-21T08:06:32.5279888Z Removing includeIf entries pointing to credentials config files
2026-02-21T08:06:32.5282016Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir:
2026-02-21T08:06:32.5296110Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url
2026-02-21T08:06:32.5422323Z [command]/usr/bin/git config --file /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config http.https://github.com/.extraheader AUTHORIZATION: basic ***
2026-02-21T08:06:32.5440483Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T08:06:32.5458919Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T08:06:32.5474714Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T08:06:32.5489063Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T08:06:32.5500881Z ##[endgroup]
2026-02-21T08:06:32.5501066Z ##[group]Fetching the repository
2026-02-21T08:06:32.5504618Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +874a7d0cadab18218a84ad3579d329dc95c51820:refs/remotes/origin/main
2026-02-21T08:06:33.0736891Z From https://github.com/pytorch/helion
2026-02-21T08:06:33.0737206Z  * [new ref]         874a7d0cadab18218a84ad3579d329dc95c51820 -> origin/main
2026-02-21T08:06:33.0753384Z [command]/usr/bin/git branch --list --remote origin/main
2026-02-21T08:06:33.0773358Z   origin/main
2026-02-21T08:06:33.0779133Z [command]/usr/bin/git rev-parse refs/remotes/origin/main
2026-02-21T08:06:33.0794774Z 874a7d0cadab18218a84ad3579d329dc95c51820
2026-02-21T08:06:33.0797217Z ##[endgroup]
2026-02-21T08:06:33.0797381Z ##[group]Determining the checkout info
2026-02-21T08:06:33.0800326Z ##[endgroup]
2026-02-21T08:06:33.0802059Z [command]/usr/bin/git sparse-checkout disable
2026-02-21T08:06:33.0821783Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2026-02-21T08:06:33.0836642Z ##[group]Checking out the ref
2026-02-21T08:06:33.0838701Z [command]/usr/bin/git checkout --progress --force -B main refs/remotes/origin/main
2026-02-21T08:06:33.1011840Z Switched to a new branch 'main'
2026-02-21T08:06:33.1012306Z branch 'main' set up to track 'origin/main'.
2026-02-21T08:06:33.1014055Z ##[endgroup]
2026-02-21T08:06:33.1040433Z [command]/usr/bin/git log -1 --format=%H
2026-02-21T08:06:33.1052039Z 874a7d0cadab18218a84ad3579d329dc95c51820
2026-02-21T08:06:33.1193214Z ##[group]Run actions/setup-python@v6
2026-02-21T08:06:33.1193358Z with:
2026-02-21T08:06:33.1193447Z   python-version: 3.12
2026-02-21T08:06:33.1193544Z   check-latest: false
2026-02-21T08:06:33.1193685Z   token: ***
2026-02-21T08:06:33.1193776Z   update-environment: true
2026-02-21T08:06:33.1193883Z   allow-prereleases: false
2026-02-21T08:06:33.1193995Z   freethreaded: false
2026-02-21T08:06:33.1194086Z env:
2026-02-21T08:06:33.1194172Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:33.1194283Z ##[endgroup]
2026-02-21T08:06:33.1196577Z ##[command]/usr/bin/docker exec  9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:06:33.3174070Z ##[group]Installed versions
2026-02-21T08:06:33.3179507Z Version 3.12 was not found in the local cache
2026-02-21T08:06:33.8491681Z Version 3.12 is available for downloading
2026-02-21T08:06:33.8492316Z Download from "https://github.com/actions/python-versions/releases/download/3.12.12-18393146713/python-3.12.12-linux-24.04-x64.tar.gz"
2026-02-21T08:06:34.8870495Z Extract downloaded archive
2026-02-21T08:06:34.8965100Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/3b2cb34e-d9a4-4ea8-ac0a-6dc84bdd17d8 -f /__w/_temp/6fe56298-1f17-4909-b978-5a49907e4cf3
2026-02-21T08:06:36.1661639Z Execute installation script
2026-02-21T08:06:36.1749346Z Check if Python hostedtoolcache folder exist...
2026-02-21T08:06:36.1750744Z Creating Python hostedtoolcache folder...
2026-02-21T08:06:36.1757295Z Create Python 3.12.12 folder
2026-02-21T08:06:36.1760418Z Copy Python binaries to hostedtoolcache folder
2026-02-21T08:06:36.6031754Z Create additional symlinks (Required for the UsePythonVersion Azure Pipelines task and the setup-python GitHub Action)
2026-02-21T08:06:36.6047920Z Upgrading pip...
2026-02-21T08:06:37.5936524Z Looking in links: /tmp/tmpzhzlgfrp
2026-02-21T08:06:37.5937235Z Requirement already satisfied: pip in /__w/_tool/Python/3.12.12/x64/lib/python3.12/site-packages (25.0.1)
2026-02-21T08:06:37.5974333Z ##[error]WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
2026-02-21T08:06:38.0154447Z ##[error]WARNING: The directory '/github/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
2026-02-21T08:06:38.1388143Z Collecting pip
2026-02-21T08:06:38.1764781Z Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB)
2026-02-21T08:06:38.1818550Z Downloading pip-26.0.1-py3-none-any.whl (1.8 MB)
2026-02-21T08:06:38.2049445Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 162.6 MB/s eta 0:00:00
2026-02-21T08:06:38.2049651Z 
2026-02-21T08:06:38.2122413Z Installing collected packages: pip
2026-02-21T08:06:38.2123620Z Attempting uninstall: pip
2026-02-21T08:06:38.2132261Z Found existing installation: pip 25.0.1
2026-02-21T08:06:38.2280191Z Uninstalling pip-25.0.1:
2026-02-21T08:06:38.2308249Z Successfully uninstalled pip-25.0.1
2026-02-21T08:06:38.6990413Z Successfully installed pip-26.0.1
2026-02-21T08:06:38.7271499Z Create complete file
2026-02-21T08:06:38.7491632Z Successfully set up CPython (3.12.12)
2026-02-21T08:06:38.7492015Z ##[endgroup]
2026-02-21T08:06:38.8098284Z ##[group]Run astral-sh/setup-uv@v7
2026-02-21T08:06:38.8098469Z with:
2026-02-21T08:06:38.8098597Z   activate-environment: false
2026-02-21T08:06:38.8098789Z   working-directory: /home/runner/_work/helion/helion
2026-02-21T08:06:38.8113532Z   github-token: ***
2026-02-21T08:06:38.8113661Z   enable-cache: auto
2026-02-21T08:06:38.8114020Z   cache-dependency-glob: **/*requirements*.txt
**/*requirements*.in
**/*constraints*.txt
**/*constraints*.in
**/pyproject.toml
**/uv.lock
**/*.py.lock

2026-02-21T08:06:38.8114456Z   restore-cache: true
2026-02-21T08:06:38.8114590Z   save-cache: true
2026-02-21T08:06:38.8114712Z   prune-cache: true
2026-02-21T08:06:38.8114842Z   cache-python: false
2026-02-21T08:06:38.8114983Z   ignore-nothing-to-cache: false
2026-02-21T08:06:38.8115148Z   ignore-empty-workdir: false
2026-02-21T08:06:38.8115314Z   add-problem-matchers: true
2026-02-21T08:06:38.8115468Z   resolution-strategy: highest
2026-02-21T08:06:38.8115618Z env:
2026-02-21T08:06:38.8115764Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:38.8115941Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:38.8116169Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:38.8116390Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:38.8116587Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:38.8116789Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:38.8116988Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T08:06:38.8117168Z ##[endgroup]
2026-02-21T08:06:38.8121922Z ##[command]/usr/bin/docker exec  9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T08:06:39.0298113Z (node:701) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
2026-02-21T08:06:39.0298475Z (Use `node --trace-deprecation ...` to show where the warning was created)
2026-02-21T08:06:39.0335857Z Trying to find version for uv in: /__w/helion/helion/uv.toml
2026-02-21T08:06:39.0336039Z Could not find file: /__w/helion/helion/uv.toml
2026-02-21T08:06:39.0336213Z Trying to find version for uv in: /__w/helion/helion/pyproject.toml
2026-02-21T08:06:39.0373196Z Could not determine uv version from uv.toml or pyproject.toml. Falling back to latest.
2026-02-21T08:06:39.0373486Z Getting latest version from GitHub API...
2026-02-21T08:06:39.2686829Z manifest-file not provided, reading from local file.
2026-02-21T08:06:39.2709579Z manifest-file does not contain version 0.10.4, arch x86_64, platform unknown-linux-gnu. Falling back to GitHub releases.
2026-02-21T08:06:39.2710749Z Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.10.4/uv-x86_64-unknown-linux-gnu.tar.gz" ...
2026-02-21T08:06:40.2341718Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/416381ee-2d9f-4ce0-98e4-51a9429e21dc -f /__w/_temp/2cdffd75-6a4f-4df0-a6ed-6b2378cec22a
2026-02-21T08:06:40.4986843Z Added /github/home/.local/bin to the path
2026-02-21T08:06:40.4989351Z Added /__w/_tool/uv/0.10.4/x86_64 to the path
2026-02-21T08:06:40.4989847Z Set UV_PYTHON_INSTALL_DIR to /github/home/.local/share/uv/python
2026-02-21T08:06:40.4990348Z Added /github/home/.local/share/uv/python to the path
2026-02-21T08:06:40.5001804Z Successfully installed uv version 0.10.4
2026-02-21T08:06:40.6342867Z ##[group]Run uv venv --python 3.12
2026-02-21T08:06:40.6343053Z [36;1muv venv --python 3.12[0m
2026-02-21T08:06:40.6343335Z shell: bash -l {0}
2026-02-21T08:06:40.6343425Z env:
2026-02-21T08:06:40.6343512Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:40.6343655Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:40.6343818Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:40.6343982Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:40.6344127Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:40.6344264Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:40.6344408Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T08:06:40.6344565Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:06:40.6344716Z ##[endgroup]
2026-02-21T08:06:40.7164082Z Using CPython 3.12.12 interpreter at: /__w/_tool/Python/3.12.12/x64/bin/python3.12
2026-02-21T08:06:40.7164314Z Creating virtual environment at: .venv
2026-02-21T08:06:40.7171937Z Activate with: source .venv/bin/activate
2026-02-21T08:06:40.7230159Z ##[group]Run source .venv/bin/activate
2026-02-21T08:06:40.7230342Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:06:40.7230706Z [36;1muv pip install -U "torch==2.9.*" --index-url https://download.pytorch.org/whl/rocm6.4[0m
2026-02-21T08:06:40.7231055Z shell: bash -l {0}
2026-02-21T08:06:40.7231164Z env:
2026-02-21T08:06:40.7231267Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:06:40.7231431Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:40.7231629Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:06:40.7231826Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:40.7231997Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:40.7232163Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:06:40.7232340Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T08:06:40.7232533Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:06:40.7232715Z ##[endgroup]
2026-02-21T08:06:42.7054904Z Resolved 11 packages in 1.91s
2026-02-21T08:06:42.7824116Z Downloading networkx (2.0MiB)
2026-02-21T08:06:42.7824484Z Downloading sympy (6.0MiB)
2026-02-21T08:06:42.8882806Z Downloading torch (4.2GiB)
2026-02-21T08:06:42.8883057Z Downloading pytorch-triton-rocm (261.8MiB)
2026-02-21T08:06:45.3117466Z  Downloaded networkx
2026-02-21T08:06:45.7762499Z  Downloaded sympy
2026-02-21T08:06:49.3716728Z  Downloaded pytorch-triton-rocm
2026-02-21T08:07:50.0456391Z  Downloaded torch
2026-02-21T08:07:50.0456620Z Prepared 11 packages in 1m 07s
2026-02-21T08:07:50.0654753Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:07:50.0655076Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:07:50.0655411Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:07:55.7777469Z Installed 11 packages in 5.73s
2026-02-21T08:07:55.7781566Z  + filelock==3.20.0
2026-02-21T08:07:55.7783143Z  + fsspec==2025.12.0
2026-02-21T08:07:55.7783498Z  + jinja2==3.1.6
2026-02-21T08:07:55.7783711Z  + markupsafe==3.0.2
2026-02-21T08:07:55.7783915Z  + mpmath==1.3.0
2026-02-21T08:07:55.7784108Z  + networkx==3.6.1
2026-02-21T08:07:55.7784364Z  + pytorch-triton-rocm==3.5.1
2026-02-21T08:07:55.7784628Z  + setuptools==70.2.0
2026-02-21T08:07:55.7784838Z  + sympy==1.14.0
2026-02-21T08:07:55.7785031Z  + torch==2.9.1+rocm6.4
2026-02-21T08:07:55.7785257Z  + typing-extensions==4.15.0
2026-02-21T08:07:55.7978618Z ##[group]Run source .venv/bin/activate
2026-02-21T08:07:55.7978815Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:07:55.7978992Z [36;1mSETUPTOOLS_SCM_PRETEND_VERSION="0.0.0" uv pip install -e .'[dev]'[0m
2026-02-21T08:07:55.7979208Z [36;1mpython -c "import helion; print(helion.__name__)"[0m
2026-02-21T08:07:55.7979734Z shell: bash -l {0}
2026-02-21T08:07:55.7979831Z env:
2026-02-21T08:07:55.7979927Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:07:55.7980058Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:55.7980222Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:07:55.7980380Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:55.7980521Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:55.7980668Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:07:55.7980809Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T08:07:55.7980984Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:07:55.7981129Z ##[endgroup]
2026-02-21T08:07:56.6291367Z Resolved 30 packages in 547ms
2026-02-21T08:07:56.6300828Z    Building helion @ file:///__w/helion/helion
2026-02-21T08:07:56.6370043Z Downloading virtualenv (5.6MiB)
2026-02-21T08:07:56.6425396Z Downloading scipy (33.4MiB)
2026-02-21T08:07:56.6426372Z Downloading pygments (1.2MiB)
2026-02-21T08:07:56.6458179Z Downloading numpy (15.8MiB)
2026-02-21T08:07:56.6459811Z Downloading scikit-learn (8.5MiB)
2026-02-21T08:07:56.7307943Z       Built helion @ file:///__w/helion/helion
2026-02-21T08:07:56.7452331Z  Downloaded virtualenv
2026-02-21T08:07:56.7537267Z  Downloaded pygments
2026-02-21T08:07:56.9269436Z  Downloaded scikit-learn
2026-02-21T08:07:56.9314119Z  Downloaded numpy
2026-02-21T08:07:57.2801948Z  Downloaded scipy
2026-02-21T08:07:57.2803371Z Prepared 27 packages in 651ms
2026-02-21T08:07:57.2813521Z Uninstalled 1 package in 0.92ms
2026-02-21T08:07:57.2820008Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:07:57.2820330Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:07:57.2820641Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:07:57.3648757Z Installed 29 packages in 83ms
2026-02-21T08:07:57.3649476Z  + cfgv==3.5.0
2026-02-21T08:07:57.3649884Z  + distlib==0.4.0
2026-02-21T08:07:57.3651113Z  + expecttest==0.3.0
2026-02-21T08:07:57.3651308Z  + filecheck==1.0.3
2026-02-21T08:07:57.3651422Z  - filelock==3.20.0
2026-02-21T08:07:57.3651517Z  + filelock==3.24.3
2026-02-21T08:07:57.3651641Z  + helion==0.0.0 (from file:///__w/helion/helion)
2026-02-21T08:07:57.3651800Z  + hypothesis==6.151.9
2026-02-21T08:07:57.3651924Z  + identify==2.6.16
2026-02-21T08:07:57.3652024Z  + iniconfig==2.3.0
2026-02-21T08:07:57.3652114Z  + joblib==1.5.3
2026-02-21T08:07:57.3652217Z  + markdown-it-py==4.0.0
2026-02-21T08:07:57.3652378Z  + mdurl==0.1.2
2026-02-21T08:07:57.3652472Z  + nodeenv==1.10.0
2026-02-21T08:07:57.3652558Z  + numpy==2.4.2
2026-02-21T08:07:57.3652650Z  + packaging==26.0
2026-02-21T08:07:57.3652746Z  + platformdirs==4.9.2
2026-02-21T08:07:57.3652876Z  + pluggy==1.6.0
2026-02-21T08:07:57.3652967Z  + pre-commit==4.5.1
2026-02-21T08:07:57.3653065Z  + psutil==7.2.2
2026-02-21T08:07:57.3653748Z  + pygments==2.19.2
2026-02-21T08:07:57.3653854Z  + pytest==9.0.2
2026-02-21T08:07:57.3653944Z  + pytest-timeout==2.4.0
2026-02-21T08:07:57.3654046Z  + pyyaml==6.0.3
2026-02-21T08:07:57.3654132Z  + rich==14.3.3
2026-02-21T08:07:57.3654225Z  + scikit-learn==1.8.0
2026-02-21T08:07:57.3654323Z  + scipy==1.17.0
2026-02-21T08:07:57.3654418Z  + sortedcontainers==2.4.0
2026-02-21T08:07:57.3654528Z  + threadpoolctl==3.6.0
2026-02-21T08:07:57.3654624Z  + virtualenv==20.38.0
2026-02-21T08:08:12.5966938Z helion
2026-02-21T08:08:14.0216728Z ##[group]Run set -x
2026-02-21T08:08:14.0216885Z [36;1mset -x[0m
2026-02-21T08:08:14.0216984Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:08:14.0217111Z [36;1muv pip install pip[0m
2026-02-21T08:08:14.0217240Z [36;1muv pip install quack-kernels --no-deps[0m
2026-02-21T08:08:14.0217402Z [36;1mmkdir -p benchmarks/ && pushd benchmarks/[0m
2026-02-21T08:08:14.0217588Z [36;1mgit clone https://github.com/pytorch-labs/tritonbench/[0m
2026-02-21T08:08:14.0217766Z [36;1mpushd tritonbench/[0m
2026-02-21T08:08:14.0217895Z [36;1mgit submodule update --init --recursive[0m
2026-02-21T08:08:14.0218039Z [36;1muv pip install -r requirements.txt[0m
2026-02-21T08:08:14.0218175Z [36;1mpython install.py --liger[0m
2026-02-21T08:08:14.0218295Z [36;1muv pip install -e . --no-deps[0m
2026-02-21T08:08:14.0218415Z [36;1mpopd[0m
2026-02-21T08:08:14.0218501Z [36;1mpopd[0m
2026-02-21T08:08:14.0218731Z shell: bash -l {0}
2026-02-21T08:08:14.0218818Z env:
2026-02-21T08:08:14.0218931Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:08:14.0219060Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:14.0219227Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:08:14.0219386Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:14.0219527Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:14.0219669Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:14.0219807Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T08:08:14.0219971Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:08:14.0220114Z ##[endgroup]
2026-02-21T08:08:14.0578917Z + source .venv/bin/activate
2026-02-21T08:08:14.0579264Z ++ '[' -z '' ']'
2026-02-21T08:08:14.0579393Z ++ '[' -n x ']'
2026-02-21T08:08:14.0579535Z ++ SCRIPT_PATH=.venv/bin/activate
2026-02-21T08:08:14.0579755Z ++ '[' .venv/bin/activate = /__w/_temp/8b34363f-b31b-4bac-8d07-2e9aab508267.sh ']'
2026-02-21T08:08:14.0580062Z ++ deactivate nondestructive
2026-02-21T08:08:14.0580207Z ++ unset -f pydoc
2026-02-21T08:08:14.0580321Z ++ '[' -z '' ']'
2026-02-21T08:08:14.0580430Z ++ '[' -z '' ']'
2026-02-21T08:08:14.0580538Z ++ hash -r
2026-02-21T08:08:14.0580640Z ++ '[' -z '' ']'
2026-02-21T08:08:14.0580752Z ++ unset VIRTUAL_ENV
2026-02-21T08:08:14.0580881Z ++ unset VIRTUAL_ENV_PROMPT
2026-02-21T08:08:14.0581035Z ++ '[' '!' nondestructive = nondestructive ']'
2026-02-21T08:08:14.0581202Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv
2026-02-21T08:08:14.0581379Z ++ '[' linux-gnu = cygwin ']'
2026-02-21T08:08:14.0581515Z ++ '[' linux-gnu = msys ']'
2026-02-21T08:08:14.0581646Z ++ export VIRTUAL_ENV
2026-02-21T08:08:14.0581764Z ++ '[' -z '' ']'
2026-02-21T08:08:14.0581878Z ++ unset SCRIPT_PATH
2026-02-21T08:08:14.0582357Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:08:14.0583216Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T08:08:14.0583694Z ++ export PATH
2026-02-21T08:08:14.0583812Z ++ '[' xhelion '!=' x ']'
2026-02-21T08:08:14.0583949Z ++ VIRTUAL_ENV_PROMPT=helion
2026-02-21T08:08:14.0584091Z ++ export VIRTUAL_ENV_PROMPT
2026-02-21T08:08:14.0584670Z ++ '[' -z '' ']'
2026-02-21T08:08:14.0584773Z ++ '[' -z '' ']'
2026-02-21T08:08:14.0584883Z ++ _OLD_VIRTUAL_PS1=
2026-02-21T08:08:14.0585004Z ++ PS1='(helion) '
2026-02-21T08:08:14.0585124Z ++ export PS1
2026-02-21T08:08:14.0585228Z ++ alias pydoc
2026-02-21T08:08:14.0585338Z ++ true
2026-02-21T08:08:14.0585436Z ++ hash -r
2026-02-21T08:08:14.0585549Z + uv pip install pip
2026-02-21T08:08:14.1044019Z Resolved 1 package in 42ms
2026-02-21T08:08:14.1084769Z Downloading pip (1.7MiB)
2026-02-21T08:08:14.1437398Z  Downloaded pip
2026-02-21T08:08:14.1437547Z Prepared 1 package in 39ms
2026-02-21T08:08:14.1605257Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:08:14.1605584Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:08:14.1605897Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:08:14.1746640Z Installed 1 package in 30ms
2026-02-21T08:08:14.1747162Z  + pip==26.0.1
2026-02-21T08:08:14.1869287Z + uv pip install quack-kernels --no-deps
2026-02-21T08:08:14.2113847Z Resolved 1 package in 18ms
2026-02-21T08:08:14.2379151Z Prepared 1 package in 26ms
2026-02-21T08:08:14.2532502Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:08:14.2532980Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:08:14.2533417Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:08:14.2545921Z Installed 1 package in 16ms
2026-02-21T08:08:14.2546106Z  + quack-kernels==0.2.10
2026-02-21T08:08:14.2666968Z + mkdir -p benchmarks/
2026-02-21T08:08:14.2676690Z + pushd benchmarks/
2026-02-21T08:08:14.2678883Z /__w/helion/helion/benchmarks /__w/helion/helion
2026-02-21T08:08:14.2679245Z + git clone https://github.com/pytorch-labs/tritonbench/
2026-02-21T08:08:14.2688753Z Cloning into 'tritonbench'...
2026-02-21T08:08:14.9256902Z + pushd tritonbench/
2026-02-21T08:08:14.9257274Z + git submodule update --init --recursive
2026-02-21T08:08:14.9257771Z /__w/helion/helion/benchmarks/tritonbench /__w/helion/helion/benchmarks /__w/helion/helion
2026-02-21T08:08:14.9380445Z Submodule 'submodules/ThunderKittens' (https://github.com/HazyResearch/ThunderKittens.git) registered for path 'submodules/ThunderKittens'
2026-02-21T08:08:14.9381594Z Submodule 'submodules/aiter' (https://github.com/ROCm/aiter.git) registered for path 'submodules/aiter'
2026-02-21T08:08:14.9382494Z Submodule 'submodules/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/cutlass'
2026-02-21T08:08:14.9383545Z Submodule 'submodules/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/flash-attention'
2026-02-21T08:08:14.9384860Z Submodule 'submodules/generative-recommenders' (https://github.com/facebookresearch/generative-recommenders.git) registered for path 'submodules/generative-recommenders'
2026-02-21T08:08:14.9386141Z Submodule 'submodules/xformers' (https://github.com/facebookresearch/xformers.git) registered for path 'submodules/xformers'
2026-02-21T08:08:14.9398423Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/ThunderKittens'...
2026-02-21T08:08:16.2046436Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter'...
2026-02-21T08:08:22.2022356Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/cutlass'...
2026-02-21T08:08:24.8406348Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention'...
2026-02-21T08:08:25.4940945Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders'...
2026-02-21T08:08:25.9129179Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers'...
2026-02-21T08:08:26.8389395Z Submodule path 'submodules/ThunderKittens': checked out '25f7568450b412a1984a4f619fb28373df06fa1b'
2026-02-21T08:08:27.0674596Z Submodule path 'submodules/aiter': checked out '1f5b378dcc9d9b0bcd9456c8c767b7424a5e8190'
2026-02-21T08:08:27.0690797Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/aiter/3rdparty/composable_kernel'
2026-02-21T08:08:27.0706819Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter/3rdparty/composable_kernel'...
2026-02-21T08:08:29.7583619Z Submodule path 'submodules/aiter/3rdparty/composable_kernel': checked out 'e31a7a4f29b371c32ea9daf9211b6ae1fed2fa40'
2026-02-21T08:08:30.1055331Z Submodule path 'submodules/cutlass': checked out 'ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e'
2026-02-21T08:08:30.1605805Z Submodule path 'submodules/flash-attention': checked out '43375aab2893018dfb7950db1cfa623c14946ad6'
2026-02-21T08:08:30.1616825Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/flash-attention/csrc/composable_kernel'
2026-02-21T08:08:30.1617444Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/flash-attention/csrc/cutlass'
2026-02-21T08:08:30.1631971Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/composable_kernel'...
2026-02-21T08:08:33.0968610Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/cutlass'...
2026-02-21T08:08:35.7971284Z Submodule path 'submodules/flash-attention/csrc/composable_kernel': checked out 'e8709c24f403173ad21a2da907d1347957e324fb'
2026-02-21T08:08:36.1649429Z Submodule path 'submodules/flash-attention/csrc/cutlass': checked out 'b1d6e2c9b334dfa811e4183dfbd02419249e4b52'
2026-02-21T08:08:36.1864044Z Submodule path 'submodules/generative-recommenders': checked out '88512dbd71b053226bc4ef8ec1630e3db53e55e5'
2026-02-21T08:08:36.1874867Z Submodule 'generative_recommenders/ops/cpp/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'
2026-02-21T08:08:36.1887930Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'...
2026-02-21T08:08:39.0701999Z Submodule path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b'
2026-02-21T08:08:39.1204716Z Submodule path 'submodules/xformers': checked out '8fc8ec5a4d6498ff81c0c418b89bbaf133ae3a44'
2026-02-21T08:08:39.1213592Z Submodule 'third_party/composable_kernel_tiled' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/composable_kernel_tiled'
2026-02-21T08:08:39.1214472Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/cutlass'
2026-02-21T08:08:39.1215295Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/xformers/third_party/flash-attention'
2026-02-21T08:08:39.1231537Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/composable_kernel_tiled'...
2026-02-21T08:08:41.6554037Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/cutlass'...
2026-02-21T08:08:44.2050770Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention'...
2026-02-21T08:08:45.1218340Z Submodule path 'submodules/xformers/third_party/composable_kernel_tiled': checked out '4f54fa30583704f34da2ac50372d524cae6bad7d'
2026-02-21T08:08:45.4552652Z Submodule path 'submodules/xformers/third_party/cutlass': checked out 'e9627ce55b42fd2599f58cd4396da9380954def0'
2026-02-21T08:08:45.4983453Z Submodule path 'submodules/xformers/third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5'
2026-02-21T08:08:45.5000532Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel'
2026-02-21T08:08:45.5002075Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/cutlass'
2026-02-21T08:08:45.5024656Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/composable_kernel'...
2026-02-21T08:08:48.2494124Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/cutlass'...
2026-02-21T08:08:51.0770868Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33'
2026-02-21T08:08:51.4054233Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420'
2026-02-21T08:08:51.4087577Z + uv pip install -r requirements.txt
2026-02-21T08:08:51.4146378Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:08:51.6581470Z Resolved 30 packages in 242ms
2026-02-21T08:08:51.6697141Z Downloading pillow (6.7MiB)
2026-02-21T08:08:51.6697298Z Downloading fonttools (4.7MiB)
2026-02-21T08:08:51.6697427Z Downloading hf-xet (3.2MiB)
2026-02-21T08:08:51.6697774Z Downloading transformers (10.3MiB)
2026-02-21T08:08:51.6698478Z Downloading tokenizers (3.0MiB)
2026-02-21T08:08:51.6728518Z Downloading matplotlib (8.3MiB)
2026-02-21T08:08:51.6729608Z Downloading kiwisolver (1.4MiB)
2026-02-21T08:08:51.7278960Z  Downloaded kiwisolver
2026-02-21T08:08:51.7751171Z  Downloaded tokenizers
2026-02-21T08:08:51.7808552Z  Downloaded hf-xet
2026-02-21T08:08:51.8445598Z  Downloaded pillow
2026-02-21T08:08:51.8450742Z  Downloaded fonttools
2026-02-21T08:08:51.8688435Z  Downloaded matplotlib
2026-02-21T08:08:51.9832275Z  Downloaded transformers
2026-02-21T08:08:51.9832643Z Prepared 23 packages in 324ms
2026-02-21T08:08:52.0023705Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:08:52.0024562Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:08:52.0025406Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:08:52.2505049Z Installed 23 packages in 263ms
2026-02-21T08:08:52.2505303Z  + certifi==2026.1.4
2026-02-21T08:08:52.2505501Z  + charset-normalizer==3.4.4
2026-02-21T08:08:52.2505654Z  + contourpy==1.3.3
2026-02-21T08:08:52.2505784Z  + cycler==0.12.1
2026-02-21T08:08:52.2505907Z  + fonttools==4.61.1
2026-02-21T08:08:52.2506037Z  + hf-xet==1.2.0
2026-02-21T08:08:52.2506172Z  + huggingface-hub==0.36.2
2026-02-21T08:08:52.2506315Z  + idna==3.11
2026-02-21T08:08:52.2506432Z  + kiwisolver==1.4.9
2026-02-21T08:08:52.2506566Z  + matplotlib==3.10.8
2026-02-21T08:08:52.2506703Z  + nvidia-ml-py==13.590.48
2026-02-21T08:08:52.2506864Z  + pillow==12.1.1
2026-02-21T08:08:52.2506991Z  + pyparsing==3.3.2
2026-02-21T08:08:52.2507127Z  + python-dateutil==2.9.0.post0
2026-02-21T08:08:52.2507278Z  + regex==2026.2.19
2026-02-21T08:08:52.2507397Z  + requests==2.32.5
2026-02-21T08:08:52.2507524Z  + safetensors==0.7.0
2026-02-21T08:08:52.2507648Z  + six==1.17.0
2026-02-21T08:08:52.2507768Z  + tabulate==0.9.0
2026-02-21T08:08:52.2507891Z  + tokenizers==0.21.4
2026-02-21T08:08:52.2508018Z  + tqdm==4.67.3
2026-02-21T08:08:52.2508138Z  + transformers==4.53.0
2026-02-21T08:08:52.2508276Z  + urllib3==2.6.3
2026-02-21T08:08:52.2786816Z + python install.py --liger
2026-02-21T08:08:56.7276050Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:08:56.7299382Z Audited 6 packages in 2ms
2026-02-21T08:08:56.7680197Z INFO:__main__:[tritonbench] installing liger-kernels...
2026-02-21T08:08:56.7720881Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:08:56.8515798Z Resolved 1 package in 78ms
2026-02-21T08:08:56.9000623Z Prepared 1 package in 48ms
2026-02-21T08:08:56.9176776Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:08:56.9177434Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:08:56.9178059Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:08:56.9214070Z Installed 1 package in 21ms
2026-02-21T08:08:56.9214632Z  + liger-kernel-nightly==0.7.0.dev20260219183429
2026-02-21T08:08:56.9296970Z INFO:__main__:[tritonbench] installation complete!
2026-02-21T08:08:57.7295511Z + uv pip install -e . --no-deps
2026-02-21T08:08:57.7612542Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv
2026-02-21T08:08:57.7635190Z Resolved 1 package in 1ms
2026-02-21T08:08:57.7639596Z    Building tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench
2026-02-21T08:08:58.4768113Z       Built tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench
2026-02-21T08:08:58.4965981Z Prepared 1 package in 733ms
2026-02-21T08:08:58.4970596Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.
2026-02-21T08:08:58.4971072Z          If the cache and target directories are on different filesystems, hardlinking may not be supported.
2026-02-21T08:08:58.4971546Z          If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.
2026-02-21T08:08:58.4974906Z Installed 1 package in 0.80ms
2026-02-21T08:08:58.4975154Z  + tritonbench==0.0.1 (from file:///__w/helion/helion/benchmarks/tritonbench)
2026-02-21T08:08:58.5324764Z + popd
2026-02-21T08:08:58.5325002Z + popd
2026-02-21T08:08:58.5325383Z /__w/helion/helion/benchmarks /__w/helion/helion
2026-02-21T08:08:58.5325617Z /__w/helion/helion
2026-02-21T08:08:58.5410480Z ##[group]Run rm -rf /tmp/torchinductor_*/ || true
2026-02-21T08:08:58.5410838Z [36;1mrm -rf /tmp/torchinductor_*/ || true[0m
2026-02-21T08:08:58.5411068Z [36;1m[0m
2026-02-21T08:08:58.5411207Z [36;1msource .venv/bin/activate[0m
2026-02-21T08:08:58.5411382Z [36;1m[0m
2026-02-21T08:08:58.5411539Z [36;1mTEST_REPORTS_DIR=$(pwd)/test/test-reports[0m
2026-02-21T08:08:58.5411766Z [36;1mmkdir -p "$TEST_REPORTS_DIR"[0m
2026-02-21T08:08:58.5411955Z [36;1mecho "$TEST_REPORTS_DIR"[0m
2026-02-21T08:08:58.5412165Z [36;1m[0m
2026-02-21T08:08:58.5412317Z [36;1mKERNEL_LIST="int4_gemm,flash_attention"[0m
2026-02-21T08:08:58.5412554Z [36;1mfor kernel in ${KERNEL_LIST//,/ }; do[0m
2026-02-21T08:08:58.5412776Z [36;1m  echo "=========================================="[0m
2026-02-21T08:08:58.5413026Z [36;1m  echo "Running benchmark for kernel: $kernel"[0m
2026-02-21T08:08:58.5413265Z [36;1m  echo "=========================================="[0m
2026-02-21T08:08:58.5413487Z [36;1m[0m
2026-02-21T08:08:58.5430935Z [36;1m  # Get available implementations and baseline for this kernel[0m
2026-02-21T08:08:58.5431418Z [36;1m  KERNEL_INFO=$(python benchmarks/run.py --list-impls-for-benchmark-ci --op $kernel | grep "^$kernel:")[0m
2026-02-21T08:08:58.5431863Z [36;1m  IMPLS=$(echo "$KERNEL_INFO" | sed -n 's/.*impls=\([^ ]*\).*/\1/p')[0m
2026-02-21T08:08:58.5432202Z [36;1m  BASELINE=$(echo "$KERNEL_INFO" | sed -n 's/.*baseline=\([^ ]*\).*/\1/p')[0m
2026-02-21T08:08:58.5432465Z [36;1m[0m
2026-02-21T08:08:58.5432612Z [36;1m  if [[ -z "$IMPLS" ]]; then[0m
2026-02-21T08:08:58.5432892Z [36;1m    echo "Warning: No implementations found for kernel $kernel, skipping..."[0m
2026-02-21T08:08:58.5433180Z [36;1m    continue[0m
2026-02-21T08:08:58.5433341Z [36;1m  fi[0m
2026-02-21T08:08:58.5433505Z [36;1m  if [[ -z "$BASELINE" ]]; then[0m
2026-02-21T08:08:58.5433785Z [36;1m    echo "Warning: No baseline found for kernel $kernel, skipping..."[0m
2026-02-21T08:08:58.5434062Z [36;1m    continue[0m
2026-02-21T08:08:58.5434210Z [36;1m  fi[0m
2026-02-21T08:08:58.5434358Z [36;1m  echo "Using baseline: $BASELINE"[0m
2026-02-21T08:08:58.5434829Z [36;1m  echo "Available implementations for $kernel: $IMPLS"[0m
2026-02-21T08:08:58.5435052Z [36;1m[0m
2026-02-21T08:08:58.5435215Z [36;1m  # Do autotuning but do not record the results[0m
2026-02-21T08:08:58.5435434Z [36;1m   python benchmarks/run.py \[0m
2026-02-21T08:08:58.5435625Z [36;1m      --op $kernel \[0m
2026-02-21T08:08:58.5435810Z [36;1m      --metrics speedup,accuracy \[0m
2026-02-21T08:08:58.5436033Z [36;1m      --latency-measure-mode triton_do_bench \[0m
2026-02-21T08:08:58.5436251Z [36;1m      --cudagraph \[0m
2026-02-21T08:08:58.5436415Z [36;1m      --only $IMPLS \[0m
2026-02-21T08:08:58.5436619Z [36;1m      --only-match-mode prefix-with-baseline \[0m
2026-02-21T08:08:58.5436835Z [36;1m      --baseline $BASELINE \[0m
2026-02-21T08:08:58.5437018Z [36;1m      --atol 1e-2 \[0m
2026-02-21T08:08:58.5437177Z [36;1m      --rtol 1e-2 \[0m
2026-02-21T08:08:58.5437372Z [36;1m      --input-sample-mode equally-spaced-k \[0m
2026-02-21T08:08:58.5437589Z [36;1m      --keep-going \[0m
2026-02-21T08:08:58.5437743Z [36;1m      [0m
2026-02-21T08:08:58.5437875Z [36;1m[0m
2026-02-21T08:08:58.5437999Z [36;1m  # Relax the GPU[0m
2026-02-21T08:08:58.5438160Z [36;1m  sleep 2m[0m
2026-02-21T08:08:58.5438294Z [36;1m[0m
2026-02-21T08:08:58.5438448Z [36;1m  # Run again with cache and record results[0m
2026-02-21T08:08:58.5438763Z [36;1m   HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py \[0m
2026-02-21T08:08:58.5439064Z [36;1m      --op $kernel \[0m
2026-02-21T08:08:58.5439243Z [36;1m      --metrics speedup,accuracy \[0m
2026-02-21T08:08:58.5439462Z [36;1m      --latency-measure-mode triton_do_bench \[0m
2026-02-21T08:08:58.5439674Z [36;1m      --cudagraph \[0m
2026-02-21T08:08:58.5439833Z [36;1m      --only $IMPLS \[0m
2026-02-21T08:08:58.5440177Z [36;1m      --only-match-mode prefix-with-baseline \[0m
2026-02-21T08:08:58.5440391Z [36;1m      --baseline $BASELINE \[0m
2026-02-21T08:08:58.5440579Z [36;1m      --atol 1e-2 \[0m
2026-02-21T08:08:58.5440740Z [36;1m      --rtol 1e-2 \[0m
2026-02-21T08:08:58.5440926Z [36;1m      --input-sample-mode equally-spaced-k \[0m
2026-02-21T08:08:58.5441178Z [36;1m      --output "$TEST_REPORTS_DIR/helionbench.json" \[0m
2026-02-21T08:08:58.5441408Z [36;1m      --append-to-output \[0m
2026-02-21T08:08:58.5441590Z [36;1m      --keep-going \[0m
2026-02-21T08:08:58.5441743Z [36;1m      [0m
2026-02-21T08:08:58.5441873Z [36;1m[0m
2026-02-21T08:08:58.5442047Z [36;1m  echo "✅ Completed benchmark for kernel: $kernel"[0m
2026-02-21T08:08:58.5442266Z [36;1mdone[0m
2026-02-21T08:08:58.5442398Z [36;1m[0m
2026-02-21T08:08:58.5442728Z [36;1mif [[ ! -s "$TEST_REPORTS_DIR/helionbench.json" ]]; then[0m
2026-02-21T08:08:58.5442998Z [36;1m  echo "❌ helionbench.json is missing or empty"[0m
2026-02-21T08:08:58.5443205Z [36;1m  exit 1[0m
2026-02-21T08:08:58.5443352Z [36;1mfi[0m
2026-02-21T08:08:58.5443506Z [36;1mcat "$TEST_REPORTS_DIR/helionbench.json"[0m
2026-02-21T08:08:58.5443900Z shell: bash -l {0}
2026-02-21T08:08:58.5444037Z env:
2026-02-21T08:08:58.5444170Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T08:08:58.5444373Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:58.5444617Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T08:08:58.5444868Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:58.5445083Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:58.5445305Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T08:08:58.5445532Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T08:08:58.5445779Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T08:08:58.5446009Z ##[endgroup]
2026-02-21T08:08:58.6572516Z /__w/helion/helion/test/test-reports
2026-02-21T08:08:58.6573541Z ==========================================
2026-02-21T08:08:58.6573927Z Running benchmark for kernel: int4_gemm
2026-02-21T08:08:58.6574265Z ==========================================
2026-02-21T08:09:10.1835146Z Using baseline: preprocessed_eager_int4_gemm
2026-02-21T08:09:10.1836055Z Available implementations for int4_gemm: helion_int4_gemm_tritonbench,preprocessed_torch_compile_int4_gemm,preprocessed_triton_int4_gemm
2026-02-21T08:09:20.5589727Z Applying custom args for int4_gemm: {'num_inputs': 10}
2026-02-21T08:09:20.5904775Z Running int4_gemm benchmark with Helion implementation...
2026-02-21T08:09:20.5905066Z 
2026-02-21T08:09:20.8084776Z Equally-spaced-k mode: Selected 10 equally spaced inputs (total available: 32)
2026-02-21T08:09:20.8085244Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 3, 7, 10, 14, 17, 21, 24, 28, 31]
2026-02-21T08:09:20.8090253Z 
2026-02-21T08:09:20.8098307Z   0%|          | 0/10 [00:00<?, ?it/s]WARNING:tritonbench.utils.triton_op:Running input ID 0:
2026-02-21T08:09:20.8098600Z x_val
2026-02-21T08:09:20.8098756Z ------------------
2026-02-21T08:09:20.8098909Z (1, 1, 1280, 8192)
2026-02-21T08:09:20.8428758Z INFO:tritonbench.utils.triton_op:Took 21.62ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T08:09:23.4276105Z INFO:tritonbench.utils.triton_op:Took 115.05ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T08:09:27.4639186Z INFO:tritonbench.utils.triton_op:Took 0.30ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T08:09:28.4838388Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:09:28.4838753Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:09:28.4838998Z               'dtype': 'torch.bfloat16',
2026-02-21T08:09:28.4839238Z               'shape': (1, 1, 8192),
2026-02-21T08:09:28.4839462Z               'stride': (8192, 8192, 1)},
2026-02-21T08:09:28.4839678Z             { 'device': 'cuda:0',
2026-02-21T08:09:28.4839901Z               'dtype': 'torch.int32',
2026-02-21T08:09:28.4840228Z               'shape': (8192, 1280),
2026-02-21T08:09:28.4841170Z               'stride': (1280, 1)}),
2026-02-21T08:09:28.4841377Z   'kwargs': {}}
2026-02-21T08:09:28.4841788Z INFO:tritonbench.utils.triton_op:Took 1.06ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T08:09:28.7387696Z [0s] Autotune random seed: 2134834638
2026-02-21T08:09:28.9211087Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:10:31.3773563Z [62s] Timeout after 60s compiling Config(block_sizes=[128, 1, 64], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, False], range_num_stages=[0, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:10:32.8446770Z [63s] Timeout after 60s compiling Config(block_sizes=[512, 1, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T08:10:34.8934941Z [65s] Timeout after 60s compiling Config(block_sizes=[64, 1, 1024], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='xyz', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:10:34.8969920Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T08:10:40.9845106Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.5 configs/s
2026-02-21T08:10:40.9852491Z [72s] Adaptive compile timeout: 30s (90% percentile=4.3s, bounds=[30.0s, 60s])
2026-02-21T08:10:42.0032877Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 911.2 configs/s
2026-02-21T08:10:42.2564283Z [73s] Initial random population of 100, 5 starting points: 
2026-02-21T08:10:42.2567129Z timeout=3
2026-02-21T08:10:42.2567337Z ok=97
2026-02-21T08:10:42.2567841Z min=0.0239
2026-02-21T08:10:42.2568139Z mid=0.1540
2026-02-21T08:10:42.2568350Z max=4.8293
2026-02-21T08:10:42.2568607Z best={'block_sizes': [1024, 1, 2],
2026-02-21T08:10:42.2569006Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T08:10:42.2569382Z  'l2_groupings': [4],
2026-02-21T08:10:42.2569660Z  'load_eviction_policies': ['', ''],
2026-02-21T08:10:42.2569972Z  'loop_orders': [[1, 0]],
2026-02-21T08:10:42.2570244Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:10:42.2573489Z  'num_sm_multiplier': 4,
2026-02-21T08:10:42.2573775Z  'num_stages': 1,
2026-02-21T08:10:42.2574036Z  'num_warps': 4,
2026-02-21T08:10:42.2574283Z  'pid_type': 'persistent_blocked',
2026-02-21T08:10:42.2574583Z  'range_flattens': [False, True],
2026-02-21T08:10:42.2574898Z  'range_multi_buffers': [None, False],
2026-02-21T08:10:42.2575179Z  'range_num_stages': [1, 0],
2026-02-21T08:10:42.2575442Z  'range_unroll_factors': [0, 3],
2026-02-21T08:10:42.2575710Z  'range_warp_specializes': [],
2026-02-21T08:10:42.2575963Z  'waves_per_eu': 3}
2026-02-21T08:10:42.2585531Z [73s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:10:43.4997412Z [74s] Generation 1 starting: 87 neighbors, 5 active search path(s)
2026-02-21T08:11:17.7999310Z [108s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 4], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='xyz', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:11:17.8010354Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 0.2 configs/s
2026-02-21T08:11:23.5437068Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 16.0 configs/s
2026-02-21T08:11:28.5063974Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 219.1         
2026-02-21T08:11:28.5064622Z                                                                   configs/s     
2026-02-21T08:11:29.0021981Z [120s] Generation 1 complete: 
2026-02-21T08:11:29.0022362Z timeout=1
2026-02-21T08:11:29.0022505Z ok=91
2026-02-21T08:11:29.0022650Z min=0.0200
2026-02-21T08:11:29.0022792Z mid=0.0261
2026-02-21T08:11:29.0022928Z max=0.1342
2026-02-21T08:11:29.0023091Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:11:29.0023363Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:11:29.0023620Z  'l2_groupings': [64],
2026-02-21T08:11:29.0023819Z  'load_eviction_policies': ['', ''],
2026-02-21T08:11:29.0024088Z  'loop_orders': [[0, 1]],
2026-02-21T08:11:29.0024283Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:11:29.0024470Z  'num_stages': 1,
2026-02-21T08:11:29.0024651Z  'num_warps': 16,
2026-02-21T08:11:29.0024818Z  'pid_type': 'xyz',
2026-02-21T08:11:29.0024992Z  'range_flattens': [None, True],
2026-02-21T08:11:29.0025208Z  'range_multi_buffers': [None, True],
2026-02-21T08:11:29.0025415Z  'range_num_stages': [0, 2],
2026-02-21T08:11:29.0025615Z  'range_unroll_factors': [0, 0],
2026-02-21T08:11:29.0025810Z  'range_warp_specializes': [],
2026-02-21T08:11:29.0025986Z  'waves_per_eu': 2}
2026-02-21T08:11:29.1090633Z [120s] Fitting surrogate: 192 points, 192 targets
2026-02-21T08:11:29.9764032Z [121s] Generation 2 starting: 91 neighbors, 5 active search path(s)
2026-02-21T08:12:04.7456205Z [155s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 16], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[4, 2], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:12:04.7473426Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 0.4 configs/s
2026-02-21T08:12:10.9189188Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 15.0 configs/s
2026-02-21T08:12:16.2568867Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 187.5         
2026-02-21T08:12:16.2569450Z                                                                   configs/s     
2026-02-21T08:12:16.8887538Z [167s] Generation 2 complete: 
2026-02-21T08:12:16.8887894Z timeout=1
2026-02-21T08:12:16.8888084Z ok=96
2026-02-21T08:12:16.8888263Z min=0.0198
2026-02-21T08:12:16.8888434Z mid=0.0234
2026-02-21T08:12:16.8888604Z max=0.4378
2026-02-21T08:12:16.8888793Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:12:16.8889148Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:12:16.8889462Z  'l2_groupings': [64],
2026-02-21T08:12:16.8889700Z  'load_eviction_policies': ['', ''],
2026-02-21T08:12:16.8889994Z  'loop_orders': [[0, 1]],
2026-02-21T08:12:16.8890229Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:12:16.8890461Z  'num_stages': 2,
2026-02-21T08:12:16.8890656Z  'num_warps': 8,
2026-02-21T08:12:16.8890857Z  'pid_type': 'xyz',
2026-02-21T08:12:16.8891078Z  'range_flattens': [None, True],
2026-02-21T08:12:16.8891341Z  'range_multi_buffers': [None, True],
2026-02-21T08:12:16.8892040Z  'range_num_stages': [0, 2],
2026-02-21T08:12:16.8892363Z  'range_unroll_factors': [0, 0],
2026-02-21T08:12:16.8892635Z  'range_warp_specializes': [],
2026-02-21T08:12:16.8892874Z  'waves_per_eu': 2}
2026-02-21T08:12:17.0741327Z [168s] Fitting surrogate: 289 points, 289 targets
2026-02-21T08:12:18.2667256Z [169s] Generation 3 starting: 82 neighbors, 5 active search path(s)
2026-02-21T08:12:28.0081463Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 1.4 configs/s
2026-02-21T08:12:33.4348106Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 15.7 configs/s
2026-02-21T08:12:38.5383490Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 207.9         
2026-02-21T08:12:38.5384017Z                                                                   configs/s     
2026-02-21T08:12:39.0907965Z [190s] Generation 3 complete: 
2026-02-21T08:12:39.0908295Z ok=87
2026-02-21T08:12:39.0908460Z min=0.0198
2026-02-21T08:12:39.0908620Z mid=0.0228
2026-02-21T08:12:39.0908767Z max=0.2118
2026-02-21T08:12:39.0908937Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:12:39.0909210Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:12:39.0909478Z  'l2_groupings': [64],
2026-02-21T08:12:39.0909677Z  'load_eviction_policies': ['', ''],
2026-02-21T08:12:39.0909906Z  'loop_orders': [[0, 1]],
2026-02-21T08:12:39.0910114Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:12:39.0910311Z  'num_stages': 2,
2026-02-21T08:12:39.0910538Z  'num_warps': 8,
2026-02-21T08:12:39.0910709Z  'pid_type': 'xyz',
2026-02-21T08:12:39.0910896Z  'range_flattens': [None, True],
2026-02-21T08:12:39.0911139Z  'range_multi_buffers': [None, True],
2026-02-21T08:12:39.0911362Z  'range_num_stages': [0, 2],
2026-02-21T08:12:39.0911557Z  'range_unroll_factors': [0, 0],
2026-02-21T08:12:39.0911768Z  'range_warp_specializes': [],
2026-02-21T08:12:39.0911964Z  'waves_per_eu': 2}
2026-02-21T08:12:39.2445407Z [190s] Fitting surrogate: 376 points, 376 targets
2026-02-21T08:12:40.0557202Z [191s] Generation 4 starting: 80 neighbors, 5 active search path(s)
2026-02-21T08:12:57.1633968Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 0.5 configs/s
2026-02-21T08:13:02.6683894Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 14.9 configs/s
2026-02-21T08:13:07.2333100Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 218.8         
2026-02-21T08:13:07.2335162Z                                                                   configs/s     
2026-02-21T08:13:07.7405091Z [218s] Generation 4 complete: 
2026-02-21T08:13:07.7405360Z ok=85
2026-02-21T08:13:07.7405947Z min=0.0198
2026-02-21T08:13:07.7406105Z mid=0.0230
2026-02-21T08:13:07.7406251Z max=0.1889
2026-02-21T08:13:07.7406419Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:13:07.7406697Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:13:07.7406966Z  'l2_groupings': [64],
2026-02-21T08:13:07.7407165Z  'load_eviction_policies': ['', ''],
2026-02-21T08:13:07.7407395Z  'loop_orders': [[0, 1]],
2026-02-21T08:13:07.7407597Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:13:07.7407798Z  'num_stages': 2,
2026-02-21T08:13:07.7407970Z  'num_warps': 8,
2026-02-21T08:13:07.7408140Z  'pid_type': 'xyz',
2026-02-21T08:13:07.7408328Z  'range_flattens': [None, True],
2026-02-21T08:13:07.7408551Z  'range_multi_buffers': [None, True],
2026-02-21T08:13:07.7408774Z  'range_num_stages': [0, 2],
2026-02-21T08:13:07.7408974Z  'range_unroll_factors': [0, 0],
2026-02-21T08:13:07.7409194Z  'range_warp_specializes': [],
2026-02-21T08:13:07.7409390Z  'waves_per_eu': 2}
2026-02-21T08:13:07.8800759Z [218s] Fitting surrogate: 461 points, 461 targets
2026-02-21T08:13:08.7479978Z [219s] Generation 5 starting: 57 neighbors, 4 active search path(s)
2026-02-21T08:13:40.0594480Z [251s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 8], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[1, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:13:40.8897112Z [251s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[1, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:13:41.0703518Z [252s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[1, 0], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:13:41.0722140Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 0.3 configs/s
2026-02-21T08:13:44.6695733Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 16.6 configs/s
2026-02-21T08:13:48.0226133Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 296.5         
2026-02-21T08:13:48.0226704Z                                                                   configs/s     
2026-02-21T08:13:48.4610714Z [259s] Generation 5 complete: 
2026-02-21T08:13:48.4611099Z timeout=3
2026-02-21T08:13:48.4611305Z ok=59
2026-02-21T08:13:48.4611512Z min=0.0199
2026-02-21T08:13:48.4611717Z mid=0.0225
2026-02-21T08:13:48.4611919Z max=0.1166
2026-02-21T08:13:48.4612143Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:13:48.4612530Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:13:48.4612896Z  'l2_groupings': [64],
2026-02-21T08:13:48.4613174Z  'load_eviction_policies': ['', ''],
2026-02-21T08:13:48.4613489Z  'loop_orders': [[0, 1]],
2026-02-21T08:13:48.4613772Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:13:48.4614042Z  'num_stages': 2,
2026-02-21T08:13:48.4614250Z  'num_warps': 8,
2026-02-21T08:13:48.4614465Z  'pid_type': 'xyz',
2026-02-21T08:13:48.4614679Z  'range_flattens': [None, True],
2026-02-21T08:13:48.4614942Z  'range_multi_buffers': [None, True],
2026-02-21T08:13:48.4615214Z  'range_num_stages': [0, 2],
2026-02-21T08:13:48.4615455Z  'range_unroll_factors': [0, 0],
2026-02-21T08:13:48.4616075Z  'range_warp_specializes': [],
2026-02-21T08:13:48.4616312Z  'waves_per_eu': 2}
2026-02-21T08:13:48.5613602Z [259s] Fitting surrogate: 523 points, 523 targets
2026-02-21T08:13:49.5097993Z [260s] Generation 6 starting: 63 neighbors, 4 active search path(s)
2026-02-21T08:14:20.5423025Z [291s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 8], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[1, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:14:21.6453904Z [292s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[1, 0], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:14:21.6470692Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 0.4 configs/s
2026-02-21T08:14:25.7023624Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 16.2 configs/s
2026-02-21T08:14:28.9932635Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 333.6         
2026-02-21T08:14:28.9934949Z                                                                   configs/s     
2026-02-21T08:14:29.2711059Z [300s] Generation 6 complete: 
2026-02-21T08:14:29.2711377Z timeout=2
2026-02-21T08:14:29.2711550Z ok=66
2026-02-21T08:14:29.2711717Z min=0.0198
2026-02-21T08:14:29.2711887Z mid=0.0253
2026-02-21T08:14:29.2712361Z max=0.4384
2026-02-21T08:14:29.2712552Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:14:29.2712858Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:14:29.2713177Z  'l2_groupings': [64],
2026-02-21T08:14:29.2713405Z  'load_eviction_policies': ['', ''],
2026-02-21T08:14:29.2713660Z  'loop_orders': [[0, 1]],
2026-02-21T08:14:29.2713882Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:14:29.2714106Z  'num_stages': 2,
2026-02-21T08:14:29.2714299Z  'num_warps': 8,
2026-02-21T08:14:29.2714492Z  'pid_type': 'xyz',
2026-02-21T08:14:29.2714704Z  'range_flattens': [None, True],
2026-02-21T08:14:29.2714950Z  'range_multi_buffers': [None, True],
2026-02-21T08:14:29.2715201Z  'range_num_stages': [0, 2],
2026-02-21T08:14:29.2715427Z  'range_unroll_factors': [0, 0],
2026-02-21T08:14:29.2715666Z  'range_warp_specializes': [],
2026-02-21T08:14:29.2715883Z  'waves_per_eu': 2}
2026-02-21T08:14:29.3585838Z [300s] Fitting surrogate: 591 points, 591 targets
2026-02-21T08:14:29.9735647Z [301s] Generation 7 starting: 62 neighbors, 4 active search path(s)
2026-02-21T08:14:37.5664009Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66/66 9.5 configs/s
2026-02-21T08:14:41.8532248Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 66/66 15.7 configs/s
2026-02-21T08:14:45.8715267Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 271.3         
2026-02-21T08:14:45.8715703Z                                                                   configs/s     
2026-02-21T08:14:46.3107034Z [317s] Generation 7 complete: 
2026-02-21T08:14:46.3107260Z ok=67
2026-02-21T08:14:46.3107387Z min=0.0199
2026-02-21T08:14:46.3107521Z mid=0.0236
2026-02-21T08:14:46.3107649Z max=0.0920
2026-02-21T08:14:46.3107797Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:14:46.3108030Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:14:46.3108272Z  'l2_groupings': [64],
2026-02-21T08:14:46.3108449Z  'load_eviction_policies': ['', ''],
2026-02-21T08:14:46.3108693Z  'loop_orders': [[0, 1]],
2026-02-21T08:14:46.3108877Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:14:46.3109048Z  'num_stages': 2,
2026-02-21T08:14:46.3109475Z  'num_warps': 8,
2026-02-21T08:14:46.3109621Z  'pid_type': 'xyz',
2026-02-21T08:14:46.3109786Z  'range_flattens': [None, True],
2026-02-21T08:14:46.3109979Z  'range_multi_buffers': [None, True],
2026-02-21T08:14:46.3110157Z  'range_num_stages': [0, 2],
2026-02-21T08:14:46.3110322Z  'range_unroll_factors': [0, 0],
2026-02-21T08:14:46.3110511Z  'range_warp_specializes': [],
2026-02-21T08:14:46.3110693Z  'waves_per_eu': 2}
2026-02-21T08:14:46.4158081Z [317s] Fitting surrogate: 658 points, 658 targets
2026-02-21T08:14:47.0182854Z [318s] Generation 8 starting: 60 neighbors, 4 active search path(s)
2026-02-21T08:14:56.5928041Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 1.7 configs/s
2026-02-21T08:15:00.7667777Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 63/63 15.6 configs/s
2026-02-21T08:15:04.4747696Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 268.7         
2026-02-21T08:15:04.4748253Z                                                                   configs/s     
2026-02-21T08:15:04.9104618Z [335s] Generation 8 complete: 
2026-02-21T08:15:04.9105002Z ok=65
2026-02-21T08:15:04.9105228Z min=0.0199
2026-02-21T08:15:04.9105441Z mid=0.0230
2026-02-21T08:15:04.9105617Z max=0.1421
2026-02-21T08:15:04.9105851Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:15:04.9106254Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:15:04.9106618Z  'l2_groupings': [64],
2026-02-21T08:15:04.9106907Z  'load_eviction_policies': ['', ''],
2026-02-21T08:15:04.9107223Z  'loop_orders': [[0, 1]],
2026-02-21T08:15:04.9107502Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:15:04.9107786Z  'num_stages': 2,
2026-02-21T08:15:04.9108016Z  'num_warps': 8,
2026-02-21T08:15:04.9108259Z  'pid_type': 'xyz',
2026-02-21T08:15:04.9108522Z  'range_flattens': [None, True],
2026-02-21T08:15:04.9108834Z  'range_multi_buffers': [None, True],
2026-02-21T08:15:04.9109175Z  'range_num_stages': [0, 2],
2026-02-21T08:15:04.9109457Z  'range_unroll_factors': [0, 0],
2026-02-21T08:15:04.9109757Z  'range_warp_specializes': [],
2026-02-21T08:15:04.9110046Z  'waves_per_eu': 2}
2026-02-21T08:15:05.0129600Z [336s] Fitting surrogate: 723 points, 723 targets
2026-02-21T08:15:05.9546165Z [337s] Generation 9 starting: 60 neighbors, 4 active search path(s)
2026-02-21T08:15:15.5164412Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 1.9 configs/s
2026-02-21T08:15:19.5553609Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 62/62 15.9 configs/s
2026-02-21T08:15:23.0863501Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 279.6         
2026-02-21T08:15:23.0864144Z                                                                   configs/s     
2026-02-21T08:15:23.5759088Z [354s] Generation 9 complete: 
2026-02-21T08:15:23.5759270Z ok=65
2026-02-21T08:15:23.5759352Z min=0.0196
2026-02-21T08:15:23.5759437Z mid=0.0213
2026-02-21T08:15:23.5759548Z max=0.1341
2026-02-21T08:15:23.5759651Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:15:23.5760176Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:15:23.5760314Z  'l2_groupings': [16],
2026-02-21T08:15:23.5760418Z  'load_eviction_policies': ['', ''],
2026-02-21T08:15:23.5760537Z  'loop_orders': [[1, 0]],
2026-02-21T08:15:23.5760640Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:15:23.5760741Z  'num_stages': 4,
2026-02-21T08:15:23.5760828Z  'num_warps': 8,
2026-02-21T08:15:23.5760931Z  'pid_type': 'flat',
2026-02-21T08:15:23.5761033Z  'range_flattens': [None, False],
2026-02-21T08:15:23.5761145Z  'range_multi_buffers': [None, None],
2026-02-21T08:15:23.5762280Z  'range_num_stages': [0, 2],
2026-02-21T08:15:23.5763508Z  'range_unroll_factors': [0, 0],
2026-02-21T08:15:23.5763779Z  'range_warp_specializes': [],
2026-02-21T08:15:23.5763900Z  'waves_per_eu': 2}
2026-02-21T08:15:23.7046189Z [354s] Fitting surrogate: 788 points, 788 targets
2026-02-21T08:15:24.8581152Z [355s] Generation 10 starting: 61 neighbors, 4 active search path(s)
2026-02-21T08:15:57.1614315Z [388s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:15:57.1631528Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62/62 0.3 configs/s
2026-02-21T08:16:01.2909524Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 62/62 15.2 configs/s
2026-02-21T08:16:04.5316886Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 306.2         
2026-02-21T08:16:04.5317145Z                                                                   configs/s     
2026-02-21T08:16:04.9678919Z [396s] Generation 10 complete: 
2026-02-21T08:16:04.9679589Z timeout=1
2026-02-21T08:16:04.9679677Z ok=64
2026-02-21T08:16:04.9679759Z min=0.0196
2026-02-21T08:16:04.9679866Z mid=0.0209
2026-02-21T08:16:04.9679942Z max=0.1386
2026-02-21T08:16:04.9680029Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:16:04.9680173Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:16:04.9680307Z  'l2_groupings': [8],
2026-02-21T08:16:04.9680416Z  'load_eviction_policies': ['', ''],
2026-02-21T08:16:04.9680530Z  'loop_orders': [[1, 0]],
2026-02-21T08:16:04.9680676Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:16:04.9680824Z  'num_stages': 3,
2026-02-21T08:16:04.9680953Z  'num_warps': 8,
2026-02-21T08:16:04.9681054Z  'pid_type': 'xyz',
2026-02-21T08:16:04.9681146Z  'range_flattens': [None, False],
2026-02-21T08:16:04.9681260Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:04.9681373Z  'range_num_stages': [0, 2],
2026-02-21T08:16:04.9681478Z  'range_unroll_factors': [0, 0],
2026-02-21T08:16:04.9681588Z  'range_warp_specializes': [],
2026-02-21T08:16:04.9681694Z  'waves_per_eu': 1}
2026-02-21T08:16:05.0803610Z [396s] Fitting surrogate: 853 points, 853 targets
2026-02-21T08:16:06.2079286Z [397s] Generation 11 starting: 42 neighbors, 3 active search path(s)
2026-02-21T08:16:15.3251998Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 1.1 configs/s
2026-02-21T08:16:18.2139257Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 43/43 15.6 configs/s
2026-02-21T08:16:20.4740753Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 435.9         
2026-02-21T08:16:20.4741029Z                                                                   configs/s     
2026-02-21T08:16:20.7984565Z [411s] Generation 11 complete: 
2026-02-21T08:16:20.7984731Z ok=46
2026-02-21T08:16:20.7984813Z min=0.0198
2026-02-21T08:16:20.7984904Z mid=0.0236
2026-02-21T08:16:20.7984979Z max=0.1377
2026-02-21T08:16:20.7985065Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:16:20.7985208Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:16:20.7985364Z  'l2_groupings': [8],
2026-02-21T08:16:20.7985468Z  'load_eviction_policies': ['', ''],
2026-02-21T08:16:20.7985952Z  'loop_orders': [[1, 0]],
2026-02-21T08:16:20.7986056Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:16:20.7986156Z  'num_stages': 3,
2026-02-21T08:16:20.7986241Z  'num_warps': 8,
2026-02-21T08:16:20.7986329Z  'pid_type': 'xyz',
2026-02-21T08:16:20.7986426Z  'range_flattens': [None, False],
2026-02-21T08:16:20.7986540Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:20.7986657Z  'range_num_stages': [0, 2],
2026-02-21T08:16:20.7986760Z  'range_unroll_factors': [0, 0],
2026-02-21T08:16:20.7986871Z  'range_warp_specializes': [],
2026-02-21T08:16:20.7986975Z  'waves_per_eu': 1}
2026-02-21T08:16:20.8593958Z [411s] Fitting surrogate: 899 points, 899 targets
2026-02-21T08:16:21.0979893Z [412s] Generation 12 starting: 16 neighbors, 1 active search path(s)
2026-02-21T08:16:23.2421291Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 13.6 configs/s
2026-02-21T08:16:24.4145221Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 15.0 configs/s
2026-02-21T08:16:25.3085486Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1051.8        
2026-02-21T08:16:25.3085763Z                                                                   configs/s     
2026-02-21T08:16:25.5485210Z [416s] Generation 12 complete: 
2026-02-21T08:16:25.5485611Z ok=18
2026-02-21T08:16:25.5485822Z min=0.0196
2026-02-21T08:16:25.5486032Z mid=0.0253
2026-02-21T08:16:25.5486238Z max=0.0370
2026-02-21T08:16:25.5486480Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:16:25.5486861Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:16:25.5487226Z  'l2_groupings': [8],
2026-02-21T08:16:25.5487496Z  'load_eviction_policies': ['', ''],
2026-02-21T08:16:25.5487809Z  'loop_orders': [[1, 0]],
2026-02-21T08:16:25.5488087Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:16:25.5488355Z  'num_stages': 3,
2026-02-21T08:16:25.5488581Z  'num_warps': 8,
2026-02-21T08:16:25.5489376Z  'pid_type': 'xyz',
2026-02-21T08:16:25.5489632Z  'range_flattens': [None, False],
2026-02-21T08:16:25.5489939Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:25.5490267Z  'range_num_stages': [0, 2],
2026-02-21T08:16:25.5490544Z  'range_unroll_factors': [0, 0],
2026-02-21T08:16:25.5490836Z  'range_warp_specializes': [],
2026-02-21T08:16:25.5491110Z  'waves_per_eu': 1}
2026-02-21T08:16:25.5707294Z [416s] Fitting surrogate: 917 points, 917 targets
2026-02-21T08:16:25.8327594Z [416s] Generation 13 starting: 18 neighbors, 1 active search path(s)
2026-02-21T08:16:28.3472186Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 14.3 configs/s
2026-02-21T08:16:30.1191067Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 11.5 configs/s
2026-02-21T08:16:30.6543670Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1753.0        
2026-02-21T08:16:30.6543941Z                                                                   configs/s     
2026-02-21T08:16:30.8686816Z [421s] Generation 13 complete: 
2026-02-21T08:16:30.8687009Z ok=20
2026-02-21T08:16:30.8687098Z min=0.0196
2026-02-21T08:16:30.8687198Z mid=0.0292
2026-02-21T08:16:30.8687272Z max=0.0603
2026-02-21T08:16:30.8687357Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:16:30.8687507Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:16:30.8687642Z  'l2_groupings': [8],
2026-02-21T08:16:30.8687743Z  'load_eviction_policies': ['', ''],
2026-02-21T08:16:30.8687857Z  'loop_orders': [[1, 0]],
2026-02-21T08:16:30.8687962Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:16:30.8688061Z  'num_stages': 3,
2026-02-21T08:16:30.8688146Z  'num_warps': 8,
2026-02-21T08:16:30.8688234Z  'pid_type': 'xyz',
2026-02-21T08:16:30.8688329Z  'range_flattens': [None, False],
2026-02-21T08:16:30.8688439Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:30.8688567Z  'range_num_stages': [0, 2],
2026-02-21T08:16:30.8688674Z  'range_unroll_factors': [0, 0],
2026-02-21T08:16:30.8688783Z  'range_warp_specializes': [],
2026-02-21T08:16:30.8688891Z  'waves_per_eu': 1}
2026-02-21T08:16:30.8872846Z [421s] Fitting surrogate: 937 points, 937 targets
2026-02-21T08:16:31.1564945Z [422s] Generation 14 starting: 15 neighbors, 1 active search path(s)
2026-02-21T08:16:33.4719794Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 11.1 configs/s
2026-02-21T08:16:34.5977823Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.3 configs/s
2026-02-21T08:16:35.3764005Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1191.5        
2026-02-21T08:16:35.3764281Z                                                                   configs/s     
2026-02-21T08:16:35.6062660Z [426s] Generation 14 complete: 
2026-02-21T08:16:35.6062847Z ok=17
2026-02-21T08:16:35.6062930Z min=0.0196
2026-02-21T08:16:35.6063012Z mid=0.0257
2026-02-21T08:16:35.6063095Z max=0.0796
2026-02-21T08:16:35.6063181Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:16:35.6063324Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:16:35.6063498Z  'l2_groupings': [8],
2026-02-21T08:16:35.6063603Z  'load_eviction_policies': ['', ''],
2026-02-21T08:16:35.6063742Z  'loop_orders': [[1, 0]],
2026-02-21T08:16:35.6063846Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:16:35.6063944Z  'num_stages': 3,
2026-02-21T08:16:35.6064031Z  'num_warps': 8,
2026-02-21T08:16:35.6064120Z  'pid_type': 'xyz',
2026-02-21T08:16:35.6064216Z  'range_flattens': [None, False],
2026-02-21T08:16:35.6064360Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:35.6064474Z  'range_num_stages': [0, 2],
2026-02-21T08:16:35.6064583Z  'range_unroll_factors': [0, 0],
2026-02-21T08:16:35.6064693Z  'range_warp_specializes': [],
2026-02-21T08:16:35.6064799Z  'waves_per_eu': 1}
2026-02-21T08:16:35.6327608Z [426s] Fitting surrogate: 954 points, 954 targets
2026-02-21T08:16:35.8852478Z [426s] Generation 15 starting: 15 neighbors, 1 active search path(s)
2026-02-21T08:16:38.5785745Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 3.5 configs/s
2026-02-21T08:16:39.7252759Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 15.9 configs/s
2026-02-21T08:16:40.6968283Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 980.5         
2026-02-21T08:16:40.6968556Z                                                                   configs/s     
2026-02-21T08:16:40.9305869Z [432s] Generation 15 complete: 
2026-02-21T08:16:40.9306107Z ok=17
2026-02-21T08:16:40.9306197Z min=0.0194
2026-02-21T08:16:40.9306281Z mid=0.0226
2026-02-21T08:16:40.9306395Z max=0.0485
2026-02-21T08:16:40.9306489Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:16:40.9306639Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:16:40.9306776Z  'l2_groupings': [8],
2026-02-21T08:16:40.9306877Z  'load_eviction_policies': ['', ''],
2026-02-21T08:16:40.9306994Z  'loop_orders': [[1, 0]],
2026-02-21T08:16:40.9307097Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:16:40.9307200Z  'num_stages': 3,
2026-02-21T08:16:40.9307286Z  'num_warps': 8,
2026-02-21T08:16:40.9307413Z  'pid_type': 'xyz',
2026-02-21T08:16:40.9307512Z  'range_flattens': [None, False],
2026-02-21T08:16:40.9307642Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:40.9307757Z  'range_num_stages': [0, 2],
2026-02-21T08:16:40.9307861Z  'range_unroll_factors': [0, 0],
2026-02-21T08:16:40.9307972Z  'range_warp_specializes': [],
2026-02-21T08:16:40.9308073Z  'waves_per_eu': 1}
2026-02-21T08:16:40.9570611Z [432s] Fitting surrogate: 971 points, 971 targets
2026-02-21T08:16:41.2346057Z [432s] Generation 16 starting: 13 neighbors, 1 active search path(s)
2026-02-21T08:16:45.4474167Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 1.3 configs/s
2026-02-21T08:16:46.4664118Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 16.0 configs/s
2026-02-21T08:16:47.1502749Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1364.9        
2026-02-21T08:16:47.1503005Z                                                                   configs/s     
2026-02-21T08:16:47.3820215Z [438s] Generation 16 complete: 
2026-02-21T08:16:47.3820457Z ok=15
2026-02-21T08:16:47.3820541Z min=0.0194
2026-02-21T08:16:47.3820990Z mid=0.0253
2026-02-21T08:16:47.3821066Z max=0.1246
2026-02-21T08:16:47.3821158Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:16:47.3821309Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:16:47.3821447Z  'l2_groupings': [8],
2026-02-21T08:16:47.3821555Z  'load_eviction_policies': ['', ''],
2026-02-21T08:16:47.3821671Z  'loop_orders': [[1, 0]],
2026-02-21T08:16:47.3821778Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:16:47.3821878Z  'num_stages': 3,
2026-02-21T08:16:47.3821967Z  'num_warps': 8,
2026-02-21T08:16:47.3822056Z  'pid_type': 'xyz',
2026-02-21T08:16:47.3822153Z  'range_flattens': [None, False],
2026-02-21T08:16:47.3822267Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:47.3822385Z  'range_num_stages': [0, 2],
2026-02-21T08:16:47.3822491Z  'range_unroll_factors': [0, 0],
2026-02-21T08:16:47.3822751Z  'range_warp_specializes': [],
2026-02-21T08:16:47.3822857Z  'waves_per_eu': 1}
2026-02-21T08:16:47.3978236Z [438s] Fitting surrogate: 986 points, 986 targets
2026-02-21T08:16:47.6434947Z [438s] Generation 17 starting: 13 neighbors, 1 active search path(s)
2026-02-21T08:16:50.2447965Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 10.3 configs/s
2026-02-21T08:16:51.2581835Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 16.1 configs/s
2026-02-21T08:16:51.7313473Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1927.2        
2026-02-21T08:16:51.7313792Z                                                                   configs/s     
2026-02-21T08:16:51.9400691Z [443s] Generation 17 complete: 
2026-02-21T08:16:51.9400909Z ok=15
2026-02-21T08:16:51.9401003Z min=0.0194
2026-02-21T08:16:51.9401128Z mid=0.0265
2026-02-21T08:16:51.9401235Z max=0.0359
2026-02-21T08:16:51.9401327Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:16:51.9401520Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:16:51.9401660Z  'l2_groupings': [8],
2026-02-21T08:16:51.9401776Z  'load_eviction_policies': ['', ''],
2026-02-21T08:16:51.9401953Z  'loop_orders': [[1, 0]],
2026-02-21T08:16:51.9402061Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:16:51.9402165Z  'num_stages': 3,
2026-02-21T08:16:51.9402253Z  'num_warps': 8,
2026-02-21T08:16:51.9402347Z  'pid_type': 'xyz',
2026-02-21T08:16:51.9402442Z  'range_flattens': [None, False],
2026-02-21T08:16:51.9402653Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:51.9402776Z  'range_num_stages': [0, 2],
2026-02-21T08:16:51.9402884Z  'range_unroll_factors': [0, 0],
2026-02-21T08:16:51.9402994Z  'range_warp_specializes': [],
2026-02-21T08:16:51.9403102Z  'waves_per_eu': 1}
2026-02-21T08:16:51.9531617Z [443s] Fitting surrogate: 1001 points, 1001 targets
2026-02-21T08:16:52.2039657Z [443s] Generation 18 starting: 14 neighbors, 1 active search path(s)
2026-02-21T08:16:56.6365171Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 0.9 configs/s
2026-02-21T08:16:57.7229271Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 15.9 configs/s
2026-02-21T08:16:58.5514125Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1143.9        
2026-02-21T08:16:58.5514430Z                                                                   configs/s     
2026-02-21T08:16:58.7744411Z [449s] Generation 18 complete: 
2026-02-21T08:16:58.7744594Z ok=16
2026-02-21T08:16:58.7744691Z min=0.0195
2026-02-21T08:16:58.7744789Z mid=0.0232
2026-02-21T08:16:58.7744877Z max=0.0922
2026-02-21T08:16:58.7745005Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:16:58.7745165Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:16:58.7745310Z  'l2_groupings': [8],
2026-02-21T08:16:58.7745424Z  'load_eviction_policies': ['', ''],
2026-02-21T08:16:58.7745543Z  'loop_orders': [[1, 0]],
2026-02-21T08:16:58.7745659Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:16:58.7745767Z  'num_stages': 3,
2026-02-21T08:16:58.7745866Z  'num_warps': 8,
2026-02-21T08:16:58.7746023Z  'pid_type': 'xyz',
2026-02-21T08:16:58.7746131Z  'range_flattens': [None, False],
2026-02-21T08:16:58.7746277Z  'range_multi_buffers': [None, True],
2026-02-21T08:16:58.7746408Z  'range_num_stages': [0, 2],
2026-02-21T08:16:58.7746526Z  'range_unroll_factors': [0, 0],
2026-02-21T08:16:58.7746646Z  'range_warp_specializes': [],
2026-02-21T08:16:58.7746766Z  'waves_per_eu': 1}
2026-02-21T08:16:58.7962464Z [449s] Fitting surrogate: 1017 points, 1017 targets
2026-02-21T08:16:59.0134011Z [450s] Generation 19 starting: 13 neighbors, 1 active search path(s)
2026-02-21T08:17:01.0412760Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 18.3 configs/s
2026-02-21T08:17:02.0044385Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 15.1 configs/s
2026-02-21T08:17:02.5673170Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1722.7        
2026-02-21T08:17:02.5674211Z                                                                   configs/s     
2026-02-21T08:17:02.7725071Z [453s] Generation 19 complete: 
2026-02-21T08:17:02.7725291Z ok=15
2026-02-21T08:17:02.7725449Z min=0.0194
2026-02-21T08:17:02.7725539Z mid=0.0255
2026-02-21T08:17:02.7725614Z max=0.0400
2026-02-21T08:17:02.7725702Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:17:02.7725848Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:17:02.7725991Z  'l2_groupings': [8],
2026-02-21T08:17:02.7726097Z  'load_eviction_policies': ['', ''],
2026-02-21T08:17:02.7726214Z  'loop_orders': [[1, 0]],
2026-02-21T08:17:02.7726321Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:17:02.7726421Z  'num_stages': 3,
2026-02-21T08:17:02.7726509Z  'num_warps': 8,
2026-02-21T08:17:02.7726596Z  'pid_type': 'xyz',
2026-02-21T08:17:02.7726693Z  'range_flattens': [None, False],
2026-02-21T08:17:02.7726806Z  'range_multi_buffers': [None, True],
2026-02-21T08:17:02.7726924Z  'range_num_stages': [0, 2],
2026-02-21T08:17:02.7727027Z  'range_unroll_factors': [0, 0],
2026-02-21T08:17:02.7727149Z  'range_warp_specializes': [],
2026-02-21T08:17:02.7727255Z  'waves_per_eu': 1}
2026-02-21T08:17:02.7895613Z [453s] Fitting surrogate: 1032 points, 1032 targets
2026-02-21T08:17:03.0407476Z [454s] Generation 20 starting: 15 neighbors, 1 active search path(s)
2026-02-21T08:17:08.0909210Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 0.9 configs/s
2026-02-21T08:17:09.2297865Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.0 configs/s
2026-02-21T08:17:10.0650904Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1127.9        
2026-02-21T08:17:10.0651246Z                                                                   configs/s     
2026-02-21T08:17:10.2972138Z [461s] Generation 20 complete: 
2026-02-21T08:17:10.2972277Z ok=17
2026-02-21T08:17:10.2972360Z min=0.0196
2026-02-21T08:17:10.2972441Z mid=0.0254
2026-02-21T08:17:10.2972519Z max=0.0818
2026-02-21T08:17:10.2972610Z best={'block_sizes': [1024, 1, 8],
2026-02-21T08:17:10.2972772Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T08:17:10.2972908Z  'l2_groupings': [8],
2026-02-21T08:17:10.2973028Z  'load_eviction_policies': ['', ''],
2026-02-21T08:17:10.2973398Z  'loop_orders': [[1, 0]],
2026-02-21T08:17:10.2973502Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:17:10.2973606Z  'num_stages': 3,
2026-02-21T08:17:10.2973690Z  'num_warps': 8,
2026-02-21T08:17:10.2973783Z  'pid_type': 'xyz',
2026-02-21T08:17:10.2973878Z  'range_flattens': [None, False],
2026-02-21T08:17:10.2973993Z  'range_multi_buffers': [None, True],
2026-02-21T08:17:10.2974105Z  'range_num_stages': [0, 2],
2026-02-21T08:17:10.2974209Z  'range_unroll_factors': [0, 0],
2026-02-21T08:17:10.2974322Z  'range_warp_specializes': [],
2026-02-21T08:17:10.2974426Z  'waves_per_eu': 1}
2026-02-21T08:17:10.3137920Z [461s] Fitting surrogate: 1049 points, 1049 targets
2026-02-21T08:17:10.4298732Z [461s] Autotuning complete in 461.5s after searching 997 configs.
2026-02-21T08:17:10.4299023Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:17:10.4299718Z     @helion.kernel(config=helion.Config(block_sizes=[1024, 1, 8], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=8, pid_type='xyz', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:17:10.4300371Z 
2026-02-21T08:17:10.4300543Z [461s] Code of selected kernel: /tmp/torchinductor_root/dx/cdxavlqqdkao7ml4v4v26uakvmiovofyjx7f353piuuiza72pk5k.py
2026-02-21T08:17:11.1878211Z WARNING:tritonbench.utils.triton_op:Completed input ID 0:
2026-02-21T08:17:11.1878436Z x_val
2026-02-21T08:17:11.1878519Z ------------------
2026-02-21T08:17:11.1878617Z (1, 1, 1280, 8192)
2026-02-21T08:17:11.1882749Z 
2026-02-21T08:17:11.1894627Z  10%|█         | 1/10 [07:50<1:10:33, 470.38s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3:
2026-02-21T08:17:11.1894834Z x_val
2026-02-21T08:17:11.1894912Z ------------------
2026-02-21T08:17:11.1894995Z (1, 1, 8192, 3584)
2026-02-21T08:17:11.1902587Z INFO:tritonbench.utils.triton_op:Took 0.63ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T08:17:12.3886004Z INFO:tritonbench.utils.triton_op:Took 4.12ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T08:17:16.2071859Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T08:17:16.2104868Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:17:16.2105027Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:17:16.2105205Z               'dtype': 'torch.bfloat16',
2026-02-21T08:17:16.2105329Z               'shape': (1, 1, 3584),
2026-02-21T08:17:16.2105445Z               'stride': (3584, 3584, 1)},
2026-02-21T08:17:16.2105568Z             { 'device': 'cuda:0',
2026-02-21T08:17:16.2105678Z               'dtype': 'torch.int32',
2026-02-21T08:17:16.2105859Z               'shape': (3584, 8192),
2026-02-21T08:17:16.2105969Z               'stride': (8192, 1)}),
2026-02-21T08:17:16.2106098Z   'kwargs': {}}
2026-02-21T08:17:16.2136058Z INFO:tritonbench.utils.triton_op:Took 3.31ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T08:17:16.4478090Z [0s] Autotune random seed: 2134834638
2026-02-21T08:17:16.5479753Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:17:51.6993736Z [35s] Timeout after 30s compiling Config(block_sizes=[128, 1, 4096], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[2, 0], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:17:53.7866090Z [37s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T08:17:55.8345878Z [39s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:17:55.9559053Z [39s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=3, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[2, 1], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:17:57.8237447Z [41s] Timeout after 30s compiling Config(block_sizes=[8, 1, 8192], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=2, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:17:58.6825855Z [42s] Timeout after 30s compiling Config(block_sizes=[256, 1, 1024], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=2, num_warps=8, pid_type='xyz', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T08:17:58.6843176Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T08:18:04.7289017Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.7 configs/s
2026-02-21T08:18:04.7298811Z [48s] Adaptive compile timeout: 30s (90% percentile=11.1s, bounds=[30.0s, 30s])
2026-02-21T08:18:05.7830629Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 866.7 configs/s
2026-02-21T08:18:06.0572171Z [49s] Initial random population of 100, 5 starting points: 
2026-02-21T08:18:06.0572467Z timeout=6
2026-02-21T08:18:06.0572580Z ok=94
2026-02-21T08:18:06.0572696Z min=0.0361
2026-02-21T08:18:06.0572807Z mid=0.1848
2026-02-21T08:18:06.0572926Z max=11.8725
2026-02-21T08:18:06.0573062Z best={'block_sizes': [64, 1, 8],
2026-02-21T08:18:06.0573300Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T08:18:06.0573491Z  'l2_groupings': [8],
2026-02-21T08:18:06.0573657Z  'load_eviction_policies': ['', ''],
2026-02-21T08:18:06.0573822Z  'loop_orders': [[0, 1]],
2026-02-21T08:18:06.0573966Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:18:06.0574113Z  'num_stages': 4,
2026-02-21T08:18:06.0574237Z  'num_warps': 4,
2026-02-21T08:18:06.0574363Z  'pid_type': 'flat',
2026-02-21T08:18:06.0574499Z  'range_flattens': [None, False],
2026-02-21T08:18:06.0574668Z  'range_multi_buffers': [None, False],
2026-02-21T08:18:06.0574835Z  'range_num_stages': [0, 3],
2026-02-21T08:18:06.0574982Z  'range_unroll_factors': [0, 0],
2026-02-21T08:18:06.0575141Z  'range_warp_specializes': [],
2026-02-21T08:18:06.0575286Z  'waves_per_eu': 4}
2026-02-21T08:18:06.0783908Z [49s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:18:07.0715841Z [50s] Generation 1 starting: 80 neighbors, 5 active search path(s)
2026-02-21T08:18:21.3232545Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 2.4 configs/s
2026-02-21T08:18:26.8537475Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 15.2 configs/s
2026-02-21T08:18:32.9337825Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 165.9         
2026-02-21T08:18:32.9338106Z                                                                   configs/s     
2026-02-21T08:18:33.5318148Z [76s] Generation 1 complete: 
2026-02-21T08:18:33.5318307Z ok=85
2026-02-21T08:18:33.5318395Z min=0.0351
2026-02-21T08:18:33.5318482Z mid=0.0387
2026-02-21T08:18:33.5318563Z max=0.2139
2026-02-21T08:18:33.5318660Z best={'block_sizes': [128, 1, 16],
2026-02-21T08:18:33.5318809Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:18:33.5318953Z  'l2_groupings': [8],
2026-02-21T08:18:33.5319058Z  'load_eviction_policies': ['', ''],
2026-02-21T08:18:33.5319183Z  'loop_orders': [[0, 1]],
2026-02-21T08:18:33.5319299Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:18:33.5319431Z  'num_stages': 3,
2026-02-21T08:18:33.5319525Z  'num_warps': 8,
2026-02-21T08:18:33.5319617Z  'pid_type': 'flat',
2026-02-21T08:18:33.5319747Z  'range_flattens': [None, False],
2026-02-21T08:18:33.5319867Z  'range_multi_buffers': [None, False],
2026-02-21T08:18:33.5319990Z  'range_num_stages': [0, 2],
2026-02-21T08:18:33.5320101Z  'range_unroll_factors': [0, 4],
2026-02-21T08:18:33.5320224Z  'range_warp_specializes': [],
2026-02-21T08:18:33.5320331Z  'waves_per_eu': 1}
2026-02-21T08:18:33.7303610Z [77s] Fitting surrogate: 185 points, 185 targets
2026-02-21T08:18:35.1544313Z [78s] Generation 2 starting: 77 neighbors, 5 active search path(s)
2026-02-21T08:18:45.0854689Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 12.1 configs/s
2026-02-21T08:18:50.1504893Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 15.6 configs/s
2026-02-21T08:18:55.0951748Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 200.0         
2026-02-21T08:18:55.0953171Z                                                                   configs/s     
2026-02-21T08:18:55.6964865Z [99s] Generation 2 complete: 
2026-02-21T08:18:55.6965170Z ok=82
2026-02-21T08:18:55.6965261Z min=0.0273
2026-02-21T08:18:55.6965353Z mid=0.0372
2026-02-21T08:18:55.6965433Z max=0.4209
2026-02-21T08:18:55.6965526Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:18:55.6965670Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:18:55.6965809Z  'l2_groupings': [2],
2026-02-21T08:18:55.6965915Z  'load_eviction_policies': ['', ''],
2026-02-21T08:18:55.6966035Z  'loop_orders': [[1, 0]],
2026-02-21T08:18:55.6966140Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:18:55.6966247Z  'num_stages': 1,
2026-02-21T08:18:55.6966335Z  'num_warps': 16,
2026-02-21T08:18:55.6966427Z  'pid_type': 'flat',
2026-02-21T08:18:55.6966530Z  'range_flattens': [None, None],
2026-02-21T08:18:55.6966644Z  'range_multi_buffers': [None, None],
2026-02-21T08:18:55.6966765Z  'range_num_stages': [0, 2],
2026-02-21T08:18:55.6966886Z  'range_unroll_factors': [0, 1],
2026-02-21T08:18:55.6967001Z  'range_warp_specializes': [],
2026-02-21T08:18:55.6967106Z  'waves_per_eu': 4}
2026-02-21T08:18:55.8603218Z [99s] Fitting surrogate: 267 points, 267 targets
2026-02-21T08:18:57.0918207Z [100s] Generation 3 starting: 77 neighbors, 5 active search path(s)
2026-02-21T08:19:36.2118558Z [139s] Timeout after 30s compiling Config(block_sizes=[512, 1, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:19:36.6834681Z [140s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 16], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:19:36.6853404Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.2 configs/s
2026-02-21T08:19:41.5393875Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 16.4 configs/s
2026-02-21T08:19:46.0844370Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 237.7         
2026-02-21T08:19:46.0844978Z                                                                   configs/s     
2026-02-21T08:19:46.5610308Z [150s] Generation 3 complete: 
2026-02-21T08:19:46.5610769Z timeout=2
2026-02-21T08:19:46.5610978Z ok=81
2026-02-21T08:19:46.5611178Z min=0.0273
2026-02-21T08:19:46.5611390Z mid=0.0360
2026-02-21T08:19:46.5611594Z max=0.4211
2026-02-21T08:19:46.5611816Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:19:46.5612193Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:19:46.5612602Z  'l2_groupings': [2],
2026-02-21T08:19:46.5612881Z  'load_eviction_policies': ['', ''],
2026-02-21T08:19:46.5613195Z  'loop_orders': [[1, 0]],
2026-02-21T08:19:46.5613502Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:19:46.5613766Z  'num_stages': 1,
2026-02-21T08:19:46.5613997Z  'num_warps': 16,
2026-02-21T08:19:46.5614223Z  'pid_type': 'flat',
2026-02-21T08:19:46.5614485Z  'range_flattens': [None, None],
2026-02-21T08:19:46.5614790Z  'range_multi_buffers': [None, None],
2026-02-21T08:19:46.5614942Z  'range_num_stages': [0, 2],
2026-02-21T08:19:46.5615046Z  'range_unroll_factors': [0, 1],
2026-02-21T08:19:46.5615155Z  'range_warp_specializes': [],
2026-02-21T08:19:46.5615261Z  'waves_per_eu': 4}
2026-02-21T08:19:46.6900800Z [150s] Fitting surrogate: 350 points, 350 targets
2026-02-21T08:19:47.5130733Z [150s] Generation 4 starting: 78 neighbors, 5 active search path(s)
2026-02-21T08:20:25.2787791Z [188s] Timeout after 30s compiling Config(block_sizes=[512, 1, 16], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='xyz', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:20:25.2810039Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.3 configs/s
2026-02-21T08:20:30.4664193Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 15.4 configs/s
2026-02-21T08:20:34.9474905Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 222.8         
2026-02-21T08:20:34.9475474Z                                                                   configs/s     
2026-02-21T08:20:35.4762948Z [198s] Generation 4 complete: 
2026-02-21T08:20:35.4763182Z timeout=1
2026-02-21T08:20:35.4763261Z ok=83
2026-02-21T08:20:35.4763348Z min=0.0273
2026-02-21T08:20:35.4763430Z mid=0.0342
2026-02-21T08:20:35.4763506Z max=0.9757
2026-02-21T08:20:35.4763630Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:20:35.4763773Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:20:35.4764329Z  'l2_groupings': [2],
2026-02-21T08:20:35.4764433Z  'load_eviction_policies': ['', ''],
2026-02-21T08:20:35.4764547Z  'loop_orders': [[1, 0]],
2026-02-21T08:20:35.4764650Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:20:35.4764751Z  'num_stages': 1,
2026-02-21T08:20:35.4764834Z  'num_warps': 16,
2026-02-21T08:20:35.4764920Z  'pid_type': 'flat',
2026-02-21T08:20:35.4765016Z  'range_flattens': [None, None],
2026-02-21T08:20:35.4765131Z  'range_multi_buffers': [None, None],
2026-02-21T08:20:35.4765243Z  'range_num_stages': [0, 2],
2026-02-21T08:20:35.4765345Z  'range_unroll_factors': [0, 1],
2026-02-21T08:20:35.4765451Z  'range_warp_specializes': [],
2026-02-21T08:20:35.4765554Z  'waves_per_eu': 4}
2026-02-21T08:20:35.6124721Z [199s] Fitting surrogate: 434 points, 434 targets
2026-02-21T08:20:37.0096028Z [200s] Generation 5 starting: 78 neighbors, 5 active search path(s)
2026-02-21T08:21:11.8511575Z [235s] Timeout after 30s compiling Config(block_sizes=[256, 1, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[3, 2], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:21:11.8530512Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.5 configs/s
2026-02-21T08:21:16.8754340Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 15.9 configs/s
2026-02-21T08:21:21.4340505Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 217.9         
2026-02-21T08:21:21.4340779Z                                                                   configs/s     
2026-02-21T08:21:21.9521382Z [245s] Generation 5 complete: 
2026-02-21T08:21:21.9521549Z timeout=1
2026-02-21T08:21:21.9521958Z ok=83
2026-02-21T08:21:21.9522037Z min=0.0269
2026-02-21T08:21:21.9522113Z mid=0.0317
2026-02-21T08:21:21.9522212Z max=0.4455
2026-02-21T08:21:21.9522298Z best={'block_sizes': [256, 1, 32],
2026-02-21T08:21:21.9522437Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:21:21.9522639Z  'l2_groupings': [8],
2026-02-21T08:21:21.9522749Z  'load_eviction_policies': ['', ''],
2026-02-21T08:21:21.9522865Z  'loop_orders': [[1, 0]],
2026-02-21T08:21:21.9522968Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:21:21.9523071Z  'num_stages': 2,
2026-02-21T08:21:21.9523154Z  'num_warps': 4,
2026-02-21T08:21:21.9523244Z  'pid_type': 'flat',
2026-02-21T08:21:21.9523343Z  'range_flattens': [None, False],
2026-02-21T08:21:21.9523459Z  'range_multi_buffers': [None, False],
2026-02-21T08:21:21.9523577Z  'range_num_stages': [0, 3],
2026-02-21T08:21:21.9523682Z  'range_unroll_factors': [0, 0],
2026-02-21T08:21:21.9523791Z  'range_warp_specializes': [],
2026-02-21T08:21:21.9523894Z  'waves_per_eu': 1}
2026-02-21T08:21:22.1119831Z [245s] Fitting surrogate: 518 points, 518 targets
2026-02-21T08:21:23.5641191Z [247s] Generation 6 starting: 69 neighbors, 5 active search path(s)
2026-02-21T08:21:55.9007763Z [279s] Timeout after 30s compiling Config(block_sizes=[256, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:22:00.5769475Z [284s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='xyz', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:22:01.2567632Z [284s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='xyz', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:22:01.2581711Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.2 configs/s
2026-02-21T08:22:05.3735144Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 16.9 configs/s
2026-02-21T08:22:09.0955241Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 265.6         
2026-02-21T08:22:09.0955887Z                                                                   configs/s     
2026-02-21T08:22:09.5582972Z [293s] Generation 6 complete: 
2026-02-21T08:22:09.5583129Z timeout=3
2026-02-21T08:22:09.5583228Z ok=71
2026-02-21T08:22:09.5583313Z min=0.0262
2026-02-21T08:22:09.5583391Z mid=0.0312
2026-02-21T08:22:09.5583487Z max=0.4934
2026-02-21T08:22:09.5583572Z best={'block_sizes': [256, 1, 32],
2026-02-21T08:22:09.5583710Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:22:09.5583838Z  'l2_groupings': [8],
2026-02-21T08:22:09.5583957Z  'load_eviction_policies': ['', ''],
2026-02-21T08:22:09.5584074Z  'loop_orders': [[1, 0]],
2026-02-21T08:22:09.5584176Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:22:09.5584279Z  'num_stages': 2,
2026-02-21T08:22:09.5584363Z  'num_warps': 4,
2026-02-21T08:22:09.5584453Z  'pid_type': 'flat',
2026-02-21T08:22:09.5584552Z  'range_flattens': [None, False],
2026-02-21T08:22:09.5584678Z  'range_multi_buffers': [None, False],
2026-02-21T08:22:09.5584793Z  'range_num_stages': [0, 2],
2026-02-21T08:22:09.5584902Z  'range_unroll_factors': [0, 0],
2026-02-21T08:22:09.5585009Z  'range_warp_specializes': [],
2026-02-21T08:22:09.5585436Z  'waves_per_eu': 1}
2026-02-21T08:22:09.6807971Z [293s] Fitting surrogate: 592 points, 592 targets
2026-02-21T08:22:10.8757376Z [294s] Generation 7 starting: 71 neighbors, 5 active search path(s)
2026-02-21T08:22:44.2674561Z [327s] Timeout after 30s compiling Config(block_sizes=[256, 1, 64], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:22:49.7083206Z [333s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='xyz', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:22:49.7101980Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 0.2 configs/s
2026-02-21T08:22:54.3014817Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.5 configs/s
2026-02-21T08:22:58.1283580Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 258.5         
2026-02-21T08:22:58.1285480Z                                                                   configs/s     
2026-02-21T08:22:58.5894161Z [342s] Generation 7 complete: 
2026-02-21T08:22:58.5894415Z timeout=2
2026-02-21T08:22:58.5894554Z ok=74
2026-02-21T08:22:58.5894691Z min=0.0250
2026-02-21T08:22:58.5894864Z mid=0.0293
2026-02-21T08:22:58.5895000Z max=0.2130
2026-02-21T08:22:58.5895151Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:22:58.5895392Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:22:58.5895634Z  'l2_groupings': [8],
2026-02-21T08:22:58.5895815Z  'load_eviction_policies': ['', ''],
2026-02-21T08:22:58.5896060Z  'loop_orders': [[1, 0]],
2026-02-21T08:22:58.5896249Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:22:58.5896431Z  'num_stages': 2,
2026-02-21T08:22:58.5897089Z  'num_warps': 8,
2026-02-21T08:22:58.5897254Z  'pid_type': 'flat',
2026-02-21T08:22:58.5897428Z  'range_flattens': [None, False],
2026-02-21T08:22:58.5897634Z  'range_multi_buffers': [None, False],
2026-02-21T08:22:58.5897837Z  'range_num_stages': [0, 2],
2026-02-21T08:22:58.5898001Z  'range_unroll_factors': [0, 0],
2026-02-21T08:22:58.5898203Z  'range_warp_specializes': [],
2026-02-21T08:22:58.5898386Z  'waves_per_eu': 1}
2026-02-21T08:22:58.7060331Z [342s] Fitting surrogate: 668 points, 668 targets
2026-02-21T08:22:59.7596845Z [343s] Generation 8 starting: 58 neighbors, 4 active search path(s)
2026-02-21T08:23:25.5053568Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 0.3 configs/s
2026-02-21T08:23:29.4679765Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 16.0 configs/s
2026-02-21T08:23:32.6152446Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 314.9         
2026-02-21T08:23:32.6152722Z                                                                   configs/s     
2026-02-21T08:23:33.0455538Z [376s] Generation 8 complete: 
2026-02-21T08:23:33.0455729Z ok=63
2026-02-21T08:23:33.0455832Z min=0.0250
2026-02-21T08:23:33.0458607Z mid=0.0320
2026-02-21T08:23:33.0459602Z max=0.6753
2026-02-21T08:23:33.0459864Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:23:33.0460187Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:23:33.0460434Z  'l2_groupings': [8],
2026-02-21T08:23:33.0460623Z  'load_eviction_policies': ['', ''],
2026-02-21T08:23:33.0460840Z  'loop_orders': [[1, 0]],
2026-02-21T08:23:33.0461028Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:23:33.0461208Z  'num_stages': 2,
2026-02-21T08:23:33.0461363Z  'num_warps': 8,
2026-02-21T08:23:33.0461520Z  'pid_type': 'flat',
2026-02-21T08:23:33.0461701Z  'range_flattens': [None, False],
2026-02-21T08:23:33.0462345Z  'range_multi_buffers': [None, False],
2026-02-21T08:23:33.0462549Z  'range_num_stages': [0, 2],
2026-02-21T08:23:33.0462821Z  'range_unroll_factors': [0, 0],
2026-02-21T08:23:33.0463038Z  'range_warp_specializes': [],
2026-02-21T08:23:33.0463220Z  'waves_per_eu': 1}
2026-02-21T08:23:33.1387652Z [376s] Fitting surrogate: 731 points, 731 targets
2026-02-21T08:23:33.7570488Z [377s] Generation 9 starting: 52 neighbors, 4 active search path(s)
2026-02-21T08:23:53.7249790Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 0.3 configs/s
2026-02-21T08:23:57.6939015Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 14.1 configs/s
2026-02-21T08:24:00.3816214Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 363.2         
2026-02-21T08:24:00.3816800Z                                                                   configs/s     
2026-02-21T08:24:00.7726762Z [404s] Generation 9 complete: 
2026-02-21T08:24:00.7727150Z ok=57
2026-02-21T08:24:00.7727370Z min=0.0246
2026-02-21T08:24:00.7727638Z mid=0.0358
2026-02-21T08:24:00.7727841Z max=0.1927
2026-02-21T08:24:00.7728079Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:24:00.7728496Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:24:00.7728896Z  'l2_groupings': [4],
2026-02-21T08:24:00.7729178Z  'load_eviction_policies': ['', ''],
2026-02-21T08:24:00.7729498Z  'loop_orders': [[1, 0]],
2026-02-21T08:24:00.7729793Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:24:00.7730062Z  'num_stages': 2,
2026-02-21T08:24:00.7730301Z  'num_warps': 8,
2026-02-21T08:24:00.7730531Z  'pid_type': 'flat',
2026-02-21T08:24:00.7730798Z  'range_flattens': [None, True],
2026-02-21T08:24:00.7731103Z  'range_multi_buffers': [None, None],
2026-02-21T08:24:00.7731421Z  'range_num_stages': [0, 2],
2026-02-21T08:24:00.7731700Z  'range_unroll_factors': [0, 0],
2026-02-21T08:24:00.7731999Z  'range_warp_specializes': [],
2026-02-21T08:24:00.7732262Z  'waves_per_eu': 1}
2026-02-21T08:24:00.8450920Z [404s] Fitting surrogate: 788 points, 788 targets
2026-02-21T08:24:01.3534630Z [404s] Generation 10 starting: 46 neighbors, 3 active search path(s)
2026-02-21T08:24:36.2378657Z [439s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:24:36.4302566Z [439s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=4, pid_type='xyz', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:24:36.4323828Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 0.4 configs/s
2026-02-21T08:24:39.3182514Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 48/48 16.9 configs/s
2026-02-21T08:24:41.5230414Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 439.6         
2026-02-21T08:24:41.5231028Z                                                                   configs/s     
2026-02-21T08:24:41.8496957Z [445s] Generation 10 complete: 
2026-02-21T08:24:41.8497332Z timeout=2
2026-02-21T08:24:41.8497498Z ok=47
2026-02-21T08:24:41.8497662Z min=0.0248
2026-02-21T08:24:41.8497825Z mid=0.0320
2026-02-21T08:24:41.8497982Z max=0.7386
2026-02-21T08:24:41.8498164Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:24:41.8498458Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:24:41.8498739Z  'l2_groupings': [4],
2026-02-21T08:24:41.8498954Z  'load_eviction_policies': ['', ''],
2026-02-21T08:24:41.8499202Z  'loop_orders': [[1, 0]],
2026-02-21T08:24:41.8499421Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:24:41.8500147Z  'num_stages': 2,
2026-02-21T08:24:41.8500343Z  'num_warps': 8,
2026-02-21T08:24:41.8500533Z  'pid_type': 'flat',
2026-02-21T08:24:41.8500757Z  'range_flattens': [None, True],
2026-02-21T08:24:41.8501001Z  'range_multi_buffers': [None, None],
2026-02-21T08:24:41.8501240Z  'range_num_stages': [0, 2],
2026-02-21T08:24:41.8501458Z  'range_unroll_factors': [0, 0],
2026-02-21T08:24:41.8501684Z  'range_warp_specializes': [],
2026-02-21T08:24:41.8501906Z  'waves_per_eu': 1}
2026-02-21T08:24:41.9084062Z [445s] Fitting surrogate: 837 points, 837 targets
2026-02-21T08:24:42.2781328Z [445s] Generation 11 starting: 30 neighbors, 2 active search path(s)
2026-02-21T08:25:13.3267374Z [476s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:25:13.3285070Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 0.4 configs/s
2026-02-21T08:25:15.2210464Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 16.8 configs/s
2026-02-21T08:25:16.1503668Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 963.0         
2026-02-21T08:25:16.1504275Z                                                                   configs/s     
2026-02-21T08:25:16.3915920Z [479s] Generation 11 complete: 
2026-02-21T08:25:16.3916323Z timeout=1
2026-02-21T08:25:16.3916512Z ok=32
2026-02-21T08:25:16.3916701Z min=0.0247
2026-02-21T08:25:16.3916894Z mid=0.0397
2026-02-21T08:25:16.3917079Z max=0.7391
2026-02-21T08:25:16.3917288Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:25:16.3917636Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:25:16.3917978Z  'l2_groupings': [4],
2026-02-21T08:25:16.3918311Z  'load_eviction_policies': ['', ''],
2026-02-21T08:25:16.3918598Z  'loop_orders': [[1, 0]],
2026-02-21T08:25:16.3918859Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:25:16.3919762Z  'num_stages': 2,
2026-02-21T08:25:16.3919972Z  'num_warps': 8,
2026-02-21T08:25:16.3920189Z  'pid_type': 'flat',
2026-02-21T08:25:16.3920426Z  'range_flattens': [None, True],
2026-02-21T08:25:16.3920709Z  'range_multi_buffers': [None, None],
2026-02-21T08:25:16.3920985Z  'range_num_stages': [0, 2],
2026-02-21T08:25:16.3921241Z  'range_unroll_factors': [0, 0],
2026-02-21T08:25:16.3921506Z  'range_warp_specializes': [],
2026-02-21T08:25:16.3921761Z  'waves_per_eu': 1}
2026-02-21T08:25:16.4093046Z [479s] Fitting surrogate: 870 points, 870 targets
2026-02-21T08:25:16.7760217Z [480s] Generation 12 starting: 31 neighbors, 2 active search path(s)
2026-02-21T08:25:47.3177337Z [510s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:25:47.8595625Z [511s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='xyz', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:25:48.3650804Z [511s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:25:48.7962478Z [512s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='xyz', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:25:48.7975761Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 0.4 configs/s
2026-02-21T08:25:50.6095614Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 33/33 18.1 configs/s
2026-02-21T08:25:51.3009280Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1295.2        
2026-02-21T08:25:51.3009912Z                                                                   configs/s     
2026-02-21T08:25:51.5357345Z [514s] Generation 12 complete: 
2026-02-21T08:25:51.5357563Z timeout=4
2026-02-21T08:25:51.5357650Z ok=30
2026-02-21T08:25:51.5357731Z min=0.0248
2026-02-21T08:25:51.5358216Z mid=0.0482
2026-02-21T08:25:51.5358291Z max=0.5346
2026-02-21T08:25:51.5358375Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:25:51.5358517Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:25:51.5358662Z  'l2_groupings': [4],
2026-02-21T08:25:51.5358766Z  'load_eviction_policies': ['', ''],
2026-02-21T08:25:51.5358882Z  'loop_orders': [[1, 0]],
2026-02-21T08:25:51.5358985Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:25:51.5359087Z  'num_stages': 2,
2026-02-21T08:25:51.5359171Z  'num_warps': 8,
2026-02-21T08:25:51.5359262Z  'pid_type': 'flat',
2026-02-21T08:25:51.5359359Z  'range_flattens': [None, True],
2026-02-21T08:25:51.5359498Z  'range_multi_buffers': [None, None],
2026-02-21T08:25:51.5359619Z  'range_num_stages': [0, 2],
2026-02-21T08:25:51.5359746Z  'range_unroll_factors': [0, 0],
2026-02-21T08:25:51.5359861Z  'range_warp_specializes': [],
2026-02-21T08:25:51.5359962Z  'waves_per_eu': 1}
2026-02-21T08:25:51.5542954Z [515s] Fitting surrogate: 904 points, 904 targets
2026-02-21T08:25:51.9730246Z [515s] Generation 13 starting: 31 neighbors, 2 active search path(s)
2026-02-21T08:26:21.4054607Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 0.3 configs/s
2026-02-21T08:26:23.5822005Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 33/33 16.2 configs/s
2026-02-21T08:26:24.9764911Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 681.4         
2026-02-21T08:26:24.9765158Z                                                                   configs/s     
2026-02-21T08:26:25.2807151Z [548s] Generation 13 complete: 
2026-02-21T08:26:25.2807579Z ok=34
2026-02-21T08:26:25.2807796Z min=0.0246
2026-02-21T08:26:25.2808015Z mid=0.0345
2026-02-21T08:26:25.2808203Z max=0.5454
2026-02-21T08:26:25.2808437Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:26:25.2826548Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:26:25.2826740Z  'l2_groupings': [4],
2026-02-21T08:26:25.2826912Z  'load_eviction_policies': ['', ''],
2026-02-21T08:26:25.2827084Z  'loop_orders': [[1, 0]],
2026-02-21T08:26:25.2827227Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:26:25.2827371Z  'num_stages': 2,
2026-02-21T08:26:25.2827488Z  'num_warps': 8,
2026-02-21T08:26:25.2827607Z  'pid_type': 'flat',
2026-02-21T08:26:25.2827736Z  'range_flattens': [None, True],
2026-02-21T08:26:25.2827889Z  'range_multi_buffers': [None, None],
2026-02-21T08:26:25.2828040Z  'range_num_stages': [0, 2],
2026-02-21T08:26:25.2828178Z  'range_unroll_factors': [0, 0],
2026-02-21T08:26:25.2828320Z  'range_warp_specializes': [],
2026-02-21T08:26:25.2828459Z  'waves_per_eu': 1}
2026-02-21T08:26:25.3154652Z [548s] Fitting surrogate: 938 points, 938 targets
2026-02-21T08:26:25.6884826Z [549s] Generation 14 starting: 27 neighbors, 2 active search path(s)
2026-02-21T08:26:30.8324813Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/29 4.2 configs/s
2026-02-21T08:26:33.4752443Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 29/29 11.5 configs/s
2026-02-21T08:26:34.8961973Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 667.0         
2026-02-21T08:26:34.8962681Z                                                                   configs/s     
2026-02-21T08:26:35.1963151Z [558s] Generation 14 complete: 
2026-02-21T08:26:35.1963372Z ok=30
2026-02-21T08:26:35.1963459Z min=0.0246
2026-02-21T08:26:35.1963543Z mid=0.0359
2026-02-21T08:26:35.1963625Z max=0.1114
2026-02-21T08:26:35.1963709Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:26:35.1963857Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:26:35.1963997Z  'l2_groupings': [4],
2026-02-21T08:26:35.1964103Z  'load_eviction_policies': ['', ''],
2026-02-21T08:26:35.1964220Z  'loop_orders': [[1, 0]],
2026-02-21T08:26:35.1964328Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:26:35.1964427Z  'num_stages': 2,
2026-02-21T08:26:35.1964520Z  'num_warps': 8,
2026-02-21T08:26:35.1964654Z  'pid_type': 'flat',
2026-02-21T08:26:35.1964754Z  'range_flattens': [None, True],
2026-02-21T08:26:35.1965298Z  'range_multi_buffers': [None, None],
2026-02-21T08:26:35.1965413Z  'range_num_stages': [0, 2],
2026-02-21T08:26:35.1965518Z  'range_unroll_factors': [0, 0],
2026-02-21T08:26:35.1965626Z  'range_warp_specializes': [],
2026-02-21T08:26:35.1965749Z  'waves_per_eu': 1}
2026-02-21T08:26:35.2306836Z [558s] Fitting surrogate: 968 points, 968 targets
2026-02-21T08:26:35.6124274Z [559s] Generation 15 starting: 29 neighbors, 2 active search path(s)
2026-02-21T08:26:40.3368772Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 5.9 configs/s
2026-02-21T08:26:42.4307930Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 15.9 configs/s
2026-02-21T08:26:43.9159483Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 638.8         
2026-02-21T08:26:43.9159756Z                                                                   configs/s     
2026-02-21T08:26:44.2163222Z [567s] Generation 15 complete: 
2026-02-21T08:26:44.2163578Z ok=32
2026-02-21T08:26:44.2163795Z min=0.0246
2026-02-21T08:26:44.2164051Z mid=0.0364
2026-02-21T08:26:44.2164257Z max=0.0950
2026-02-21T08:26:44.2164500Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:26:44.2164881Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:26:44.2165249Z  'l2_groupings': [4],
2026-02-21T08:26:44.2165529Z  'load_eviction_policies': ['', ''],
2026-02-21T08:26:44.2165854Z  'loop_orders': [[1, 0]],
2026-02-21T08:26:44.2166140Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:26:44.2166429Z  'num_stages': 2,
2026-02-21T08:26:44.2166667Z  'num_warps': 8,
2026-02-21T08:26:44.2166904Z  'pid_type': 'flat',
2026-02-21T08:26:44.2167176Z  'range_flattens': [None, True],
2026-02-21T08:26:44.2167487Z  'range_multi_buffers': [None, None],
2026-02-21T08:26:44.2167806Z  'range_num_stages': [0, 2],
2026-02-21T08:26:44.2168084Z  'range_unroll_factors': [0, 0],
2026-02-21T08:26:44.2168386Z  'range_warp_specializes': [],
2026-02-21T08:26:44.2169188Z  'waves_per_eu': 1}
2026-02-21T08:26:44.2495291Z [567s] Fitting surrogate: 1000 points, 1000 targets
2026-02-21T08:26:44.6579541Z [568s] Generation 16 starting: 30 neighbors, 2 active search path(s)
2026-02-21T08:27:16.0206346Z [599s] Timeout after 30s compiling Config(block_sizes=[512, 1, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[3, 3], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:27:16.0222080Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 0.3 configs/s
2026-02-21T08:27:17.9661934Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 16.8 configs/s
2026-02-21T08:27:18.8926894Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 995.6         
2026-02-21T08:27:18.8930047Z                                                                   configs/s     
2026-02-21T08:27:19.1553523Z [602s] Generation 16 complete: 
2026-02-21T08:27:19.1553876Z timeout=1
2026-02-21T08:27:19.1554120Z ok=32
2026-02-21T08:27:19.1554326Z min=0.0246
2026-02-21T08:27:19.1554541Z mid=0.0387
2026-02-21T08:27:19.1554746Z max=0.3619
2026-02-21T08:27:19.1554973Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:27:19.1555349Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:27:19.1555704Z  'l2_groupings': [4],
2026-02-21T08:27:19.1555982Z  'load_eviction_policies': ['', ''],
2026-02-21T08:27:19.1556297Z  'loop_orders': [[1, 0]],
2026-02-21T08:27:19.1556583Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:27:19.1556861Z  'num_stages': 2,
2026-02-21T08:27:19.1557100Z  'num_warps': 8,
2026-02-21T08:27:19.1557333Z  'pid_type': 'flat',
2026-02-21T08:27:19.1557597Z  'range_flattens': [None, True],
2026-02-21T08:27:19.1557906Z  'range_multi_buffers': [None, None],
2026-02-21T08:27:19.1558237Z  'range_num_stages': [0, 2],
2026-02-21T08:27:19.1558520Z  'range_unroll_factors': [0, 0],
2026-02-21T08:27:19.1559223Z  'range_warp_specializes': [],
2026-02-21T08:27:19.1559506Z  'waves_per_eu': 1}
2026-02-21T08:27:19.1797726Z [602s] Fitting surrogate: 1033 points, 1033 targets
2026-02-21T08:27:19.5720741Z [603s] Generation 17 starting: 29 neighbors, 2 active search path(s)
2026-02-21T08:27:50.6303637Z [634s] Timeout after 30s compiling Config(block_sizes=[512, 1, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:27:53.0709375Z [636s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:27:53.0727971Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0.2 configs/s
2026-02-21T08:27:54.8972194Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 17.4 configs/s
2026-02-21T08:27:56.7351849Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 804.7         
2026-02-21T08:27:56.7355861Z                                                                   configs/s     
2026-02-21T08:27:57.0010664Z [640s] Generation 17 complete: 
2026-02-21T08:27:57.0011011Z timeout=2
2026-02-21T08:27:57.0011236Z ok=30
2026-02-21T08:27:57.0011442Z min=0.0246
2026-02-21T08:27:57.0011651Z mid=0.0355
2026-02-21T08:27:57.0011852Z max=0.7757
2026-02-21T08:27:57.0012540Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:27:57.0012917Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:27:57.0013270Z  'l2_groupings': [4],
2026-02-21T08:27:57.0013584Z  'load_eviction_policies': ['', ''],
2026-02-21T08:27:57.0013896Z  'loop_orders': [[1, 0]],
2026-02-21T08:27:57.0014176Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:27:57.0014449Z  'num_stages': 2,
2026-02-21T08:27:57.0014677Z  'num_warps': 8,
2026-02-21T08:27:57.0014919Z  'pid_type': 'flat',
2026-02-21T08:27:57.0015189Z  'range_flattens': [None, True],
2026-02-21T08:27:57.0015494Z  'range_multi_buffers': [None, None],
2026-02-21T08:27:57.0015804Z  'range_num_stages': [0, 2],
2026-02-21T08:27:57.0016088Z  'range_unroll_factors': [0, 0],
2026-02-21T08:27:57.0016413Z  'range_warp_specializes': [],
2026-02-21T08:27:57.0016677Z  'waves_per_eu': 1}
2026-02-21T08:27:57.0315408Z [640s] Fitting surrogate: 1065 points, 1065 targets
2026-02-21T08:27:57.4409038Z [640s] Generation 18 starting: 30 neighbors, 2 active search path(s)
2026-02-21T08:28:28.0648980Z [671s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:28:28.3006311Z [671s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:28:29.4919727Z [672s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:28:31.6567331Z [675s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:28:31.6587475Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 0.2 configs/s
2026-02-21T08:28:33.4277626Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 18.0 configs/s
2026-02-21T08:28:34.4369577Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 920.9         
2026-02-21T08:28:34.4372551Z                                                                   configs/s     
2026-02-21T08:28:34.6991765Z [678s] Generation 18 complete: 
2026-02-21T08:28:34.6992204Z timeout=4
2026-02-21T08:28:34.6992410Z ok=29
2026-02-21T08:28:34.6992612Z min=0.0246
2026-02-21T08:28:34.6992814Z mid=0.0368
2026-02-21T08:28:34.6993009Z max=0.3607
2026-02-21T08:28:34.6993232Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:28:34.6993605Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:28:34.6993951Z  'l2_groupings': [4],
2026-02-21T08:28:34.6994227Z  'load_eviction_policies': ['', ''],
2026-02-21T08:28:34.6994534Z  'loop_orders': [[1, 0]],
2026-02-21T08:28:34.6994815Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:28:34.6995090Z  'num_stages': 2,
2026-02-21T08:28:34.6995313Z  'num_warps': 8,
2026-02-21T08:28:34.6995544Z  'pid_type': 'flat',
2026-02-21T08:28:34.6995799Z  'range_flattens': [None, True],
2026-02-21T08:28:34.6996787Z  'range_multi_buffers': [None, None],
2026-02-21T08:28:34.6997090Z  'range_num_stages': [0, 2],
2026-02-21T08:28:34.6997382Z  'range_unroll_factors': [0, 0],
2026-02-21T08:28:34.6997669Z  'range_warp_specializes': [],
2026-02-21T08:28:34.6997948Z  'waves_per_eu': 1}
2026-02-21T08:28:34.7249609Z [678s] Fitting surrogate: 1098 points, 1098 targets
2026-02-21T08:28:35.1737156Z [678s] Generation 19 starting: 27 neighbors, 2 active search path(s)
2026-02-21T08:29:05.9741257Z [709s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:29:05.9754798Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/29 0.2 configs/s
2026-02-21T08:29:07.7182123Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 29/29 17.1 configs/s
2026-02-21T08:29:08.8061218Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 851.1         
2026-02-21T08:29:08.8061964Z                                                                   configs/s     
2026-02-21T08:29:09.0632308Z [712s] Generation 19 complete: 
2026-02-21T08:29:09.0632472Z timeout=1
2026-02-21T08:29:09.0632552Z ok=29
2026-02-21T08:29:09.0632639Z min=0.0246
2026-02-21T08:29:09.0632714Z mid=0.0358
2026-02-21T08:29:09.0632792Z max=0.3579
2026-02-21T08:29:09.0632877Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:29:09.0633023Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:29:09.0633158Z  'l2_groupings': [4],
2026-02-21T08:29:09.0633257Z  'load_eviction_policies': ['', ''],
2026-02-21T08:29:09.0633372Z  'loop_orders': [[1, 0]],
2026-02-21T08:29:09.0633476Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:29:09.0633576Z  'num_stages': 2,
2026-02-21T08:29:09.0633676Z  'num_warps': 8,
2026-02-21T08:29:09.0633764Z  'pid_type': 'flat',
2026-02-21T08:29:09.0633860Z  'range_flattens': [None, True],
2026-02-21T08:29:09.0634223Z  'range_multi_buffers': [None, None],
2026-02-21T08:29:09.0634333Z  'range_num_stages': [0, 2],
2026-02-21T08:29:09.0634436Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:09.0634546Z  'range_warp_specializes': [],
2026-02-21T08:29:09.0634647Z  'waves_per_eu': 1}
2026-02-21T08:29:09.0848123Z [712s] Fitting surrogate: 1128 points, 1128 targets
2026-02-21T08:29:09.4983089Z [712s] Generation 20 starting: 31 neighbors, 2 active search path(s)
2026-02-21T08:29:41.9330090Z [745s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[3, 3], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:29:42.3406676Z [745s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:29:42.3426645Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 0.3 configs/s
2026-02-21T08:29:44.3332997Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 33/33 17.0 configs/s
2026-02-21T08:29:45.4993496Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 806.1         
2026-02-21T08:29:45.4997339Z                                                                   configs/s     
2026-02-21T08:29:45.7734518Z [749s] Generation 20 complete: 
2026-02-21T08:29:45.7735574Z timeout=2
2026-02-21T08:29:45.7735786Z ok=32
2026-02-21T08:29:45.7739575Z min=0.0246
2026-02-21T08:29:45.7739875Z mid=0.0354
2026-02-21T08:29:45.7740085Z max=0.1502
2026-02-21T08:29:45.7740318Z best={'block_sizes': [512, 1, 32],
2026-02-21T08:29:45.7740725Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:29:45.7741085Z  'l2_groupings': [4],
2026-02-21T08:29:45.7741353Z  'load_eviction_policies': ['', ''],
2026-02-21T08:29:45.7741663Z  'loop_orders': [[1, 0]],
2026-02-21T08:29:45.7741940Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:29:45.7742211Z  'num_stages': 2,
2026-02-21T08:29:45.7742434Z  'num_warps': 8,
2026-02-21T08:29:45.7742669Z  'pid_type': 'flat',
2026-02-21T08:29:45.7742929Z  'range_flattens': [None, True],
2026-02-21T08:29:45.7743230Z  'range_multi_buffers': [None, None],
2026-02-21T08:29:45.7743530Z  'range_num_stages': [0, 2],
2026-02-21T08:29:45.7743807Z  'range_unroll_factors': [0, 0],
2026-02-21T08:29:45.7744101Z  'range_warp_specializes': [],
2026-02-21T08:29:45.7744382Z  'waves_per_eu': 1}
2026-02-21T08:29:45.8025245Z [749s] Fitting surrogate: 1162 points, 1162 targets
2026-02-21T08:29:45.9499144Z [749s] Autotuning complete in 749.4s after searching 1086 configs.
2026-02-21T08:29:45.9499388Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:29:45.9500089Z     @helion.kernel(config=helion.Config(block_sizes=[512, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:29:45.9500726Z 
2026-02-21T08:29:45.9500897Z [749s] Code of selected kernel: /tmp/torchinductor_root/wk/cwkmlt5dlqdr7tbbndynkhebku5gl33vony35lt74jvn3deniwxf.py
2026-02-21T08:29:46.9349173Z WARNING:tritonbench.utils.triton_op:Completed input ID 3:
2026-02-21T08:29:46.9349698Z x_val
2026-02-21T08:29:46.9349933Z ------------------
2026-02-21T08:29:46.9350195Z (1, 1, 8192, 3584)
2026-02-21T08:29:46.9350358Z 
2026-02-21T08:29:46.9368659Z  20%|██        | 2/10 [20:26<1:25:05, 638.24s/it]WARNING:tritonbench.utils.triton_op:Running input ID 7:
2026-02-21T08:29:46.9369009Z x_val
2026-02-21T08:29:46.9369156Z ------------------
2026-02-21T08:29:46.9369306Z (4, 1, 8192, 3584)
2026-02-21T08:29:46.9371142Z INFO:tritonbench.utils.triton_op:Took 0.16ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T08:29:48.0792610Z INFO:tritonbench.utils.triton_op:Took 5.23ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T08:29:49.3571317Z Autotune Choices Stats:
2026-02-21T08:29:49.3572186Z {"num_choices": 28, "num_triton_choices": 27, "best_kernel": "triton_mm_11", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.023959999904036522, "best_triton_pos": 0}
2026-02-21T08:29:49.3579934Z AUTOTUNE mm(4x3584, 3584x8192)
2026-02-21T08:29:49.3580114Z strides: [3584, 1], [8192, 1]
2026-02-21T08:29:49.3581370Z dtypes: torch.bfloat16, torch.bfloat16
2026-02-21T08:29:49.3581913Z   triton_mm_11 0.0240 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:29:49.3582348Z   mm 0.0246 ms 97.2% 
2026-02-21T08:29:49.3582702Z   triton_mm_10 0.0256 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:29:49.3583749Z   triton_mm_18 0.0256 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:29:49.3584366Z   triton_mm_3 0.0281 ms 85.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=2
2026-02-21T08:29:49.3585086Z   triton_mm_7 0.0281 ms 85.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=2
2026-02-21T08:29:49.3585762Z   triton_mm_9 0.0306 ms 78.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:29:49.3586343Z   triton_mm_4 0.0307 ms 78.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:29:49.3587096Z   triton_mm_17 0.0307 ms 78.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:29:49.3587686Z   triton_mm_24 0.0320 ms 74.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:29:49.3588144Z SingleProcess AUTOTUNE benchmarking takes 0.4020 seconds and 0.1629 seconds precompiling for 28 choices
2026-02-21T08:29:52.1466815Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T08:29:53.7844249Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:29:53.7844515Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:29:53.7844641Z               'dtype': 'torch.bfloat16',
2026-02-21T08:29:53.7844762Z               'shape': (4, 1, 3584),
2026-02-21T08:29:53.7844902Z               'stride': (3584, 3584, 1)},
2026-02-21T08:29:53.7845022Z             { 'device': 'cuda:0',
2026-02-21T08:29:53.7845134Z               'dtype': 'torch.int32',
2026-02-21T08:29:53.7845246Z               'shape': (3584, 8192),
2026-02-21T08:29:53.7845357Z               'stride': (8192, 1)}),
2026-02-21T08:29:53.7845465Z   'kwargs': {}}
2026-02-21T08:29:53.7877472Z INFO:tritonbench.utils.triton_op:Took 3.83ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T08:29:53.9635299Z [0s] Autotune random seed: 2134834638
2026-02-21T08:29:54.1525474Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:30:27.1395869Z [32s] Timeout after 30s compiling Config(block_sizes=[128, 1, 2048], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=4, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T08:30:30.0009572Z [35s] Timeout after 30s compiling Config(block_sizes=[8, 2, 8192], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[0, 4], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T08:30:31.4825981Z [37s] Timeout after 30s compiling Config(block_sizes=[512, 1, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T08:30:32.8888076Z [38s] Timeout after 30s compiling Config(block_sizes=[8, 2, 2048], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[0, 1], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:30:37.0087786Z [42s] Timeout after 30s compiling Config(block_sizes=[512, 1, 1024], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=2, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[4, 0], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:30:37.0102379Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s
2026-02-21T08:30:43.0826719Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.6 configs/s
2026-02-21T08:30:43.0835430Z [48s] Adaptive compile timeout: 30s (90% percentile=11.3s, bounds=[30.0s, 30s])
2026-02-21T08:30:43.4061529Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 3852.3 configs/s
2026-02-21T08:30:43.6725237Z [49s] Initial random population of 100, 5 starting points: 
2026-02-21T08:30:43.6725442Z error=1
2026-02-21T08:30:43.6725536Z timeout=5
2026-02-21T08:30:43.6725619Z ok=94
2026-02-21T08:30:43.6725707Z min=0.0502
2026-02-21T08:30:43.6725792Z mid=0.2939
2026-02-21T08:30:43.6725876Z max=21.1667
2026-02-21T08:30:43.6725971Z best={'block_sizes': [64, 2, 8],
2026-02-21T08:30:43.6726159Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:30:43.6726292Z  'l2_groupings': [16],
2026-02-21T08:30:43.6726405Z  'load_eviction_policies': ['', ''],
2026-02-21T08:30:43.6726548Z  'loop_orders': [[1, 0]],
2026-02-21T08:30:43.6726660Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:30:43.6726773Z  'num_sm_multiplier': 8,
2026-02-21T08:30:43.6726874Z  'num_stages': 2,
2026-02-21T08:30:43.6726970Z  'num_warps': 2,
2026-02-21T08:30:43.6727072Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:30:43.6727201Z  'range_flattens': [None, True],
2026-02-21T08:30:43.6727317Z  'range_multi_buffers': [None, True],
2026-02-21T08:30:43.6727438Z  'range_num_stages': [0, 4],
2026-02-21T08:30:43.6727543Z  'range_unroll_factors': [0, 2],
2026-02-21T08:30:43.6727659Z  'range_warp_specializes': [],
2026-02-21T08:30:43.6727770Z  'waves_per_eu': 2}
2026-02-21T08:30:43.6787070Z [49s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:30:44.7367017Z [50s] Generation 1 starting: 94 neighbors, 5 active search path(s)
2026-02-21T08:31:11.2196450Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.3 configs/s
2026-02-21T08:31:17.3067355Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 16.3 configs/s
2026-02-21T08:31:19.6114883Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 411.7         
2026-02-21T08:31:19.6115509Z                                                                   configs/s     
2026-02-21T08:31:20.0156322Z [85s] Generation 1 complete: 
2026-02-21T08:31:20.0156696Z ok=100
2026-02-21T08:31:20.0156899Z min=0.0443
2026-02-21T08:31:20.0157061Z mid=0.0796
2026-02-21T08:31:20.0158875Z max=1.0174
2026-02-21T08:31:20.0159051Z best={'block_sizes': [128, 2, 8],
2026-02-21T08:31:20.0159343Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:31:20.0159614Z  'l2_groupings': [8],
2026-02-21T08:31:20.0159818Z  'load_eviction_policies': ['', ''],
2026-02-21T08:31:20.0160049Z  'loop_orders': [[1, 0]],
2026-02-21T08:31:20.0160254Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:31:20.0160497Z  'num_sm_multiplier': 8,
2026-02-21T08:31:20.0160695Z  'num_stages': 4,
2026-02-21T08:31:20.0160862Z  'num_warps': 2,
2026-02-21T08:31:20.0161077Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:31:20.0161314Z  'range_flattens': [None, False],
2026-02-21T08:31:20.0161539Z  'range_multi_buffers': [None, True],
2026-02-21T08:31:20.0161764Z  'range_num_stages': [2, 3],
2026-02-21T08:31:20.0161977Z  'range_unroll_factors': [4, 3],
2026-02-21T08:31:20.0162200Z  'range_warp_specializes': [],
2026-02-21T08:31:20.0162406Z  'waves_per_eu': 4}
2026-02-21T08:31:20.0641054Z [85s] Fitting surrogate: 200 points, 200 targets
2026-02-21T08:31:21.0840492Z [86s] Generation 2 starting: 96 neighbors, 5 active search path(s)
2026-02-21T08:31:53.9038567Z [119s] Timeout after 30s compiling Config(block_sizes=[512, 2, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[2, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T08:31:55.0699376Z [120s] Timeout after 30s compiling Config(block_sizes=[512, 2, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[2, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T08:31:55.0722799Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 0.4 configs/s
2026-02-21T08:32:01.5190307Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 15.4 configs/s
2026-02-21T08:32:06.0911240Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 253.7         
2026-02-21T08:32:06.0935260Z                                                                   configs/s     
2026-02-21T08:32:06.4259875Z [132s] Generation 2 complete: 
2026-02-21T08:32:06.4260097Z timeout=2
2026-02-21T08:32:06.4260188Z ok=99
2026-02-21T08:32:06.4262058Z min=0.0411
2026-02-21T08:32:06.4262190Z mid=0.0676
2026-02-21T08:32:06.4262274Z max=0.9163
2026-02-21T08:32:06.4262386Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:32:06.4262536Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:32:06.4262687Z  'l2_groupings': [8],
2026-02-21T08:32:06.4262802Z  'load_eviction_policies': ['', ''],
2026-02-21T08:32:06.4262926Z  'loop_orders': [[1, 0]],
2026-02-21T08:32:06.4263037Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:32:06.4263146Z  'num_sm_multiplier': 4,
2026-02-21T08:32:06.4263259Z  'num_stages': 2,
2026-02-21T08:32:06.4263352Z  'num_warps': 2,
2026-02-21T08:32:06.4263464Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:32:06.4263593Z  'range_flattens': [None, True],
2026-02-21T08:32:06.4264457Z  'range_multi_buffers': [None, True],
2026-02-21T08:32:06.4264588Z  'range_num_stages': [0, 4],
2026-02-21T08:32:06.4264712Z  'range_unroll_factors': [0, 2],
2026-02-21T08:32:06.4264829Z  'range_warp_specializes': [],
2026-02-21T08:32:06.4264935Z  'waves_per_eu': 2}
2026-02-21T08:32:06.5067832Z [132s] Fitting surrogate: 301 points, 301 targets
2026-02-21T08:32:07.5521231Z [133s] Generation 3 starting: 99 neighbors, 5 active search path(s)
2026-02-21T08:32:40.7230880Z [166s] Timeout after 30s compiling Config(block_sizes=[512, 2, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[2, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T08:32:45.0604565Z [170s] Timeout after 30s compiling Config(block_sizes=[256, 2, 64], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:32:45.3440813Z [171s] Timeout after 30s compiling Config(block_sizes=[256, 2, 32], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:32:45.4476398Z [171s] Timeout after 30s compiling Config(block_sizes=[512, 1, 64], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[1, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:32:45.8430690Z [171s] Timeout after 30s compiling Config(block_sizes=[128, 1, 64], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:32:45.9747757Z [171s] Timeout after 30s compiling Config(block_sizes=[256, 1, 64], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[1, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:32:45.9761082Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 101/101 0.8 configs/s
2026-02-21T08:32:52.1205032Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 101/101 16.6 configs/s
2026-02-21T08:32:57.1522870Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 221.4         
2026-02-21T08:32:57.1523235Z                                                                   configs/s     
2026-02-21T08:32:57.6562892Z [183s] Generation 3 complete: 
2026-02-21T08:32:57.6563105Z timeout=6
2026-02-21T08:32:57.6563184Z ok=98
2026-02-21T08:32:57.6563266Z min=0.0393
2026-02-21T08:32:57.6563377Z mid=0.0638
2026-02-21T08:32:57.6563452Z max=1.3491
2026-02-21T08:32:57.6564011Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:32:57.6564151Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:32:57.6564308Z  'l2_groupings': [2],
2026-02-21T08:32:57.6564413Z  'load_eviction_policies': ['', ''],
2026-02-21T08:32:57.6564529Z  'loop_orders': [[1, 0]],
2026-02-21T08:32:57.6564633Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:32:57.6564739Z  'num_stages': 1,
2026-02-21T08:32:57.6565614Z  'num_warps': 4,
2026-02-21T08:32:57.6565811Z  'pid_type': 'xyz',
2026-02-21T08:32:57.6565926Z  'range_flattens': [None, None],
2026-02-21T08:32:57.6566078Z  'range_multi_buffers': [None, False],
2026-02-21T08:32:57.6566200Z  'range_num_stages': [0, 2],
2026-02-21T08:32:57.6566311Z  'range_unroll_factors': [0, 0],
2026-02-21T08:32:57.6566424Z  'range_warp_specializes': [],
2026-02-21T08:32:57.6566525Z  'waves_per_eu': 2}
2026-02-21T08:32:57.7588991Z [183s] Fitting surrogate: 405 points, 405 targets
2026-02-21T08:32:58.7112959Z [184s] Generation 4 starting: 95 neighbors, 5 active search path(s)
2026-02-21T08:33:41.7590397Z [227s] Timeout after 30s compiling Config(block_sizes=[256, 2, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:33:42.4277830Z [228s] Timeout after 30s compiling Config(block_sizes=[1024, 2, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:33:42.4298527Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.5 configs/s
2026-02-21T08:33:48.6702472Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 15.6 configs/s
2026-02-21T08:33:53.6162488Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 200.1         
2026-02-21T08:33:53.6162946Z                                                                   configs/s     
2026-02-21T08:33:54.1523591Z [239s] Generation 4 complete: 
2026-02-21T08:33:54.1524008Z timeout=2
2026-02-21T08:33:54.1524223Z ok=98
2026-02-21T08:33:54.1524437Z min=0.0391
2026-02-21T08:33:54.1524643Z mid=0.0497
2026-02-21T08:33:54.1524857Z max=19.7939
2026-02-21T08:33:54.1525097Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:33:54.1525468Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:33:54.1525825Z  'l2_groupings': [8],
2026-02-21T08:33:54.1526106Z  'load_eviction_policies': ['', ''],
2026-02-21T08:33:54.1526448Z  'loop_orders': [[1, 0]],
2026-02-21T08:33:54.1526730Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:33:54.1527077Z  'num_stages': 2,
2026-02-21T08:33:54.1527310Z  'num_warps': 4,
2026-02-21T08:33:54.1527544Z  'pid_type': 'xyz',
2026-02-21T08:33:54.1527826Z  'range_flattens': [None, True],
2026-02-21T08:33:54.1528136Z  'range_multi_buffers': [None, True],
2026-02-21T08:33:54.1528444Z  'range_num_stages': [0, 4],
2026-02-21T08:33:54.1528728Z  'range_unroll_factors': [0, 2],
2026-02-21T08:33:54.1529032Z  'range_warp_specializes': [],
2026-02-21T08:33:54.1529309Z  'waves_per_eu': 2}
2026-02-21T08:33:54.2730668Z [240s] Fitting surrogate: 505 points, 505 targets
2026-02-21T08:33:55.0626620Z [240s] Generation 5 starting: 72 neighbors, 4 active search path(s)
2026-02-21T08:34:34.5485757Z [280s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T08:34:35.5634879Z [281s] Timeout after 30s compiling Config(block_sizes=[512, 4, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T08:34:35.5720544Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.4 configs/s
2026-02-21T08:34:39.9860838Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 16.7 configs/s
2026-02-21T08:34:44.0500011Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 241.8         
2026-02-21T08:34:44.0500691Z                                                                   configs/s     
2026-02-21T08:34:44.5620827Z [290s] Generation 5 complete: 
2026-02-21T08:34:44.5621044Z timeout=2
2026-02-21T08:34:44.5621214Z ok=74
2026-02-21T08:34:44.5621299Z min=0.0387
2026-02-21T08:34:44.5621387Z mid=0.0508
2026-02-21T08:34:44.5621466Z max=1.0559
2026-02-21T08:34:44.5621561Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:34:44.5621716Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:34:44.5621857Z  'l2_groupings': [2],
2026-02-21T08:34:44.5621968Z  'load_eviction_policies': ['', ''],
2026-02-21T08:34:44.5622090Z  'loop_orders': [[1, 0]],
2026-02-21T08:34:44.5622204Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:34:44.5622311Z  'num_sm_multiplier': 4,
2026-02-21T08:34:44.5622416Z  'num_stages': 1,
2026-02-21T08:34:44.5622505Z  'num_warps': 4,
2026-02-21T08:34:44.5622612Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:34:44.5622735Z  'range_flattens': [None, None],
2026-02-21T08:34:44.5622855Z  'range_multi_buffers': [False, False],
2026-02-21T08:34:44.5622992Z  'range_num_stages': [2, 2],
2026-02-21T08:34:44.5623097Z  'range_unroll_factors': [0, 0],
2026-02-21T08:34:44.5623216Z  'range_warp_specializes': [],
2026-02-21T08:34:44.5623801Z  'waves_per_eu': 2}
2026-02-21T08:34:44.6598767Z [290s] Fitting surrogate: 581 points, 581 targets
2026-02-21T08:34:45.5073634Z [291s] Generation 6 starting: 84 neighbors, 4 active search path(s)
2026-02-21T08:35:08.0283491Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 0.6 configs/s
2026-02-21T08:35:13.5365635Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 15.8 configs/s
2026-02-21T08:35:19.4658644Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 187.1         
2026-02-21T08:35:19.4658987Z                                                                   configs/s     
2026-02-21T08:35:19.9878231Z [325s] Generation 6 complete: 
2026-02-21T08:35:19.9878647Z ok=88
2026-02-21T08:35:19.9878853Z min=0.0384
2026-02-21T08:35:19.9879115Z mid=0.0442
2026-02-21T08:35:19.9879310Z max=0.6796
2026-02-21T08:35:19.9879584Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:35:19.9879956Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:35:19.9880359Z  'l2_groupings': [2],
2026-02-21T08:35:19.9880633Z  'load_eviction_policies': ['', ''],
2026-02-21T08:35:19.9880946Z  'loop_orders': [[1, 0]],
2026-02-21T08:35:19.9881225Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:35:19.9881503Z  'num_sm_multiplier': 4,
2026-02-21T08:35:19.9881776Z  'num_stages': 1,
2026-02-21T08:35:19.9882003Z  'num_warps': 4,
2026-02-21T08:35:19.9882265Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:35:19.9882681Z  'range_flattens': [None, None],
2026-02-21T08:35:19.9882988Z  'range_multi_buffers': [False, False],
2026-02-21T08:35:19.9883297Z  'range_num_stages': [2, 2],
2026-02-21T08:35:19.9883569Z  'range_unroll_factors': [0, 0],
2026-02-21T08:35:19.9883857Z  'range_warp_specializes': [],
2026-02-21T08:35:19.9884128Z  'waves_per_eu': 2}
2026-02-21T08:35:20.1210496Z [325s] Fitting surrogate: 669 points, 669 targets
2026-02-21T08:35:20.7812634Z [326s] Generation 7 starting: 57 neighbors, 3 active search path(s)
2026-02-21T08:35:55.2455668Z [361s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[3, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:35:55.9183077Z [361s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:35:58.8862421Z [364s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 2], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:35:58.8878956Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58/58 0.5 configs/s
2026-02-21T08:36:02.2343054Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 58/58 17.6 configs/s
2026-02-21T08:36:04.5922873Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 405.8         
2026-02-21T08:36:04.5923494Z                                                                   configs/s     
2026-02-21T08:36:04.9431451Z [370s] Generation 7 complete: 
2026-02-21T08:36:04.9431698Z timeout=3
2026-02-21T08:36:04.9431797Z ok=58
2026-02-21T08:36:04.9431902Z min=0.0384
2026-02-21T08:36:04.9432439Z mid=0.0600
2026-02-21T08:36:04.9432534Z max=0.7736
2026-02-21T08:36:04.9432652Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:36:04.9432830Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:36:04.9433007Z  'l2_groupings': [2],
2026-02-21T08:36:04.9433142Z  'load_eviction_policies': ['', ''],
2026-02-21T08:36:04.9433295Z  'loop_orders': [[1, 0]],
2026-02-21T08:36:04.9433426Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:36:04.9433561Z  'num_sm_multiplier': 4,
2026-02-21T08:36:04.9433685Z  'num_stages': 1,
2026-02-21T08:36:04.9433795Z  'num_warps': 4,
2026-02-21T08:36:04.9433921Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:36:04.9434079Z  'range_flattens': [None, None],
2026-02-21T08:36:04.9434218Z  'range_multi_buffers': [False, False],
2026-02-21T08:36:04.9434368Z  'range_num_stages': [2, 2],
2026-02-21T08:36:04.9434510Z  'range_unroll_factors': [0, 0],
2026-02-21T08:36:04.9434653Z  'range_warp_specializes': [],
2026-02-21T08:36:04.9434779Z  'waves_per_eu': 2}
2026-02-21T08:36:05.0044087Z [370s] Fitting surrogate: 730 points, 730 targets
2026-02-21T08:36:05.5815251Z [371s] Generation 8 starting: 54 neighbors, 3 active search path(s)
2026-02-21T08:36:32.9478814Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 0.3 configs/s
2026-02-21T08:36:36.3921498Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 16.4 configs/s
2026-02-21T08:36:39.4755828Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 315.2         
2026-02-21T08:36:39.4756348Z                                                                   configs/s     
2026-02-21T08:36:39.8735515Z [405s] Generation 8 complete: 
2026-02-21T08:36:39.8735957Z ok=58
2026-02-21T08:36:39.8736164Z min=0.0388
2026-02-21T08:36:39.8736383Z mid=0.0416
2026-02-21T08:36:39.8736578Z max=1.0584
2026-02-21T08:36:39.8736810Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:36:39.8737972Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:36:39.8738350Z  'l2_groupings': [2],
2026-02-21T08:36:39.8738654Z  'load_eviction_policies': ['', ''],
2026-02-21T08:36:39.8738971Z  'loop_orders': [[1, 0]],
2026-02-21T08:36:39.8739249Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:36:39.8739540Z  'num_sm_multiplier': 4,
2026-02-21T08:36:39.8739806Z  'num_stages': 1,
2026-02-21T08:36:39.8740030Z  'num_warps': 4,
2026-02-21T08:36:39.8756368Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:36:39.8756576Z  'range_flattens': [None, None],
2026-02-21T08:36:39.8756743Z  'range_multi_buffers': [False, False],
2026-02-21T08:36:39.8756908Z  'range_num_stages': [2, 2],
2026-02-21T08:36:39.8757057Z  'range_unroll_factors': [0, 0],
2026-02-21T08:36:39.8757248Z  'range_warp_specializes': [],
2026-02-21T08:36:39.8757418Z  'waves_per_eu': 2}
2026-02-21T08:36:39.9531394Z [405s] Fitting surrogate: 788 points, 788 targets
2026-02-21T08:36:40.5704786Z [406s] Generation 9 starting: 45 neighbors, 3 active search path(s)
2026-02-21T08:36:48.6580086Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 3.8 configs/s
2026-02-21T08:36:51.6451235Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 45/45 15.8 configs/s
2026-02-21T08:36:55.1770975Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 277.1         
2026-02-21T08:36:55.1771411Z                                                                   configs/s     
2026-02-21T08:36:55.5919266Z [421s] Generation 9 complete: 
2026-02-21T08:36:55.5919544Z ok=49
2026-02-21T08:36:55.5919713Z min=0.0386
2026-02-21T08:36:55.5919880Z mid=0.0398
2026-02-21T08:36:55.5920038Z max=0.2858
2026-02-21T08:36:55.5920237Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:36:55.5920531Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:36:55.5920817Z  'l2_groupings': [2],
2026-02-21T08:36:55.5921037Z  'load_eviction_policies': ['', ''],
2026-02-21T08:36:55.5921280Z  'loop_orders': [[1, 0]],
2026-02-21T08:36:55.5922247Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:36:55.5922477Z  'num_sm_multiplier': 4,
2026-02-21T08:36:55.5922790Z  'num_stages': 1,
2026-02-21T08:36:55.5922973Z  'num_warps': 4,
2026-02-21T08:36:55.5923178Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:36:55.5923434Z  'range_flattens': [None, None],
2026-02-21T08:36:55.5923667Z  'range_multi_buffers': [False, False],
2026-02-21T08:36:55.5923914Z  'range_num_stages': [2, 2],
2026-02-21T08:36:55.5924127Z  'range_unroll_factors': [0, 0],
2026-02-21T08:36:55.5924364Z  'range_warp_specializes': [],
2026-02-21T08:36:55.5924579Z  'waves_per_eu': 2}
2026-02-21T08:36:55.6858306Z [421s] Fitting surrogate: 837 points, 837 targets
2026-02-21T08:36:56.3076159Z [422s] Generation 10 starting: 55 neighbors, 3 active search path(s)
2026-02-21T08:37:06.4634385Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55/55 4.0 configs/s
2026-02-21T08:37:10.0325323Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 55/55 16.0 configs/s
2026-02-21T08:37:14.1833280Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 235.7         
2026-02-21T08:37:14.1833934Z                                                                   configs/s     
2026-02-21T08:37:14.6577212Z [440s] Generation 10 complete: 
2026-02-21T08:37:14.6577581Z ok=59
2026-02-21T08:37:14.6577970Z min=0.0383
2026-02-21T08:37:14.6578194Z mid=0.0430
2026-02-21T08:37:14.6579451Z max=0.2860
2026-02-21T08:37:14.6579812Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:37:14.6580262Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:37:14.6580652Z  'l2_groupings': [2],
2026-02-21T08:37:14.6580928Z  'load_eviction_policies': ['', ''],
2026-02-21T08:37:14.6581240Z  'loop_orders': [[1, 0]],
2026-02-21T08:37:14.6581514Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:37:14.6581801Z  'num_sm_multiplier': 4,
2026-02-21T08:37:14.6582063Z  'num_stages': 1,
2026-02-21T08:37:14.6582292Z  'num_warps': 4,
2026-02-21T08:37:14.6582553Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:37:14.6582938Z  'range_flattens': [None, None],
2026-02-21T08:37:14.6583242Z  'range_multi_buffers': [False, False],
2026-02-21T08:37:14.6583578Z  'range_num_stages': [2, 2],
2026-02-21T08:37:14.6583860Z  'range_unroll_factors': [0, 0],
2026-02-21T08:37:14.6584149Z  'range_warp_specializes': [],
2026-02-21T08:37:14.6584427Z  'waves_per_eu': 2}
2026-02-21T08:37:14.7565661Z [440s] Fitting surrogate: 896 points, 896 targets
2026-02-21T08:37:15.3350142Z [441s] Generation 11 starting: 52 neighbors, 3 active search path(s)
2026-02-21T08:37:23.0509433Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 7.1 configs/s
2026-02-21T08:37:26.4490401Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 52/52 16.0 configs/s
2026-02-21T08:37:30.0698595Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 272.4         
2026-02-21T08:37:30.0699225Z                                                                   configs/s     
2026-02-21T08:37:30.4931032Z [456s] Generation 11 complete: 
2026-02-21T08:37:30.4931291Z ok=56
2026-02-21T08:37:30.4931372Z min=0.0386
2026-02-21T08:37:30.4931460Z mid=0.0407
2026-02-21T08:37:30.4931968Z max=0.3934
2026-02-21T08:37:30.4932056Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:37:30.4932198Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:37:30.4932338Z  'l2_groupings': [2],
2026-02-21T08:37:30.4932438Z  'load_eviction_policies': ['', ''],
2026-02-21T08:37:30.4932555Z  'loop_orders': [[1, 0]],
2026-02-21T08:37:30.4932662Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:37:30.4932764Z  'num_sm_multiplier': 4,
2026-02-21T08:37:30.4932863Z  'num_stages': 1,
2026-02-21T08:37:30.4932945Z  'num_warps': 4,
2026-02-21T08:37:30.4933044Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:37:30.4933163Z  'range_flattens': [None, None],
2026-02-21T08:37:30.4933276Z  'range_multi_buffers': [False, False],
2026-02-21T08:37:30.4933391Z  'range_num_stages': [2, 2],
2026-02-21T08:37:30.4933497Z  'range_unroll_factors': [0, 0],
2026-02-21T08:37:30.4933606Z  'range_warp_specializes': [],
2026-02-21T08:37:30.4933710Z  'waves_per_eu': 2}
2026-02-21T08:37:30.5782085Z [456s] Fitting surrogate: 952 points, 952 targets
2026-02-21T08:37:31.8942525Z [457s] Generation 12 starting: 36 neighbors, 2 active search path(s)
2026-02-21T08:37:38.7733573Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 4.7 configs/s
2026-02-21T08:37:41.1594771Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 36/36 16.0 configs/s
2026-02-21T08:37:43.4481546Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 424.4         
2026-02-21T08:37:43.4481871Z                                                                   configs/s     
2026-02-21T08:37:43.8070360Z [469s] Generation 12 complete: 
2026-02-21T08:37:43.8070714Z ok=39
2026-02-21T08:37:43.8070930Z min=0.0386
2026-02-21T08:37:43.8071143Z mid=0.0436
2026-02-21T08:37:43.8071344Z max=0.1632
2026-02-21T08:37:43.8071575Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:37:43.8072368Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:37:43.8072737Z  'l2_groupings': [2],
2026-02-21T08:37:43.8073015Z  'load_eviction_policies': ['', ''],
2026-02-21T08:37:43.8073357Z  'loop_orders': [[1, 0]],
2026-02-21T08:37:43.8073546Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:37:43.8073654Z  'num_sm_multiplier': 4,
2026-02-21T08:37:43.8073756Z  'num_stages': 1,
2026-02-21T08:37:43.8073840Z  'num_warps': 4,
2026-02-21T08:37:43.8073939Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:37:43.8074057Z  'range_flattens': [None, None],
2026-02-21T08:37:43.8074170Z  'range_multi_buffers': [False, False],
2026-02-21T08:37:43.8074283Z  'range_num_stages': [2, 2],
2026-02-21T08:37:43.8074387Z  'range_unroll_factors': [0, 0],
2026-02-21T08:37:43.8074496Z  'range_warp_specializes': [],
2026-02-21T08:37:43.8074602Z  'waves_per_eu': 2}
2026-02-21T08:37:43.8701979Z [469s] Fitting surrogate: 991 points, 991 targets
2026-02-21T08:37:44.2775942Z [470s] Generation 13 starting: 32 neighbors, 2 active search path(s)
2026-02-21T08:37:50.1533920Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 4.6 configs/s
2026-02-21T08:37:52.3210443Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 32/32 15.8 configs/s
2026-02-21T08:37:54.7528114Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 396.6         
2026-02-21T08:37:54.7528378Z                                                                   configs/s     
2026-02-21T08:37:55.1282123Z [480s] Generation 13 complete: 
2026-02-21T08:37:55.1284767Z ok=35
2026-02-21T08:37:55.1284990Z min=0.0387
2026-02-21T08:37:55.1285123Z mid=0.0398
2026-02-21T08:37:55.1285223Z max=0.1613
2026-02-21T08:37:55.1285334Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:37:55.1285555Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:37:55.1285733Z  'l2_groupings': [2],
2026-02-21T08:37:55.1285859Z  'load_eviction_policies': ['', ''],
2026-02-21T08:37:55.1286004Z  'loop_orders': [[1, 0]],
2026-02-21T08:37:55.1286142Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:37:55.1286312Z  'num_sm_multiplier': 4,
2026-02-21T08:37:55.1286437Z  'num_stages': 1,
2026-02-21T08:37:55.1287009Z  'num_warps': 4,
2026-02-21T08:37:55.1287126Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:37:55.1287281Z  'range_flattens': [None, None],
2026-02-21T08:37:55.1287422Z  'range_multi_buffers': [False, False],
2026-02-21T08:37:55.1287563Z  'range_num_stages': [2, 2],
2026-02-21T08:37:55.1287689Z  'range_unroll_factors': [0, 0],
2026-02-21T08:37:55.1287827Z  'range_warp_specializes': [],
2026-02-21T08:37:55.1287952Z  'waves_per_eu': 2}
2026-02-21T08:37:55.1833820Z [481s] Fitting surrogate: 1026 points, 1026 targets
2026-02-21T08:37:55.4817958Z [481s] Generation 14 starting: 19 neighbors, 1 active search path(s)
2026-02-21T08:38:06.0584988Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 1.1 configs/s
2026-02-21T08:38:07.3963835Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 15.9 configs/s
2026-02-21T08:38:08.6461225Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 741.7         
2026-02-21T08:38:08.6461611Z                                                                   configs/s     
2026-02-21T08:38:08.9634774Z [494s] Generation 14 complete: 
2026-02-21T08:38:08.9634998Z ok=21
2026-02-21T08:38:08.9635091Z min=0.0388
2026-02-21T08:38:08.9635227Z mid=0.0470
2026-02-21T08:38:08.9635308Z max=0.2357
2026-02-21T08:38:08.9635410Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:38:08.9635562Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:38:08.9635706Z  'l2_groupings': [2],
2026-02-21T08:38:08.9635817Z  'load_eviction_policies': ['', ''],
2026-02-21T08:38:08.9635939Z  'loop_orders': [[1, 0]],
2026-02-21T08:38:08.9636050Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:38:08.9636157Z  'num_sm_multiplier': 4,
2026-02-21T08:38:08.9636261Z  'num_stages': 1,
2026-02-21T08:38:08.9636351Z  'num_warps': 4,
2026-02-21T08:38:08.9636457Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:38:08.9637230Z  'range_flattens': [None, None],
2026-02-21T08:38:08.9637350Z  'range_multi_buffers': [False, False],
2026-02-21T08:38:08.9637475Z  'range_num_stages': [2, 2],
2026-02-21T08:38:08.9637603Z  'range_unroll_factors': [0, 0],
2026-02-21T08:38:08.9637722Z  'range_warp_specializes': [],
2026-02-21T08:38:08.9637830Z  'waves_per_eu': 2}
2026-02-21T08:38:09.0127488Z [494s] Fitting surrogate: 1047 points, 1047 targets
2026-02-21T08:38:09.4013481Z [495s] Generation 15 starting: 19 neighbors, 1 active search path(s)
2026-02-21T08:38:13.4129516Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 5.9 configs/s
2026-02-21T08:38:14.6929854Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 15.8 configs/s
2026-02-21T08:38:17.0435044Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 635.0         
2026-02-21T08:38:17.0437591Z                                                                   configs/s     
2026-02-21T08:38:17.3378163Z [503s] Generation 15 complete: 
2026-02-21T08:38:17.3378330Z ok=21
2026-02-21T08:38:17.3378619Z min=0.0383
2026-02-21T08:38:17.3378779Z mid=0.0418
2026-02-21T08:38:17.3378897Z max=0.2641
2026-02-21T08:38:17.3379057Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:38:17.3379281Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:38:17.3379485Z  'l2_groupings': [2],
2026-02-21T08:38:17.3379638Z  'load_eviction_policies': ['', ''],
2026-02-21T08:38:17.3379832Z  'loop_orders': [[1, 0]],
2026-02-21T08:38:17.3379977Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:38:17.3380137Z  'num_sm_multiplier': 4,
2026-02-21T08:38:17.3380279Z  'num_stages': 1,
2026-02-21T08:38:17.3380409Z  'num_warps': 4,
2026-02-21T08:38:17.3380541Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:38:17.3380729Z  'range_flattens': [None, None],
2026-02-21T08:38:17.3380895Z  'range_multi_buffers': [False, False],
2026-02-21T08:38:17.3383324Z  'range_num_stages': [2, 2],
2026-02-21T08:38:17.3383509Z  'range_unroll_factors': [0, 0],
2026-02-21T08:38:17.3383641Z  'range_warp_specializes': [],
2026-02-21T08:38:17.3383809Z  'waves_per_eu': 2}
2026-02-21T08:38:17.3724796Z [503s] Fitting surrogate: 1068 points, 1068 targets
2026-02-21T08:38:17.6168749Z [503s] Generation 16 starting: 15 neighbors, 1 active search path(s)
2026-02-21T08:38:24.3592974Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 0.9 configs/s
2026-02-21T08:38:25.4313585Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 16.1 configs/s
2026-02-21T08:38:26.5013174Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 871.4         
2026-02-21T08:38:26.5013529Z                                                                   configs/s     
2026-02-21T08:38:26.7635153Z [512s] Generation 16 complete: 
2026-02-21T08:38:26.7635499Z ok=17
2026-02-21T08:38:26.7635714Z min=0.0384
2026-02-21T08:38:26.7635926Z mid=0.0419
2026-02-21T08:38:26.7636126Z max=0.2729
2026-02-21T08:38:26.7636353Z best={'block_sizes': [256, 4, 8],
2026-02-21T08:38:26.7636725Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T08:38:26.7637132Z  'l2_groupings': [2],
2026-02-21T08:38:26.7637411Z  'load_eviction_policies': ['', ''],
2026-02-21T08:38:26.7637754Z  'loop_orders': [[1, 0]],
2026-02-21T08:38:26.7638030Z  'matrix_instr_nonkdim': 32,
2026-02-21T08:38:26.7638315Z  'num_sm_multiplier': 4,
2026-02-21T08:38:26.7638572Z  'num_stages': 1,
2026-02-21T08:38:26.7638801Z  'num_warps': 4,
2026-02-21T08:38:26.7638961Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:38:26.7639082Z  'range_flattens': [None, None],
2026-02-21T08:38:26.7639193Z  'range_multi_buffers': [False, False],
2026-02-21T08:38:26.7639309Z  'range_num_stages': [2, 2],
2026-02-21T08:38:26.7639411Z  'range_unroll_factors': [0, 0],
2026-02-21T08:38:26.7639528Z  'range_warp_specializes': [],
2026-02-21T08:38:26.7639632Z  'waves_per_eu': 2}
2026-02-21T08:38:26.7848405Z [512s] Fitting surrogate: 1085 points, 1085 targets
2026-02-21T08:38:26.9225837Z [512s] Autotuning complete in 512.8s after searching 1017 configs.
2026-02-21T08:38:26.9226714Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:38:26.9227793Z     @helion.kernel(config=helion.Config(block_sizes=[256, 4, 8], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[2, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T08:38:26.9228821Z 
2026-02-21T08:38:26.9229055Z [512s] Code of selected kernel: /tmp/torchinductor_root/vb/cvblzbuik7b7vaeoggd3uv3xxkiue2vr65rtb75vreoiajhkb5wg.py
2026-02-21T08:38:28.0290635Z WARNING:tritonbench.utils.triton_op:Completed input ID 7:
2026-02-21T08:38:28.0290948Z x_val
2026-02-21T08:38:28.0291087Z ------------------
2026-02-21T08:38:28.0291233Z (4, 1, 8192, 3584)
2026-02-21T08:38:28.0291325Z 
2026-02-21T08:38:28.0304746Z  30%|███       | 3/10 [29:07<1:08:13, 584.75s/it]WARNING:tritonbench.utils.triton_op:Running input ID 10:
2026-02-21T08:38:28.0305061Z x_val
2026-02-21T08:38:28.0305614Z -------------------
2026-02-21T08:38:28.0305747Z (16, 1, 7168, 8192)
2026-02-21T08:38:28.0307118Z INFO:tritonbench.utils.triton_op:Took 0.15ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T08:38:29.0834151Z INFO:tritonbench.utils.triton_op:Took 2.80ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T08:38:30.1428810Z Autotune Choices Stats:
2026-02-21T08:38:30.1430069Z {"num_choices": 28, "num_triton_choices": 27, "best_kernel": "mm", "best_time": 0.03171899914741516, "best_triton_pos": 1, "best_triton_time": 0.053279001265764236, "best_triton_kernel": "triton_mm_38", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4"}
2026-02-21T08:38:30.1457334Z AUTOTUNE mm(16x8192, 8192x7168)
2026-02-21T08:38:30.1457524Z strides: [8192, 1], [7168, 1]
2026-02-21T08:38:30.1457733Z dtypes: torch.bfloat16, torch.bfloat16
2026-02-21T08:38:30.1457931Z   mm 0.0317 ms 100.0% 
2026-02-21T08:38:30.1458530Z   triton_mm_38 0.0533 ms 59.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:38:30.1459554Z   triton_mm_30 0.0563 ms 56.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=2
2026-02-21T08:38:30.1460556Z   triton_mm_37 0.0615 ms 51.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:38:30.1461572Z   triton_mm_34 0.0619 ms 51.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=2
2026-02-21T08:38:30.1462588Z   triton_mm_45 0.0619 ms 51.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:38:30.1463623Z   triton_mm_31 0.0738 ms 43.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:38:30.1464635Z   triton_mm_44 0.0741 ms 42.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:38:30.1466027Z   triton_mm_36 0.0742 ms 42.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=16, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:38:30.1467067Z   triton_mm_51 0.0757 ms 41.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:38:30.1467825Z SingleProcess AUTOTUNE benchmarking takes 0.7212 seconds and 0.1618 seconds precompiling for 28 choices
2026-02-21T08:38:32.7799334Z INFO:tritonbench.utils.triton_op:Took 0.18ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T08:38:33.7838220Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:38:33.7838520Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:38:33.7838696Z               'dtype': 'torch.bfloat16',
2026-02-21T08:38:33.7838900Z               'shape': (16, 1, 8192),
2026-02-21T08:38:33.7839023Z               'stride': (8192, 8192, 1)},
2026-02-21T08:38:33.7839580Z             { 'device': 'cuda:0',
2026-02-21T08:38:33.7839687Z               'dtype': 'torch.int32',
2026-02-21T08:38:33.7839808Z               'shape': (8192, 7168),
2026-02-21T08:38:33.7839947Z               'stride': (7168, 1)}),
2026-02-21T08:38:33.7840057Z   'kwargs': {}}
2026-02-21T08:38:33.7882418Z INFO:tritonbench.utils.triton_op:Took 4.97ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T08:38:33.9913654Z [0s] Autotune random seed: 2134834638
2026-02-21T08:38:34.0836902Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:39:14.5341555Z [40s] Timeout after 30s compiling Config(block_sizes=[256, 2, 128], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[2, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:39:14.5365521Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T08:39:15.2133791Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:39:15.2139116Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}>
2026-02-21T08:39:15.2142872Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T08:39:15.2143271Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}>
2026-02-21T08:39:15.2143623Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:39:15.2143922Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:39:15.2144171Z #smem = #ttg.shared_memory
2026-02-21T08:39:15.2144412Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:39:15.2144981Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:39:15.2145407Z     %cst = arith.constant dense<16> : tensor<16x1xi64, #mma>
2026-02-21T08:39:15.2145578Z     %cst_0 = arith.constant dense<0> : tensor<16x1xi64, #mma>
2026-02-21T08:39:15.2145756Z     %cst_1 = arith.constant dense<7168> : tensor<16x1xi64, #mma>
2026-02-21T08:39:15.2146537Z     %cst_2 = arith.constant dense<7168> : tensor<1x256xi64, #mma>
2026-02-21T08:39:15.2146713Z     %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #mma>
2026-02-21T08:39:15.2146899Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:39:15.2147073Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:39:15.2147254Z     %cst_6 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T08:39:15.2147408Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:39:15.2147566Z     %cst_7 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T08:39:15.2147730Z     %c7168_i64 = arith.constant 7168 : i64
2026-02-21T08:39:15.2147855Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:39:15.2147979Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:39:15.2148100Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:39:15.2148219Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:39:15.2148338Z     %c28_i32 = arith.constant 28 : i32
2026-02-21T08:39:15.2148485Z     %cst_8 = arith.constant dense<0> : tensor<1x256xi8, #blocked2>
2026-02-21T08:39:15.2148804Z     %cst_9 = arith.constant dense<7168> : tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2148991Z     %cst_10 = arith.constant dense<0> : tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2149171Z     %cst_11 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked>
2026-02-21T08:39:15.2149322Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T08:39:15.2149447Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:39:15.2149563Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:39:15.2149686Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:39:15.2149804Z     %c4864_i32 = arith.constant 4864 : i32
2026-02-21T08:39:15.2149993Z     %cst_12 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:15.2150191Z     %0 = tt.get_program_id x : i32
2026-02-21T08:39:15.2150395Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:39:15.2150673Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:39:15.2150945Z     %3 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:39:15.2151202Z     %4 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:39:15.2151406Z     %5 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T08:39:15.2151660Z     %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:39:15.2152001Z     %7 = arith.extsi %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:39:15.2152366Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:39:15.2152785Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:39:15.2153194Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:39:15.2153450Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:39:15.2153655Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T08:39:15.2153852Z     %13 = arith.cmpi eq, %10, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:39:15.2154048Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T08:39:15.2154261Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:39:15.2154716Z     %16 = arith.extsi %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:39:15.2155024Z     %17 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:39:15.2155329Z     %18 = arith.extsi %17 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:39:15.2155589Z     scf.for %arg3 = %0 to %c28_i32 step %c4864_i32  : i32 {
2026-02-21T08:39:15.2155745Z       %19 = arith.divsi %arg3, %c112_i32 : i32
2026-02-21T08:39:15.2155868Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T08:39:15.2155993Z       %21 = arith.subi %c1_i32, %20 : i32
2026-02-21T08:39:15.2156111Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T08:39:15.2156237Z       %23 = arith.remsi %arg3, %c112_i32 : i32
2026-02-21T08:39:15.2156357Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T08:39:15.2156475Z       %25 = arith.addi %20, %24 : i32
2026-02-21T08:39:15.2156590Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T08:39:15.2156712Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T08:39:15.2156931Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:39:15.2157157Z       %29 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:39:15.2157334Z       %30 = arith.muli %26, %c256_i32 : i32
2026-02-21T08:39:15.2157559Z       %31 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:39:15.2157821Z       %32 = arith.muli %31, %cst_6 : tensor<16x1xi32, #blocked1>
2026-02-21T08:39:15.2158021Z       %33 = tt.broadcast %32 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:39:15.2158198Z       %34 = arith.extsi %30 : i32 to i64
2026-02-21T08:39:15.2158374Z       %35 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:39:15.2158647Z       %36 = arith.addi %35, %7 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:39:15.2158935Z       %37 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2159203Z       %38 = arith.cmpi sge, %37, %cst_10 : tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2159383Z       %39 = arith.cmpi slt, %37, %cst_9 : tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2159553Z       %40 = arith.andi %38, %39 : tensor<1x256xi1, #blocked2>
2026-02-21T08:39:15.2159791Z       %41 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c2_i32 iter_args(%arg5 = %cst_7) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T08:39:15.2160015Z         %64 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:39:15.2160189Z         %65 = tt.splat %64 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:39:15.2160412Z         %66 = arith.addi %65, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:39:15.2160690Z         %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:39:15.2160969Z         %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:39:15.2161164Z         %69 = arith.addi %33, %68 : tensor<16x2xi32, #blocked1>
2026-02-21T08:39:15.2161378Z         %70 = tt.addptr %4, %69 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:39:15.2161588Z         %71 = tt.load %70 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:39:15.2161860Z         %72 = ttg.convert_layout %71 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:39:15.2162274Z         %73 = arith.extf %72 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:39:15.2162721Z         %74 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:39:15.2162853Z         %75 = arith.muli %74, %c7168_i64 : i64
2026-02-21T08:39:15.2162996Z         %76 = tt.splat %75 : i64 -> tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2163158Z         %77 = arith.addi %76, %37 : tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2163355Z         %78 = tt.addptr %5, %77 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2163573Z         %79 = tt.load %78, %40, %cst_8 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T08:39:15.2163839Z         %80 = ttg.convert_layout %79 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:15.2164126Z         %81 = arith.shli %80, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:15.2164370Z         %82 = arith.shrsi %81, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:15.2164613Z         %83 = arith.shrsi %80, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:15.2164912Z         %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:39:15.2165296Z         %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:39:15.2165582Z         %86 = tt.broadcast %84 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:39:15.2165828Z         %87 = arith.select %12, %86, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:39:15.2166070Z         %88 = tt.broadcast %85 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:39:15.2166311Z         %89 = arith.select %14, %88, %87 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:39:15.2166545Z         %90 = tt.reshape %89 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T08:39:15.2166769Z         %91 = arith.sitofp %90 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T08:39:15.2167026Z         %92 = ttg.local_alloc %91 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T08:39:15.2167353Z         %93 = ttg.local_load %92 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:39:15.2167840Z         %94 = tt.dot %73, %93, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T08:39:15.2168215Z         %95 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T08:39:15.2168341Z         %96 = arith.muli %95, %c2_i32 : i32
2026-02-21T08:39:15.2168518Z         %97 = tt.splat %96 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:39:15.2168746Z         %98 = arith.addi %97, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:39:15.2169021Z         %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:39:15.2169313Z         %100 = tt.broadcast %99 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:39:15.2169511Z         %101 = arith.addi %33, %100 : tensor<16x2xi32, #blocked1>
2026-02-21T08:39:15.2169719Z         %102 = tt.addptr %4, %101 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:39:15.2169928Z         %103 = tt.load %102 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:39:15.2170202Z         %104 = ttg.convert_layout %103 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:39:15.2170646Z         %105 = arith.extf %104 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:39:15.2170930Z         %106 = arith.extsi %95 : i32 to i64
2026-02-21T08:39:15.2171062Z         %107 = arith.muli %106, %c7168_i64 : i64
2026-02-21T08:39:15.2171210Z         %108 = tt.splat %107 : i64 -> tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2171376Z         %109 = arith.addi %108, %37 : tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2171580Z         %110 = tt.addptr %5, %109 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi64, #blocked2>
2026-02-21T08:39:15.2171775Z         %111 = arith.cmpi slt, %106, %c4096_i64 : i64
2026-02-21T08:39:15.2171929Z         %112 = tt.splat %111 : i1 -> tensor<1x256xi1, #blocked2>
2026-02-21T08:39:15.2172088Z         %113 = arith.andi %112, %40 : tensor<1x256xi1, #blocked2>
2026-02-21T08:39:15.2172264Z         %114 = tt.load %110, %113, %cst_8 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T08:39:15.2172534Z         %115 = ttg.convert_layout %114 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:15.2172824Z         %116 = arith.shli %115, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:15.2173128Z         %117 = arith.shrsi %116, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:15.2173369Z         %118 = arith.shrsi %115, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:15.2173668Z         %119 = tt.expand_dims %117 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:39:15.2174016Z         %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:39:15.2174310Z         %121 = tt.broadcast %119 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:39:15.2174568Z         %122 = arith.select %12, %121, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:39:15.2174821Z         %123 = tt.broadcast %120 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:39:15.2175067Z         %124 = arith.select %14, %123, %122 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:39:15.2175305Z         %125 = tt.reshape %124 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T08:39:15.2175537Z         %126 = arith.sitofp %125 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T08:39:15.2175800Z         %127 = ttg.local_alloc %126 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T08:39:15.2176132Z         %128 = ttg.local_load %127 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:39:15.2176612Z         %129 = tt.dot %105, %128, %94, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T08:39:15.2176974Z         scf.yield %129 : tensor<16x256xf32, #mma>
2026-02-21T08:39:15.2177132Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:39:15.2177328Z       %42 = arith.truncf %41 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T08:39:15.2177503Z       %43 = arith.extsi %27 : i32 to i64
2026-02-21T08:39:15.2177670Z       %44 = tt.splat %43 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:39:15.2177882Z       %45 = arith.addi %44, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:39:15.2178143Z       %46 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma>
2026-02-21T08:39:15.2178386Z       %47 = arith.muli %46, %cst_1 : tensor<16x1xi64, #mma>
2026-02-21T08:39:15.2178597Z       %48 = tt.broadcast %47 : tensor<16x1xi64, #mma> -> tensor<16x256xi64, #mma>
2026-02-21T08:39:15.2178805Z       %49 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:39:15.2179019Z       %50 = arith.addi %49, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:39:15.2179286Z       %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T08:39:15.2179550Z       %52 = tt.broadcast %51 : tensor<1x256xi64, #mma> -> tensor<16x256xi64, #mma>
2026-02-21T08:39:15.2179729Z       %53 = arith.addi %48, %52 : tensor<16x256xi64, #mma>
2026-02-21T08:39:15.2179920Z       %54 = tt.addptr %15, %53 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi64, #mma>
2026-02-21T08:39:15.2180119Z       %55 = arith.cmpi sge, %46, %cst_0 : tensor<16x1xi64, #mma>
2026-02-21T08:39:15.2180285Z       %56 = arith.cmpi slt, %46, %cst : tensor<16x1xi64, #mma>
2026-02-21T08:39:15.2180441Z       %57 = arith.andi %55, %56 : tensor<16x1xi1, #mma>
2026-02-21T08:39:15.2180609Z       %58 = tt.broadcast %57 : tensor<16x1xi1, #mma> -> tensor<16x256xi1, #mma>
2026-02-21T08:39:15.2180795Z       %59 = arith.cmpi sge, %51, %cst_3 : tensor<1x256xi64, #mma>
2026-02-21T08:39:15.2180990Z       %60 = arith.cmpi slt, %51, %cst_2 : tensor<1x256xi64, #mma>
2026-02-21T08:39:15.2181151Z       %61 = arith.andi %59, %60 : tensor<1x256xi1, #mma>
2026-02-21T08:39:15.2181320Z       %62 = tt.broadcast %61 : tensor<1x256xi1, #mma> -> tensor<16x256xi1, #mma>
2026-02-21T08:39:15.2181495Z       %63 = arith.andi %58, %62 : tensor<16x256xi1, #mma>
2026-02-21T08:39:15.2181655Z       tt.store %54, %42, %63 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:39:15.2181864Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T08:39:15.2182055Z     tt.return
2026-02-21T08:39:15.2182138Z   }
2026-02-21T08:39:15.2182225Z }
2026-02-21T08:39:15.2182271Z 
2026-02-21T08:39:15.2182304Z {-#
2026-02-21T08:39:15.2182392Z   external_resources: {
2026-02-21T08:39:15.2182502Z     mlir_reproducer: {
2026-02-21T08:39:15.2183518Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:39:15.2184539Z       disable_threading: false,
2026-02-21T08:39:15.2184652Z       verify_each: true
2026-02-21T08:39:15.2184744Z     }
2026-02-21T08:39:15.2184822Z   }
2026-02-21T08:39:15.2184895Z #-}
2026-02-21T08:39:15.2185188Z /tmp/torchinductor_root/t3/ct32ivh7iogp24wx3jtpjtpf3jmhln2btps5madsnxbgqvcxyyhy.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:39:15.2185942Z /tmp/torchinductor_root/t3/ct32ivh7iogp24wx3jtpjtpf3jmhln2btps5madsnxbgqvcxyyhy.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:39:15.2186512Z [41s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:39:15.2187309Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[2, 4], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T08:39:15.2188063Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:39:15.2188234Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:39:18.1613090Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:39:18.1616000Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}>
2026-02-21T08:39:18.1616822Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T08:39:18.1617533Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}>
2026-02-21T08:39:18.1618156Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:39:18.1618788Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:39:18.1619190Z #smem = #ttg.shared_memory
2026-02-21T08:39:18.1620414Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:39:18.1621466Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:39:18.1622358Z     %cst = arith.constant dense<16> : tensor<16x1xi64, #mma>
2026-02-21T08:39:18.1622794Z     %cst_0 = arith.constant dense<0> : tensor<16x1xi64, #mma>
2026-02-21T08:39:18.1623163Z     %cst_1 = arith.constant dense<7168> : tensor<16x1xi64, #mma>
2026-02-21T08:39:18.1623532Z     %cst_2 = arith.constant dense<7168> : tensor<1x512xi64, #mma>
2026-02-21T08:39:18.1623900Z     %cst_3 = arith.constant dense<0> : tensor<1x512xi64, #mma>
2026-02-21T08:39:18.1624276Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:39:18.1624665Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:39:18.1625057Z     %cst_6 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T08:39:18.1625353Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:39:18.1625603Z     %cst_7 = arith.constant dense<0.000000e+00> : tensor<16x512xf32, #mma>
2026-02-21T08:39:18.1625870Z     %c7168_i64 = arith.constant 7168 : i64
2026-02-21T08:39:18.1626072Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:39:18.1626263Z     %c14_i32 = arith.constant 14 : i32
2026-02-21T08:39:18.1626499Z     %cst_8 = arith.constant dense<0> : tensor<1x512xi8, #blocked2>
2026-02-21T08:39:18.1626797Z     %cst_9 = arith.constant dense<7168> : tensor<1x512xi64, #blocked2>
2026-02-21T08:39:18.1627092Z     %cst_10 = arith.constant dense<0> : tensor<1x512xi64, #blocked2>
2026-02-21T08:39:18.1627386Z     %cst_11 = arith.constant dense<0> : tensor<1x2x512xi8, #blocked>
2026-02-21T08:39:18.1627622Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:39:18.1627822Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:39:18.1628000Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:39:18.1628288Z     %cst_12 = arith.constant dense<4> : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:18.1628606Z     %0 = tt.get_program_id x : i32
2026-02-21T08:39:18.1628791Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T08:39:18.1628975Z     %2 = arith.minsi %1, %c14_i32 : i32
2026-02-21T08:39:18.1629293Z     %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:39:18.1629738Z     %4 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:39:18.1630181Z     %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:39:18.1630895Z     %6 = tt.expand_dims %5 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:39:18.1631312Z     %7 = arith.muli %6, %cst_6 : tensor<16x1xi32, #blocked1>
2026-02-21T08:39:18.1631619Z     %8 = tt.broadcast %7 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:39:18.1631969Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:39:18.1632285Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T08:39:18.1632690Z     %11 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:39:18.1633240Z     %12 = arith.extsi %11 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:39:18.1633824Z     %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:39:18.1634501Z     %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:39:18.1645419Z     %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:39:18.1645670Z     %16 = arith.cmpi eq, %15, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:39:18.1645872Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked>
2026-02-21T08:39:18.1646072Z     %18 = arith.cmpi eq, %15, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:39:18.1646265Z     %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked>
2026-02-21T08:39:18.1646477Z     %20 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x512x!tt.ptr<bf16>, #mma>
2026-02-21T08:39:18.1646752Z     %21 = arith.extsi %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:39:18.1647079Z     %22 = tt.expand_dims %21 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma>
2026-02-21T08:39:18.1647311Z     %23 = arith.muli %22, %cst_1 : tensor<16x1xi64, #mma>
2026-02-21T08:39:18.1647488Z     %24 = tt.broadcast %23 : tensor<16x1xi64, #mma> -> tensor<16x512xi64, #mma>
2026-02-21T08:39:18.1647728Z     %25 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:39:18.1648032Z     %26 = arith.extsi %25 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:39:18.1648288Z     %27 = arith.cmpi sge, %22, %cst_0 : tensor<16x1xi64, #mma>
2026-02-21T08:39:18.1648447Z     %28 = arith.cmpi slt, %22, %cst : tensor<16x1xi64, #mma>
2026-02-21T08:39:18.1648603Z     %29 = arith.andi %27, %28 : tensor<16x1xi1, #mma>
2026-02-21T08:39:18.1648771Z     %30 = tt.broadcast %29 : tensor<16x1xi1, #mma> -> tensor<16x512xi1, #mma>
2026-02-21T08:39:18.1648952Z     scf.for %arg3 = %0 to %2 step %c1_i32  : i32 {
2026-02-21T08:39:18.1649091Z       %31 = arith.muli %arg3, %c512_i32 : i32
2026-02-21T08:39:18.1649217Z       %32 = arith.extsi %31 : i32 to i64
2026-02-21T08:39:18.1649391Z       %33 = tt.splat %32 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:39:18.1649614Z       %34 = arith.addi %33, %12 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:39:18.1649898Z       %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x512xi64, #blocked2>
2026-02-21T08:39:18.1650163Z       %36 = arith.cmpi sge, %35, %cst_10 : tensor<1x512xi64, #blocked2>
2026-02-21T08:39:18.1650339Z       %37 = arith.cmpi slt, %35, %cst_9 : tensor<1x512xi64, #blocked2>
2026-02-21T08:39:18.1650562Z       %38 = arith.andi %36, %37 : tensor<1x512xi1, #blocked2>
2026-02-21T08:39:18.1650798Z       %39 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c1_i32 iter_args(%arg5 = %cst_7) -> (tensor<16x512xf32, #mma>)  : i32 {
2026-02-21T08:39:18.1651031Z         %52 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:39:18.1651198Z         %53 = tt.splat %52 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:39:18.1651425Z         %54 = arith.addi %53, %4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:39:18.1651706Z         %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:39:18.1651982Z         %56 = tt.broadcast %55 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:39:18.1652178Z         %57 = arith.addi %8, %56 : tensor<16x2xi32, #blocked1>
2026-02-21T08:39:18.1652376Z         %58 = tt.addptr %9, %57 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:39:18.1652591Z         %59 = tt.load %58 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:39:18.1652865Z         %60 = ttg.convert_layout %59 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:39:18.1653304Z         %61 = arith.extf %60 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:39:18.1653591Z         %62 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:39:18.1653716Z         %63 = arith.muli %62, %c7168_i64 : i64
2026-02-21T08:39:18.1653861Z         %64 = tt.splat %63 : i64 -> tensor<1x512xi64, #blocked2>
2026-02-21T08:39:18.1654024Z         %65 = arith.addi %64, %35 : tensor<1x512xi64, #blocked2>
2026-02-21T08:39:18.1654224Z         %66 = tt.addptr %10, %65 : tensor<1x512x!tt.ptr<i8>, #blocked2>, tensor<1x512xi64, #blocked2>
2026-02-21T08:39:18.1654444Z         %67 = tt.load %66, %38, %cst_8 : tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T08:39:18.1654704Z         %68 = ttg.convert_layout %67 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:18.1654994Z         %69 = arith.shli %68, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:18.1655236Z         %70 = arith.shrsi %69, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:18.1655499Z         %71 = arith.shrsi %68, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:39:18.1655797Z         %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T08:39:18.1656155Z         %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T08:39:18.1656447Z         %74 = tt.broadcast %72 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T08:39:18.1656702Z         %75 = arith.select %17, %74, %cst_11 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T08:39:18.1656949Z         %76 = tt.broadcast %73 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T08:39:18.1657191Z         %77 = arith.select %19, %76, %75 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T08:39:18.1657425Z         %78 = tt.reshape %77 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked2>
2026-02-21T08:39:18.1657648Z         %79 = arith.sitofp %78 : tensor<2x512xi8, #blocked2> to tensor<2x512xf32, #blocked2>
2026-02-21T08:39:18.1657902Z         %80 = ttg.local_alloc %79 : (tensor<2x512xf32, #blocked2>) -> !ttg.memdesc<2x512xf32, #shared, #smem>
2026-02-21T08:39:18.1658228Z         %81 = ttg.local_load %80 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:39:18.1658751Z         %82 = tt.dot %61, %81, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x512xf32, #mma>
2026-02-21T08:39:18.1659111Z         scf.yield %82 : tensor<16x512xf32, #mma>
2026-02-21T08:39:18.1659280Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T08:39:18.1659570Z       %40 = arith.truncf %39 : tensor<16x512xf32, #mma> to tensor<16x512xbf16, #mma>
2026-02-21T08:39:18.1659825Z       %41 = tt.splat %32 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:39:18.1660069Z       %42 = arith.addi %41, %26 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:39:18.1660412Z       %43 = tt.expand_dims %42 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T08:39:18.1660680Z       %44 = tt.broadcast %43 : tensor<1x512xi64, #mma> -> tensor<16x512xi64, #mma>
2026-02-21T08:39:18.1660869Z       %45 = arith.addi %24, %44 : tensor<16x512xi64, #mma>
2026-02-21T08:39:18.1661055Z       %46 = tt.addptr %20, %45 : tensor<16x512x!tt.ptr<bf16>, #mma>, tensor<16x512xi64, #mma>
2026-02-21T08:39:18.1661308Z       %47 = arith.cmpi sge, %43, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T08:39:18.1661472Z       %48 = arith.cmpi slt, %43, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T08:39:18.1661635Z       %49 = arith.andi %47, %48 : tensor<1x512xi1, #mma>
2026-02-21T08:39:18.1661808Z       %50 = tt.broadcast %49 : tensor<1x512xi1, #mma> -> tensor<16x512xi1, #mma>
2026-02-21T08:39:18.1661986Z       %51 = arith.andi %30, %50 : tensor<16x512xi1, #mma>
2026-02-21T08:39:18.1662146Z       tt.store %46, %40, %51 : tensor<16x512x!tt.ptr<bf16>, #mma>
2026-02-21T08:39:18.1662352Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32}
2026-02-21T08:39:18.1662545Z     tt.return
2026-02-21T08:39:18.1662633Z   }
2026-02-21T08:39:18.1662721Z }
2026-02-21T08:39:18.1662767Z 
2026-02-21T08:39:18.1662802Z {-#
2026-02-21T08:39:18.1662891Z   external_resources: {
2026-02-21T08:39:18.1663000Z     mlir_reproducer: {
2026-02-21T08:39:18.1664011Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:39:18.1665026Z       disable_threading: false,
2026-02-21T08:39:18.1665140Z       verify_each: true
2026-02-21T08:39:18.1665236Z     }
2026-02-21T08:39:18.1665316Z   }
2026-02-21T08:39:18.1665389Z #-}
2026-02-21T08:39:18.1665677Z /tmp/torchinductor_root/43/c43dkbophld72d3ragneufisw3cdtnrk2n357ly3meonnfakpgjg.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:39:18.1666433Z /tmp/torchinductor_root/43/c43dkbophld72d3ragneufisw3cdtnrk2n357ly3meonnfakpgjg.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:39:18.1666992Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:39:18.1667844Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:39:18.1668557Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:39:18.1668729Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:39:21.5743264Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.3 configs/s
2026-02-21T08:39:21.5755537Z [47s] Adaptive compile timeout: 30s (90% percentile=12.6s, bounds=[30.0s, 30s])
2026-02-21T08:39:22.1725236Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 919.9 configs/s
2026-02-21T08:39:22.7944532Z [48s] Initial random population of 100, 5 starting points: 
2026-02-21T08:39:22.7944762Z error=4
2026-02-21T08:39:22.7944851Z timeout=1
2026-02-21T08:39:22.7944940Z ok=95
2026-02-21T08:39:22.7945027Z min=0.1533
2026-02-21T08:39:22.7945110Z mid=1.2324
2026-02-21T08:39:22.7945198Z max=16.4603
2026-02-21T08:39:22.7945293Z best={'block_sizes': [64, 16, 32],
2026-02-21T08:39:22.7945467Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T08:39:22.7945606Z  'l2_groupings': [32],
2026-02-21T08:39:22.7945715Z  'load_eviction_policies': ['', ''],
2026-02-21T08:39:22.7946171Z  'loop_orders': [[0, 1]],
2026-02-21T08:39:22.7946286Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:39:22.7946397Z  'num_sm_multiplier': 1,
2026-02-21T08:39:22.7946509Z  'num_stages': 4,
2026-02-21T08:39:22.7946602Z  'num_warps': 8,
2026-02-21T08:39:22.7946707Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:39:22.7946917Z  'range_flattens': [None, True],
2026-02-21T08:39:22.7947035Z  'range_multi_buffers': [True, True],
2026-02-21T08:39:22.7947157Z  'range_num_stages': [0, 1],
2026-02-21T08:39:22.7947266Z  'range_unroll_factors': [2, 2],
2026-02-21T08:39:22.7947386Z  'range_warp_specializes': [],
2026-02-21T08:39:22.7947491Z  'waves_per_eu': 3}
2026-02-21T08:39:22.7986893Z [48s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:39:24.0068564Z [49s] Generation 1 starting: 102 neighbors, 5 active search path(s)
2026-02-21T08:40:05.4727834Z [91s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, False], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:40:05.9819837Z [91s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:40:06.6799846Z [92s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:40:06.6816429Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 105/105 0.4 configs/s
2026-02-21T08:40:07.1877369Z /tmp/torchinductor_root/ui/cuidcmuoourevkgc4xs7zqtizjfdsxlfzqlgpapqy2ekhhslavhx.py:54:87: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T08:40:07.1879689Z             b_tile = tl.load(B + (indices_3[:, None] * 7168 + indices_2[None, :] * 1), None)
2026-02-21T08:40:07.1885487Z                                                                                       ^
2026-02-21T08:40:07.1887768Z /tmp/torchinductor_root/ui/cuidcmuoourevkgc4xs7zqtizjfdsxlfzqlgpapqy2ekhhslavhx.py:67:45: note: - use: %348 = "ttg.convert_layout"(<<UNKNOWN SSA VALUE>>) : (tensor<256x16xi8, #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}>>) -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}>}>>
2026-02-21T08:40:07.1888997Z 
2026-02-21T08:40:07.1889100Z             expanded_1 = tl.expand_dims(v_6, 1)
2026-02-21T08:40:07.1889246Z                                             ^
2026-02-21T08:40:07.1890228Z /tmp/torchinductor_root/ui/cuidcmuoourevkgc4xs7zqtizjfdsxlfzqlgpapqy2ekhhslavhx.py:66:45: note: - use: %349 = "ttg.convert_layout"(<<UNKNOWN SSA VALUE>>) : (tensor<256x16xi8, #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}>>) -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}>}>>
2026-02-21T08:40:07.1891216Z 
2026-02-21T08:40:07.1891288Z             expanded_0 = tl.expand_dims(v_4, 1)
2026-02-21T08:40:07.1892069Z                                             ^
2026-02-21T08:40:07.1892271Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T08:40:07.1924311Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T08:40:07.1924912Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T08:40:07.1925486Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:40:07.1926059Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:40:07.1926633Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [16], order = [0]}>
2026-02-21T08:40:07.1927093Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 8], order = [1, 0]}>
2026-02-21T08:40:07.1927455Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 16], order = [0, 1]}>
2026-02-21T08:40:07.1927795Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T08:40:07.1928162Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [16, 1, 1], order = [0, 1, 2]}>
2026-02-21T08:40:07.1928569Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [16, 1], order = [0, 1]}>
2026-02-21T08:40:07.1928922Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [16, 1, 1], order = [0, 1, 2]}>
2026-02-21T08:40:07.1929292Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:40:07.1930116Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:40:07.1930714Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:40:07.1931178Z     %cst = arith.constant dense<7168> : tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.1931393Z     %cst_0 = arith.constant dense<0> : tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.1931594Z     %cst_1 = arith.constant dense<16> : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.1931800Z     %cst_2 = arith.constant dense<0> : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.1932000Z     %cst_3 = arith.constant dense<7168> : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.1932215Z     %cst_4 = arith.constant dense<0> : tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.1932434Z     %c14336_i32 = arith.constant 14336 : i32
2026-02-21T08:40:07.1932639Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:40:07.1932841Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:40:07.1932982Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:40:07.1949294Z     %c304_i32 = arith.constant 304 : i32
2026-02-21T08:40:07.1949601Z     %cst_5 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:40:07.1949848Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:40:07.1950088Z     %cst_7 = arith.constant dense<4> : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.1950429Z     %cst_8 = arith.constant dense<7168> : tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.1950712Z     %cst_9 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T08:40:07.1950918Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:40:07.1951190Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.1951418Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:40:07.1951560Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:40:07.1952100Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:40:07.1952242Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T08:40:07.1952388Z     %0 = tt.get_program_id x : i32
2026-02-21T08:40:07.1952589Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked4>
2026-02-21T08:40:07.1952844Z     %2 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked4>
2026-02-21T08:40:07.1953084Z     %3 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked4>
2026-02-21T08:40:07.1953335Z     %4 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x512x!tt.ptr<bf16>, #blocked5>
2026-02-21T08:40:07.1953595Z     %5 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<256x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:40:07.1953835Z     %6 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked4>
2026-02-21T08:40:07.1954143Z     %7 = ttg.convert_layout %6 : tensor<2xi32, #blocked4> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.1954673Z     %8 = tt.expand_dims %7 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi32, #blocked6>
2026-02-21T08:40:07.1955116Z     %9 = ttg.convert_layout %8 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #blocked7>
2026-02-21T08:40:07.1955554Z     %10 = ttg.convert_layout %9 : tensor<1x2xi32, #blocked7> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T08:40:07.1955976Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x2x1xi32, #blocked8>
2026-02-21T08:40:07.1956340Z     %12 = ttg.convert_layout %11 : tensor<1x2x1xi32, #blocked8> -> tensor<1x2x1xi32, #blocked3>
2026-02-21T08:40:07.1956590Z     %13 = arith.cmpi eq, %12, %cst_6 : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:40:07.1957135Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked3> -> tensor<256x2x16xi1, #blocked3>
2026-02-21T08:40:07.1957418Z     %15 = ttg.convert_layout %14 : tensor<256x2x16xi1, #blocked3> -> tensor<256x2x16xi1, #blocked2>
2026-02-21T08:40:07.1957680Z     %16 = arith.cmpi eq, %12, %cst_5 : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:40:07.1957917Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked3> -> tensor<256x2x16xi1, #blocked3>
2026-02-21T08:40:07.1958192Z     %18 = ttg.convert_layout %17 : tensor<256x2x16xi1, #blocked3> -> tensor<256x2x16xi1, #blocked2>
2026-02-21T08:40:07.1958476Z     %19 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x16x!tt.ptr<bf16>, #blocked>
2026-02-21T08:40:07.1958763Z     %20 = arith.extsi %1 : tensor<16xi32, #blocked4> to tensor<16xi64, #blocked4>
2026-02-21T08:40:07.1959041Z     %21 = arith.subi %c448_i32, %0 : i32
2026-02-21T08:40:07.1959207Z     %c1_i32_11 = arith.constant 1 : i32
2026-02-21T08:40:07.1959361Z     %22 = arith.subi %c304_i32, %c1_i32_11 : i32
2026-02-21T08:40:07.1959535Z     %23 = arith.addi %21, %22 : i32
2026-02-21T08:40:07.1959709Z     %24 = arith.divui %23, %c304_i32 : i32
2026-02-21T08:40:07.1959858Z     %c2_i32_12 = arith.constant 2 : i32
2026-02-21T08:40:07.1960003Z     %25 = arith.remsi %24, %c2_i32_12 : i32
2026-02-21T08:40:07.1960145Z     %26 = arith.subi %24, %25 : i32
2026-02-21T08:40:07.1960276Z     %27 = arith.muli %26, %c304_i32 : i32
2026-02-21T08:40:07.1960465Z     %28 = arith.addi %0, %27 : i32
2026-02-21T08:40:07.1960665Z     %29 = arith.muli %c304_i32, %c2_i32_12 : i32
2026-02-21T08:40:07.1960846Z     scf.for %arg3 = %0 to %28 step %29  : i32 {
2026-02-21T08:40:07.1961007Z       %30 = arith.divsi %arg3, %c14336_i32 : i32
2026-02-21T08:40:07.1961153Z       %31 = arith.muli %30, %c32_i32 : i32
2026-02-21T08:40:07.1961297Z       %32 = arith.subi %c1_i32, %31 : i32
2026-02-21T08:40:07.1961434Z       %33 = arith.minsi %32, %c32_i32 : i32
2026-02-21T08:40:07.1961583Z       %34 = arith.remsi %arg3, %c14336_i32 : i32
2026-02-21T08:40:07.1961726Z       %35 = arith.remsi %34, %33 : i32
2026-02-21T08:40:07.1961871Z       %36 = arith.addi %31, %35 : i32
2026-02-21T08:40:07.1962010Z       %37 = arith.divsi %34, %33 : i32
2026-02-21T08:40:07.1962295Z       %38 = arith.muli %36, %c16_i32 : i32
2026-02-21T08:40:07.1962462Z       %39 = tt.splat %38 : i32 -> tensor<16xi32, #blocked4>
2026-02-21T08:40:07.1962773Z       %40 = arith.addi %39, %1 : tensor<16xi32, #blocked4>
2026-02-21T08:40:07.1962939Z       %41 = arith.muli %37, %c16_i32 : i32
2026-02-21T08:40:07.1963095Z       %42 = tt.splat %41 : i32 -> tensor<16xi32, #blocked4>
2026-02-21T08:40:07.1963274Z       %43 = arith.addi %42, %1 : tensor<16xi32, #blocked4>
2026-02-21T08:40:07.1963544Z       %44 = ttg.convert_layout %40 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.1963951Z       %45 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi32, #blocked9>
2026-02-21T08:40:07.1964401Z       %46 = ttg.convert_layout %45 : tensor<16x1xi32, #blocked9> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:40:07.1964644Z       %47 = arith.muli %46, %cst_9 : tensor<16x1xi32, #blocked1>
2026-02-21T08:40:07.1964884Z       %48 = tt.broadcast %47 : tensor<16x1xi32, #blocked1> -> tensor<16x512xi32, #blocked1>
2026-02-21T08:40:07.1965245Z       %49 = ttg.convert_layout %48 : tensor<16x512xi32, #blocked1> -> tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.1965791Z       %50 = ttg.convert_layout %43 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.1966375Z       %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi32, #blocked6>
2026-02-21T08:40:07.1966836Z       %52 = ttg.convert_layout %51 : tensor<1x16xi32, #blocked6> -> tensor<1x16xi32, #blocked>
2026-02-21T08:40:07.1967234Z       %53 = tt.broadcast %52 : tensor<1x16xi32, #blocked> -> tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.1967462Z       %c512_i32 = arith.constant 512 : i32
2026-02-21T08:40:07.1967947Z       %54 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c512_i32 iter_args(%arg5 = %cst_10) -> (tensor<16x16xf32, #blocked>)  : i32 {
2026-02-21T08:40:07.1968377Z         %140 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T08:40:07.1968682Z         %141 = arith.addi %140, %2 : tensor<256xi32, #blocked4>
2026-02-21T08:40:07.1968947Z         %142 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:40:07.1969212Z         %143 = tt.splat %142 : i32 -> tensor<512xi32, #blocked4>
2026-02-21T08:40:07.1969480Z         %144 = arith.addi %143, %3 : tensor<512xi32, #blocked4>
2026-02-21T08:40:07.1969808Z         %145 = ttg.convert_layout %144 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.1970209Z         %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6>
2026-02-21T08:40:07.1970573Z         %147 = ttg.convert_layout %146 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5>
2026-02-21T08:40:07.1970868Z         %148 = tt.broadcast %147 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.1971116Z         %149 = arith.addi %49, %148 : tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.1971377Z         %150 = tt.addptr %4, %149 : tensor<16x512x!tt.ptr<bf16>, #blocked5>, tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.1971635Z         %151 = tt.load %150 : tensor<16x512x!tt.ptr<bf16>, #blocked5>
2026-02-21T08:40:07.1971870Z         %152 = arith.extf %151 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5>
2026-02-21T08:40:07.1972209Z         %153 = ttg.convert_layout %141 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.1972622Z         %154 = tt.expand_dims %153 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9>
2026-02-21T08:40:07.1972973Z         %155 = ttg.convert_layout %154 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.1973227Z         %156 = arith.muli %155, %cst_8 : tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.1973560Z         %157 = tt.broadcast %156 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1>
2026-02-21T08:40:07.1973847Z         %158 = ttg.convert_layout %157 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.1974092Z         %159 = arith.addi %158, %53 : tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.1974332Z         %160 = tt.addptr %5, %159 : tensor<256x16x!tt.ptr<i8>, #blocked>, tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.1974576Z         %161 = tt.load %160 : tensor<256x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:40:07.1974772Z         %162 = arith.shli %161, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.1974973Z         %163 = arith.shrsi %162, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.1975169Z         %164 = arith.shrsi %161, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.1975470Z         %165 = ttg.convert_layout %163 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.1975893Z         %166 = tt.expand_dims %165 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.1976269Z         %167 = ttg.convert_layout %166 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.1976631Z         %168 = ttg.convert_layout %164 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.1977045Z         %169 = tt.expand_dims %168 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.1977567Z         %170 = ttg.convert_layout %169 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.1977942Z         %171 = tt.broadcast %167 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.1978242Z         %172 = ttg.convert_layout %171 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.1978668Z         %173 = arith.select %15, %172, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.1979051Z         %174 = tt.broadcast %170 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.1979422Z         %175 = ttg.convert_layout %174 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.1979737Z         %176 = arith.select %18, %175, %173 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.1980031Z         %177 = tt.reshape %176 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked>
2026-02-21T08:40:07.1980309Z         %178 = arith.sitofp %177 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked>
2026-02-21T08:40:07.1980668Z         %179 = ttg.convert_layout %152 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
2026-02-21T08:40:07.1981083Z         %180 = ttg.convert_layout %178 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>>
2026-02-21T08:40:07.1981435Z         %181 = ttg.convert_layout %arg5 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.1981912Z         %182 = tt.dot %179, %180, %181, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.1982366Z         %c1_i32_15 = arith.constant 1 : i32
2026-02-21T08:40:07.1982528Z         %183 = arith.muli %c256_i32, %c1_i32_15 : i32
2026-02-21T08:40:07.1982678Z         %184 = arith.addi %arg4, %183 : i32
2026-02-21T08:40:07.1982844Z         %185 = tt.splat %184 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T08:40:07.1983030Z         %186 = arith.addi %185, %2 : tensor<256xi32, #blocked4>
2026-02-21T08:40:07.1983202Z         %187 = arith.muli %184, %c2_i32 : i32
2026-02-21T08:40:07.1983434Z         %188 = tt.splat %187 : i32 -> tensor<512xi32, #blocked4>
2026-02-21T08:40:07.1983622Z         %189 = arith.addi %188, %3 : tensor<512xi32, #blocked4>
2026-02-21T08:40:07.1983909Z         %190 = ttg.convert_layout %189 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.1984303Z         %191 = tt.expand_dims %190 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6>
2026-02-21T08:40:07.1984657Z         %192 = ttg.convert_layout %191 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5>
2026-02-21T08:40:07.1984990Z         %193 = tt.broadcast %192 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.1985291Z         %194 = arith.addi %49, %193 : tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.1985548Z         %195 = tt.addptr %4, %194 : tensor<16x512x!tt.ptr<bf16>, #blocked5>, tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.1985798Z         %196 = tt.load %195 : tensor<16x512x!tt.ptr<bf16>, #blocked5>
2026-02-21T08:40:07.1986038Z         %197 = arith.extf %196 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5>
2026-02-21T08:40:07.1986373Z         %198 = ttg.convert_layout %186 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.1986776Z         %199 = tt.expand_dims %198 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9>
2026-02-21T08:40:07.1987136Z         %200 = ttg.convert_layout %199 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.1987388Z         %201 = arith.muli %200, %cst_8 : tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.1987674Z         %202 = tt.broadcast %201 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1>
2026-02-21T08:40:07.1987958Z         %203 = ttg.convert_layout %202 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.1988208Z         %204 = arith.addi %203, %53 : tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.1988499Z         %205 = tt.addptr %5, %204 : tensor<256x16x!tt.ptr<i8>, #blocked>, tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.1988763Z         %206 = tt.load %205 : tensor<256x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:40:07.1988971Z         %207 = arith.shli %206, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.1989301Z         %208 = arith.shrsi %207, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.1989668Z         %209 = arith.shrsi %206, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.1990155Z         %210 = ttg.convert_layout %208 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.1990627Z         %211 = tt.expand_dims %210 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.1991043Z         %212 = ttg.convert_layout %211 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.1991447Z         %213 = ttg.convert_layout %209 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.1991873Z         %214 = tt.expand_dims %213 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.1992253Z         %215 = ttg.convert_layout %214 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.1992561Z         %216 = tt.broadcast %212 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.1992970Z         %217 = ttg.convert_layout %216 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.1993290Z         %218 = arith.select %15, %217, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.1993724Z         %219 = tt.broadcast %215 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.1994051Z         %220 = ttg.convert_layout %219 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.1994534Z         %221 = arith.select %18, %220, %218 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.1995010Z         %222 = tt.reshape %221 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked>
2026-02-21T08:40:07.1995440Z         %223 = arith.sitofp %222 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked>
2026-02-21T08:40:07.1995911Z         %224 = ttg.convert_layout %197 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
2026-02-21T08:40:07.1996330Z         %225 = ttg.convert_layout %223 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>>
2026-02-21T08:40:07.1996921Z         %226 = ttg.convert_layout %182 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.1997582Z         %227 = tt.dot %224, %225, %226, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.1998254Z         scf.yield %227 : tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.1998471Z       } {tt.flatten}
2026-02-21T08:40:07.1998660Z       %55 = arith.truncf %54 : tensor<16x16xf32, #blocked> to tensor<16x16xbf16, #blocked>
2026-02-21T08:40:07.1998874Z       %56 = arith.extsi %38 : i32 to i64
2026-02-21T08:40:07.1999044Z       %57 = arith.extsi %41 : i32 to i64
2026-02-21T08:40:07.1999207Z       %58 = tt.splat %56 : i64 -> tensor<16xi64, #blocked4>
2026-02-21T08:40:07.1999394Z       %59 = arith.addi %58, %20 : tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2035106Z       %60 = ttg.convert_layout %59 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.2035752Z       %61 = tt.expand_dims %60 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi64, #blocked9>
2026-02-21T08:40:07.2036097Z       %62 = ttg.convert_layout %61 : tensor<16x1xi64, #blocked9> -> tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2036337Z       %63 = arith.muli %62, %cst_3 : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2036568Z       %64 = tt.broadcast %63 : tensor<16x1xi64, #blocked1> -> tensor<16x16xi64, #blocked1>
2026-02-21T08:40:07.2036843Z       %65 = ttg.convert_layout %64 : tensor<16x16xi64, #blocked1> -> tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2037086Z       %66 = tt.splat %57 : i64 -> tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2037322Z       %67 = arith.addi %66, %20 : tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2037599Z       %68 = ttg.convert_layout %67 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.2037984Z       %69 = tt.expand_dims %68 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi64, #blocked6>
2026-02-21T08:40:07.2038316Z       %70 = ttg.convert_layout %69 : tensor<1x16xi64, #blocked6> -> tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.2038587Z       %71 = tt.broadcast %70 : tensor<1x16xi64, #blocked> -> tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2038815Z       %72 = arith.addi %65, %71 : tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2039047Z       %73 = tt.addptr %19, %72 : tensor<16x16x!tt.ptr<bf16>, #blocked>, tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2039298Z       %74 = arith.cmpi sge, %62, %cst_2 : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2039498Z       %75 = arith.cmpi slt, %62, %cst_1 : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2039730Z       %76 = arith.andi %74, %75 : tensor<16x1xi1, #blocked1>
2026-02-21T08:40:07.2040075Z       %77 = tt.broadcast %76 : tensor<16x1xi1, #blocked1> -> tensor<16x16xi1, #blocked1>
2026-02-21T08:40:07.2040535Z       %78 = ttg.convert_layout %77 : tensor<16x16xi1, #blocked1> -> tensor<16x16xi1, #blocked>
2026-02-21T08:40:07.2040777Z       %79 = arith.cmpi sge, %70, %cst_0 : tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.2040973Z       %80 = arith.cmpi slt, %70, %cst : tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.2041236Z       %81 = arith.andi %79, %80 : tensor<1x16xi1, #blocked>
2026-02-21T08:40:07.2041509Z       %82 = tt.broadcast %81 : tensor<1x16xi1, #blocked> -> tensor<16x16xi1, #blocked>
2026-02-21T08:40:07.2041769Z       %83 = arith.andi %78, %82 : tensor<16x16xi1, #blocked>
2026-02-21T08:40:07.2042031Z       tt.store %73, %55, %83 : tensor<16x16x!tt.ptr<bf16>, #blocked>
2026-02-21T08:40:07.2042208Z       %c1_i32_13 = arith.constant 1 : i32
2026-02-21T08:40:07.2042367Z       %84 = arith.muli %c304_i32, %c1_i32_13 : i32
2026-02-21T08:40:07.2042550Z       %85 = arith.addi %arg3, %84 : i32
2026-02-21T08:40:07.2042825Z       %86 = arith.divsi %85, %c14336_i32 : i32
2026-02-21T08:40:07.2042994Z       %87 = arith.muli %86, %c32_i32 : i32
2026-02-21T08:40:07.2043180Z       %88 = arith.subi %c1_i32, %87 : i32
2026-02-21T08:40:07.2043323Z       %89 = arith.minsi %88, %c32_i32 : i32
2026-02-21T08:40:07.2043472Z       %90 = arith.remsi %85, %c14336_i32 : i32
2026-02-21T08:40:07.2043615Z       %91 = arith.remsi %90, %89 : i32
2026-02-21T08:40:07.2043749Z       %92 = arith.addi %87, %91 : i32
2026-02-21T08:40:07.2043888Z       %93 = arith.divsi %90, %89 : i32
2026-02-21T08:40:07.2044022Z       %94 = arith.muli %92, %c16_i32 : i32
2026-02-21T08:40:07.2044182Z       %95 = tt.splat %94 : i32 -> tensor<16xi32, #blocked4>
2026-02-21T08:40:07.2044358Z       %96 = arith.addi %95, %1 : tensor<16xi32, #blocked4>
2026-02-21T08:40:07.2044519Z       %97 = arith.muli %93, %c16_i32 : i32
2026-02-21T08:40:07.2044676Z       %98 = tt.splat %97 : i32 -> tensor<16xi32, #blocked4>
2026-02-21T08:40:07.2044913Z       %99 = arith.addi %98, %1 : tensor<16xi32, #blocked4>
2026-02-21T08:40:07.2045201Z       %100 = ttg.convert_layout %96 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.2045592Z       %101 = tt.expand_dims %100 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi32, #blocked9>
2026-02-21T08:40:07.2045941Z       %102 = ttg.convert_layout %101 : tensor<16x1xi32, #blocked9> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:40:07.2046186Z       %103 = arith.muli %102, %cst_9 : tensor<16x1xi32, #blocked1>
2026-02-21T08:40:07.2046424Z       %104 = tt.broadcast %103 : tensor<16x1xi32, #blocked1> -> tensor<16x512xi32, #blocked1>
2026-02-21T08:40:07.2046713Z       %105 = ttg.convert_layout %104 : tensor<16x512xi32, #blocked1> -> tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2047050Z       %106 = ttg.convert_layout %99 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.2047439Z       %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi32, #blocked6>
2026-02-21T08:40:07.2047788Z       %108 = ttg.convert_layout %107 : tensor<1x16xi32, #blocked6> -> tensor<1x16xi32, #blocked>
2026-02-21T08:40:07.2048061Z       %109 = tt.broadcast %108 : tensor<1x16xi32, #blocked> -> tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2048278Z       %c512_i32_14 = arith.constant 512 : i32
2026-02-21T08:40:07.2048546Z       %110 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c512_i32_14 iter_args(%arg5 = %cst_10) -> (tensor<16x16xf32, #blocked>)  : i32 {
2026-02-21T08:40:07.2048843Z         %140 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T08:40:07.2049073Z         %141 = arith.addi %140, %2 : tensor<256xi32, #blocked4>
2026-02-21T08:40:07.2049263Z         %142 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:40:07.2049430Z         %143 = tt.splat %142 : i32 -> tensor<512xi32, #blocked4>
2026-02-21T08:40:07.2049616Z         %144 = arith.addi %143, %3 : tensor<512xi32, #blocked4>
2026-02-21T08:40:07.2049902Z         %145 = ttg.convert_layout %144 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.2050367Z         %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6>
2026-02-21T08:40:07.2050928Z         %147 = ttg.convert_layout %146 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5>
2026-02-21T08:40:07.2051276Z         %148 = tt.broadcast %147 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2051518Z         %149 = arith.addi %105, %148 : tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2051773Z         %150 = tt.addptr %4, %149 : tensor<16x512x!tt.ptr<bf16>, #blocked5>, tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2052162Z         %151 = tt.load %150 : tensor<16x512x!tt.ptr<bf16>, #blocked5>
2026-02-21T08:40:07.2052553Z         %152 = arith.extf %151 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5>
2026-02-21T08:40:07.2053042Z         %153 = ttg.convert_layout %141 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.2053450Z         %154 = tt.expand_dims %153 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9>
2026-02-21T08:40:07.2053888Z         %155 = ttg.convert_layout %154 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.2054142Z         %156 = arith.muli %155, %cst_8 : tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.2054425Z         %157 = tt.broadcast %156 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1>
2026-02-21T08:40:07.2054917Z         %158 = ttg.convert_layout %157 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2055171Z         %159 = arith.addi %158, %109 : tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2055496Z         %160 = tt.addptr %5, %159 : tensor<256x16x!tt.ptr<i8>, #blocked>, tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2055744Z         %161 = tt.load %160 : tensor<256x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:40:07.2055997Z         %162 = arith.shli %161, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2056193Z         %163 = arith.shrsi %162, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2056390Z         %164 = arith.shrsi %161, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2056826Z         %165 = ttg.convert_layout %163 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.2057453Z         %166 = tt.expand_dims %165 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.2057832Z         %167 = ttg.convert_layout %166 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.2058325Z         %168 = ttg.convert_layout %164 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.2058964Z         %169 = tt.expand_dims %168 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.2059342Z         %170 = ttg.convert_layout %169 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.2059726Z         %171 = tt.broadcast %167 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.2060041Z         %172 = ttg.convert_layout %171 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2060399Z         %173 = arith.select %15, %172, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2060765Z         %174 = tt.broadcast %170 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.2061083Z         %175 = ttg.convert_layout %174 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2061537Z         %176 = arith.select %18, %175, %173 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2061835Z         %177 = tt.reshape %176 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked>
2026-02-21T08:40:07.2062189Z         %178 = arith.sitofp %177 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked>
2026-02-21T08:40:07.2062540Z         %179 = ttg.convert_layout %152 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
2026-02-21T08:40:07.2062998Z         %180 = ttg.convert_layout %178 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>>
2026-02-21T08:40:07.2063385Z         %181 = ttg.convert_layout %arg5 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.2063889Z         %182 = tt.dot %179, %180, %181, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.2064316Z         %c1_i32_15 = arith.constant 1 : i32
2026-02-21T08:40:07.2064496Z         %183 = arith.muli %c256_i32, %c1_i32_15 : i32
2026-02-21T08:40:07.2064666Z         %184 = arith.addi %arg4, %183 : i32
2026-02-21T08:40:07.2064832Z         %185 = tt.splat %184 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T08:40:07.2065049Z         %186 = arith.addi %185, %2 : tensor<256xi32, #blocked4>
2026-02-21T08:40:07.2065240Z         %187 = arith.muli %184, %c2_i32 : i32
2026-02-21T08:40:07.2065404Z         %188 = tt.splat %187 : i32 -> tensor<512xi32, #blocked4>
2026-02-21T08:40:07.2065594Z         %189 = arith.addi %188, %3 : tensor<512xi32, #blocked4>
2026-02-21T08:40:07.2065961Z         %190 = ttg.convert_layout %189 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.2066446Z         %191 = tt.expand_dims %190 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6>
2026-02-21T08:40:07.2066866Z         %192 = ttg.convert_layout %191 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5>
2026-02-21T08:40:07.2067151Z         %193 = tt.broadcast %192 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2067463Z         %194 = arith.addi %105, %193 : tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2067867Z         %195 = tt.addptr %4, %194 : tensor<16x512x!tt.ptr<bf16>, #blocked5>, tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2068148Z         %196 = tt.load %195 : tensor<16x512x!tt.ptr<bf16>, #blocked5>
2026-02-21T08:40:07.2068386Z         %197 = arith.extf %196 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5>
2026-02-21T08:40:07.2068804Z         %198 = ttg.convert_layout %186 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.2069203Z         %199 = tt.expand_dims %198 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9>
2026-02-21T08:40:07.2069553Z         %200 = ttg.convert_layout %199 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.2069804Z         %201 = arith.muli %200, %cst_8 : tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.2070043Z         %202 = tt.broadcast %201 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1>
2026-02-21T08:40:07.2070327Z         %203 = ttg.convert_layout %202 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2070622Z         %204 = arith.addi %203, %109 : tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2070894Z         %205 = tt.addptr %5, %204 : tensor<256x16x!tt.ptr<i8>, #blocked>, tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2071140Z         %206 = tt.load %205 : tensor<256x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:40:07.2071386Z         %207 = arith.shli %206, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2071640Z         %208 = arith.shrsi %207, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2071921Z         %209 = arith.shrsi %206, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2072216Z         %210 = ttg.convert_layout %208 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.2072780Z         %211 = tt.expand_dims %210 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.2073151Z         %212 = ttg.convert_layout %211 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.2073514Z         %213 = ttg.convert_layout %209 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.2073931Z         %214 = tt.expand_dims %213 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.2074385Z         %215 = ttg.convert_layout %214 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.2074894Z         %216 = tt.broadcast %212 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.2075413Z         %217 = ttg.convert_layout %216 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2075914Z         %218 = arith.select %15, %217, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2076389Z         %219 = tt.broadcast %215 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.2076872Z         %220 = ttg.convert_layout %219 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2077458Z         %221 = arith.select %18, %220, %218 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2077905Z         %222 = tt.reshape %221 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked>
2026-02-21T08:40:07.2078310Z         %223 = arith.sitofp %222 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked>
2026-02-21T08:40:07.2078854Z         %224 = ttg.convert_layout %197 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
2026-02-21T08:40:07.2079524Z         %225 = ttg.convert_layout %223 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>>
2026-02-21T08:40:07.2080106Z         %226 = ttg.convert_layout %182 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.2080874Z         %227 = tt.dot %224, %225, %226, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.2081516Z         scf.yield %227 : tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.2081746Z       } {tt.flatten}
2026-02-21T08:40:07.2082025Z       %111 = arith.truncf %110 : tensor<16x16xf32, #blocked> to tensor<16x16xbf16, #blocked>
2026-02-21T08:40:07.2082372Z       %112 = arith.extsi %94 : i32 to i64
2026-02-21T08:40:07.2082640Z       %113 = arith.extsi %97 : i32 to i64
2026-02-21T08:40:07.2082906Z       %114 = tt.splat %112 : i64 -> tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2083166Z       %115 = arith.addi %114, %20 : tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2083453Z       %116 = ttg.convert_layout %115 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.2083850Z       %117 = tt.expand_dims %116 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi64, #blocked9>
2026-02-21T08:40:07.2084266Z       %118 = ttg.convert_layout %117 : tensor<16x1xi64, #blocked9> -> tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2084575Z       %119 = arith.muli %118, %cst_3 : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2084811Z       %120 = tt.broadcast %119 : tensor<16x1xi64, #blocked1> -> tensor<16x16xi64, #blocked1>
2026-02-21T08:40:07.2085143Z       %121 = ttg.convert_layout %120 : tensor<16x16xi64, #blocked1> -> tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2085485Z       %122 = tt.splat %113 : i64 -> tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2085715Z       %123 = arith.addi %122, %20 : tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2085998Z       %124 = ttg.convert_layout %123 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.2086393Z       %125 = tt.expand_dims %124 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi64, #blocked6>
2026-02-21T08:40:07.2086834Z       %126 = ttg.convert_layout %125 : tensor<1x16xi64, #blocked6> -> tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.2087250Z       %127 = tt.broadcast %126 : tensor<1x16xi64, #blocked> -> tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2087626Z       %128 = arith.addi %121, %127 : tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2088023Z       %129 = tt.addptr %19, %128 : tensor<16x16x!tt.ptr<bf16>, #blocked>, tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2088410Z       %130 = arith.cmpi sge, %118, %cst_2 : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2088738Z       %131 = arith.cmpi slt, %118, %cst_1 : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2089042Z       %132 = arith.andi %130, %131 : tensor<16x1xi1, #blocked1>
2026-02-21T08:40:07.2089380Z       %133 = tt.broadcast %132 : tensor<16x1xi1, #blocked1> -> tensor<16x16xi1, #blocked1>
2026-02-21T08:40:07.2089817Z       %134 = ttg.convert_layout %133 : tensor<16x16xi1, #blocked1> -> tensor<16x16xi1, #blocked>
2026-02-21T08:40:07.2090195Z       %135 = arith.cmpi sge, %126, %cst_0 : tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.2090505Z       %136 = arith.cmpi slt, %126, %cst : tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.2090867Z       %137 = arith.andi %135, %136 : tensor<1x16xi1, #blocked>
2026-02-21T08:40:07.2091204Z       %138 = tt.broadcast %137 : tensor<1x16xi1, #blocked> -> tensor<16x16xi1, #blocked>
2026-02-21T08:40:07.2091541Z       %139 = arith.andi %134, %138 : tensor<16x16xi1, #blocked>
2026-02-21T08:40:07.2091839Z       tt.store %129, %111, %139 : tensor<16x16x!tt.ptr<bf16>, #blocked>
2026-02-21T08:40:07.2092096Z     }
2026-02-21T08:40:07.2092285Z     scf.for %arg3 = %28 to %c448_i32 step %c304_i32  : i32 {
2026-02-21T08:40:07.2092540Z       %30 = arith.divsi %arg3, %c14336_i32 : i32
2026-02-21T08:40:07.2092771Z       %31 = arith.muli %30, %c32_i32 : i32
2026-02-21T08:40:07.2092984Z       %32 = arith.subi %c1_i32, %31 : i32
2026-02-21T08:40:07.2093192Z       %33 = arith.minsi %32, %c32_i32 : i32
2026-02-21T08:40:07.2093403Z       %34 = arith.remsi %arg3, %c14336_i32 : i32
2026-02-21T08:40:07.2093628Z       %35 = arith.remsi %34, %33 : i32
2026-02-21T08:40:07.2093836Z       %36 = arith.addi %31, %35 : i32
2026-02-21T08:40:07.2094045Z       %37 = arith.divsi %34, %33 : i32
2026-02-21T08:40:07.2094255Z       %38 = arith.muli %36, %c16_i32 : i32
2026-02-21T08:40:07.2094504Z       %39 = tt.splat %38 : i32 -> tensor<16xi32, #blocked4>
2026-02-21T08:40:07.2094765Z       %40 = arith.addi %39, %1 : tensor<16xi32, #blocked4>
2026-02-21T08:40:07.2094997Z       %41 = arith.muli %37, %c16_i32 : i32
2026-02-21T08:40:07.2095241Z       %42 = tt.splat %41 : i32 -> tensor<16xi32, #blocked4>
2026-02-21T08:40:07.2095491Z       %43 = arith.addi %42, %1 : tensor<16xi32, #blocked4>
2026-02-21T08:40:07.2095937Z       %44 = ttg.convert_layout %40 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.2096549Z       %45 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi32, #blocked9>
2026-02-21T08:40:07.2097068Z       %46 = ttg.convert_layout %45 : tensor<16x1xi32, #blocked9> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:40:07.2097436Z       %47 = arith.muli %46, %cst_9 : tensor<16x1xi32, #blocked1>
2026-02-21T08:40:07.2097775Z       %48 = tt.broadcast %47 : tensor<16x1xi32, #blocked1> -> tensor<16x512xi32, #blocked1>
2026-02-21T08:40:07.2098269Z       %49 = ttg.convert_layout %48 : tensor<16x512xi32, #blocked1> -> tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2098788Z       %50 = ttg.convert_layout %43 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.2099393Z       %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi32, #blocked6>
2026-02-21T08:40:07.2099940Z       %52 = ttg.convert_layout %51 : tensor<1x16xi32, #blocked6> -> tensor<1x16xi32, #blocked>
2026-02-21T08:40:07.2100301Z       %53 = tt.broadcast %52 : tensor<1x16xi32, #blocked> -> tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2100517Z       %c512_i32 = arith.constant 512 : i32
2026-02-21T08:40:07.2100864Z       %54 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c512_i32 iter_args(%arg5 = %cst_10) -> (tensor<16x16xf32, #blocked>)  : i32 {
2026-02-21T08:40:07.2101164Z         %84 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T08:40:07.2101357Z         %85 = arith.addi %84, %2 : tensor<256xi32, #blocked4>
2026-02-21T08:40:07.2101525Z         %86 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:40:07.2101693Z         %87 = tt.splat %86 : i32 -> tensor<512xi32, #blocked4>
2026-02-21T08:40:07.2101873Z         %88 = arith.addi %87, %3 : tensor<512xi32, #blocked4>
2026-02-21T08:40:07.2102160Z         %89 = ttg.convert_layout %88 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.2102553Z         %90 = tt.expand_dims %89 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6>
2026-02-21T08:40:07.2102899Z         %91 = ttg.convert_layout %90 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5>
2026-02-21T08:40:07.2103248Z         %92 = tt.broadcast %91 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2103484Z         %93 = arith.addi %49, %92 : tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2103733Z         %94 = tt.addptr %4, %93 : tensor<16x512x!tt.ptr<bf16>, #blocked5>, tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2103982Z         %95 = tt.load %94 : tensor<16x512x!tt.ptr<bf16>, #blocked5>
2026-02-21T08:40:07.2104211Z         %96 = arith.extf %95 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5>
2026-02-21T08:40:07.2104542Z         %97 = ttg.convert_layout %85 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.2104936Z         %98 = tt.expand_dims %97 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9>
2026-02-21T08:40:07.2105284Z         %99 = ttg.convert_layout %98 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.2105538Z         %100 = arith.muli %99, %cst_8 : tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.2105775Z         %101 = tt.broadcast %100 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1>
2026-02-21T08:40:07.2106075Z         %102 = ttg.convert_layout %101 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2106322Z         %103 = arith.addi %102, %53 : tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2106565Z         %104 = tt.addptr %5, %103 : tensor<256x16x!tt.ptr<i8>, #blocked>, tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2106814Z         %105 = tt.load %104 : tensor<256x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:40:07.2107010Z         %106 = arith.shli %105, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2107212Z         %107 = arith.shrsi %106, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2107415Z         %108 = arith.shrsi %105, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2107721Z         %109 = ttg.convert_layout %107 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.2108143Z         %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.2108561Z         %111 = ttg.convert_layout %110 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.2108922Z         %112 = ttg.convert_layout %108 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.2109333Z         %113 = tt.expand_dims %112 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.2109713Z         %114 = ttg.convert_layout %113 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.2110020Z         %115 = tt.broadcast %111 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.2110323Z         %116 = ttg.convert_layout %115 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2110640Z         %117 = arith.select %15, %116, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2110944Z         %118 = tt.broadcast %114 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.2111254Z         %119 = ttg.convert_layout %118 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2111564Z         %120 = arith.select %18, %119, %117 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2111862Z         %121 = tt.reshape %120 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked>
2026-02-21T08:40:07.2112136Z         %122 = arith.sitofp %121 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked>
2026-02-21T08:40:07.2112520Z         %123 = ttg.convert_layout %96 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
2026-02-21T08:40:07.2112932Z         %124 = ttg.convert_layout %122 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>>
2026-02-21T08:40:07.2113291Z         %125 = ttg.convert_layout %arg5 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.2113763Z         %126 = tt.dot %123, %124, %125, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.2114171Z         %c1_i32_13 = arith.constant 1 : i32
2026-02-21T08:40:07.2114335Z         %127 = arith.muli %c256_i32, %c1_i32_13 : i32
2026-02-21T08:40:07.2114488Z         %128 = arith.addi %arg4, %127 : i32
2026-02-21T08:40:07.2114656Z         %129 = tt.splat %128 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T08:40:07.2114844Z         %130 = arith.addi %129, %2 : tensor<256xi32, #blocked4>
2026-02-21T08:40:07.2115018Z         %131 = arith.muli %128, %c2_i32 : i32
2026-02-21T08:40:07.2115184Z         %132 = tt.splat %131 : i32 -> tensor<512xi32, #blocked4>
2026-02-21T08:40:07.2115372Z         %133 = arith.addi %132, %3 : tensor<512xi32, #blocked4>
2026-02-21T08:40:07.2115659Z         %134 = ttg.convert_layout %133 : tensor<512xi32, #blocked4> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.2116052Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6>
2026-02-21T08:40:07.2116407Z         %136 = ttg.convert_layout %135 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked5>
2026-02-21T08:40:07.2116697Z         %137 = tt.broadcast %136 : tensor<1x512xi32, #blocked5> -> tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2116938Z         %138 = arith.addi %49, %137 : tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2117190Z         %139 = tt.addptr %4, %138 : tensor<16x512x!tt.ptr<bf16>, #blocked5>, tensor<16x512xi32, #blocked5>
2026-02-21T08:40:07.2117438Z         %140 = tt.load %139 : tensor<16x512x!tt.ptr<bf16>, #blocked5>
2026-02-21T08:40:07.2117729Z         %141 = arith.extf %140 : tensor<16x512xbf16, #blocked5> to tensor<16x512xf32, #blocked5>
2026-02-21T08:40:07.2118057Z         %142 = ttg.convert_layout %130 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.2118453Z         %143 = tt.expand_dims %142 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<256x1xi32, #blocked9>
2026-02-21T08:40:07.2118802Z         %144 = ttg.convert_layout %143 : tensor<256x1xi32, #blocked9> -> tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.2119050Z         %145 = arith.muli %144, %cst_8 : tensor<256x1xi32, #blocked1>
2026-02-21T08:40:07.2119286Z         %146 = tt.broadcast %145 : tensor<256x1xi32, #blocked1> -> tensor<256x16xi32, #blocked1>
2026-02-21T08:40:07.2119570Z         %147 = ttg.convert_layout %146 : tensor<256x16xi32, #blocked1> -> tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2119818Z         %148 = arith.addi %147, %53 : tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2120057Z         %149 = tt.addptr %5, %148 : tensor<256x16x!tt.ptr<i8>, #blocked>, tensor<256x16xi32, #blocked>
2026-02-21T08:40:07.2120296Z         %150 = tt.load %149 : tensor<256x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:40:07.2120493Z         %151 = arith.shli %150, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2120691Z         %152 = arith.shrsi %151, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2120888Z         %153 = arith.shrsi %150, %cst_7 : tensor<256x16xi8, #blocked>
2026-02-21T08:40:07.2121217Z         %154 = ttg.convert_layout %152 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.2121669Z         %155 = tt.expand_dims %154 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.2122045Z         %156 = ttg.convert_layout %155 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.2122411Z         %157 = ttg.convert_layout %153 : tensor<256x16xi8, #blocked> -> tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:40:07.2122923Z         %158 = tt.expand_dims %157 {axis = 1 : i32} : tensor<256x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<256x1x16xi8, #blocked10>
2026-02-21T08:40:07.2123296Z         %159 = ttg.convert_layout %158 : tensor<256x1x16xi8, #blocked10> -> tensor<256x1x16xi8, #blocked11>
2026-02-21T08:40:07.2123604Z         %160 = tt.broadcast %156 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.2123912Z         %161 = ttg.convert_layout %160 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2124231Z         %162 = arith.select %15, %161, %cst_4 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2124540Z         %163 = tt.broadcast %159 : tensor<256x1x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked11>
2026-02-21T08:40:07.2124847Z         %164 = ttg.convert_layout %163 : tensor<256x2x16xi8, #blocked11> -> tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2125211Z         %165 = arith.select %18, %164, %162 : tensor<256x2x16xi1, #blocked2>, tensor<256x2x16xi8, #blocked2>
2026-02-21T08:40:07.2125509Z         %166 = tt.reshape %165 : tensor<256x2x16xi8, #blocked2> -> tensor<512x16xi8, #blocked>
2026-02-21T08:40:07.2125784Z         %167 = arith.sitofp %166 : tensor<512x16xi8, #blocked> to tensor<512x16xf32, #blocked>
2026-02-21T08:40:07.2126136Z         %168 = ttg.convert_layout %141 : tensor<16x512xf32, #blocked5> -> tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
2026-02-21T08:40:07.2126553Z         %169 = ttg.convert_layout %167 : tensor<512x16xf32, #blocked> -> tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>>
2026-02-21T08:40:07.2126907Z         %170 = ttg.convert_layout %126 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.2127435Z         %171 = tt.dot %168, %169, %170, inputPrecision = tf32 : tensor<16x512xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.2127861Z         scf.yield %171 : tensor<16x16xf32, #blocked>
2026-02-21T08:40:07.2128099Z       } {tt.flatten}
2026-02-21T08:40:07.2128302Z       %55 = arith.truncf %54 : tensor<16x16xf32, #blocked> to tensor<16x16xbf16, #blocked>
2026-02-21T08:40:07.2128559Z       %56 = arith.extsi %38 : i32 to i64
2026-02-21T08:40:07.2128731Z       %57 = arith.extsi %41 : i32 to i64
2026-02-21T08:40:07.2128892Z       %58 = tt.splat %56 : i64 -> tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2129074Z       %59 = arith.addi %58, %20 : tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2129362Z       %60 = ttg.convert_layout %59 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T08:40:07.2129746Z       %61 = tt.expand_dims %60 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<16x1xi64, #blocked9>
2026-02-21T08:40:07.2130086Z       %62 = ttg.convert_layout %61 : tensor<16x1xi64, #blocked9> -> tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2130327Z       %63 = arith.muli %62, %cst_3 : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2130554Z       %64 = tt.broadcast %63 : tensor<16x1xi64, #blocked1> -> tensor<16x16xi64, #blocked1>
2026-02-21T08:40:07.2130831Z       %65 = ttg.convert_layout %64 : tensor<16x16xi64, #blocked1> -> tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2131061Z       %66 = tt.splat %57 : i64 -> tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2131241Z       %67 = arith.addi %66, %20 : tensor<16xi64, #blocked4>
2026-02-21T08:40:07.2131577Z       %68 = ttg.convert_layout %67 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T08:40:07.2131958Z       %69 = tt.expand_dims %68 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x16xi64, #blocked6>
2026-02-21T08:40:07.2132288Z       %70 = ttg.convert_layout %69 : tensor<1x16xi64, #blocked6> -> tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.2132557Z       %71 = tt.broadcast %70 : tensor<1x16xi64, #blocked> -> tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2132782Z       %72 = arith.addi %65, %71 : tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2133014Z       %73 = tt.addptr %19, %72 : tensor<16x16x!tt.ptr<bf16>, #blocked>, tensor<16x16xi64, #blocked>
2026-02-21T08:40:07.2133263Z       %74 = arith.cmpi sge, %62, %cst_2 : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2133468Z       %75 = arith.cmpi slt, %62, %cst_1 : tensor<16x1xi64, #blocked1>
2026-02-21T08:40:07.2133659Z       %76 = arith.andi %74, %75 : tensor<16x1xi1, #blocked1>
2026-02-21T08:40:07.2133879Z       %77 = tt.broadcast %76 : tensor<16x1xi1, #blocked1> -> tensor<16x16xi1, #blocked1>
2026-02-21T08:40:07.2134143Z       %78 = ttg.convert_layout %77 : tensor<16x16xi1, #blocked1> -> tensor<16x16xi1, #blocked>
2026-02-21T08:40:07.2134395Z       %79 = arith.cmpi sge, %70, %cst_0 : tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.2134592Z       %80 = arith.cmpi slt, %70, %cst : tensor<1x16xi64, #blocked>
2026-02-21T08:40:07.2134783Z       %81 = arith.andi %79, %80 : tensor<1x16xi1, #blocked>
2026-02-21T08:40:07.2134990Z       %82 = tt.broadcast %81 : tensor<1x16xi1, #blocked> -> tensor<16x16xi1, #blocked>
2026-02-21T08:40:07.2135209Z       %83 = arith.andi %78, %82 : tensor<16x16xi1, #blocked>
2026-02-21T08:40:07.2135399Z       tt.store %73, %55, %83 : tensor<16x16x!tt.ptr<bf16>, #blocked>
2026-02-21T08:40:07.2135570Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:40:07.2135698Z     tt.return
2026-02-21T08:40:07.2135794Z   }
2026-02-21T08:40:07.2135893Z }
2026-02-21T08:40:07.2135949Z 
2026-02-21T08:40:07.2135986Z {-#
2026-02-21T08:40:07.2136087Z   external_resources: {
2026-02-21T08:40:07.2136225Z     mlir_reproducer: {
2026-02-21T08:40:07.2138946Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T08:40:07.2141811Z       disable_threading: false,
2026-02-21T08:40:07.2141938Z       verify_each: true
2026-02-21T08:40:07.2142073Z     }
2026-02-21T08:40:07.2142166Z   }
2026-02-21T08:40:07.2142255Z #-}
2026-02-21T08:40:07.2142593Z /tmp/torchinductor_root/ui/cuidcmuoourevkgc4xs7zqtizjfdsxlfzqlgpapqy2ekhhslavhx.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:40:07.2143515Z /tmp/torchinductor_root/ui/cuidcmuoourevkgc4xs7zqtizjfdsxlfzqlgpapqy2ekhhslavhx.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:40:07.2144224Z [93s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:40:07.2145151Z Config: @helion.kernel(config=helion.Config(block_sizes=[256, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[0, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T08:40:07.2145991Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:40:07.2146194Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:40:13.5446189Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 105/105 15.6 configs/s
2026-02-21T08:40:16.3077922Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 327.6         
2026-02-21T08:40:16.3079040Z                                                                   configs/s     
2026-02-21T08:40:16.7388574Z [102s] Generation 1 complete: 
2026-02-21T08:40:16.7388742Z error=5
2026-02-21T08:40:16.7388837Z timeout=3
2026-02-21T08:40:16.7388920Z ok=100
2026-02-21T08:40:16.7389002Z min=0.0853
2026-02-21T08:40:16.7389090Z mid=0.2193
2026-02-21T08:40:16.7389169Z max=3.5782
2026-02-21T08:40:16.7389288Z best={'block_sizes': [128, 16, 16],
2026-02-21T08:40:16.7389432Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:40:16.7389572Z  'l2_groupings': [64],
2026-02-21T08:40:16.7389680Z  'load_eviction_policies': ['', ''],
2026-02-21T08:40:16.7389805Z  'loop_orders': [[1, 0]],
2026-02-21T08:40:16.7389911Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:40:16.7390046Z  'num_sm_multiplier': 128,
2026-02-21T08:40:16.7390150Z  'num_stages': 4,
2026-02-21T08:40:16.7390244Z  'num_warps': 2,
2026-02-21T08:40:16.7390368Z  'pid_type': 'persistent_blocked',
2026-02-21T08:40:16.7390487Z  'range_flattens': [False, True],
2026-02-21T08:40:16.7391014Z  'range_multi_buffers': [True, False],
2026-02-21T08:40:16.7391132Z  'range_num_stages': [4, 3],
2026-02-21T08:40:16.7391242Z  'range_unroll_factors': [0, 2],
2026-02-21T08:40:16.7391355Z  'range_warp_specializes': [],
2026-02-21T08:40:16.7391466Z  'waves_per_eu': 1}
2026-02-21T08:40:16.7672278Z [102s] Fitting surrogate: 208 points, 208 targets
2026-02-21T08:40:18.1504711Z [104s] Generation 2 starting: 101 neighbors, 5 active search path(s)
2026-02-21T08:41:00.2619553Z [146s] Timeout after 30s compiling Config(block_sizes=[256, 8, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:41:00.2640044Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 104/104 0.5 configs/s
2026-02-21T08:41:05.9106972Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 104/104 18.6 configs/s
2026-02-21T08:41:15.1900819Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 113.3         
2026-02-21T08:41:15.1901519Z                                                                   configs/s     
2026-02-21T08:41:15.8671313Z [161s] Generation 2 complete: 
2026-02-21T08:41:15.8671759Z error=12
2026-02-21T08:41:15.8671974Z timeout=1
2026-02-21T08:41:15.8672168Z ok=94
2026-02-21T08:41:15.8672365Z min=0.0854
2026-02-21T08:41:15.8672561Z mid=0.1110
2026-02-21T08:41:15.8672762Z max=1.0793
2026-02-21T08:41:15.8672989Z best={'block_sizes': [128, 16, 16],
2026-02-21T08:41:15.8673368Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:41:15.8673729Z  'l2_groupings': [64],
2026-02-21T08:41:15.8674539Z  'load_eviction_policies': ['', ''],
2026-02-21T08:41:15.8674858Z  'loop_orders': [[1, 0]],
2026-02-21T08:41:15.8675133Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:41:15.8675443Z  'num_sm_multiplier': 128,
2026-02-21T08:41:15.8675708Z  'num_stages': 4,
2026-02-21T08:41:15.8675938Z  'num_warps': 2,
2026-02-21T08:41:15.8676193Z  'pid_type': 'persistent_blocked',
2026-02-21T08:41:15.8676505Z  'range_flattens': [False, True],
2026-02-21T08:41:15.8676821Z  'range_multi_buffers': [True, False],
2026-02-21T08:41:15.8677128Z  'range_num_stages': [4, 3],
2026-02-21T08:41:15.8677411Z  'range_unroll_factors': [0, 2],
2026-02-21T08:41:15.8677700Z  'range_warp_specializes': [],
2026-02-21T08:41:15.8677975Z  'waves_per_eu': 1}
2026-02-21T08:41:15.9890909Z [161s] Fitting surrogate: 315 points, 315 targets
2026-02-21T08:41:17.0526749Z [162s] Generation 3 starting: 97 neighbors, 5 active search path(s)
2026-02-21T08:41:39.2361112Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 0.7 configs/s
2026-02-21T08:41:44.9920536Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 17.5 configs/s
2026-02-21T08:41:51.6694433Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 145.0         
2026-02-21T08:41:51.6695051Z                                                                   configs/s     
2026-02-21T08:41:52.2767297Z [198s] Generation 3 complete: 
2026-02-21T08:41:52.2767525Z error=4
2026-02-21T08:41:52.2767659Z ok=99
2026-02-21T08:41:52.2767771Z min=0.0855
2026-02-21T08:41:52.2767880Z mid=0.1279
2026-02-21T08:41:52.2767990Z max=2.7258
2026-02-21T08:41:52.2768104Z best={'block_sizes': [128, 16, 16],
2026-02-21T08:41:52.2768304Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:41:52.2768483Z  'l2_groupings': [64],
2026-02-21T08:41:52.2768633Z  'load_eviction_policies': ['', ''],
2026-02-21T08:41:52.2768793Z  'loop_orders': [[1, 0]],
2026-02-21T08:41:52.2768938Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:41:52.2769079Z  'num_stages': 3,
2026-02-21T08:41:52.2769195Z  'num_warps': 2,
2026-02-21T08:41:52.2769346Z  'pid_type': 'flat',
2026-02-21T08:41:52.2769488Z  'range_flattens': [None, True],
2026-02-21T08:41:52.2769957Z  'range_multi_buffers': [None, False],
2026-02-21T08:41:52.2770114Z  'range_num_stages': [0, 3],
2026-02-21T08:41:52.2770260Z  'range_unroll_factors': [0, 2],
2026-02-21T08:41:52.2770409Z  'range_warp_specializes': [],
2026-02-21T08:41:52.2770553Z  'waves_per_eu': 1}
2026-02-21T08:41:52.3934193Z [198s] Fitting surrogate: 418 points, 418 targets
2026-02-21T08:41:53.2677160Z [199s] Generation 4 starting: 79 neighbors, 4 active search path(s)
2026-02-21T08:42:30.0103013Z [235s] Timeout after 30s compiling Config(block_sizes=[128, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:42:30.0122448Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.2 configs/s
2026-02-21T08:42:35.4203669Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 14.7 configs/s
2026-02-21T08:42:42.5181111Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 136.9         
2026-02-21T08:42:42.5181723Z                                                                   configs/s     
2026-02-21T08:42:43.1133195Z [249s] Generation 4 complete: 
2026-02-21T08:42:43.1133544Z error=1
2026-02-21T08:42:43.1133739Z timeout=1
2026-02-21T08:42:43.1133927Z ok=81
2026-02-21T08:42:43.1134107Z min=0.0773
2026-02-21T08:42:43.1134297Z mid=0.0937
2026-02-21T08:42:43.1134469Z max=2.2467
2026-02-21T08:42:43.1134676Z best={'block_sizes': [256, 16, 16],
2026-02-21T08:42:43.1135017Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:42:43.1135349Z  'l2_groupings': [4],
2026-02-21T08:42:43.1135922Z  'load_eviction_policies': ['', ''],
2026-02-21T08:42:43.1136204Z  'loop_orders': [[1, 0]],
2026-02-21T08:42:43.1136473Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:42:43.1136753Z  'num_sm_multiplier': 8,
2026-02-21T08:42:43.1136991Z  'num_stages': 2,
2026-02-21T08:42:43.1137196Z  'num_warps': 2,
2026-02-21T08:42:43.1137432Z  'pid_type': 'persistent_blocked',
2026-02-21T08:42:43.1137713Z  'range_flattens': [False, True],
2026-02-21T08:42:43.1137991Z  'range_multi_buffers': [None, None],
2026-02-21T08:42:43.1138267Z  'range_num_stages': [0, 4],
2026-02-21T08:42:43.1138521Z  'range_unroll_factors': [2, 2],
2026-02-21T08:42:43.1138789Z  'range_warp_specializes': [],
2026-02-21T08:42:43.1139036Z  'waves_per_eu': 1}
2026-02-21T08:42:43.2237333Z [249s] Fitting surrogate: 501 points, 501 targets
2026-02-21T08:42:43.9873725Z [249s] Generation 5 starting: 71 neighbors, 4 active search path(s)
2026-02-21T08:43:20.6098488Z [286s] Timeout after 30s compiling Config(block_sizes=[256, 16, 8], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:43:21.4187354Z [287s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[1, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:43:21.4202152Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.5 configs/s
2026-02-21T08:43:25.3356374Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 18.4 configs/s
2026-02-21T08:43:30.2941324Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 193.6         
2026-02-21T08:43:30.2942613Z                                                                   configs/s     
2026-02-21T08:43:30.7904353Z [296s] Generation 5 complete: 
2026-02-21T08:43:30.7904776Z error=5
2026-02-21T08:43:30.7904988Z timeout=2
2026-02-21T08:43:30.7905187Z ok=68
2026-02-21T08:43:30.7905388Z min=0.0773
2026-02-21T08:43:30.7905592Z mid=0.0975
2026-02-21T08:43:30.7905784Z max=1.3049
2026-02-21T08:43:30.7906008Z best={'block_sizes': [256, 16, 16],
2026-02-21T08:43:30.7906336Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:43:30.7906597Z  'l2_groupings': [4],
2026-02-21T08:43:30.7906798Z  'load_eviction_policies': ['', ''],
2026-02-21T08:43:30.7907052Z  'loop_orders': [[1, 0]],
2026-02-21T08:43:30.7907253Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:43:30.7907451Z  'num_sm_multiplier': 8,
2026-02-21T08:43:30.7907639Z  'num_stages': 2,
2026-02-21T08:43:30.7907841Z  'num_warps': 2,
2026-02-21T08:43:30.7908041Z  'pid_type': 'persistent_blocked',
2026-02-21T08:43:30.7908262Z  'range_flattens': [False, True],
2026-02-21T08:43:30.7908496Z  'range_multi_buffers': [None, None],
2026-02-21T08:43:30.7908711Z  'range_num_stages': [0, 3],
2026-02-21T08:43:30.7908912Z  'range_unroll_factors': [2, 2],
2026-02-21T08:43:30.7909121Z  'range_warp_specializes': [],
2026-02-21T08:43:30.7909315Z  'waves_per_eu': 1}
2026-02-21T08:43:30.8651440Z [296s] Fitting surrogate: 576 points, 576 targets
2026-02-21T08:43:31.4261978Z [297s] Generation 6 starting: 51 neighbors, 3 active search path(s)
2026-02-21T08:44:07.8130598Z [333s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[0, 3], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:44:07.8152376Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 0.5 configs/s
2026-02-21T08:44:10.7085870Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 17.9 configs/s
2026-02-21T08:44:14.6560421Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 240.9         
2026-02-21T08:44:14.6560797Z                                                                   configs/s     
2026-02-21T08:44:15.1036922Z [341s] Generation 6 complete: 
2026-02-21T08:44:15.1048570Z error=3
2026-02-21T08:44:15.1048670Z timeout=1
2026-02-21T08:44:15.1048749Z ok=50
2026-02-21T08:44:15.1048832Z min=0.0756
2026-02-21T08:44:15.1048917Z mid=0.0912
2026-02-21T08:44:15.1049000Z max=0.9757
2026-02-21T08:44:15.1049092Z best={'block_sizes': [256, 16, 16],
2026-02-21T08:44:15.1049245Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:44:15.1049383Z  'l2_groupings': [64],
2026-02-21T08:44:15.1049524Z  'load_eviction_policies': ['', ''],
2026-02-21T08:44:15.1049666Z  'loop_orders': [[0, 1]],
2026-02-21T08:44:15.1049784Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:44:15.1049894Z  'num_sm_multiplier': 128,
2026-02-21T08:44:15.1049996Z  'num_stages': 3,
2026-02-21T08:44:15.1050085Z  'num_warps': 2,
2026-02-21T08:44:15.1050184Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:44:15.1050310Z  'range_flattens': [False, True],
2026-02-21T08:44:15.1050428Z  'range_multi_buffers': [True, False],
2026-02-21T08:44:15.1050542Z  'range_num_stages': [4, 3],
2026-02-21T08:44:15.1050651Z  'range_unroll_factors': [1, 2],
2026-02-21T08:44:15.1050759Z  'range_warp_specializes': [],
2026-02-21T08:44:15.1050864Z  'waves_per_eu': 1}
2026-02-21T08:44:15.1643396Z [341s] Fitting surrogate: 630 points, 630 targets
2026-02-21T08:44:15.6282353Z [341s] Generation 7 starting: 43 neighbors, 2 active search path(s)
2026-02-21T08:44:51.2519870Z [377s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[4, 3], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:44:51.2537454Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 0.4 configs/s
2026-02-21T08:44:53.7176107Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 43/43 17.8 configs/s
2026-02-21T08:44:56.6464352Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 317.3         
2026-02-21T08:44:56.6464979Z                                                                   configs/s     
2026-02-21T08:44:57.0868876Z [383s] Generation 7 complete: 
2026-02-21T08:44:57.0870725Z error=2
2026-02-21T08:44:57.0871273Z timeout=1
2026-02-21T08:44:57.0871547Z ok=42
2026-02-21T08:44:57.0871766Z min=0.0750
2026-02-21T08:44:57.0872036Z mid=0.1085
2026-02-21T08:44:57.0872237Z max=0.6665
2026-02-21T08:44:57.0872479Z best={'block_sizes': [256, 16, 16],
2026-02-21T08:44:57.0872907Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T08:44:57.0873292Z  'l2_groupings': [64],
2026-02-21T08:44:57.0873573Z  'load_eviction_policies': ['', ''],
2026-02-21T08:44:57.0873899Z  'loop_orders': [[0, 1]],
2026-02-21T08:44:57.0874178Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:44:57.0874464Z  'num_sm_multiplier': 2,
2026-02-21T08:44:57.0874733Z  'num_stages': 3,
2026-02-21T08:44:57.0874959Z  'num_warps': 2,
2026-02-21T08:44:57.0875223Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:44:57.0875552Z  'range_flattens': [None, True],
2026-02-21T08:44:57.0875859Z  'range_multi_buffers': [True, False],
2026-02-21T08:44:57.0876160Z  'range_num_stages': [0, 2],
2026-02-21T08:44:57.0876439Z  'range_unroll_factors': [2, 2],
2026-02-21T08:44:57.0876729Z  'range_warp_specializes': [],
2026-02-21T08:44:57.0876988Z  'waves_per_eu': 1}
2026-02-21T08:44:57.1273348Z [383s] Fitting surrogate: 675 points, 675 targets
2026-02-21T08:44:57.4194181Z [383s] Generation 8 starting: 21 neighbors, 1 active search path(s)
2026-02-21T08:45:06.8082848Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 1.9 configs/s
2026-02-21T08:45:08.1518577Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.5 configs/s
2026-02-21T08:45:09.9793407Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 490.8         
2026-02-21T08:45:09.9793671Z                                                                   configs/s     
2026-02-21T08:45:10.3296931Z [396s] Generation 8 complete: 
2026-02-21T08:45:10.3297142Z error=1
2026-02-21T08:45:10.3297224Z ok=21
2026-02-21T08:45:10.3297304Z min=0.0756
2026-02-21T08:45:10.3297396Z mid=0.0866
2026-02-21T08:45:10.3297481Z max=0.2555
2026-02-21T08:45:10.3297569Z best={'block_sizes': [256, 16, 16],
2026-02-21T08:45:10.3297714Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T08:45:10.3297889Z  'l2_groupings': [64],
2026-02-21T08:45:10.3297995Z  'load_eviction_policies': ['', ''],
2026-02-21T08:45:10.3298562Z  'loop_orders': [[0, 1]],
2026-02-21T08:45:10.3298670Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:45:10.3298774Z  'num_sm_multiplier': 2,
2026-02-21T08:45:10.3298876Z  'num_stages': 3,
2026-02-21T08:45:10.3298963Z  'num_warps': 2,
2026-02-21T08:45:10.3299060Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:45:10.3299184Z  'range_flattens': [None, True],
2026-02-21T08:45:10.3299299Z  'range_multi_buffers': [True, False],
2026-02-21T08:45:10.3299414Z  'range_num_stages': [0, 2],
2026-02-21T08:45:10.3299516Z  'range_unroll_factors': [1, 2],
2026-02-21T08:45:10.3299626Z  'range_warp_specializes': [],
2026-02-21T08:45:10.3299727Z  'waves_per_eu': 1}
2026-02-21T08:45:10.3543433Z [396s] Fitting surrogate: 697 points, 697 targets
2026-02-21T08:45:10.6500432Z [396s] Generation 9 starting: 21 neighbors, 1 active search path(s)
2026-02-21T08:45:25.0135532Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 0.5 configs/s
2026-02-21T08:45:26.3528045Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 21/21 17.6 configs/s
2026-02-21T08:45:28.4642244Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 624.5         
2026-02-21T08:45:28.4642790Z                                                                   configs/s     
2026-02-21T08:45:28.8174244Z [414s] Generation 9 complete: 
2026-02-21T08:45:28.8174406Z error=1
2026-02-21T08:45:28.8174503Z ok=21
2026-02-21T08:45:28.8174599Z min=0.0755
2026-02-21T08:45:28.8174716Z mid=0.0810
2026-02-21T08:45:28.8174812Z max=0.5877
2026-02-21T08:45:28.8174917Z best={'block_sizes': [256, 16, 16],
2026-02-21T08:45:28.8175097Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T08:45:28.8175267Z  'l2_groupings': [64],
2026-02-21T08:45:28.8175397Z  'load_eviction_policies': ['', ''],
2026-02-21T08:45:28.8175550Z  'loop_orders': [[0, 1]],
2026-02-21T08:45:28.8175685Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:45:28.8176191Z  'num_sm_multiplier': 2,
2026-02-21T08:45:28.8176312Z  'num_stages': 3,
2026-02-21T08:45:28.8176419Z  'num_warps': 2,
2026-02-21T08:45:28.8176559Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:45:28.8176715Z  'range_flattens': [None, True],
2026-02-21T08:45:28.8176855Z  'range_multi_buffers': [True, True],
2026-02-21T08:45:28.8176992Z  'range_num_stages': [0, 2],
2026-02-21T08:45:28.8177122Z  'range_unroll_factors': [1, 2],
2026-02-21T08:45:28.8177257Z  'range_warp_specializes': [],
2026-02-21T08:45:28.8177385Z  'waves_per_eu': 1}
2026-02-21T08:45:28.8372929Z [414s] Fitting surrogate: 719 points, 719 targets
2026-02-21T08:45:29.1331236Z [415s] Generation 10 starting: 21 neighbors, 1 active search path(s)
2026-02-21T08:45:43.6403648Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 0.7 configs/s
2026-02-21T08:45:44.8265475Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 18.5 configs/s
2026-02-21T08:45:45.8218030Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 843.4         
2026-02-21T08:45:45.8218601Z                                                                   configs/s     
2026-02-21T08:45:46.2063133Z [432s] Generation 10 complete: 
2026-02-21T08:45:46.2063529Z error=2
2026-02-21T08:45:46.2063768Z ok=20
2026-02-21T08:45:46.2063976Z min=0.0756
2026-02-21T08:45:46.2064185Z mid=0.1217
2026-02-21T08:45:46.2064385Z max=0.7168
2026-02-21T08:45:46.2064615Z best={'block_sizes': [256, 16, 16],
2026-02-21T08:45:46.2064985Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T08:45:46.2065377Z  'l2_groupings': [64],
2026-02-21T08:45:46.2065657Z  'load_eviction_policies': ['', ''],
2026-02-21T08:45:46.2065965Z  'loop_orders': [[0, 1]],
2026-02-21T08:45:46.2066252Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:45:46.2066536Z  'num_sm_multiplier': 2,
2026-02-21T08:45:46.2066791Z  'num_stages': 3,
2026-02-21T08:45:46.2067017Z  'num_warps': 2,
2026-02-21T08:45:46.2067274Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:45:46.2067628Z  'range_flattens': [None, True],
2026-02-21T08:45:46.2067923Z  'range_multi_buffers': [True, True],
2026-02-21T08:45:46.2068772Z  'range_num_stages': [0, 2],
2026-02-21T08:45:46.2069046Z  'range_unroll_factors': [1, 2],
2026-02-21T08:45:46.2069339Z  'range_warp_specializes': [],
2026-02-21T08:45:46.2069614Z  'waves_per_eu': 1}
2026-02-21T08:45:46.2245888Z [432s] Fitting surrogate: 741 points, 741 targets
2026-02-21T08:45:46.4611756Z [432s] Generation 11 starting: 17 neighbors, 1 active search path(s)
2026-02-21T08:45:54.6278105Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 1.2 configs/s
2026-02-21T08:45:55.6305270Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 20.0 configs/s
2026-02-21T08:45:56.9024731Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 675.9         
2026-02-21T08:45:56.9025313Z                                                                   configs/s     
2026-02-21T08:45:57.2723852Z [443s] Generation 11 complete: 
2026-02-21T08:45:57.2724213Z error=3
2026-02-21T08:45:57.2724434Z ok=15
2026-02-21T08:45:57.2724639Z min=0.0755
2026-02-21T08:45:57.2724872Z mid=0.0789
2026-02-21T08:45:57.2725069Z max=0.2450
2026-02-21T08:45:57.2725292Z best={'block_sizes': [256, 16, 16],
2026-02-21T08:45:57.2725662Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:45:57.2726010Z  'l2_groupings': [64],
2026-02-21T08:45:57.2726289Z  'load_eviction_policies': ['', ''],
2026-02-21T08:45:57.2726600Z  'loop_orders': [[0, 1]],
2026-02-21T08:45:57.2726875Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:45:57.2727155Z  'num_sm_multiplier': 2,
2026-02-21T08:45:57.2727408Z  'num_stages': 4,
2026-02-21T08:45:57.2727633Z  'num_warps': 2,
2026-02-21T08:45:57.2727889Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:45:57.2728215Z  'range_flattens': [None, True],
2026-02-21T08:45:57.2728510Z  'range_multi_buffers': [True, True],
2026-02-21T08:45:57.2728816Z  'range_num_stages': [0, 2],
2026-02-21T08:45:57.2729094Z  'range_unroll_factors': [1, 2],
2026-02-21T08:45:57.2729811Z  'range_warp_specializes': [],
2026-02-21T08:45:57.2730088Z  'waves_per_eu': 1}
2026-02-21T08:45:57.2994801Z [443s] Fitting surrogate: 759 points, 759 targets
2026-02-21T08:45:57.3921944Z [443s] Autotuning complete in 443.3s after searching 721 configs.
2026-02-21T08:45:57.3922480Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:45:57.3924571Z     @helion.kernel(config=helion.Config(block_sizes=[256, 16, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[0, 2], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:45:57.3926399Z 
2026-02-21T08:45:57.3926858Z [443s] Code of selected kernel: /tmp/torchinductor_root/6j/c6j4pcigu52vdmxsdp26z6z6tjsrk6hrno6jgs5d7xmqfx4qjwen.py
2026-02-21T08:45:58.3559473Z WARNING:tritonbench.utils.triton_op:Completed input ID 10:
2026-02-21T08:45:58.3559638Z x_val
2026-02-21T08:45:58.3559721Z -------------------
2026-02-21T08:45:58.3559814Z (16, 1, 7168, 8192)
2026-02-21T08:45:58.3559865Z 
2026-02-21T08:45:58.3588488Z  40%|████      | 4/10 [36:37<53:10, 531.68s/it]  WARNING:tritonbench.utils.triton_op:Running input ID 14:
2026-02-21T08:45:58.3588685Z x_val
2026-02-21T08:45:58.3588765Z -------------------
2026-02-21T08:45:58.3588853Z (64, 1, 7168, 8192)
2026-02-21T08:45:58.3591720Z INFO:tritonbench.utils.triton_op:Took 0.15ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T08:45:59.3402078Z INFO:tritonbench.utils.triton_op:Took 5.25ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T08:46:00.2092341Z Autotune Choices Stats:
2026-02-21T08:46:00.2094301Z {"num_choices": 31, "num_triton_choices": 30, "best_kernel": "mm", "best_time": 0.05023900046944618, "best_triton_pos": 1, "best_triton_time": 0.060398999601602554, "best_triton_kernel": "triton_mm_67", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8"}
2026-02-21T08:46:00.2101288Z AUTOTUNE mm(64x8192, 8192x7168)
2026-02-21T08:46:00.2101591Z strides: [8192, 1], [7168, 1]
2026-02-21T08:46:00.2101769Z dtypes: torch.bfloat16, torch.bfloat16
2026-02-21T08:46:00.2101903Z   mm 0.0502 ms 100.0% 
2026-02-21T08:46:00.2102298Z   triton_mm_67 0.0604 ms 83.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:46:00.2102992Z   triton_mm_62 0.0687 ms 73.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:46:00.2103638Z   triton_mm_66 0.0688 ms 73.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:46:00.2104288Z   triton_mm_74 0.0694 ms 72.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:46:00.2104929Z   triton_mm_63 0.0770 ms 65.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:46:00.2105817Z   triton_mm_61 0.0772 ms 65.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:46:00.2106455Z   triton_mm_83 0.0775 ms 64.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:46:00.2107096Z   triton_mm_58 0.0812 ms 61.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:46:00.2107731Z   triton_mm_57 0.0878 ms 57.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:46:00.2108232Z SingleProcess AUTOTUNE benchmarking takes 0.5805 seconds and 0.1320 seconds precompiling for 31 choices
2026-02-21T08:46:01.5598348Z INFO:tritonbench.utils.triton_op:Took 0.15ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T08:46:01.5625750Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:46:01.5626063Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:46:01.5626320Z               'dtype': 'torch.bfloat16',
2026-02-21T08:46:01.5626559Z               'shape': (64, 1, 8192),
2026-02-21T08:46:01.5626781Z               'stride': (8192, 8192, 1)},
2026-02-21T08:46:01.5627013Z             { 'device': 'cuda:0',
2026-02-21T08:46:01.5627226Z               'dtype': 'torch.int32',
2026-02-21T08:46:01.5627442Z               'shape': (8192, 7168),
2026-02-21T08:46:01.5627658Z               'stride': (7168, 1)}),
2026-02-21T08:46:01.5627861Z   'kwargs': {}}
2026-02-21T08:46:01.5657842Z INFO:tritonbench.utils.triton_op:Took 3.41ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T08:46:01.7487768Z [0s] Autotune random seed: 2134834638
2026-02-21T08:46:01.8436356Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:46:38.9483837Z [37s] Timeout after 30s compiling Config(block_sizes=[128, 1, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:46:42.1619387Z [40s] Timeout after 30s compiling Config(block_sizes=[1024, 32, 8], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T08:46:44.4572538Z [42s] Timeout after 30s compiling Config(block_sizes=[512, 4, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[3, 0], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T08:46:44.4591877Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T08:46:46.1443460Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:46:46.1448283Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}>
2026-02-21T08:46:46.1451443Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T08:46:46.1451816Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}>
2026-02-21T08:46:46.1452128Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:46:46.1452385Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:46:46.1452572Z #smem = #ttg.shared_memory
2026-02-21T08:46:46.1452812Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:46:46.1453286Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:46:46.1453689Z     %cst = arith.constant dense<64> : tensor<16x1xi64, #mma>
2026-02-21T08:46:46.1453861Z     %cst_0 = arith.constant dense<0> : tensor<16x1xi64, #mma>
2026-02-21T08:46:46.1454028Z     %cst_1 = arith.constant dense<7168> : tensor<16x1xi64, #mma>
2026-02-21T08:46:46.1454199Z     %cst_2 = arith.constant dense<7168> : tensor<1x256xi64, #mma>
2026-02-21T08:46:46.1454363Z     %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #mma>
2026-02-21T08:46:46.1454531Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:46.1454699Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:46.1454872Z     %cst_6 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T08:46:46.1455021Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:46:46.1455170Z     %cst_7 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T08:46:46.1455331Z     %c7168_i64 = arith.constant 7168 : i64
2026-02-21T08:46:46.1455447Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:46:46.1455587Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:46:46.1455707Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:46:46.1455975Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:46:46.1456085Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T08:46:46.1456227Z     %cst_8 = arith.constant dense<0> : tensor<1x256xi8, #blocked2>
2026-02-21T08:46:46.1456399Z     %cst_9 = arith.constant dense<7168> : tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1456574Z     %cst_10 = arith.constant dense<0> : tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1456748Z     %cst_11 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked>
2026-02-21T08:46:46.1456892Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:46:46.1457009Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:46:46.1457122Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:46:46.1457234Z     %c4864_i32 = arith.constant 4864 : i32
2026-02-21T08:46:46.1457418Z     %cst_12 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:46.1457611Z     %0 = tt.get_program_id x : i32
2026-02-21T08:46:46.1457807Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:46.1458080Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:46.1458340Z     %3 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:46.1458584Z     %4 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:46.1458782Z     %5 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:46.1459018Z     %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:46.1459384Z     %7 = arith.extsi %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:46.1459740Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:46:46.1460151Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:46:46.1460548Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:46.1460801Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:46.1460995Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T08:46:46.1461189Z     %13 = arith.cmpi eq, %10, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:46.1461377Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T08:46:46.1461583Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:46:46.1461848Z     %16 = arith.extsi %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:46.1462146Z     %17 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:46.1462448Z     %18 = arith.extsi %17 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:46.1462701Z     scf.for %arg3 = %0 to %c112_i32 step %c4864_i32  : i32 {
2026-02-21T08:46:46.1462848Z       %19 = arith.divsi %arg3, %c112_i32 : i32
2026-02-21T08:46:46.1462971Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T08:46:46.1463087Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T08:46:46.1463201Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T08:46:46.1463318Z       %23 = arith.remsi %arg3, %c112_i32 : i32
2026-02-21T08:46:46.1463438Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T08:46:46.1463550Z       %25 = arith.addi %20, %24 : i32
2026-02-21T08:46:46.1463693Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T08:46:46.1463806Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T08:46:46.1463973Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:46.1464195Z       %29 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:46.1464363Z       %30 = arith.muli %26, %c256_i32 : i32
2026-02-21T08:46:46.1464588Z       %31 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:46:46.1464837Z       %32 = arith.muli %31, %cst_6 : tensor<16x1xi32, #blocked1>
2026-02-21T08:46:46.1465025Z       %33 = tt.broadcast %32 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:46.1465200Z       %34 = arith.extsi %30 : i32 to i64
2026-02-21T08:46:46.1465375Z       %35 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:46.1465619Z       %36 = arith.addi %35, %7 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:46.1465900Z       %37 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1466161Z       %38 = arith.cmpi sge, %37, %cst_10 : tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1466332Z       %39 = arith.cmpi slt, %37, %cst_9 : tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1466497Z       %40 = arith.andi %38, %39 : tensor<1x256xi1, #blocked2>
2026-02-21T08:46:46.1466728Z       %41 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c2_i32 iter_args(%arg5 = %cst_7) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T08:46:46.1466946Z         %64 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:46:46.1467146Z         %65 = tt.splat %64 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:46.1467361Z         %66 = arith.addi %65, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:46.1467638Z         %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:46:46.1467910Z         %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:46.1468101Z         %69 = arith.addi %33, %68 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:46.1468299Z         %70 = tt.addptr %4, %69 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:46.1468502Z         %71 = tt.load %70 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:46.1468769Z         %72 = ttg.convert_layout %71 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:46.1469170Z         %73 = arith.extf %72 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:46.1469454Z         %74 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:46:46.1469575Z         %75 = arith.muli %74, %c7168_i64 : i64
2026-02-21T08:46:46.1469715Z         %76 = tt.splat %75 : i64 -> tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1469872Z         %77 = arith.addi %76, %37 : tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1470062Z         %78 = tt.addptr %5, %77 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1470270Z         %79 = tt.load %78, %40, %cst_8 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:46.1470536Z         %80 = ttg.convert_layout %79 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:46.1470819Z         %81 = arith.shli %80, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:46.1471056Z         %82 = arith.shrsi %81, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:46.1471326Z         %83 = arith.shrsi %80, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:46.1471615Z         %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:46:46.1471947Z         %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:46:46.1472229Z         %86 = tt.broadcast %84 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:46:46.1472467Z         %87 = arith.select %12, %86, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:46:46.1472703Z         %88 = tt.broadcast %85 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:46:46.1472931Z         %89 = arith.select %14, %88, %87 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:46:46.1473154Z         %90 = tt.reshape %89 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T08:46:46.1473376Z         %91 = arith.sitofp %90 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T08:46:46.1473626Z         %92 = ttg.local_alloc %91 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T08:46:46.1473948Z         %93 = ttg.local_load %92 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:46.1474421Z         %94 = tt.dot %73, %93, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T08:46:46.1474771Z         %95 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T08:46:46.1474890Z         %96 = arith.muli %95, %c2_i32 : i32
2026-02-21T08:46:46.1499071Z         %97 = tt.splat %96 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:46.1499383Z         %98 = arith.addi %97, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:46.1499656Z         %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:46:46.1499931Z         %100 = tt.broadcast %99 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:46.1500126Z         %101 = arith.addi %33, %100 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:46.1500328Z         %102 = tt.addptr %4, %101 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:46.1500532Z         %103 = tt.load %102 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:46.1500800Z         %104 = ttg.convert_layout %103 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:46.1501201Z         %105 = arith.extf %104 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:46.1501484Z         %106 = arith.extsi %95 : i32 to i64
2026-02-21T08:46:46.1501608Z         %107 = arith.muli %106, %c7168_i64 : i64
2026-02-21T08:46:46.1501750Z         %108 = tt.splat %107 : i64 -> tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1501911Z         %109 = arith.addi %108, %37 : tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1502108Z         %110 = tt.addptr %5, %109 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi64, #blocked2>
2026-02-21T08:46:46.1502304Z         %111 = arith.cmpi slt, %106, %c4096_i64 : i64
2026-02-21T08:46:46.1502450Z         %112 = tt.splat %111 : i1 -> tensor<1x256xi1, #blocked2>
2026-02-21T08:46:46.1502604Z         %113 = arith.andi %112, %40 : tensor<1x256xi1, #blocked2>
2026-02-21T08:46:46.1502776Z         %114 = tt.load %110, %113, %cst_8 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:46.1503034Z         %115 = ttg.convert_layout %114 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:46.1503401Z         %116 = arith.shli %115, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:46.1503639Z         %117 = arith.shrsi %116, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:46.1503878Z         %118 = arith.shrsi %115, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:46.1504175Z         %119 = tt.expand_dims %117 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:46:46.1504512Z         %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:46:46.1504802Z         %121 = tt.broadcast %119 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:46:46.1505050Z         %122 = arith.select %12, %121, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:46:46.1505294Z         %123 = tt.broadcast %120 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:46:46.1505533Z         %124 = arith.select %14, %123, %122 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:46:46.1505767Z         %125 = tt.reshape %124 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T08:46:46.1506000Z         %126 = arith.sitofp %125 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T08:46:46.1506255Z         %127 = ttg.local_alloc %126 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T08:46:46.1506578Z         %128 = ttg.local_load %127 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:46.1507100Z         %129 = tt.dot %105, %128, %94, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T08:46:46.1507450Z         scf.yield %129 : tensor<16x256xf32, #mma>
2026-02-21T08:46:46.1507604Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:46:46.1507790Z       %42 = arith.truncf %41 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T08:46:46.1507955Z       %43 = arith.extsi %27 : i32 to i64
2026-02-21T08:46:46.1508116Z       %44 = tt.splat %43 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:46.1508318Z       %45 = arith.addi %44, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:46.1508582Z       %46 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma>
2026-02-21T08:46:46.1508821Z       %47 = arith.muli %46, %cst_1 : tensor<16x1xi64, #mma>
2026-02-21T08:46:46.1508994Z       %48 = tt.broadcast %47 : tensor<16x1xi64, #mma> -> tensor<16x256xi64, #mma>
2026-02-21T08:46:46.1509199Z       %49 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:46.1509404Z       %50 = arith.addi %49, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:46.1509669Z       %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T08:46:46.1509930Z       %52 = tt.broadcast %51 : tensor<1x256xi64, #mma> -> tensor<16x256xi64, #mma>
2026-02-21T08:46:46.1510103Z       %53 = arith.addi %48, %52 : tensor<16x256xi64, #mma>
2026-02-21T08:46:46.1510284Z       %54 = tt.addptr %15, %53 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi64, #mma>
2026-02-21T08:46:46.1510475Z       %55 = arith.cmpi sge, %46, %cst_0 : tensor<16x1xi64, #mma>
2026-02-21T08:46:46.1510635Z       %56 = arith.cmpi slt, %46, %cst : tensor<16x1xi64, #mma>
2026-02-21T08:46:46.1510784Z       %57 = arith.andi %55, %56 : tensor<16x1xi1, #mma>
2026-02-21T08:46:46.1510994Z       %58 = tt.broadcast %57 : tensor<16x1xi1, #mma> -> tensor<16x256xi1, #mma>
2026-02-21T08:46:46.1511173Z       %59 = arith.cmpi sge, %51, %cst_3 : tensor<1x256xi64, #mma>
2026-02-21T08:46:46.1511332Z       %60 = arith.cmpi slt, %51, %cst_2 : tensor<1x256xi64, #mma>
2026-02-21T08:46:46.1511483Z       %61 = arith.andi %59, %60 : tensor<1x256xi1, #mma>
2026-02-21T08:46:46.1511645Z       %62 = tt.broadcast %61 : tensor<1x256xi1, #mma> -> tensor<16x256xi1, #mma>
2026-02-21T08:46:46.1511817Z       %63 = arith.andi %58, %62 : tensor<16x256xi1, #mma>
2026-02-21T08:46:46.1511970Z       tt.store %54, %42, %63 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:46:46.1512172Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T08:46:46.1512354Z     tt.return
2026-02-21T08:46:46.1512432Z   }
2026-02-21T08:46:46.1512514Z }
2026-02-21T08:46:46.1512558Z 
2026-02-21T08:46:46.1512589Z {-#
2026-02-21T08:46:46.1512673Z   external_resources: {
2026-02-21T08:46:46.1512772Z     mlir_reproducer: {
2026-02-21T08:46:46.1513777Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:46:46.1514778Z       disable_threading: false,
2026-02-21T08:46:46.1514887Z       verify_each: true
2026-02-21T08:46:46.1514975Z     }
2026-02-21T08:46:46.1515048Z   }
2026-02-21T08:46:46.1515152Z #-}
2026-02-21T08:46:46.1515433Z /tmp/torchinductor_root/yy/cyyoukks6lw5y4r3bbdrispjowrlw7qo2wiar4qx74wz4k3k6ofn.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:46:46.1516176Z /tmp/torchinductor_root/yy/cyyoukks6lw5y4r3bbdrispjowrlw7qo2wiar4qx74wz4k3k6ofn.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:46:46.1516722Z [44s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:46:46.1517508Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[2, 4], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T08:46:46.1518223Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:46:46.1518389Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:46:50.4596240Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:46:50.4601181Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}>
2026-02-21T08:46:50.4601859Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T08:46:50.4602527Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}>
2026-02-21T08:46:50.4603271Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:46:50.4604395Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 32, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:46:50.4604881Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:46:50.4605204Z #smem = #ttg.shared_memory
2026-02-21T08:46:50.4605621Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:46:50.4606497Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:46:50.4607198Z     %cst = arith.constant dense<64> : tensor<32x1xi64, #mma>
2026-02-21T08:46:50.4607499Z     %cst_0 = arith.constant dense<0> : tensor<32x1xi64, #mma>
2026-02-21T08:46:50.4607803Z     %cst_1 = arith.constant dense<7168> : tensor<32x1xi64, #mma>
2026-02-21T08:46:50.4608103Z     %cst_2 = arith.constant dense<7168> : tensor<1x512xi64, #mma>
2026-02-21T08:46:50.4608407Z     %cst_3 = arith.constant dense<0> : tensor<1x512xi64, #mma>
2026-02-21T08:46:50.4608708Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:50.4609008Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:50.4609320Z     %cst_6 = arith.constant dense<8192> : tensor<32x1xi32, #blocked1>
2026-02-21T08:46:50.4609589Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:46:50.4609861Z     %cst_7 = arith.constant dense<0.000000e+00> : tensor<32x512xf32, #mma>
2026-02-21T08:46:50.4610143Z     %c7168_i64 = arith.constant 7168 : i64
2026-02-21T08:46:50.4610353Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:46:50.4610561Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:46:50.4610759Z     %c28_i32 = arith.constant 28 : i32
2026-02-21T08:46:50.4611226Z     %cst_8 = arith.constant dense<0> : tensor<1x512xi8, #blocked2>
2026-02-21T08:46:50.4611540Z     %cst_9 = arith.constant dense<7168> : tensor<1x512xi64, #blocked2>
2026-02-21T08:46:50.4611862Z     %cst_10 = arith.constant dense<0> : tensor<1x512xi64, #blocked2>
2026-02-21T08:46:50.4612171Z     %cst_11 = arith.constant dense<0> : tensor<1x2x512xi8, #blocked>
2026-02-21T08:46:50.4612465Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:46:50.4612675Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:46:50.4612872Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:46:50.4613195Z     %cst_12 = arith.constant dense<4> : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:50.4613539Z     %0 = tt.get_program_id x : i32
2026-02-21T08:46:50.4613738Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T08:46:50.4613941Z     %2 = arith.minsi %1, %c28_i32 : i32
2026-02-21T08:46:50.4614300Z     %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:50.4614793Z     %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:50.4615246Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:50.4615568Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:50.4615836Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:50.4616159Z     %8 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:50.4616593Z     %9 = arith.extsi %8 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:50.4617081Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:46:50.4617631Z     %11 = tt.expand_dims %10 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:46:50.4618217Z     %12 = tt.expand_dims %11 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:50.4618550Z     %13 = arith.cmpi eq, %12, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:50.4618815Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked>
2026-02-21T08:46:50.4619079Z     %15 = arith.cmpi eq, %12, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:50.4619324Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked>
2026-02-21T08:46:50.4619601Z     %17 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x512x!tt.ptr<bf16>, #mma>
2026-02-21T08:46:50.4619958Z     %18 = arith.extsi %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:50.4620359Z     %19 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:50.4620773Z     %20 = arith.extsi %19 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:50.4621090Z     scf.for %arg3 = %0 to %2 step %c1_i32  : i32 {
2026-02-21T08:46:50.4621269Z       %21 = arith.remsi %arg3, %c2_i32 : i32
2026-02-21T08:46:50.4621436Z       %22 = arith.divsi %arg3, %c2_i32 : i32
2026-02-21T08:46:50.4621591Z       %23 = arith.muli %21, %c32_i32 : i32
2026-02-21T08:46:50.4621810Z       %24 = tt.splat %23 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:50.4622110Z       %25 = arith.addi %24, %3 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:50.4622331Z       %26 = arith.muli %22, %c512_i32 : i32
2026-02-21T08:46:50.4622677Z       %27 = tt.expand_dims %25 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T08:46:50.4623013Z       %28 = arith.muli %27, %cst_6 : tensor<32x1xi32, #blocked1>
2026-02-21T08:46:50.4623264Z       %29 = tt.broadcast %28 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:46:50.4623494Z       %30 = arith.extsi %26 : i32 to i64
2026-02-21T08:46:50.4623709Z       %31 = tt.splat %30 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:50.4624003Z       %32 = arith.addi %31, %9 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:50.4624368Z       %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x512xi64, #blocked2>
2026-02-21T08:46:50.4624745Z       %34 = arith.cmpi sge, %33, %cst_10 : tensor<1x512xi64, #blocked2>
2026-02-21T08:46:50.4624936Z       %35 = arith.cmpi slt, %33, %cst_9 : tensor<1x512xi64, #blocked2>
2026-02-21T08:46:50.4625108Z       %36 = arith.andi %34, %35 : tensor<1x512xi1, #blocked2>
2026-02-21T08:46:50.4625357Z       %37 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c1_i32 iter_args(%arg5 = %cst_7) -> (tensor<32x512xf32, #mma>)  : i32 {
2026-02-21T08:46:50.4625592Z         %60 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:46:50.4625767Z         %61 = tt.splat %60 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:50.4625999Z         %62 = arith.addi %61, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:50.4626283Z         %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:46:50.4626571Z         %64 = tt.broadcast %63 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:46:50.4626773Z         %65 = arith.addi %29, %64 : tensor<32x2xi32, #blocked1>
2026-02-21T08:46:50.4626981Z         %66 = tt.addptr %6, %65 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:46:50.4627194Z         %67 = tt.load %66 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:50.4627461Z         %68 = ttg.local_alloc %67 : (tensor<32x2xbf16, #blocked1>) -> !ttg.memdesc<32x2xbf16, #shared, #smem>
2026-02-21T08:46:50.4627807Z         %69 = ttg.local_load %68 : !ttg.memdesc<32x2xbf16, #shared, #smem> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:50.4628237Z         %70 = arith.extf %69 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:50.4628533Z         %71 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:46:50.4628665Z         %72 = arith.muli %71, %c7168_i64 : i64
2026-02-21T08:46:50.4628812Z         %73 = tt.splat %72 : i64 -> tensor<1x512xi64, #blocked2>
2026-02-21T08:46:50.4628973Z         %74 = arith.addi %73, %33 : tensor<1x512xi64, #blocked2>
2026-02-21T08:46:50.4629180Z         %75 = tt.addptr %7, %74 : tensor<1x512x!tt.ptr<i8>, #blocked2>, tensor<1x512xi64, #blocked2>
2026-02-21T08:46:50.4629402Z         %76 = tt.load %75, %36, %cst_8 : tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:50.4629680Z         %77 = ttg.convert_layout %76 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:50.4629978Z         %78 = arith.shli %77, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:50.4630228Z         %79 = arith.shrsi %78, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:50.4630480Z         %80 = arith.shrsi %77, %cst_12 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:50.4630788Z         %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T08:46:50.4631186Z         %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T08:46:50.4631487Z         %83 = tt.broadcast %81 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T08:46:50.4631743Z         %84 = arith.select %14, %83, %cst_11 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T08:46:50.4631996Z         %85 = tt.broadcast %82 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T08:46:50.4632240Z         %86 = arith.select %16, %85, %84 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T08:46:50.4632481Z         %87 = tt.reshape %86 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked2>
2026-02-21T08:46:50.4632716Z         %88 = arith.sitofp %87 : tensor<2x512xi8, #blocked2> to tensor<2x512xf32, #blocked2>
2026-02-21T08:46:50.4632981Z         %89 = ttg.local_alloc %88 : (tensor<2x512xf32, #blocked2>) -> !ttg.memdesc<2x512xf32, #shared1, #smem>
2026-02-21T08:46:50.4633333Z         %90 = ttg.local_load %89 : !ttg.memdesc<2x512xf32, #shared1, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:50.4633838Z         %91 = tt.dot %70, %90, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma>
2026-02-21T08:46:50.4634212Z         scf.yield %91 : tensor<32x512xf32, #mma>
2026-02-21T08:46:50.4634390Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T08:46:50.4634605Z       %38 = arith.truncf %37 : tensor<32x512xf32, #mma> to tensor<32x512xbf16, #mma>
2026-02-21T08:46:50.4634774Z       %39 = arith.extsi %23 : i32 to i64
2026-02-21T08:46:50.4634930Z       %40 = tt.splat %39 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:50.4635133Z       %41 = arith.addi %40, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:50.4635391Z       %42 = tt.expand_dims %41 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T08:46:50.4635652Z       %43 = arith.muli %42, %cst_1 : tensor<32x1xi64, #mma>
2026-02-21T08:46:50.4635826Z       %44 = tt.broadcast %43 : tensor<32x1xi64, #mma> -> tensor<32x512xi64, #mma>
2026-02-21T08:46:50.4636025Z       %45 = tt.splat %30 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:50.4636230Z       %46 = arith.addi %45, %20 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:50.4636491Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T08:46:50.4636748Z       %48 = tt.broadcast %47 : tensor<1x512xi64, #mma> -> tensor<32x512xi64, #mma>
2026-02-21T08:46:50.4636923Z       %49 = arith.addi %44, %48 : tensor<32x512xi64, #mma>
2026-02-21T08:46:50.4637105Z       %50 = tt.addptr %17, %49 : tensor<32x512x!tt.ptr<bf16>, #mma>, tensor<32x512xi64, #mma>
2026-02-21T08:46:50.4637298Z       %51 = arith.cmpi sge, %42, %cst_0 : tensor<32x1xi64, #mma>
2026-02-21T08:46:50.4637459Z       %52 = arith.cmpi slt, %42, %cst : tensor<32x1xi64, #mma>
2026-02-21T08:46:50.4637607Z       %53 = arith.andi %51, %52 : tensor<32x1xi1, #mma>
2026-02-21T08:46:50.4637780Z       %54 = tt.broadcast %53 : tensor<32x1xi1, #mma> -> tensor<32x512xi1, #mma>
2026-02-21T08:46:50.4637957Z       %55 = arith.cmpi sge, %47, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T08:46:50.4638119Z       %56 = arith.cmpi slt, %47, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T08:46:50.4638273Z       %57 = arith.andi %55, %56 : tensor<1x512xi1, #mma>
2026-02-21T08:46:50.4638438Z       %58 = tt.broadcast %57 : tensor<1x512xi1, #mma> -> tensor<32x512xi1, #mma>
2026-02-21T08:46:50.4638613Z       %59 = arith.andi %54, %58 : tensor<32x512xi1, #mma>
2026-02-21T08:46:50.4638763Z       tt.store %50, %38, %59 : tensor<32x512x!tt.ptr<bf16>, #mma>
2026-02-21T08:46:50.4639019Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32}
2026-02-21T08:46:50.4639198Z     tt.return
2026-02-21T08:46:50.4639282Z   }
2026-02-21T08:46:50.4639359Z }
2026-02-21T08:46:50.4639404Z 
2026-02-21T08:46:50.4639435Z {-#
2026-02-21T08:46:50.4639516Z   external_resources: {
2026-02-21T08:46:50.4639612Z     mlir_reproducer: {
2026-02-21T08:46:50.4640611Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:46:50.4641614Z       disable_threading: false,
2026-02-21T08:46:50.4641719Z       verify_each: true
2026-02-21T08:46:50.4641814Z     }
2026-02-21T08:46:50.4641886Z   }
2026-02-21T08:46:50.4641956Z #-}
2026-02-21T08:46:50.4642231Z /tmp/torchinductor_root/ov/cov4uajfvx5mko2n2x5v26enht6si7crrlbd46iulyg2niaxw46u.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:46:50.4642959Z /tmp/torchinductor_root/ov/cov4uajfvx5mko2n2x5v26enht6si7crrlbd46iulyg2niaxw46u.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:46:50.4643511Z [48s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:46:50.4644290Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:46:50.4645035Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:46:50.4645202Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:46:55.5810419Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:46:55.5833502Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:46:55.5834181Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:46:55.5834829Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 16], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:46:55.5835445Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:46:55.5836005Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:46:55.5836504Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:46:55.5836864Z #smem = #ttg.shared_memory
2026-02-21T08:46:55.5837324Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:46:55.5838266Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:46:55.5839789Z     %cst = arith.constant dense<64> : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5840129Z     %cst_0 = arith.constant dense<0> : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5840462Z     %cst_1 = arith.constant dense<7168> : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5840800Z     %cst_2 = arith.constant dense<7168> : tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5841130Z     %cst_3 = arith.constant dense<0> : tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5841463Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:55.5841795Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:55.5842148Z     %cst_6 = arith.constant dense<0.000000e+00> : tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5842469Z     %c331_i32 = arith.constant 331 : i32
2026-02-21T08:46:55.5842998Z     %c-1_i32 = arith.constant -1 : i32
2026-02-21T08:46:55.5843281Z     %cst_7 = arith.constant dense<0> : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5843635Z     %cst_8 = arith.constant dense<false> : tensor<1x1024xi1, #blocked2>
2026-02-21T08:46:55.5843944Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T08:46:55.5844182Z     %c4094_i32 = arith.constant 4094 : i32
2026-02-21T08:46:55.5844499Z     %cst_9 = arith.constant dense<8188> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5844901Z     %cst_10 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5845247Z     %cst_11 = arith.constant dense<0> : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5845515Z     %cst_12 = arith.constant dense<7168> : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5845793Z     %cst_13 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T08:46:55.5846018Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:46:55.5846188Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:46:55.5846371Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:46:55.5846544Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:46:55.5846716Z     %c28_i32 = arith.constant 28 : i32
2026-02-21T08:46:55.5846886Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:46:55.5847204Z     %c912_i32 = arith.constant 912 : i32
2026-02-21T08:46:55.5847422Z     %cst_14 = arith.constant dense<0> : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5847653Z     %c608_i32 = arith.constant 608 : i32
2026-02-21T08:46:55.5847821Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T08:46:55.5847999Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T08:46:55.5848165Z     %c7168_i64 = arith.constant 7168 : i64
2026-02-21T08:46:55.5848381Z     %cst_15 = arith.constant dense<0> : tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5848605Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:46:55.5848775Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:46:55.5848941Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:46:55.5849108Z     %c304_i32 = arith.constant 304 : i32
2026-02-21T08:46:55.5849327Z     %cst_16 = arith.constant dense<4> : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5849654Z     %cst_17 = arith.constant dense<4> : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5849956Z     %0 = tt.get_program_id x : i32
2026-02-21T08:46:55.5850251Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:55.5850663Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:55.5851059Z     %3 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5851419Z     %4 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5851723Z     %5 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5852158Z     %6 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:55.5852665Z     %7 = arith.extsi %6 : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:55.5853215Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:46:55.5853841Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:46:55.5854396Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:55.5854690Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:55.5854925Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x1024xi1, #blocked>
2026-02-21T08:46:55.5855185Z     %13 = arith.cmpi eq, %10, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:46:55.5855403Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x1024xi1, #blocked>
2026-02-21T08:46:55.5855655Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x1024x!tt.ptr<bf16>, #mma>
2026-02-21T08:46:55.5855966Z     %16 = arith.extsi %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:55.5856319Z     %17 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:55.5856683Z     %18 = arith.extsi %17 : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:55.5856955Z     %19 = arith.subi %c331_i32, %0 : i32
2026-02-21T08:46:55.5857095Z     %20 = arith.divui %19, %c304_i32 : i32
2026-02-21T08:46:55.5857232Z     %21 = arith.remsi %20, %c3_i32 : i32
2026-02-21T08:46:55.5857361Z     %22 = arith.subi %20, %21 : i32
2026-02-21T08:46:55.5857501Z     %23 = arith.muli %22, %c304_i32 : i32
2026-02-21T08:46:55.5857637Z     %24 = arith.addi %0, %23 : i32
2026-02-21T08:46:55.5857819Z     scf.for %arg3 = %0 to %24 step %c912_i32  : i32 {
2026-02-21T08:46:55.5857975Z       %30 = arith.divsi %arg3, %c28_i32 : i32
2026-02-21T08:46:55.5858118Z       %31 = arith.muli %30, %c4_i32 : i32
2026-02-21T08:46:55.5858255Z       %32 = arith.subi %c4_i32, %31 : i32
2026-02-21T08:46:55.5858384Z       %33 = arith.minsi %32, %c4_i32 : i32
2026-02-21T08:46:55.5858525Z       %34 = arith.remsi %arg3, %c28_i32 : i32
2026-02-21T08:46:55.5858658Z       %35 = arith.remsi %34, %33 : i32
2026-02-21T08:46:55.5858787Z       %36 = arith.addi %31, %35 : i32
2026-02-21T08:46:55.5858918Z       %37 = arith.divsi %34, %33 : i32
2026-02-21T08:46:55.5859047Z       %38 = arith.muli %36, %c16_i32 : i32
2026-02-21T08:46:55.5859243Z       %39 = tt.splat %38 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:55.5859503Z       %40 = arith.addi %39, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:55.5859703Z       %41 = arith.muli %37, %c1024_i32 : i32
2026-02-21T08:46:55.5859960Z       %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:46:55.5860253Z       %43 = arith.muli %42, %cst_13 : tensor<16x1xi32, #blocked1>
2026-02-21T08:46:55.5860470Z       %44 = tt.broadcast %43 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5860669Z       %45 = arith.extsi %41 : i32 to i64
2026-02-21T08:46:55.5860866Z       %46 = tt.splat %45 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:55.5861122Z       %47 = arith.addi %46, %7 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:55.5861446Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5861787Z       %49 = arith.cmpi sge, %48, %cst_11 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5861990Z       %50 = arith.cmpi slt, %48, %cst_12 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5862184Z       %51 = arith.andi %49, %50 : tensor<1x1024xi1, #blocked2>
2026-02-21T08:46:55.5862413Z       %52 = tt.addptr %5, %48 : tensor<1x1024x!tt.ptr<i8>, #blocked2>, tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5862718Z       %53 = tt.load %52, %51, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5862973Z       %54 = arith.addi %48, %cst_12 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5863200Z       %55 = tt.addptr %5, %54 : tensor<1x1024x!tt.ptr<i8>, #blocked2>, tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5863492Z       %56 = tt.load %55, %51, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5863969Z       %57:3 = scf.for %arg4 = %c0_i32 to %c4094_i32 step %c1_i32 iter_args(%arg5 = %cst_6, %arg6 = %53, %arg7 = %56) -> (tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2>)  : i32 {
2026-02-21T08:46:55.5864320Z         %314 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:46:55.5864493Z         %315 = tt.splat %314 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5864719Z         %316 = arith.addi %315, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5864997Z         %317 = tt.expand_dims %316 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:46:55.5865272Z         %318 = tt.broadcast %317 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5865467Z         %319 = arith.addi %44, %318 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5865691Z         %320 = tt.addptr %4, %319 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5865903Z         %321 = tt.load %320 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5866168Z         %322 = ttg.convert_layout %321 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5866606Z         %323 = arith.extf %322 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5866892Z         %324 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:46:55.5867013Z         %325 = arith.extsi %324 : i32 to i64
2026-02-21T08:46:55.5867136Z         %326 = arith.muli %325, %c7168_i64 : i64
2026-02-21T08:46:55.5867276Z         %327 = tt.splat %326 : i64 -> tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5867439Z         %328 = arith.addi %327, %48 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5867639Z         %329 = tt.addptr %5, %328 : tensor<1x1024x!tt.ptr<i8>, #blocked2>, tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5867853Z         %330 = tt.load %329, %51, %cst_14 : tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5868037Z         %331 = arith.shli %arg6, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5868208Z         %332 = arith.shrsi %331, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5868463Z         %333 = ttg.convert_layout %332 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5868716Z         %334 = arith.shrsi %arg6, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5868966Z         %335 = ttg.convert_layout %334 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5869308Z         %336 = tt.expand_dims %333 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5869687Z         %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5869980Z         %338 = tt.broadcast %336 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5870236Z         %339 = arith.select %12, %338, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5870483Z         %340 = tt.broadcast %337 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5870732Z         %341 = arith.select %14, %340, %339 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5870974Z         %342 = tt.reshape %341 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3>
2026-02-21T08:46:55.5871210Z         %343 = arith.sitofp %342 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3>
2026-02-21T08:46:55.5871471Z         %344 = ttg.local_alloc %343 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem>
2026-02-21T08:46:55.5871806Z         %345 = ttg.local_load %344 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5872293Z         %346 = tt.dot %323, %345, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5872737Z         scf.yield %346, %arg7, %330 : tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5872980Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T08:46:55.5873178Z       %58 = arith.addi %3, %cst_9 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5873452Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:46:55.5873731Z       %60 = tt.broadcast %59 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5873921Z       %61 = arith.addi %44, %60 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5874157Z       %62 = tt.addptr %4, %61 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5886029Z       %63 = tt.load %62 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5886308Z       %64 = ttg.convert_layout %63 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5886715Z       %65 = arith.extf %64 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5887025Z       %66 = arith.shli %57#1, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5887200Z       %67 = arith.shrsi %66, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5887458Z       %68 = ttg.convert_layout %67 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5887711Z       %69 = arith.shrsi %57#1, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5887960Z       %70 = ttg.convert_layout %69 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5888302Z       %71 = tt.expand_dims %68 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5888658Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5888947Z       %73 = tt.broadcast %71 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5889190Z       %74 = arith.select %12, %73, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5889433Z       %75 = tt.broadcast %72 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5889745Z       %76 = arith.select %14, %75, %74 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5889980Z       %77 = tt.reshape %76 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3>
2026-02-21T08:46:55.5890208Z       %78 = arith.sitofp %77 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3>
2026-02-21T08:46:55.5890459Z       %79 = ttg.local_alloc %78 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem>
2026-02-21T08:46:55.5890792Z       %80 = ttg.local_load %79 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5891267Z       %81 = tt.dot %65, %80, %57#0, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5891668Z       %82 = arith.addi %3, %cst_10 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5891947Z       %83 = tt.expand_dims %82 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:46:55.5892220Z       %84 = tt.broadcast %83 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5892415Z       %85 = arith.addi %44, %84 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5892611Z       %86 = tt.addptr %4, %85 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5892814Z       %87 = tt.load %86 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5893076Z       %88 = ttg.convert_layout %87 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5893468Z       %89 = arith.extf %88 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5893776Z       %90 = arith.shli %57#2, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5893942Z       %91 = arith.shrsi %90, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5894233Z       %92 = ttg.convert_layout %91 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5894484Z       %93 = arith.shrsi %57#2, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5894730Z       %94 = ttg.convert_layout %93 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5895067Z       %95 = tt.expand_dims %92 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5895410Z       %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5895692Z       %97 = tt.broadcast %95 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5895937Z       %98 = arith.select %12, %97, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5896177Z       %99 = tt.broadcast %96 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5896413Z       %100 = arith.select %14, %99, %98 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5896651Z       %101 = tt.reshape %100 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3>
2026-02-21T08:46:55.5896887Z       %102 = arith.sitofp %101 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3>
2026-02-21T08:46:55.5897149Z       %103 = ttg.local_alloc %102 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem>
2026-02-21T08:46:55.5897481Z       %104 = ttg.local_load %103 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5897992Z       %105 = tt.dot %89, %104, %81, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5898387Z       %106 = arith.truncf %105 : tensor<16x1024xf32, #mma> to tensor<16x1024xbf16, #mma>
2026-02-21T08:46:55.5898563Z       %107 = arith.extsi %38 : i32 to i64
2026-02-21T08:46:55.5898733Z       %108 = tt.splat %107 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:55.5898943Z       %109 = arith.addi %108, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:55.5899210Z       %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5899456Z       %111 = arith.muli %110, %cst_1 : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5899636Z       %112 = tt.broadcast %111 : tensor<16x1xi64, #mma> -> tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5899853Z       %113 = tt.splat %45 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:55.5900071Z       %114 = arith.addi %113, %18 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:55.5900349Z       %115 = tt.expand_dims %114 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5900625Z       %116 = tt.broadcast %115 : tensor<1x1024xi64, #mma> -> tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5900813Z       %117 = arith.addi %112, %116 : tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5901014Z       %118 = tt.addptr %15, %117 : tensor<16x1024x!tt.ptr<bf16>, #mma>, tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5901217Z       %119 = arith.cmpi sge, %110, %cst_0 : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5901387Z       %120 = arith.cmpi slt, %110, %cst : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5901543Z       %121 = arith.andi %119, %120 : tensor<16x1xi1, #mma>
2026-02-21T08:46:55.5901722Z       %122 = tt.broadcast %121 : tensor<16x1xi1, #mma> -> tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5901945Z       %123 = arith.cmpi sge, %115, %cst_3 : tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5902111Z       %124 = arith.cmpi slt, %115, %cst_2 : tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5902276Z       %125 = arith.andi %123, %124 : tensor<1x1024xi1, #mma>
2026-02-21T08:46:55.5902451Z       %126 = tt.broadcast %125 : tensor<1x1024xi1, #mma> -> tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5902636Z       %127 = arith.andi %122, %126 : tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5902804Z       tt.store %118, %106, %127 : tensor<16x1024x!tt.ptr<bf16>, #mma>
2026-02-21T08:46:55.5902957Z       %128 = arith.addi %arg3, %c304_i32 : i32
2026-02-21T08:46:55.5903085Z       %129 = arith.divsi %128, %c28_i32 : i32
2026-02-21T08:46:55.5903205Z       %130 = arith.muli %129, %c4_i32 : i32
2026-02-21T08:46:55.5903326Z       %131 = arith.subi %c4_i32, %130 : i32
2026-02-21T08:46:55.5903444Z       %132 = arith.minsi %131, %c4_i32 : i32
2026-02-21T08:46:55.5903567Z       %133 = arith.remsi %128, %c28_i32 : i32
2026-02-21T08:46:55.5903686Z       %134 = arith.remsi %133, %132 : i32
2026-02-21T08:46:55.5903806Z       %135 = arith.addi %130, %134 : i32
2026-02-21T08:46:55.5903922Z       %136 = arith.divsi %133, %132 : i32
2026-02-21T08:46:55.5904036Z       %137 = arith.muli %135, %c16_i32 : i32
2026-02-21T08:46:55.5904211Z       %138 = tt.splat %137 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:55.5904440Z       %139 = arith.addi %138, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:55.5904620Z       %140 = arith.muli %136, %c1024_i32 : i32
2026-02-21T08:46:55.5904847Z       %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:46:55.5905107Z       %142 = arith.muli %141, %cst_13 : tensor<16x1xi32, #blocked1>
2026-02-21T08:46:55.5905350Z       %143 = tt.broadcast %142 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5905525Z       %144 = arith.extsi %140 : i32 to i64
2026-02-21T08:46:55.5905702Z       %145 = tt.splat %144 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:55.5905934Z       %146 = arith.addi %145, %7 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:55.5906219Z       %147 = tt.expand_dims %146 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5906487Z       %148 = arith.cmpi sge, %147, %cst_11 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5906665Z       %149 = arith.cmpi slt, %147, %cst_12 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5906835Z       %150 = arith.andi %148, %149 : tensor<1x1024xi1, #blocked2>
2026-02-21T08:46:55.5907039Z       %151 = tt.addptr %5, %147 : tensor<1x1024x!tt.ptr<i8>, #blocked2>, tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5907305Z       %152 = tt.load %151, %150, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5907535Z       %153 = arith.addi %147, %cst_12 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5907741Z       %154 = tt.addptr %5, %153 : tensor<1x1024x!tt.ptr<i8>, #blocked2>, tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5908000Z       %155 = tt.load %154, %150, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5908412Z       %156:3 = scf.for %arg4 = %c0_i32 to %c4094_i32 step %c1_i32 iter_args(%arg5 = %cst_6, %arg6 = %152, %arg7 = %155) -> (tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2>)  : i32 {
2026-02-21T08:46:55.5908739Z         %314 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:46:55.5908916Z         %315 = tt.splat %314 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5909143Z         %316 = arith.addi %315, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5909421Z         %317 = tt.expand_dims %316 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:46:55.5909776Z         %318 = tt.broadcast %317 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5909970Z         %319 = arith.addi %143, %318 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5910170Z         %320 = tt.addptr %4, %319 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5910377Z         %321 = tt.load %320 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5910644Z         %322 = ttg.convert_layout %321 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5911055Z         %323 = arith.extf %322 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5911336Z         %324 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:46:55.5911464Z         %325 = arith.extsi %324 : i32 to i64
2026-02-21T08:46:55.5911587Z         %326 = arith.muli %325, %c7168_i64 : i64
2026-02-21T08:46:55.5911730Z         %327 = tt.splat %326 : i64 -> tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5911894Z         %328 = arith.addi %327, %147 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5912094Z         %329 = tt.addptr %5, %328 : tensor<1x1024x!tt.ptr<i8>, #blocked2>, tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5912312Z         %330 = tt.load %329, %150, %cst_14 : tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5912495Z         %331 = arith.shli %arg6, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5912666Z         %332 = arith.shrsi %331, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5912951Z         %333 = ttg.convert_layout %332 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5913206Z         %334 = arith.shrsi %arg6, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5913458Z         %335 = ttg.convert_layout %334 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5913798Z         %336 = tt.expand_dims %333 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5914168Z         %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5914460Z         %338 = tt.broadcast %336 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5914710Z         %339 = arith.select %12, %338, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5914959Z         %340 = tt.broadcast %337 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5915202Z         %341 = arith.select %14, %340, %339 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5915444Z         %342 = tt.reshape %341 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3>
2026-02-21T08:46:55.5915677Z         %343 = arith.sitofp %342 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3>
2026-02-21T08:46:55.5915935Z         %344 = ttg.local_alloc %343 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem>
2026-02-21T08:46:55.5916265Z         %345 = ttg.local_load %344 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5916742Z         %346 = tt.dot %323, %345, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5917184Z         scf.yield %346, %arg7, %330 : tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5922415Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T08:46:55.5922644Z       %157 = arith.addi %143, %60 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5922844Z       %158 = tt.addptr %4, %157 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5923049Z       %159 = tt.load %158 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5923315Z       %160 = ttg.convert_layout %159 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5923713Z       %161 = arith.extf %160 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5924017Z       %162 = arith.shli %156#1, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5924190Z       %163 = arith.shrsi %162, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5924451Z       %164 = ttg.convert_layout %163 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5924706Z       %165 = arith.shrsi %156#1, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5924955Z       %166 = ttg.convert_layout %165 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5925300Z       %167 = tt.expand_dims %164 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5925638Z       %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5925930Z       %169 = tt.broadcast %167 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5926225Z       %170 = arith.select %12, %169, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5926470Z       %171 = tt.broadcast %168 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5926712Z       %172 = arith.select %14, %171, %170 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5926947Z       %173 = tt.reshape %172 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3>
2026-02-21T08:46:55.5927181Z       %174 = arith.sitofp %173 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3>
2026-02-21T08:46:55.5927440Z       %175 = ttg.local_alloc %174 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem>
2026-02-21T08:46:55.5927768Z       %176 = ttg.local_load %175 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5928253Z       %177 = tt.dot %161, %176, %156#0, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5928620Z       %178 = arith.addi %143, %84 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5928816Z       %179 = tt.addptr %4, %178 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5929021Z       %180 = tt.load %179 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5929285Z       %181 = ttg.convert_layout %180 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5929680Z       %182 = arith.extf %181 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5929988Z       %183 = arith.shli %156#2, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5930159Z       %184 = arith.shrsi %183, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5930411Z       %185 = ttg.convert_layout %184 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5930707Z       %186 = arith.shrsi %156#2, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5930959Z       %187 = ttg.convert_layout %186 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5931300Z       %188 = tt.expand_dims %185 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5931640Z       %189 = tt.expand_dims %187 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5931929Z       %190 = tt.broadcast %188 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5932177Z       %191 = arith.select %12, %190, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5932417Z       %192 = tt.broadcast %189 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5932658Z       %193 = arith.select %14, %192, %191 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5932893Z       %194 = tt.reshape %193 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3>
2026-02-21T08:46:55.5933122Z       %195 = arith.sitofp %194 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3>
2026-02-21T08:46:55.5933380Z       %196 = ttg.local_alloc %195 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem>
2026-02-21T08:46:55.5933706Z       %197 = ttg.local_load %196 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5934209Z       %198 = tt.dot %182, %197, %177, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5934600Z       %199 = arith.truncf %198 : tensor<16x1024xf32, #mma> to tensor<16x1024xbf16, #mma>
2026-02-21T08:46:55.5934773Z       %200 = arith.extsi %137 : i32 to i64
2026-02-21T08:46:55.5934936Z       %201 = tt.splat %200 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:55.5935144Z       %202 = arith.addi %201, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:55.5935404Z       %203 = tt.expand_dims %202 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5935644Z       %204 = arith.muli %203, %cst_1 : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5935820Z       %205 = tt.broadcast %204 : tensor<16x1xi64, #mma> -> tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5936031Z       %206 = tt.splat %144 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:55.5936245Z       %207 = arith.addi %206, %18 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:55.5936517Z       %208 = tt.expand_dims %207 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5936782Z       %209 = tt.broadcast %208 : tensor<1x1024xi64, #mma> -> tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5936966Z       %210 = arith.addi %205, %209 : tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5937161Z       %211 = tt.addptr %15, %210 : tensor<16x1024x!tt.ptr<bf16>, #mma>, tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5937360Z       %212 = arith.cmpi sge, %203, %cst_0 : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5937526Z       %213 = arith.cmpi slt, %203, %cst : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5937678Z       %214 = arith.andi %212, %213 : tensor<16x1xi1, #mma>
2026-02-21T08:46:55.5937851Z       %215 = tt.broadcast %214 : tensor<16x1xi1, #mma> -> tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5938042Z       %216 = arith.cmpi sge, %208, %cst_3 : tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5938207Z       %217 = arith.cmpi slt, %208, %cst_2 : tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5938404Z       %218 = arith.andi %216, %217 : tensor<1x1024xi1, #mma>
2026-02-21T08:46:55.5938578Z       %219 = tt.broadcast %218 : tensor<1x1024xi1, #mma> -> tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5938761Z       %220 = arith.andi %215, %219 : tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5938921Z       tt.store %211, %199, %220 : tensor<16x1024x!tt.ptr<bf16>, #mma>
2026-02-21T08:46:55.5939075Z       %221 = arith.addi %arg3, %c608_i32 : i32
2026-02-21T08:46:55.5939201Z       %222 = arith.divsi %221, %c28_i32 : i32
2026-02-21T08:46:55.5939319Z       %223 = arith.muli %222, %c4_i32 : i32
2026-02-21T08:46:55.5939437Z       %224 = arith.subi %c4_i32, %223 : i32
2026-02-21T08:46:55.5939553Z       %225 = arith.minsi %224, %c4_i32 : i32
2026-02-21T08:46:55.5939673Z       %226 = arith.remsi %221, %c28_i32 : i32
2026-02-21T08:46:55.5939789Z       %227 = arith.remsi %226, %225 : i32
2026-02-21T08:46:55.5939904Z       %228 = arith.addi %223, %227 : i32
2026-02-21T08:46:55.5940021Z       %229 = arith.divsi %226, %225 : i32
2026-02-21T08:46:55.5940133Z       %230 = arith.muli %228, %c16_i32 : i32
2026-02-21T08:46:55.5940303Z       %231 = tt.splat %230 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:55.5940524Z       %232 = arith.addi %231, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:55.5940699Z       %233 = arith.muli %229, %c1024_i32 : i32
2026-02-21T08:46:55.5940921Z       %234 = tt.expand_dims %232 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:46:55.5941174Z       %235 = arith.muli %234, %cst_13 : tensor<16x1xi32, #blocked1>
2026-02-21T08:46:55.5941392Z       %236 = tt.broadcast %235 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5941598Z       %237 = arith.extsi %233 : i32 to i64
2026-02-21T08:46:55.5941771Z       %238 = tt.splat %237 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:55.5942003Z       %239 = arith.addi %238, %7 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:55.5942286Z       %240 = tt.expand_dims %239 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5942548Z       %241 = arith.cmpi sge, %240, %cst_11 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5942725Z       %242 = arith.cmpi slt, %240, %cst_12 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5942896Z       %243 = arith.andi %241, %242 : tensor<1x1024xi1, #blocked2>
2026-02-21T08:46:55.5943098Z       %244 = tt.addptr %5, %240 : tensor<1x1024x!tt.ptr<i8>, #blocked2>, tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5943365Z       %245 = tt.load %244, %243, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5943591Z       %246 = arith.addi %240, %cst_12 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5943796Z       %247 = tt.addptr %5, %246 : tensor<1x1024x!tt.ptr<i8>, #blocked2>, tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5944057Z       %248 = tt.load %247, %243, %cst_14 {amd.pipeliner_part = "prologue"} : tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5944468Z       %249:3 = scf.for %arg4 = %c0_i32 to %c4094_i32 step %c1_i32 iter_args(%arg5 = %cst_6, %arg6 = %245, %arg7 = %248) -> (tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2>)  : i32 {
2026-02-21T08:46:55.5944794Z         %314 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:46:55.5944971Z         %315 = tt.splat %314 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5945195Z         %316 = arith.addi %315, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5945478Z         %317 = tt.expand_dims %316 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:46:55.5945790Z         %318 = tt.broadcast %317 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5945984Z         %319 = arith.addi %236, %318 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5946181Z         %320 = tt.addptr %4, %319 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5946381Z         %321 = tt.load %320 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5946648Z         %322 = ttg.convert_layout %321 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5947053Z         %323 = arith.extf %322 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5947332Z         %324 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:46:55.5947459Z         %325 = arith.extsi %324 : i32 to i64
2026-02-21T08:46:55.5947579Z         %326 = arith.muli %325, %c7168_i64 : i64
2026-02-21T08:46:55.5947725Z         %327 = tt.splat %326 : i64 -> tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5947889Z         %328 = arith.addi %327, %240 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5948089Z         %329 = tt.addptr %5, %328 : tensor<1x1024x!tt.ptr<i8>, #blocked2>, tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5948306Z         %330 = tt.load %329, %243, %cst_14 : tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5948486Z         %331 = arith.shli %arg6, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5948656Z         %332 = arith.shrsi %331, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5948910Z         %333 = ttg.convert_layout %332 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5949197Z         %334 = arith.shrsi %arg6, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5949450Z         %335 = ttg.convert_layout %334 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5949790Z         %336 = tt.expand_dims %333 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5950140Z         %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5950433Z         %338 = tt.broadcast %336 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5950684Z         %339 = arith.select %12, %338, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5950931Z         %340 = tt.broadcast %337 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5951174Z         %341 = arith.select %14, %340, %339 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5951414Z         %342 = tt.reshape %341 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3>
2026-02-21T08:46:55.5951648Z         %343 = arith.sitofp %342 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3>
2026-02-21T08:46:55.5951906Z         %344 = ttg.local_alloc %343 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem>
2026-02-21T08:46:55.5952241Z         %345 = ttg.local_load %344 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5952725Z         %346 = tt.dot %323, %345, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5953180Z         scf.yield %346, %arg7, %330 : tensor<16x1024xf32, #mma>, tensor<1x1024xi8, #blocked2>, tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5953423Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T08:46:55.5953616Z       %250 = arith.addi %236, %60 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5953818Z       %251 = tt.addptr %4, %250 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5954023Z       %252 = tt.load %251 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5954285Z       %253 = ttg.convert_layout %252 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5954691Z       %254 = arith.extf %253 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5954995Z       %255 = arith.shli %249#1, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5955166Z       %256 = arith.shrsi %255, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5955419Z       %257 = ttg.convert_layout %256 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5955671Z       %258 = arith.shrsi %249#1, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5955920Z       %259 = ttg.convert_layout %258 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5956257Z       %260 = tt.expand_dims %257 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5956598Z       %261 = tt.expand_dims %259 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5956887Z       %262 = tt.broadcast %260 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5957131Z       %263 = arith.select %12, %262, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5957410Z       %264 = tt.broadcast %261 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5957648Z       %265 = arith.select %14, %264, %263 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5957884Z       %266 = tt.reshape %265 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3>
2026-02-21T08:46:55.5958115Z       %267 = arith.sitofp %266 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3>
2026-02-21T08:46:55.5958369Z       %268 = ttg.local_alloc %267 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem>
2026-02-21T08:46:55.5958696Z       %269 = ttg.local_load %268 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5959175Z       %270 = tt.dot %254, %269, %249#0, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5959537Z       %271 = arith.addi %236, %84 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5959738Z       %272 = tt.addptr %4, %271 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5959937Z       %273 = tt.load %272 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5960199Z       %274 = ttg.convert_layout %273 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5960593Z       %275 = arith.extf %274 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5960893Z       %276 = arith.shli %249#2, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5961062Z       %277 = arith.shrsi %276, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5961311Z       %278 = ttg.convert_layout %277 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5961562Z       %279 = arith.shrsi %249#2, %cst_16 : tensor<1x1024xi8, #blocked2>
2026-02-21T08:46:55.5961837Z       %280 = ttg.convert_layout %279 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5962172Z       %281 = tt.expand_dims %278 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5962513Z       %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5962834Z       %283 = tt.broadcast %281 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5963081Z       %284 = arith.select %12, %283, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5963329Z       %285 = tt.broadcast %282 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5963565Z       %286 = arith.select %14, %285, %284 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5963803Z       %287 = tt.reshape %286 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3>
2026-02-21T08:46:55.5964030Z       %288 = arith.sitofp %287 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3>
2026-02-21T08:46:55.5964287Z       %289 = ttg.local_alloc %288 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem>
2026-02-21T08:46:55.5964615Z       %290 = ttg.local_load %289 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5965082Z       %291 = tt.dot %275, %290, %270, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5965506Z       %292 = arith.truncf %291 : tensor<16x1024xf32, #mma> to tensor<16x1024xbf16, #mma>
2026-02-21T08:46:55.5965684Z       %293 = arith.extsi %230 : i32 to i64
2026-02-21T08:46:55.5965844Z       %294 = tt.splat %293 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:55.5966051Z       %295 = arith.addi %294, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:55.5966310Z       %296 = tt.expand_dims %295 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5966546Z       %297 = arith.muli %296, %cst_1 : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5966722Z       %298 = tt.broadcast %297 : tensor<16x1xi64, #mma> -> tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5966932Z       %299 = tt.splat %237 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:55.5967144Z       %300 = arith.addi %299, %18 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:55.5967413Z       %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5967680Z       %302 = tt.broadcast %301 : tensor<1x1024xi64, #mma> -> tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5967864Z       %303 = arith.addi %298, %302 : tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5968055Z       %304 = tt.addptr %15, %303 : tensor<16x1024x!tt.ptr<bf16>, #mma>, tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5968257Z       %305 = arith.cmpi sge, %296, %cst_0 : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5968418Z       %306 = arith.cmpi slt, %296, %cst : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5968573Z       %307 = arith.andi %305, %306 : tensor<16x1xi1, #mma>
2026-02-21T08:46:55.5968765Z       %308 = tt.broadcast %307 : tensor<16x1xi1, #mma> -> tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5968949Z       %309 = arith.cmpi sge, %301, %cst_3 : tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5969120Z       %310 = arith.cmpi slt, %301, %cst_2 : tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5969276Z       %311 = arith.andi %309, %310 : tensor<1x1024xi1, #mma>
2026-02-21T08:46:55.5969497Z       %312 = tt.broadcast %311 : tensor<1x1024xi1, #mma> -> tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5969678Z       %313 = arith.andi %308, %312 : tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5969837Z       tt.store %304, %292, %313 : tensor<16x1024x!tt.ptr<bf16>, #mma>
2026-02-21T08:46:55.5969990Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:46:55.5970112Z     %25 = arith.subi %c28_i32, %24 : i32
2026-02-21T08:46:55.5970234Z     %26 = arith.ceildivsi %25, %c304_i32 : i32
2026-02-21T08:46:55.5970355Z     %27 = arith.muli %26, %c4096_i32 : i32
2026-02-21T08:46:55.5970468Z     %28 = arith.subi %24, %c304_i32 : i32
2026-02-21T08:46:55.5970970Z     %29:9 = scf.for %arg3 = %c0_i32 to %27 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %28, %arg6 = %c0_i32, %arg7 = %cst_6, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_7, %arg11 = %cst_11, %arg12 = %cst_8) -> (i32, i32, i32, tensor<16x1024xf32, #mma>, i32, i32, tensor<16x2xi32, #blocked1>, tensor<1x1024xi64, #blocked2>, tensor<1x1024xi1, #blocked2>)  : i32 {
2026-02-21T08:46:55.5971473Z       %30 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T08:46:55.5971599Z       %31 = arith.cmpi eq, %arg4, %c4095_i32 : i32
2026-02-21T08:46:55.5971726Z       %32 = arith.select %31, %c0_i32, %30 : i32
2026-02-21T08:46:55.5971852Z       %33 = arith.cmpi eq, %32, %c0_i32 : i32
2026-02-21T08:46:55.5971974Z       %34 = arith.select %33, %c0_i32, %arg6 : i32
2026-02-21T08:46:55.5972204Z       %35:6 = scf.if %33 -> (i32, i32, tensor<16x2xi32, #blocked1>, tensor<1x1024xi64, #blocked2>, tensor<1x1024xi1, #blocked2>, i32) {
2026-02-21T08:46:55.5972438Z         %75 = arith.addi %arg5, %c304_i32 : i32
2026-02-21T08:46:55.5972560Z         %76 = arith.divsi %75, %c28_i32 : i32
2026-02-21T08:46:55.5972676Z         %77 = arith.muli %76, %c4_i32 : i32
2026-02-21T08:46:55.5972828Z         %78 = arith.subi %c4_i32, %77 : i32
2026-02-21T08:46:55.5972942Z         %79 = arith.minsi %78, %c4_i32 : i32
2026-02-21T08:46:55.5973055Z         %80 = arith.remsi %75, %c28_i32 : i32
2026-02-21T08:46:55.5973171Z         %81 = arith.remsi %80, %79 : i32
2026-02-21T08:46:55.5973284Z         %82 = arith.addi %77, %81 : i32
2026-02-21T08:46:55.5973392Z         %83 = arith.divsi %80, %79 : i32
2026-02-21T08:46:55.5973504Z         %84 = arith.muli %82, %c16_i32 : i32
2026-02-21T08:46:55.5973672Z         %85 = tt.splat %84 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:55.5973890Z         %86 = arith.addi %85, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:46:55.5974060Z         %87 = arith.muli %83, %c1024_i32 : i32
2026-02-21T08:46:55.5974283Z         %88 = tt.expand_dims %86 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:46:55.5974540Z         %89 = arith.muli %88, %cst_13 : tensor<16x1xi32, #blocked1>
2026-02-21T08:46:55.5974730Z         %90 = tt.broadcast %89 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5974906Z         %91 = arith.extsi %87 : i32 to i64
2026-02-21T08:46:55.5975073Z         %92 = tt.splat %91 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:55.5975297Z         %93 = arith.addi %92, %7 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:46:55.5975583Z         %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5975844Z         %95 = arith.cmpi sge, %94, %cst_11 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5976021Z         %96 = arith.cmpi slt, %94, %cst_12 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5976185Z         %97 = arith.andi %95, %96 : tensor<1x1024xi1, #blocked2>
2026-02-21T08:46:55.5976446Z         scf.yield %84, %87, %90, %94, %97, %75 : i32, i32, tensor<16x2xi32, #blocked1>, tensor<1x1024xi64, #blocked2>, tensor<1x1024xi1, #blocked2>, i32
2026-02-21T08:46:55.5976682Z       } else {
2026-02-21T08:46:55.5976954Z         scf.yield %arg8, %arg9, %arg10, %arg11, %arg12, %arg5 : i32, i32, tensor<16x2xi32, #blocked1>, tensor<1x1024xi64, #blocked2>, tensor<1x1024xi1, #blocked2>, i32
2026-02-21T08:46:55.5977215Z       }
2026-02-21T08:46:55.5977298Z       %36 = arith.muli %34, %c2_i32 : i32
2026-02-21T08:46:55.5977466Z       %37 = tt.splat %36 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5977679Z       %38 = arith.addi %37, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:46:55.5977948Z       %39 = tt.expand_dims %38 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:46:55.5978217Z       %40 = tt.broadcast %39 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5978405Z       %41 = arith.addi %35#2, %40 : tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5978601Z       %42 = tt.addptr %4, %41 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T08:46:55.5978799Z       %43 = tt.load %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:46:55.5979061Z       %44 = ttg.convert_layout %43 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5979455Z       %45 = arith.extf %44 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5979729Z       %46 = arith.extsi %34 : i32 to i64
2026-02-21T08:46:55.5979849Z       %47 = arith.muli %46, %c7168_i64 : i64
2026-02-21T08:46:55.5979982Z       %48 = tt.splat %47 : i64 -> tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5980144Z       %49 = arith.addi %48, %35#3 : tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5980377Z       %50 = tt.addptr %5, %49 : tensor<1x1024x!tt.ptr<i8>, #blocked2>, tensor<1x1024xi64, #blocked2>
2026-02-21T08:46:55.5980561Z       %51 = arith.cmpi sge, %46, %c0_i64 : i64
2026-02-21T08:46:55.5980691Z       %52 = arith.cmpi slt, %46, %c4096_i64 : i64
2026-02-21T08:46:55.5980812Z       %53 = arith.andi %51, %52 : i1
2026-02-21T08:46:55.5980945Z       %54 = tt.splat %53 : i1 -> tensor<1x1024xi1, #blocked2>
2026-02-21T08:46:55.5981099Z       %55 = arith.andi %54, %35#4 : tensor<1x1024xi1, #blocked2>
2026-02-21T08:46:55.5981265Z       %56 = tt.load %50, %55, %cst_14 : tensor<1x1024x!tt.ptr<i8>, #blocked2>
2026-02-21T08:46:55.5981527Z       %57 = ttg.convert_layout %56 : tensor<1x1024xi8, #blocked2> -> tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5981808Z       %58 = arith.shli %57, %cst_17 : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5982046Z       %59 = arith.shrsi %58, %cst_17 : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5982283Z       %60 = arith.shrsi %57, %cst_17 : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:46:55.5982577Z       %61 = tt.expand_dims %59 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5982917Z       %62 = tt.expand_dims %60 {axis = 1 : i32} : tensor<1x1024xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x1024xi8, #blocked>
2026-02-21T08:46:55.5983200Z       %63 = tt.broadcast %61 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5983443Z       %64 = arith.select %12, %63, %cst_15 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5983681Z       %65 = tt.broadcast %62 : tensor<1x1x1024xi8, #blocked> -> tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5983917Z       %66 = arith.select %14, %65, %64 : tensor<1x2x1024xi1, #blocked>, tensor<1x2x1024xi8, #blocked>
2026-02-21T08:46:55.5984150Z       %67 = tt.reshape %66 : tensor<1x2x1024xi8, #blocked> -> tensor<2x1024xi8, #blocked3>
2026-02-21T08:46:55.5984373Z       %68 = arith.sitofp %67 : tensor<2x1024xi8, #blocked3> to tensor<2x1024xf32, #blocked3>
2026-02-21T08:46:55.5984663Z       %69 = ttg.local_alloc %68 : (tensor<2x1024xf32, #blocked3>) -> !ttg.memdesc<2x1024xf32, #shared, #smem>
2026-02-21T08:46:55.5984988Z       %70 = ttg.local_load %69 : !ttg.memdesc<2x1024xf32, #shared, #smem> -> tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:46:55.5985465Z       %71 = tt.dot %45, %70, %arg7, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x1024xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5985814Z       %72 = arith.addi %34, %c1_i32 : i32
2026-02-21T08:46:55.5985935Z       %73 = arith.cmpi eq, %32, %c4095_i32 : i32
2026-02-21T08:46:55.5986083Z       %74 = arith.select %73, %cst_6, %71 : tensor<16x1024xf32, #mma>
2026-02-21T08:46:55.5986223Z       scf.if %73 {
2026-02-21T08:46:55.5986363Z         %75 = arith.truncf %71 : tensor<16x1024xf32, #mma> to tensor<16x1024xbf16, #mma>
2026-02-21T08:46:55.5986538Z         %76 = arith.extsi %35#0 : i32 to i64
2026-02-21T08:46:55.5986655Z         %77 = arith.extsi %35#1 : i32 to i64
2026-02-21T08:46:55.5986817Z         %78 = tt.splat %76 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:55.5987022Z         %79 = arith.addi %78, %16 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:46:55.5987283Z         %80 = tt.expand_dims %79 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5987518Z         %81 = arith.muli %80, %cst_1 : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5987693Z         %82 = tt.broadcast %81 : tensor<16x1xi64, #mma> -> tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5987904Z         %83 = tt.splat %77 : i64 -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:55.5988148Z         %84 = arith.addi %83, %18 : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:46:55.5988428Z         %85 = tt.expand_dims %84 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5988692Z         %86 = tt.broadcast %85 : tensor<1x1024xi64, #mma> -> tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5988871Z         %87 = arith.addi %82, %86 : tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5989059Z         %88 = tt.addptr %15, %87 : tensor<16x1024x!tt.ptr<bf16>, #mma>, tensor<16x1024xi64, #mma>
2026-02-21T08:46:55.5989256Z         %89 = arith.cmpi sge, %80, %cst_0 : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5989416Z         %90 = arith.cmpi slt, %80, %cst : tensor<16x1xi64, #mma>
2026-02-21T08:46:55.5989566Z         %91 = arith.andi %89, %90 : tensor<16x1xi1, #mma>
2026-02-21T08:46:55.5989732Z         %92 = tt.broadcast %91 : tensor<16x1xi1, #mma> -> tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5989918Z         %93 = arith.cmpi sge, %85, %cst_3 : tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5990080Z         %94 = arith.cmpi slt, %85, %cst_2 : tensor<1x1024xi64, #mma>
2026-02-21T08:46:55.5990240Z         %95 = arith.andi %93, %94 : tensor<1x1024xi1, #mma>
2026-02-21T08:46:55.5990409Z         %96 = tt.broadcast %95 : tensor<1x1024xi1, #mma> -> tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5990586Z         %97 = arith.andi %92, %96 : tensor<16x1024xi1, #mma>
2026-02-21T08:46:55.5990741Z         tt.store %88, %75, %97 : tensor<16x1024x!tt.ptr<bf16>, #mma>
2026-02-21T08:46:55.5990873Z       }
2026-02-21T08:46:55.5991150Z       scf.yield %32, %35#5, %72, %74, %35#0, %35#1, %35#2, %35#3, %35#4 : i32, i32, i32, tensor<16x1024xf32, #mma>, i32, i32, tensor<16x2xi32, #blocked1>, tensor<1x1024xi64, #blocked2>, tensor<1x1024xi1, #blocked2>
2026-02-21T08:46:55.5991452Z     } {tt.num_stages = 3 : i32}
2026-02-21T08:46:55.5991556Z     tt.return
2026-02-21T08:46:55.5991637Z   }
2026-02-21T08:46:55.5991717Z }
2026-02-21T08:46:55.5991760Z 
2026-02-21T08:46:55.5991797Z {-#
2026-02-21T08:46:55.5991875Z   external_resources: {
2026-02-21T08:46:55.5991976Z     mlir_reproducer: {
2026-02-21T08:46:55.5993009Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:46:55.5994014Z       disable_threading: false,
2026-02-21T08:46:55.5994121Z       verify_each: true
2026-02-21T08:46:55.5994209Z     }
2026-02-21T08:46:55.5994283Z   }
2026-02-21T08:46:55.5994352Z #-}
2026-02-21T08:46:55.5994633Z /tmp/torchinductor_root/jv/cjvgrlgzvuwhzkuafdg5p4pazwxtkgt72fjgnqr47zio66tszde6.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:46:55.5995346Z /tmp/torchinductor_root/jv/cjvgrlgzvuwhzkuafdg5p4pazwxtkgt72fjgnqr47zio66tszde6.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:46:55.5995892Z [53s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:46:55.5996728Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 1024], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T08:46:55.5997446Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:46:55.5997610Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:46:55.6803766Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 8.9 configs/s
2026-02-21T08:46:55.6815673Z [53s] Adaptive compile timeout: 30s (90% percentile=14.0s, bounds=[30.0s, 30s])
2026-02-21T08:46:55.6818410Z [53s] Initial random population of 100, 5 starting points: 
2026-02-21T08:46:55.6818600Z error=3
2026-02-21T08:46:55.6818693Z timeout=3
2026-02-21T08:46:55.6818781Z ok=94
2026-02-21T08:46:55.6818865Z min=0.2657
2026-02-21T08:46:55.6818946Z mid=2.9382
2026-02-21T08:46:55.6819029Z max=85.7581
2026-02-21T08:46:55.6819128Z best={'block_sizes': [16, 16, 16],
2026-02-21T08:46:55.6819271Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:46:55.6819413Z  'l2_groupings': [1],
2026-02-21T08:46:55.6819542Z  'load_eviction_policies': ['', ''],
2026-02-21T08:46:55.6819665Z  'loop_orders': [[0, 1]],
2026-02-21T08:46:55.6819790Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:46:55.6819897Z  'num_stages': 1,
2026-02-21T08:46:55.6819992Z  'num_warps': 4,
2026-02-21T08:46:55.6820081Z  'pid_type': 'flat',
2026-02-21T08:46:55.6820192Z  'range_flattens': [None, None],
2026-02-21T08:46:55.6820314Z  'range_multi_buffers': [None, None],
2026-02-21T08:46:55.6820439Z  'range_num_stages': [0, 0],
2026-02-21T08:46:55.6820557Z  'range_unroll_factors': [0, 0],
2026-02-21T08:46:55.6820679Z  'range_warp_specializes': [],
2026-02-21T08:46:55.6820794Z  'waves_per_eu': 1}
2026-02-21T08:46:55.6834099Z [53s] Fitting surrogate: 100 points, 100 targets
2026-02-21T08:46:56.6474352Z [54s] Generation 1 starting: 98 neighbors, 5 active search path(s)
2026-02-21T08:47:17.0937567Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 3.4 configs/s
2026-02-21T08:47:24.1370816Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 102/102 14.8 configs/s
2026-02-21T08:47:26.5906090Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 340.5         
2026-02-21T08:47:26.5907452Z                                                                   configs/s     
2026-02-21T08:47:27.2087894Z [85s] Generation 1 complete: 
2026-02-21T08:47:27.2088332Z error=3
2026-02-21T08:47:27.2088535Z ok=101
2026-02-21T08:47:27.2088744Z min=0.1526
2026-02-21T08:47:27.2088956Z mid=0.6797
2026-02-21T08:47:27.2089159Z max=20.1061
2026-02-21T08:47:27.2089398Z best={'block_sizes': [32, 16, 16],
2026-02-21T08:47:27.2089766Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:47:27.2090127Z  'l2_groupings': [1],
2026-02-21T08:47:27.2090411Z  'load_eviction_policies': ['', ''],
2026-02-21T08:47:27.2090724Z  'loop_orders': [[0, 1]],
2026-02-21T08:47:27.2090982Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:47:27.2091225Z  'num_stages': 1,
2026-02-21T08:47:27.2091431Z  'num_warps': 2,
2026-02-21T08:47:27.2091643Z  'pid_type': 'flat',
2026-02-21T08:47:27.2091938Z  'range_flattens': [None, False],
2026-02-21T08:47:27.2092223Z  'range_multi_buffers': [None, None],
2026-02-21T08:47:27.2092529Z  'range_num_stages': [0, 0],
2026-02-21T08:47:27.2092786Z  'range_unroll_factors': [0, 0],
2026-02-21T08:47:27.2093071Z  'range_warp_specializes': [],
2026-02-21T08:47:27.2093322Z  'waves_per_eu': 1}
2026-02-21T08:47:27.2426248Z [85s] Fitting surrogate: 204 points, 204 targets
2026-02-21T08:47:28.3060212Z [86s] Generation 2 starting: 106 neighbors, 5 active search path(s)
2026-02-21T08:47:50.4586098Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 109/109 1.2 configs/s
2026-02-21T08:47:56.2577615Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:47:56.2580686Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:47:56.2581143Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:47:56.2581667Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:47:56.2582148Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:47:56.2582557Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:47:56.2582912Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:47:56.2583174Z #smem = #ttg.shared_memory
2026-02-21T08:47:56.2583510Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:47:56.2584180Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:47:56.2584733Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T08:47:56.2584984Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:56.2585219Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:56.2585470Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T08:47:56.2585727Z     %cst_3 = arith.constant dense<7168> : tensor<2x1xi32, #blocked1>
2026-02-21T08:47:56.2585977Z     %cst_4 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2>
2026-02-21T08:47:56.2586184Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:47:56.2586344Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:47:56.2586511Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:47:56.2586668Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:47:56.2586831Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T08:47:56.2587025Z     %cst_5 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.2587403Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T08:47:56.2587562Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:47:56.2587706Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:47:56.2587838Z     %c38912_i32 = arith.constant 38912 : i32
2026-02-21T08:47:56.2588061Z     %cst_6 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.2588282Z     %0 = tt.get_program_id x : i32
2026-02-21T08:47:56.2588511Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:56.2588842Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:47:56.2589156Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:56.2589459Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:47:56.2589766Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:56.2590073Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:56.2590347Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:47:56.2590578Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T08:47:56.2590886Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:47:56.2591407Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:47:56.2591872Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:56.2592168Z     %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:56.2592392Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T08:47:56.2592619Z     %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:56.2592838Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T08:47:56.2593082Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T08:47:56.2593292Z     scf.for %arg3 = %0 to %c448_i32 step %c38912_i32  : i32 {
2026-02-21T08:47:56.2593463Z       %17 = arith.divsi %arg3, %c7168_i32 : i32
2026-02-21T08:47:56.2593608Z       %18 = arith.muli %17, %c64_i32 : i32
2026-02-21T08:47:56.2593750Z       %19 = arith.subi %c4_i32, %18 : i32
2026-02-21T08:47:56.2593881Z       %20 = arith.minsi %19, %c64_i32 : i32
2026-02-21T08:47:56.2594027Z       %21 = arith.remsi %arg3, %c7168_i32 : i32
2026-02-21T08:47:56.2594166Z       %22 = arith.remsi %21, %20 : i32
2026-02-21T08:47:56.2594294Z       %23 = arith.addi %18, %22 : i32
2026-02-21T08:47:56.2594424Z       %24 = arith.divsi %21, %20 : i32
2026-02-21T08:47:56.2594548Z       %25 = arith.muli %23, %c16_i32 : i32
2026-02-21T08:47:56.2594744Z       %26 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:56.2594990Z       %27 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:47:56.2595237Z       %28 = arith.addi %26, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:56.2595473Z       %29 = arith.addi %27, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:47:56.2595661Z       %30 = arith.muli %24, %c64_i32 : i32
2026-02-21T08:47:56.2595858Z       %31 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:56.2596100Z       %32 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:47:56.2596380Z       %33 = arith.addi %31, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:56.2596623Z       %34 = arith.addi %32, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:47:56.2596934Z       %35 = tt.expand_dims %28 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T08:47:56.2597228Z       %36 = arith.muli %35, %cst_4 : tensor<16x1xi32, #blocked2>
2026-02-21T08:47:56.2597444Z       %37 = tt.broadcast %36 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:47:56.2597784Z       %38 = tt.expand_dims %33 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T08:47:56.2598088Z       %39 = tt.broadcast %38 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T08:47:56.2598352Z       %40 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T08:47:56.2598625Z         %49 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:56.2598845Z         %50 = arith.addi %49, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:56.2599019Z         %51 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:47:56.2599188Z         %52 = tt.splat %51 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:56.2599402Z         %53 = arith.addi %52, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:56.2599677Z         %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:47:56.2599995Z         %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:47:56.2600190Z         %56 = arith.addi %37, %55 : tensor<16x4xi32, #blocked2>
2026-02-21T08:47:56.2600394Z         %57 = tt.addptr %7, %56 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T08:47:56.2600597Z         %58 = tt.load %57 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:47:56.2600868Z         %59 = ttg.convert_layout %58 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.2601268Z         %60 = arith.extf %59 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.2601658Z         %61 = tt.expand_dims %50 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T08:47:56.2601903Z         %62 = arith.muli %61, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T08:47:56.2602094Z         %63 = tt.broadcast %62 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T08:47:56.2602284Z         %64 = arith.addi %63, %39 : tensor<2x64xi32, #blocked1>
2026-02-21T08:47:56.2602476Z         %65 = tt.addptr %8, %64 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T08:47:56.2602729Z         %66 = tt.load %65 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T08:47:56.2602969Z         %67 = ttg.convert_layout %66 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.2603249Z         %68 = arith.shli %67, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.2603482Z         %69 = arith.shrsi %68, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.2603714Z         %70 = arith.shrsi %67, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.2604002Z         %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:56.2604381Z         %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:56.2604654Z         %73 = tt.broadcast %71 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.2604888Z         %74 = arith.select %13, %73, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.2605116Z         %75 = tt.broadcast %72 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.2605341Z         %76 = arith.select %15, %75, %74 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.2605564Z         %77 = tt.reshape %76 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T08:47:56.2605777Z         %78 = arith.sitofp %77 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T08:47:56.2606021Z         %79 = ttg.local_alloc %78 : (tensor<4x64xf32, #blocked3>) -> !ttg.memdesc<4x64xf32, #shared, #smem>
2026-02-21T08:47:56.2606338Z         %80 = ttg.local_load %79 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.2606808Z         %81 = tt.dot %60, %80, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T08:47:56.2607155Z         %82 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:47:56.2607338Z         %83 = tt.splat %82 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:56.2607558Z         %84 = arith.addi %83, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:56.2607727Z         %85 = arith.muli %82, %c2_i32 : i32
2026-02-21T08:47:56.2607929Z         %86 = tt.splat %85 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:56.2608141Z         %87 = arith.addi %86, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:56.2608411Z         %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:47:56.2608683Z         %89 = tt.broadcast %88 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:47:56.2608875Z         %90 = arith.addi %37, %89 : tensor<16x4xi32, #blocked2>
2026-02-21T08:47:56.2609073Z         %91 = tt.addptr %7, %90 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T08:47:56.2609276Z         %92 = tt.load %91 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:47:56.2609535Z         %93 = ttg.convert_layout %92 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.2609933Z         %94 = arith.extf %93 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.2610311Z         %95 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T08:47:56.2610551Z         %96 = arith.muli %95, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T08:47:56.2610741Z         %97 = tt.broadcast %96 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T08:47:56.2610925Z         %98 = arith.addi %97, %39 : tensor<2x64xi32, #blocked1>
2026-02-21T08:47:56.2611121Z         %99 = tt.addptr %8, %98 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T08:47:56.2611317Z         %100 = tt.load %99 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T08:47:56.2611556Z         %101 = ttg.convert_layout %100 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.2611844Z         %102 = arith.shli %101, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.2621016Z         %103 = arith.shrsi %102, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.2621253Z         %104 = arith.shrsi %101, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.2621544Z         %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:56.2621879Z         %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:56.2622164Z         %107 = tt.broadcast %105 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.2622405Z         %108 = arith.select %13, %107, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.2622645Z         %109 = tt.broadcast %106 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.2622880Z         %110 = arith.select %15, %109, %108 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.2623112Z         %111 = tt.reshape %110 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T08:47:56.2623334Z         %112 = arith.sitofp %111 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T08:47:56.2623583Z         %113 = ttg.local_alloc %112 : (tensor<4x64xf32, #blocked3>) -> !ttg.memdesc<4x64xf32, #shared, #smem>
2026-02-21T08:47:56.2623910Z         %114 = ttg.local_load %113 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.2624380Z         %115 = tt.dot %94, %114, %81, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T08:47:56.2624764Z         scf.yield %115 : tensor<16x64xf32, #mma>
2026-02-21T08:47:56.2624897Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:47:56.2625061Z       %41 = arith.truncf %40 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T08:47:56.2625319Z       %42 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T08:47:56.2625562Z       %43 = arith.muli %42, %cst : tensor<16x1xi32, #mma>
2026-02-21T08:47:56.2625782Z       %44 = tt.expand_dims %34 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T08:47:56.2626034Z       %45 = tt.broadcast %43 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T08:47:56.2626229Z       %46 = tt.broadcast %44 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T08:47:56.2626399Z       %47 = arith.addi %45, %46 : tensor<16x64xi32, #mma>
2026-02-21T08:47:56.2626582Z       %48 = tt.addptr %16, %47 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T08:47:56.2626769Z       tt.store %48, %41 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T08:47:56.2626931Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T08:47:56.2627066Z     tt.return
2026-02-21T08:47:56.2627148Z   }
2026-02-21T08:47:56.2627226Z }
2026-02-21T08:47:56.2627272Z 
2026-02-21T08:47:56.2627303Z {-#
2026-02-21T08:47:56.2627387Z   external_resources: {
2026-02-21T08:47:56.2627486Z     mlir_reproducer: {
2026-02-21T08:47:56.2628493Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:47:56.2629532Z       disable_threading: false,
2026-02-21T08:47:56.2629639Z       verify_each: true
2026-02-21T08:47:56.2629732Z     }
2026-02-21T08:47:56.2629806Z   }
2026-02-21T08:47:56.2629876Z #-}
2026-02-21T08:47:56.2630157Z /tmp/torchinductor_root/dz/cdzhwkiurocicad57kyzlaveorf27dlzeuloa2mas23wvubxwhkl.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:47:56.2630845Z /tmp/torchinductor_root/dz/cdzhwkiurocicad57kyzlaveorf27dlzeuloa2mas23wvubxwhkl.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:47:56.2631405Z [114s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:47:56.2632194Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 64], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:47:56.2632906Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:47:56.2633079Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:47:56.5123000Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:47:56.5125871Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}>
2026-02-21T08:47:56.5127102Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:47:56.5128032Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:47:56.5128496Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:47:56.5128954Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:47:56.5129302Z #smem = #ttg.shared_memory
2026-02-21T08:47:56.5129774Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:47:56.5130803Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:47:56.5131605Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T08:47:56.5131831Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:56.5132057Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:56.5132288Z     %cst_2 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T08:47:56.5132484Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:47:56.5132682Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T08:47:56.5132909Z     %cst_4 = arith.constant dense<7168> : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5133140Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5133362Z     %cst_6 = arith.constant dense<4096> : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5133663Z     %cst_7 = arith.constant dense<0> : tensor<1x64xi64, #blocked2>
2026-02-21T08:47:56.5133884Z     %cst_8 = arith.constant dense<7168> : tensor<1x64xi64, #blocked2>
2026-02-21T08:47:56.5134107Z     %cst_9 = arith.constant dense<0> : tensor<2x64xi8, #blocked2>
2026-02-21T08:47:56.5134473Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:47:56.5134642Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:47:56.5134790Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:47:56.5134967Z     %cst_10 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.5135160Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T08:47:56.5135312Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:47:56.5135458Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:47:56.5135691Z     %cst_11 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.5135936Z     %0 = tt.get_program_id x : i32
2026-02-21T08:47:56.5136083Z     %1 = arith.divsi %0, %c7168_i32 : i32
2026-02-21T08:47:56.5136228Z     %2 = arith.muli %1, %c64_i32 : i32
2026-02-21T08:47:56.5136376Z     %3 = arith.subi %c4_i32, %2 : i32
2026-02-21T08:47:56.5136523Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T08:47:56.5136663Z     %5 = arith.remsi %0, %c7168_i32 : i32
2026-02-21T08:47:56.5136809Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T08:47:56.5136950Z     %7 = arith.addi %2, %6 : i32
2026-02-21T08:47:56.5137091Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T08:47:56.5137221Z     %9 = arith.muli %7, %c16_i32 : i32
2026-02-21T08:47:56.5137493Z     %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:56.5137843Z     %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:47:56.5138100Z     %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:56.5138319Z     %13 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:47:56.5138574Z     %14 = arith.addi %12, %10 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:56.5138790Z     %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:47:56.5138959Z     %16 = arith.muli %8, %c64_i32 : i32
2026-02-21T08:47:56.5139154Z     %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:47:56.5139429Z     %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:56.5139671Z     %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:47:56.5139876Z     %20 = arith.addi %19, %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:47:56.5140117Z     %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:56.5140430Z     %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:47:56.5140689Z     %23 = arith.muli %22, %cst_2 : tensor<16x1xi32, #blocked1>
2026-02-21T08:47:56.5140882Z     %24 = tt.broadcast %23 : tensor<16x1xi32, #blocked1> -> tensor<16x4xi32, #blocked1>
2026-02-21T08:47:56.5141120Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:47:56.5141286Z     %26 = arith.extsi %16 : i32 to i64
2026-02-21T08:47:56.5141440Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T08:47:56.5141679Z     %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:56.5142014Z     %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:56.5153318Z     %30 = tt.splat %26 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:56.5153650Z     %31 = arith.extsi %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:56.5153956Z     %32 = arith.addi %30, %31 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:56.5154300Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T08:47:56.5154586Z     %34 = tt.broadcast %33 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T08:47:56.5154790Z     %35 = arith.cmpi sge, %33, %cst_7 : tensor<1x64xi64, #blocked2>
2026-02-21T08:47:56.5154958Z     %36 = arith.cmpi slt, %33, %cst_8 : tensor<1x64xi64, #blocked2>
2026-02-21T08:47:56.5155125Z     %37 = arith.andi %35, %36 : tensor<1x64xi1, #blocked2>
2026-02-21T08:47:56.5155304Z     %38 = tt.broadcast %37 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T08:47:56.5155592Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:47:56.5156006Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:47:56.5156407Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:56.5156663Z     %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:56.5156855Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T08:47:56.5157049Z     %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:56.5157239Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T08:47:56.5157501Z     %46 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg4 = %cst_3) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T08:47:56.5157781Z       %56 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T08:47:56.5157950Z       %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:56.5158173Z       %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:56.5158443Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:47:56.5158713Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<16x4xi32, #blocked1>
2026-02-21T08:47:56.5158906Z       %61 = arith.addi %24, %60 : tensor<16x4xi32, #blocked1>
2026-02-21T08:47:56.5159101Z       %62 = tt.addptr %25, %61 : tensor<16x4x!tt.ptr<bf16>, #blocked1>, tensor<16x4xi32, #blocked1>
2026-02-21T08:47:56.5159305Z       %63 = tt.load %62 : tensor<16x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:47:56.5159569Z       %64 = ttg.convert_layout %63 : tensor<16x4xbf16, #blocked1> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.5159971Z       %65 = arith.extf %64 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.5160260Z       %66 = arith.extsi %arg3 : i32 to i64
2026-02-21T08:47:56.5160427Z       %67 = tt.splat %66 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:56.5160647Z       %68 = arith.addi %67, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:56.5160914Z       %69 = tt.expand_dims %68 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5161157Z       %70 = arith.muli %69, %cst_4 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5161342Z       %71 = tt.broadcast %70 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T08:47:56.5161527Z       %72 = arith.addi %71, %34 : tensor<2x64xi64, #blocked2>
2026-02-21T08:47:56.5161722Z       %73 = tt.addptr %27, %72 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T08:47:56.5161956Z       %74 = arith.cmpi sge, %69, %cst_5 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5162126Z       %75 = arith.cmpi slt, %69, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5162289Z       %76 = arith.andi %74, %75 : tensor<2x1xi1, #blocked2>
2026-02-21T08:47:56.5162468Z       %77 = tt.broadcast %76 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T08:47:56.5162729Z       %78 = arith.andi %77, %38 : tensor<2x64xi1, #blocked2>
2026-02-21T08:47:56.5162889Z       %79 = tt.load %73, %78, %cst_9 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T08:47:56.5163141Z       %80 = ttg.convert_layout %79 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.5163418Z       %81 = arith.shli %80, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.5163655Z       %82 = arith.shrsi %81, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.5163890Z       %83 = arith.shrsi %80, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.5164170Z       %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:56.5164498Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:56.5164777Z       %86 = tt.broadcast %84 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.5165008Z       %87 = arith.select %43, %86, %cst_10 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.5165244Z       %88 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.5165512Z       %89 = arith.select %45, %88, %87 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.5165733Z       %90 = tt.reshape %89 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T08:47:56.5165951Z       %91 = arith.sitofp %90 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T08:47:56.5166189Z       %92 = ttg.local_alloc %91 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared, #smem>
2026-02-21T08:47:56.5166509Z       %93 = ttg.local_load %92 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.5166972Z       %94 = tt.dot %65, %93, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T08:47:56.5167315Z       %95 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T08:47:56.5167438Z       %96 = arith.muli %95, %c2_i32 : i32
2026-02-21T08:47:56.5167607Z       %97 = tt.splat %96 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:56.5167824Z       %98 = arith.addi %97, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:56.5168092Z       %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:47:56.5168363Z       %100 = tt.broadcast %99 : tensor<1x4xi32, #blocked1> -> tensor<16x4xi32, #blocked1>
2026-02-21T08:47:56.5168558Z       %101 = arith.addi %24, %100 : tensor<16x4xi32, #blocked1>
2026-02-21T08:47:56.5168757Z       %102 = tt.addptr %25, %101 : tensor<16x4x!tt.ptr<bf16>, #blocked1>, tensor<16x4xi32, #blocked1>
2026-02-21T08:47:56.5168966Z       %103 = tt.load %102 : tensor<16x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:47:56.5169230Z       %104 = ttg.convert_layout %103 : tensor<16x4xbf16, #blocked1> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.5169632Z       %105 = arith.extf %104 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.5169955Z       %106 = arith.extsi %95 : i32 to i64
2026-02-21T08:47:56.5170124Z       %107 = tt.splat %106 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:56.5170348Z       %108 = arith.addi %107, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:56.5170625Z       %109 = tt.expand_dims %108 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5170877Z       %110 = arith.muli %109, %cst_4 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5171070Z       %111 = tt.broadcast %110 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T08:47:56.5171260Z       %112 = arith.addi %111, %34 : tensor<2x64xi64, #blocked2>
2026-02-21T08:47:56.5171463Z       %113 = tt.addptr %27, %112 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T08:47:56.5171671Z       %114 = arith.cmpi sge, %109, %cst_5 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5171849Z       %115 = arith.cmpi slt, %109, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:56.5172012Z       %116 = arith.andi %114, %115 : tensor<2x1xi1, #blocked2>
2026-02-21T08:47:56.5172199Z       %117 = tt.broadcast %116 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T08:47:56.5172391Z       %118 = arith.andi %117, %38 : tensor<2x64xi1, #blocked2>
2026-02-21T08:47:56.5172557Z       %119 = tt.load %113, %118, %cst_9 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T08:47:56.5172817Z       %120 = ttg.convert_layout %119 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.5173095Z       %121 = arith.shli %120, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.5173367Z       %122 = arith.shrsi %121, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.5173606Z       %123 = arith.shrsi %120, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:56.5173895Z       %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:56.5174226Z       %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:56.5174501Z       %126 = tt.broadcast %124 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.5174739Z       %127 = arith.select %43, %126, %cst_10 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.5174973Z       %128 = tt.broadcast %125 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.5175200Z       %129 = arith.select %45, %128, %127 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:56.5175431Z       %130 = tt.reshape %129 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T08:47:56.5175650Z       %131 = arith.sitofp %130 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T08:47:56.5175896Z       %132 = ttg.local_alloc %131 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared, #smem>
2026-02-21T08:47:56.5176214Z       %133 = ttg.local_load %132 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:56.5176672Z       %134 = tt.dot %105, %133, %94, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T08:47:56.5177016Z       scf.yield %134 : tensor<16x64xf32, #mma>
2026-02-21T08:47:56.5177140Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:47:56.5177294Z     %47 = arith.truncf %46 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T08:47:56.5177549Z     %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T08:47:56.5177812Z     %49 = arith.muli %48, %cst : tensor<16x1xi32, #mma>
2026-02-21T08:47:56.5178033Z     %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T08:47:56.5178278Z     %51 = tt.broadcast %49 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T08:47:56.5178471Z     %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T08:47:56.5178638Z     %53 = arith.addi %51, %52 : tensor<16x64xi32, #mma>
2026-02-21T08:47:56.5178806Z     %54 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T08:47:56.5179010Z     %55 = tt.addptr %54, %53 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T08:47:56.5179195Z     tt.store %55, %47 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T08:47:56.5179324Z     tt.return
2026-02-21T08:47:56.5179404Z   }
2026-02-21T08:47:56.5179480Z }
2026-02-21T08:47:56.5179526Z 
2026-02-21T08:47:56.5179556Z {-#
2026-02-21T08:47:56.5179640Z   external_resources: {
2026-02-21T08:47:56.5179736Z     mlir_reproducer: {
2026-02-21T08:47:56.5180747Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:47:56.5181758Z       disable_threading: false,
2026-02-21T08:47:56.5181899Z       verify_each: true
2026-02-21T08:47:56.5181986Z     }
2026-02-21T08:47:56.5182060Z   }
2026-02-21T08:47:56.5182127Z #-}
2026-02-21T08:47:56.5182407Z /tmp/torchinductor_root/2d/c2d2fvrviaycdfpxcopjx3txsy2pc2p2sbbhp3w33hnxcsxbupf7.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:47:56.5183088Z /tmp/torchinductor_root/2d/c2d2fvrviaycdfpxcopjx3txsy2pc2p2sbbhp3w33hnxcsxbupf7.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:47:56.5183633Z [114s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:47:56.5184356Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:47:56.5185009Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:47:56.5185173Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:47:57.1794942Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:47:57.1797083Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}>
2026-02-21T08:47:57.1798488Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:47:57.1799607Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:47:57.1800438Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:47:57.1801486Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:47:57.1801993Z #smem = #ttg.shared_memory
2026-02-21T08:47:57.1803021Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:47:57.1804536Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:47:57.1805458Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T08:47:57.1806044Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:57.1806469Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:57.1806890Z     %cst_2 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T08:47:57.1807287Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:47:57.1807625Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T08:47:57.1807938Z     %cst_4 = arith.constant dense<7168> : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1808221Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1808506Z     %cst_6 = arith.constant dense<4096> : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1808792Z     %cst_7 = arith.constant dense<0> : tensor<1x64xi64, #blocked2>
2026-02-21T08:47:57.1809078Z     %cst_8 = arith.constant dense<7168> : tensor<1x64xi64, #blocked2>
2026-02-21T08:47:57.1809377Z     %cst_9 = arith.constant dense<0> : tensor<2x64xi8, #blocked2>
2026-02-21T08:47:57.1809619Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:47:57.1809803Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:47:57.1810183Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:47:57.1810410Z     %cst_10 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:57.1810662Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T08:47:57.1810863Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:47:57.1811055Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:47:57.1811353Z     %cst_11 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:57.1811674Z     %0 = tt.get_program_id x : i32
2026-02-21T08:47:57.1811860Z     %1 = arith.divsi %0, %c7168_i32 : i32
2026-02-21T08:47:57.1812047Z     %2 = arith.muli %1, %c64_i32 : i32
2026-02-21T08:47:57.1812234Z     %3 = arith.subi %c4_i32, %2 : i32
2026-02-21T08:47:57.1812411Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T08:47:57.1812607Z     %5 = arith.remsi %0, %c7168_i32 : i32
2026-02-21T08:47:57.1812790Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T08:47:57.1812968Z     %7 = arith.addi %2, %6 : i32
2026-02-21T08:47:57.1813145Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T08:47:57.1813321Z     %9 = arith.muli %7, %c16_i32 : i32
2026-02-21T08:47:57.1813656Z     %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:57.1814111Z     %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:47:57.1814515Z     %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:57.1814865Z     %13 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:47:57.1815212Z     %14 = arith.addi %12, %10 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:47:57.1815554Z     %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:47:57.1815818Z     %16 = arith.muli %8, %c64_i32 : i32
2026-02-21T08:47:57.1816142Z     %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:47:57.1816580Z     %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:57.1817048Z     %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:47:57.1817377Z     %20 = arith.addi %19, %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:47:57.1817707Z     %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:57.1818086Z     %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T08:47:57.1818392Z     %23 = arith.muli %22, %cst_2 : tensor<16x1xi32, #blocked1>
2026-02-21T08:47:57.1818623Z     %24 = tt.broadcast %23 : tensor<16x1xi32, #blocked1> -> tensor<16x4xi32, #blocked1>
2026-02-21T08:47:57.1818897Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:47:57.1819092Z     %26 = arith.extsi %16 : i32 to i64
2026-02-21T08:47:57.1819282Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T08:47:57.1819568Z     %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:57.1819960Z     %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:57.1820323Z     %30 = tt.splat %26 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:57.1820704Z     %31 = arith.extsi %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:57.1821063Z     %32 = arith.addi %30, %31 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:47:57.1821443Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T08:47:57.1821781Z     %34 = tt.broadcast %33 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T08:47:57.1822029Z     %35 = arith.cmpi sge, %33, %cst_7 : tensor<1x64xi64, #blocked2>
2026-02-21T08:47:57.1822240Z     %36 = arith.cmpi slt, %33, %cst_8 : tensor<1x64xi64, #blocked2>
2026-02-21T08:47:57.1822442Z     %37 = arith.andi %35, %36 : tensor<1x64xi1, #blocked2>
2026-02-21T08:47:57.1822663Z     %38 = tt.broadcast %37 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T08:47:57.1823012Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:47:57.1823520Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:47:57.1824016Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:57.1824332Z     %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:57.1824571Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T08:47:57.1824809Z     %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:47:57.1825047Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T08:47:57.1825376Z     %46 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg4 = %cst_3) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T08:47:57.1825647Z       %56 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T08:47:57.1825866Z       %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:57.1826136Z       %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:57.1826480Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:47:57.1844349Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<16x4xi32, #blocked1>
2026-02-21T08:47:57.1844542Z       %61 = arith.addi %24, %60 : tensor<16x4xi32, #blocked1>
2026-02-21T08:47:57.1844743Z       %62 = tt.addptr %25, %61 : tensor<16x4x!tt.ptr<bf16>, #blocked1>, tensor<16x4xi32, #blocked1>
2026-02-21T08:47:57.1844945Z       %63 = tt.load %62 : tensor<16x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:47:57.1845213Z       %64 = ttg.convert_layout %63 : tensor<16x4xbf16, #blocked1> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:57.1845611Z       %65 = arith.extf %64 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:57.1845895Z       %66 = arith.extsi %arg3 : i32 to i64
2026-02-21T08:47:57.1846065Z       %67 = tt.splat %66 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:57.1846283Z       %68 = arith.addi %67, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:57.1846553Z       %69 = tt.expand_dims %68 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1846794Z       %70 = arith.muli %69, %cst_4 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1846984Z       %71 = tt.broadcast %70 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T08:47:57.1847169Z       %72 = arith.addi %71, %34 : tensor<2x64xi64, #blocked2>
2026-02-21T08:47:57.1847359Z       %73 = tt.addptr %27, %72 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T08:47:57.1847563Z       %74 = arith.cmpi sge, %69, %cst_5 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1847778Z       %75 = arith.cmpi slt, %69, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1847941Z       %76 = arith.andi %74, %75 : tensor<2x1xi1, #blocked2>
2026-02-21T08:47:57.1848124Z       %77 = tt.broadcast %76 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T08:47:57.1848307Z       %78 = arith.andi %77, %38 : tensor<2x64xi1, #blocked2>
2026-02-21T08:47:57.1848472Z       %79 = tt.load %73, %78, %cst_9 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T08:47:57.1848719Z       %80 = ttg.convert_layout %79 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:57.1849000Z       %81 = arith.shli %80, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:57.1849230Z       %82 = arith.shrsi %81, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:57.1849460Z       %83 = arith.shrsi %80, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:57.1849743Z       %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:57.1850072Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:57.1850350Z       %86 = tt.broadcast %84 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:57.1850585Z       %87 = arith.select %43, %86, %cst_10 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:57.1850821Z       %88 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:57.1851049Z       %89 = arith.select %45, %88, %87 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:57.1851265Z       %90 = tt.reshape %89 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T08:47:57.1851485Z       %91 = arith.sitofp %90 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T08:47:57.1851728Z       %92 = ttg.local_alloc %91 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared, #smem>
2026-02-21T08:47:57.1852083Z       %93 = ttg.local_load %92 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:57.1852548Z       %94 = tt.dot %65, %93, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T08:47:57.1852898Z       %95 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T08:47:57.1853021Z       %96 = arith.muli %95, %c2_i32 : i32
2026-02-21T08:47:57.1853188Z       %97 = tt.splat %96 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:57.1853402Z       %98 = arith.addi %97, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:47:57.1853677Z       %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:47:57.1853946Z       %100 = tt.broadcast %99 : tensor<1x4xi32, #blocked1> -> tensor<16x4xi32, #blocked1>
2026-02-21T08:47:57.1854144Z       %101 = arith.addi %24, %100 : tensor<16x4xi32, #blocked1>
2026-02-21T08:47:57.1854349Z       %102 = tt.addptr %25, %101 : tensor<16x4x!tt.ptr<bf16>, #blocked1>, tensor<16x4xi32, #blocked1>
2026-02-21T08:47:57.1854552Z       %103 = tt.load %102 : tensor<16x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:47:57.1854818Z       %104 = ttg.convert_layout %103 : tensor<16x4xbf16, #blocked1> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:57.1855214Z       %105 = arith.extf %104 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:57.1855495Z       %106 = arith.extsi %95 : i32 to i64
2026-02-21T08:47:57.1855710Z       %107 = tt.splat %106 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:57.1855931Z       %108 = arith.addi %107, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:47:57.1856214Z       %109 = tt.expand_dims %108 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1856462Z       %110 = arith.muli %109, %cst_4 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1856656Z       %111 = tt.broadcast %110 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T08:47:57.1856846Z       %112 = arith.addi %111, %34 : tensor<2x64xi64, #blocked2>
2026-02-21T08:47:57.1857044Z       %113 = tt.addptr %27, %112 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T08:47:57.1857258Z       %114 = arith.cmpi sge, %109, %cst_5 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1857431Z       %115 = arith.cmpi slt, %109, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T08:47:57.1857602Z       %116 = arith.andi %114, %115 : tensor<2x1xi1, #blocked2>
2026-02-21T08:47:57.1857784Z       %117 = tt.broadcast %116 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T08:47:57.1857976Z       %118 = arith.andi %117, %38 : tensor<2x64xi1, #blocked2>
2026-02-21T08:47:57.1858146Z       %119 = tt.load %113, %118, %cst_9 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T08:47:57.1858399Z       %120 = ttg.convert_layout %119 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:57.1858683Z       %121 = arith.shli %120, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:57.1858924Z       %122 = arith.shrsi %121, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:57.1859166Z       %123 = arith.shrsi %120, %cst_11 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:47:57.1859461Z       %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:57.1859795Z       %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T08:47:57.1860111Z       %126 = tt.broadcast %124 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:57.1860349Z       %127 = arith.select %43, %126, %cst_10 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:57.1860590Z       %128 = tt.broadcast %125 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:57.1860823Z       %129 = arith.select %45, %128, %127 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T08:47:57.1861049Z       %130 = tt.reshape %129 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T08:47:57.1861270Z       %131 = arith.sitofp %130 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T08:47:57.1861517Z       %132 = ttg.local_alloc %131 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared, #smem>
2026-02-21T08:47:57.1861843Z       %133 = ttg.local_load %132 : !ttg.memdesc<4x64xf32, #shared, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:47:57.1862316Z       %134 = tt.dot %105, %133, %94, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T08:47:57.1862660Z       scf.yield %134 : tensor<16x64xf32, #mma>
2026-02-21T08:47:57.1862793Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:47:57.1862955Z     %47 = arith.truncf %46 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T08:47:57.1863214Z     %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T08:47:57.1863451Z     %49 = arith.muli %48, %cst : tensor<16x1xi32, #mma>
2026-02-21T08:47:57.1863701Z     %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T08:47:57.1863954Z     %51 = tt.broadcast %49 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T08:47:57.1864148Z     %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T08:47:57.1864325Z     %53 = arith.addi %51, %52 : tensor<16x64xi32, #mma>
2026-02-21T08:47:57.1864496Z     %54 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T08:47:57.1864699Z     %55 = tt.addptr %54, %53 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T08:47:57.1864888Z     tt.store %55, %47 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T08:47:57.1865017Z     tt.return
2026-02-21T08:47:57.1865103Z   }
2026-02-21T08:47:57.1865180Z }
2026-02-21T08:47:57.1865229Z 
2026-02-21T08:47:57.1865261Z {-#
2026-02-21T08:47:57.1865342Z   external_resources: {
2026-02-21T08:47:57.1865447Z     mlir_reproducer: {
2026-02-21T08:47:57.1866446Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:47:57.1867445Z       disable_threading: false,
2026-02-21T08:47:57.1867552Z       verify_each: true
2026-02-21T08:47:57.1867646Z     }
2026-02-21T08:47:57.1867720Z   }
2026-02-21T08:47:57.1867794Z #-}
2026-02-21T08:47:57.1868073Z /tmp/torchinductor_root/45/c453xlk5xasye6ijooj4jwaykdpsr4scc47yhy2o7agxwzgm2aq3.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:47:57.1868758Z /tmp/torchinductor_root/45/c453xlk5xasye6ijooj4jwaykdpsr4scc47yhy2o7agxwzgm2aq3.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:47:57.1869342Z [115s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:47:57.1870066Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:47:57.1870726Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:47:57.1870901Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:47:57.4770312Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 109/109 15.8 configs/s
2026-02-21T08:47:59.3998588Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 431.4         
2026-02-21T08:47:59.3998997Z                                                                   configs/s     
2026-02-21T08:47:59.9489531Z [118s] Generation 2 complete: 
2026-02-21T08:47:59.9489762Z error=6
2026-02-21T08:47:59.9489903Z ok=105
2026-02-21T08:47:59.9490039Z min=0.1224
2026-02-21T08:47:59.9490181Z mid=0.3420
2026-02-21T08:47:59.9490316Z max=14.9881
2026-02-21T08:47:59.9490475Z best={'block_sizes': [64, 32, 32],
2026-02-21T08:47:59.9490732Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:47:59.9490976Z  'l2_groupings': [8],
2026-02-21T08:47:59.9491163Z  'load_eviction_policies': ['', ''],
2026-02-21T08:47:59.9491401Z  'loop_orders': [[1, 0]],
2026-02-21T08:47:59.9491945Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:47:59.9492156Z  'num_sm_multiplier': 8,
2026-02-21T08:47:59.9492335Z  'num_stages': 2,
2026-02-21T08:47:59.9492515Z  'num_warps': 4,
2026-02-21T08:47:59.9492694Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:47:59.9492909Z  'range_flattens': [None, True],
2026-02-21T08:47:59.9493114Z  'range_multi_buffers': [None, None],
2026-02-21T08:47:59.9493317Z  'range_num_stages': [0, 3],
2026-02-21T08:47:59.9493506Z  'range_unroll_factors': [1, 4],
2026-02-21T08:47:59.9493704Z  'range_warp_specializes': [],
2026-02-21T08:47:59.9493888Z  'waves_per_eu': 3}
2026-02-21T08:47:59.9644479Z [118s] Fitting surrogate: 315 points, 315 targets
2026-02-21T08:48:00.9612186Z [119s] Generation 3 starting: 92 neighbors, 5 active search path(s)
2026-02-21T08:48:19.2611418Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 49.3 configs/s
2026-02-21T08:48:25.1523471Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 16.1 configs/s
2026-02-21T08:48:29.8853087Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 196.8         
2026-02-21T08:48:29.8853708Z                                                                   configs/s     
2026-02-21T08:48:30.4247871Z [148s] Generation 3 complete: 
2026-02-21T08:48:30.4248283Z ok=98
2026-02-21T08:48:30.4248547Z min=0.1082
2026-02-21T08:48:30.4248763Z mid=0.2042
2026-02-21T08:48:30.4248974Z max=10.4375
2026-02-21T08:48:30.4249213Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:48:30.4249585Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:48:30.4249936Z  'l2_groupings': [1],
2026-02-21T08:48:30.4250212Z  'load_eviction_policies': ['', ''],
2026-02-21T08:48:30.4250525Z  'loop_orders': [[0, 1]],
2026-02-21T08:48:30.4250800Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:48:30.4251065Z  'num_stages': 1,
2026-02-21T08:48:30.4251296Z  'num_warps': 2,
2026-02-21T08:48:30.4251523Z  'pid_type': 'flat',
2026-02-21T08:48:30.4251790Z  'range_flattens': [None, None],
2026-02-21T08:48:30.4252135Z  'range_multi_buffers': [None, None],
2026-02-21T08:48:30.4252446Z  'range_num_stages': [0, 0],
2026-02-21T08:48:30.4253306Z  'range_unroll_factors': [0, 0],
2026-02-21T08:48:30.4253598Z  'range_warp_specializes': [],
2026-02-21T08:48:30.4253872Z  'waves_per_eu': 1}
2026-02-21T08:48:30.4871268Z [148s] Fitting surrogate: 413 points, 413 targets
2026-02-21T08:48:31.4260467Z [149s] Generation 4 starting: 88 neighbors, 5 active search path(s)
2026-02-21T08:48:49.5628951Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 1.4 configs/s
2026-02-21T08:48:54.7889464Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 17.7 configs/s
2026-02-21T08:49:03.6276752Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 120.6         
2026-02-21T08:49:03.6278751Z                                                                   configs/s     
2026-02-21T08:49:04.3557720Z [182s] Generation 4 complete: 
2026-02-21T08:49:04.3558161Z error=7
2026-02-21T08:49:04.3558364Z ok=86
2026-02-21T08:49:04.3558634Z min=0.1070
2026-02-21T08:49:04.3558841Z mid=0.1488
2026-02-21T08:49:04.3559043Z max=10.2584
2026-02-21T08:49:04.3559308Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:49:04.3559682Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T08:49:04.3560030Z  'l2_groupings': [1],
2026-02-21T08:49:04.3560309Z  'load_eviction_policies': ['', ''],
2026-02-21T08:49:04.3560607Z  'loop_orders': [[0, 1]],
2026-02-21T08:49:04.3560850Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:49:04.3561083Z  'num_stages': 1,
2026-02-21T08:49:04.3561294Z  'num_warps': 2,
2026-02-21T08:49:04.3561496Z  'pid_type': 'flat',
2026-02-21T08:49:04.3561719Z  'range_flattens': [None, None],
2026-02-21T08:49:04.3561986Z  'range_multi_buffers': [None, None],
2026-02-21T08:49:04.3562256Z  'range_num_stages': [0, 0],
2026-02-21T08:49:04.3562497Z  'range_unroll_factors': [0, 0],
2026-02-21T08:49:04.3562845Z  'range_warp_specializes': [],
2026-02-21T08:49:04.3563091Z  'waves_per_eu': 1}
2026-02-21T08:49:04.4526449Z [182s] Fitting surrogate: 506 points, 506 targets
2026-02-21T08:49:05.3038930Z [183s] Generation 5 starting: 81 neighbors, 5 active search path(s)
2026-02-21T08:49:34.5568580Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.2 configs/s
2026-02-21T08:49:38.8590302Z /tmp/torchinductor_root/65/c65wjfoaegxg4l3qhkglwuwaoieefsi6tdfg3pxozbvljcnd7vhm.py:47:25: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T08:49:38.8591656Z         b_tile = tl.load(tl.make_block_ptr(B, [4096, 7168], [7168, 1], [offset_3, offset_2], [_BLOCK_SIZE_0, _BLOCK_SIZE_2], [1, 0]), boundary_check=[0, 1], padding_option='zero')
2026-02-21T08:49:38.8592430Z                         ^
2026-02-21T08:49:38.8594317Z /tmp/torchinductor_root/65/c65wjfoaegxg4l3qhkglwuwaoieefsi6tdfg3pxozbvljcnd7vhm.py:60:41: note: - use: %112 = "ttg.convert_layout"(<<UNKNOWN SSA VALUE>>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>}>>
2026-02-21T08:49:38.8596462Z 
2026-02-21T08:49:38.8596605Z         expanded_1 = tl.expand_dims(v_6, 1)
2026-02-21T08:49:38.8596868Z                                         ^
2026-02-21T08:49:38.8598241Z /tmp/torchinductor_root/65/c65wjfoaegxg4l3qhkglwuwaoieefsi6tdfg3pxozbvljcnd7vhm.py:59:41: note: - use: %113 = "ttg.convert_layout"(<<UNKNOWN SSA VALUE>>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>}>>
2026-02-21T08:49:38.8599541Z 
2026-02-21T08:49:38.8599632Z         expanded_0 = tl.expand_dims(v_4, 1)
2026-02-21T08:49:38.8599877Z                                         ^
2026-02-21T08:49:38.8600156Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T08:49:38.8600652Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:49:38.8601302Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:49:38.8601962Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:49:38.8602746Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:49:38.8603368Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}>
2026-02-21T08:49:38.8603980Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}>
2026-02-21T08:49:38.8604613Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T08:49:38.8605230Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}>
2026-02-21T08:49:38.8605855Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:49:38.8606492Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}>
2026-02-21T08:49:38.8607003Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}>
2026-02-21T08:49:38.8607481Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:49:38.8607939Z #blocked12 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [8, 8], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:49:38.8608638Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:49:38.8609341Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:49:38.8609908Z     %cst = arith.constant dense<0> : tensor<64x16xi8, #blocked>
2026-02-21T08:49:38.8610186Z     %cst_0 = arith.constant dense<7168> : tensor<1x16xi64, #blocked>
2026-02-21T08:49:38.8610446Z     %cst_1 = arith.constant dense<0> : tensor<1x16xi64, #blocked>
2026-02-21T08:49:38.8610705Z     %cst_2 = arith.constant dense<4096> : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8610970Z     %cst_3 = arith.constant dense<0> : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8611228Z     %cst_4 = arith.constant dense<7168> : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8611494Z     %cst_5 = arith.constant dense<0> : tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:38.8611726Z     %c28672_i32 = arith.constant 28672 : i32
2026-02-21T08:49:38.8611910Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:49:38.8612140Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:49:38.8612353Z     %cst_6 = arith.constant dense<7168> : tensor<32x1xi32, #blocked1>
2026-02-21T08:49:38.8612617Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:49:38.8612874Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:49:38.8613129Z     %cst_9 = arith.constant dense<4> : tensor<64x16xi8, #blocked>
2026-02-21T08:49:38.8613393Z     %cst_10 = arith.constant dense<8192> : tensor<32x1xi32, #blocked1>
2026-02-21T08:49:38.8613617Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:49:38.8613835Z     %cst_11 = arith.constant dense<0.000000e+00> : tensor<32x16xf32, #blocked>
2026-02-21T08:49:38.8614081Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:49:38.8614246Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:49:38.8614416Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:49:38.8614593Z     %0 = tt.get_program_id x : i32
2026-02-21T08:49:38.8614755Z     %1 = arith.divsi %0, %c28672_i32 : i32
2026-02-21T08:49:38.8614930Z     %2 = arith.muli %1, %c64_i32 : i32
2026-02-21T08:49:38.8615099Z     %3 = arith.subi %c2_i32, %2 : i32
2026-02-21T08:49:38.8615262Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T08:49:38.8615426Z     %5 = arith.remsi %0, %c28672_i32 : i32
2026-02-21T08:49:38.8615603Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T08:49:38.8615770Z     %7 = arith.addi %2, %6 : i32
2026-02-21T08:49:38.8615925Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T08:49:38.8616085Z     %9 = arith.muli %7, %c32_i32 : i32
2026-02-21T08:49:38.8616318Z     %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked4>
2026-02-21T08:49:38.8616635Z     %11 = tt.splat %9 : i32 -> tensor<32xi32, #blocked4>
2026-02-21T08:49:38.8616824Z     %12 = arith.addi %11, %10 : tensor<32xi32, #blocked4>
2026-02-21T08:49:38.8616978Z     %13 = arith.muli %8, %c16_i32 : i32
2026-02-21T08:49:38.8617159Z     %14 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked4>
2026-02-21T08:49:38.8617367Z     %15 = tt.splat %13 : i32 -> tensor<16xi32, #blocked4>
2026-02-21T08:49:38.8617566Z     %16 = arith.addi %15, %14 : tensor<16xi32, #blocked4>
2026-02-21T08:49:38.8617771Z     %17 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked4>
2026-02-21T08:49:38.8618081Z     %18 = ttg.convert_layout %12 : tensor<32xi32, #blocked4> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:49:38.8618461Z     %19 = tt.expand_dims %18 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<32x1xi32, #blocked5>
2026-02-21T08:49:38.8618799Z     %20 = ttg.convert_layout %19 : tensor<32x1xi32, #blocked5> -> tensor<32x1xi32, #blocked1>
2026-02-21T08:49:38.8619030Z     %21 = arith.muli %20, %cst_10 : tensor<32x1xi32, #blocked1>
2026-02-21T08:49:38.8619299Z     %22 = tt.broadcast %21 : tensor<32x1xi32, #blocked1> -> tensor<32x128xi32, #blocked1>
2026-02-21T08:49:38.8619568Z     %23 = ttg.convert_layout %22 : tensor<32x128xi32, #blocked1> -> tensor<32x128xi32, #blocked6>
2026-02-21T08:49:38.8619841Z     %24 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T08:49:38.8620037Z     %25 = arith.extsi %13 : i32 to i64
2026-02-21T08:49:38.8620205Z     %26 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<64x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:49:38.8620440Z     %27 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked4>
2026-02-21T08:49:38.8620679Z     %28 = arith.extsi %27 : tensor<64xi32, #blocked4> to tensor<64xi64, #blocked4>
2026-02-21T08:49:38.8620888Z     %29 = tt.splat %25 : i64 -> tensor<16xi64, #blocked4>
2026-02-21T08:49:38.8621081Z     %30 = arith.extsi %14 : tensor<16xi32, #blocked4> to tensor<16xi64, #blocked4>
2026-02-21T08:49:38.8621283Z     %31 = arith.addi %29, %30 : tensor<16xi64, #blocked4>
2026-02-21T08:49:38.8621555Z     %32 = ttg.convert_layout %31 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:49:38.8621971Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi64, #blocked7>
2026-02-21T08:49:38.8622292Z     %34 = ttg.convert_layout %33 : tensor<1x16xi64, #blocked7> -> tensor<1x16xi64, #blocked>
2026-02-21T08:49:38.8622549Z     %35 = tt.broadcast %34 : tensor<1x16xi64, #blocked> -> tensor<64x16xi64, #blocked>
2026-02-21T08:49:38.8622777Z     %36 = arith.cmpi sge, %34, %cst_1 : tensor<1x16xi64, #blocked>
2026-02-21T08:49:38.8622968Z     %37 = arith.cmpi slt, %34, %cst_0 : tensor<1x16xi64, #blocked>
2026-02-21T08:49:38.8623151Z     %38 = arith.andi %36, %37 : tensor<1x16xi1, #blocked>
2026-02-21T08:49:38.8623354Z     %39 = tt.broadcast %38 : tensor<1x16xi1, #blocked> -> tensor<64x16xi1, #blocked>
2026-02-21T08:49:38.8623587Z     %40 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked4>
2026-02-21T08:49:38.8623915Z     %41 = ttg.convert_layout %40 : tensor<2xi32, #blocked4> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:49:38.8624285Z     %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x2xi32, #blocked7>
2026-02-21T08:49:38.8624607Z     %43 = ttg.convert_layout %42 : tensor<1x2xi32, #blocked7> -> tensor<1x2xi32, #blocked8>
2026-02-21T08:49:38.8624934Z     %44 = ttg.convert_layout %43 : tensor<1x2xi32, #blocked8> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T08:49:38.8625309Z     %45 = tt.expand_dims %44 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x2x1xi32, #blocked9>
2026-02-21T08:49:38.8625655Z     %46 = ttg.convert_layout %45 : tensor<1x2x1xi32, #blocked9> -> tensor<1x2x1xi32, #blocked3>
2026-02-21T08:49:38.8625894Z     %47 = arith.cmpi eq, %46, %cst_8 : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:49:38.8626124Z     %48 = tt.broadcast %47 : tensor<1x2x1xi1, #blocked3> -> tensor<64x2x16xi1, #blocked3>
2026-02-21T08:49:38.8626395Z     %49 = ttg.convert_layout %48 : tensor<64x2x16xi1, #blocked3> -> tensor<64x2x16xi1, #blocked2>
2026-02-21T08:49:38.8626620Z     %50 = arith.cmpi eq, %46, %cst_7 : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:49:38.8626813Z     %51 = tt.broadcast %50 : tensor<1x2x1xi1, #blocked3> -> tensor<64x2x16xi1, #blocked3>
2026-02-21T08:49:38.8627040Z     %52 = ttg.convert_layout %51 : tensor<64x2x16xi1, #blocked3> -> tensor<64x2x16xi1, #blocked2>
2026-02-21T08:49:38.8627224Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:49:38.8627446Z     %53 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %cst_11) -> (tensor<32x16xf32, #blocked>)  : i32 {
2026-02-21T08:49:38.8627665Z       %68 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T08:49:38.8627802Z       %69 = tt.splat %68 : i32 -> tensor<128xi32, #blocked4>
2026-02-21T08:49:38.8627951Z       %70 = arith.addi %69, %17 : tensor<128xi32, #blocked4>
2026-02-21T08:49:38.8628219Z       %71 = ttg.convert_layout %70 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:49:38.8628545Z       %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7>
2026-02-21T08:49:38.8628830Z       %73 = ttg.convert_layout %72 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6>
2026-02-21T08:49:38.8629063Z       %74 = tt.broadcast %73 : tensor<1x128xi32, #blocked6> -> tensor<32x128xi32, #blocked6>
2026-02-21T08:49:38.8629253Z       %75 = arith.addi %23, %74 : tensor<32x128xi32, #blocked6>
2026-02-21T08:49:38.8629453Z       %76 = tt.addptr %24, %75 : tensor<32x128x!tt.ptr<bf16>, #blocked6>, tensor<32x128xi32, #blocked6>
2026-02-21T08:49:38.8629659Z       %77 = tt.load %76 : tensor<32x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T08:49:38.8629848Z       %78 = arith.extf %77 : tensor<32x128xbf16, #blocked6> to tensor<32x128xf32, #blocked6>
2026-02-21T08:49:38.8630025Z       %79 = arith.extsi %arg3 : i32 to i64
2026-02-21T08:49:38.8630156Z       %80 = tt.splat %79 : i64 -> tensor<64xi64, #blocked4>
2026-02-21T08:49:38.8630340Z       %81 = arith.addi %80, %28 : tensor<64xi64, #blocked4>
2026-02-21T08:49:38.8630569Z       %82 = ttg.convert_layout %81 : tensor<64xi64, #blocked4> -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:49:38.8630889Z       %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi64, #blocked5>
2026-02-21T08:49:38.8631167Z       %84 = ttg.convert_layout %83 : tensor<64x1xi64, #blocked5> -> tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8631364Z       %85 = arith.muli %84, %cst_4 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8631551Z       %86 = tt.broadcast %85 : tensor<64x1xi64, #blocked1> -> tensor<64x16xi64, #blocked1>
2026-02-21T08:49:38.8631776Z       %87 = ttg.convert_layout %86 : tensor<64x16xi64, #blocked1> -> tensor<64x16xi64, #blocked>
2026-02-21T08:49:38.8631975Z       %88 = arith.addi %87, %35 : tensor<64x16xi64, #blocked>
2026-02-21T08:49:38.8632166Z       %89 = tt.addptr %26, %88 : tensor<64x16x!tt.ptr<i8>, #blocked>, tensor<64x16xi64, #blocked>
2026-02-21T08:49:38.8632368Z       %90 = arith.cmpi sge, %84, %cst_3 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8632536Z       %91 = arith.cmpi slt, %84, %cst_2 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8632691Z       %92 = arith.andi %90, %91 : tensor<64x1xi1, #blocked1>
2026-02-21T08:49:38.8632870Z       %93 = tt.broadcast %92 : tensor<64x1xi1, #blocked1> -> tensor<64x16xi1, #blocked1>
2026-02-21T08:49:38.8633093Z       %94 = ttg.convert_layout %93 : tensor<64x16xi1, #blocked1> -> tensor<64x16xi1, #blocked>
2026-02-21T08:49:38.8633282Z       %95 = arith.andi %94, %39 : tensor<64x16xi1, #blocked>
2026-02-21T08:49:38.8633439Z       %96 = tt.load %89, %95, %cst : tensor<64x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:49:38.8633603Z       %97 = arith.shli %96, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:38.8633758Z       %98 = arith.shrsi %97, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:38.8633915Z       %99 = arith.shrsi %96, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:38.8634157Z       %100 = ttg.convert_layout %98 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:49:38.8634503Z       %101 = tt.expand_dims %100 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:49:38.8634811Z       %102 = ttg.convert_layout %101 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:49:38.8635104Z       %103 = ttg.convert_layout %99 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:49:38.8635442Z       %104 = tt.expand_dims %103 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:49:38.8635795Z       %105 = ttg.convert_layout %104 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:49:38.8636048Z       %106 = tt.broadcast %102 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:49:38.8636292Z       %107 = ttg.convert_layout %106 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:38.8636547Z       %108 = arith.select %49, %107, %cst_5 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:38.8636793Z       %109 = tt.broadcast %105 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:49:38.8637037Z       %110 = ttg.convert_layout %109 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:38.8637292Z       %111 = arith.select %52, %110, %108 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:38.8637527Z       %112 = tt.reshape %111 : tensor<64x2x16xi8, #blocked2> -> tensor<128x16xi8, #blocked>
2026-02-21T08:49:38.8637759Z       %113 = arith.sitofp %112 : tensor<128x16xi8, #blocked> to tensor<128x16xf32, #blocked>
2026-02-21T08:49:38.8638084Z       %114 = ttg.convert_layout %78 : tensor<32x128xf32, #blocked6> -> tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>>
2026-02-21T08:49:38.8638431Z       %115 = ttg.convert_layout %113 : tensor<128x16xf32, #blocked> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>>
2026-02-21T08:49:38.8638731Z       %116 = ttg.convert_layout %arg4 : tensor<32x16xf32, #blocked> -> tensor<32x16xf32, #blocked12>
2026-02-21T08:49:38.8639140Z       %117 = tt.dot %114, %115, %116, inputPrecision = tf32 : tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> -> tensor<32x16xf32, #blocked12>
2026-02-21T08:49:38.8639547Z       %118 = ttg.convert_layout %117 : tensor<32x16xf32, #blocked12> -> tensor<32x16xf32, #blocked>
2026-02-21T08:49:38.8639740Z       %c1_i32 = arith.constant 1 : i32
2026-02-21T08:49:38.8639862Z       %119 = arith.muli %c64_i32, %c1_i32 : i32
2026-02-21T08:49:38.8639984Z       %120 = arith.addi %arg3, %119 : i32
2026-02-21T08:49:38.8640101Z       %121 = arith.muli %120, %c2_i32 : i32
2026-02-21T08:49:38.8640240Z       %122 = tt.splat %121 : i32 -> tensor<128xi32, #blocked4>
2026-02-21T08:49:38.8640395Z       %123 = arith.addi %122, %17 : tensor<128xi32, #blocked4>
2026-02-21T08:49:38.8640635Z       %124 = ttg.convert_layout %123 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:49:38.8640972Z       %125 = tt.expand_dims %124 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7>
2026-02-21T08:49:38.8641267Z       %126 = ttg.convert_layout %125 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6>
2026-02-21T08:49:38.8641506Z       %127 = tt.broadcast %126 : tensor<1x128xi32, #blocked6> -> tensor<32x128xi32, #blocked6>
2026-02-21T08:49:38.8641705Z       %128 = arith.addi %23, %127 : tensor<32x128xi32, #blocked6>
2026-02-21T08:49:38.8641912Z       %129 = tt.addptr %24, %128 : tensor<32x128x!tt.ptr<bf16>, #blocked6>, tensor<32x128xi32, #blocked6>
2026-02-21T08:49:38.8642125Z       %130 = tt.load %129 : tensor<32x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T08:49:38.8642319Z       %131 = arith.extf %130 : tensor<32x128xbf16, #blocked6> to tensor<32x128xf32, #blocked6>
2026-02-21T08:49:38.8642496Z       %132 = arith.extsi %120 : i32 to i64
2026-02-21T08:49:38.8642663Z       %133 = tt.splat %132 : i64 -> tensor<64xi64, #blocked4>
2026-02-21T08:49:38.8642815Z       %134 = arith.addi %133, %28 : tensor<64xi64, #blocked4>
2026-02-21T08:49:38.8643049Z       %135 = ttg.convert_layout %134 : tensor<64xi64, #blocked4> -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:49:38.8643370Z       %136 = tt.expand_dims %135 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi64, #blocked5>
2026-02-21T08:49:38.8643693Z       %137 = ttg.convert_layout %136 : tensor<64x1xi64, #blocked5> -> tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8643897Z       %138 = arith.muli %137, %cst_4 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8644093Z       %139 = tt.broadcast %138 : tensor<64x1xi64, #blocked1> -> tensor<64x16xi64, #blocked1>
2026-02-21T08:49:38.8644326Z       %140 = ttg.convert_layout %139 : tensor<64x16xi64, #blocked1> -> tensor<64x16xi64, #blocked>
2026-02-21T08:49:38.8644525Z       %141 = arith.addi %140, %35 : tensor<64x16xi64, #blocked>
2026-02-21T08:49:38.8644721Z       %142 = tt.addptr %26, %141 : tensor<64x16x!tt.ptr<i8>, #blocked>, tensor<64x16xi64, #blocked>
2026-02-21T08:49:38.8644927Z       %143 = arith.cmpi sge, %137, %cst_3 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8645104Z       %144 = arith.cmpi slt, %137, %cst_2 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:38.8645267Z       %145 = arith.andi %143, %144 : tensor<64x1xi1, #blocked1>
2026-02-21T08:49:38.8645459Z       %146 = tt.broadcast %145 : tensor<64x1xi1, #blocked1> -> tensor<64x16xi1, #blocked1>
2026-02-21T08:49:38.8645691Z       %147 = ttg.convert_layout %146 : tensor<64x16xi1, #blocked1> -> tensor<64x16xi1, #blocked>
2026-02-21T08:49:38.8645923Z       %148 = arith.andi %147, %39 : tensor<64x16xi1, #blocked>
2026-02-21T08:49:38.8646091Z       %149 = tt.load %142, %148, %cst : tensor<64x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:49:38.8646263Z       %150 = arith.shli %149, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:38.8646428Z       %151 = arith.shrsi %150, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:38.8646591Z       %152 = arith.shrsi %149, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:38.8646840Z       %153 = ttg.convert_layout %151 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:49:38.8647188Z       %154 = tt.expand_dims %153 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:49:38.8647495Z       %155 = ttg.convert_layout %154 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:49:38.8647791Z       %156 = ttg.convert_layout %152 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:49:38.8648131Z       %157 = tt.expand_dims %156 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:49:38.8648435Z       %158 = ttg.convert_layout %157 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:49:38.8648684Z       %159 = tt.broadcast %155 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:49:38.8648928Z       %160 = ttg.convert_layout %159 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:38.8649182Z       %161 = arith.select %49, %160, %cst_5 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:38.8649432Z       %162 = tt.broadcast %158 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:49:38.8649679Z       %163 = ttg.convert_layout %162 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:38.8649929Z       %164 = arith.select %52, %163, %161 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:38.8650172Z       %165 = tt.reshape %164 : tensor<64x2x16xi8, #blocked2> -> tensor<128x16xi8, #blocked>
2026-02-21T08:49:38.8650403Z       %166 = arith.sitofp %165 : tensor<128x16xi8, #blocked> to tensor<128x16xf32, #blocked>
2026-02-21T08:49:38.8650695Z       %167 = ttg.convert_layout %131 : tensor<32x128xf32, #blocked6> -> tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>>
2026-02-21T08:49:38.8651046Z       %168 = ttg.convert_layout %166 : tensor<128x16xf32, #blocked> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>>
2026-02-21T08:49:38.8651377Z       %169 = ttg.convert_layout %118 : tensor<32x16xf32, #blocked> -> tensor<32x16xf32, #blocked12>
2026-02-21T08:49:38.8651779Z       %170 = tt.dot %167, %168, %169, inputPrecision = tf32 : tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> -> tensor<32x16xf32, #blocked12>
2026-02-21T08:49:38.8652178Z       %171 = ttg.convert_layout %170 : tensor<32x16xf32, #blocked12> -> tensor<32x16xf32, #blocked>
2026-02-21T08:49:38.8652374Z       scf.yield %171 : tensor<32x16xf32, #blocked>
2026-02-21T08:49:38.8652495Z     } {tt.flatten}
2026-02-21T08:49:38.8652637Z     %54 = arith.truncf %53 : tensor<32x16xf32, #blocked> to tensor<32x16xbf16, #blocked>
2026-02-21T08:49:38.8652905Z     %55 = ttg.convert_layout %12 : tensor<32xi32, #blocked4> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:49:38.8653227Z     %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<32x1xi32, #blocked5>
2026-02-21T08:49:38.8653512Z     %57 = ttg.convert_layout %56 : tensor<32x1xi32, #blocked5> -> tensor<32x1xi32, #blocked1>
2026-02-21T08:49:38.8653711Z     %58 = arith.muli %57, %cst_6 : tensor<32x1xi32, #blocked1>
2026-02-21T08:49:38.8653974Z     %59 = ttg.convert_layout %16 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:49:38.8654290Z     %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi32, #blocked7>
2026-02-21T08:49:38.8654566Z     %61 = ttg.convert_layout %60 : tensor<1x16xi32, #blocked7> -> tensor<1x16xi32, #blocked>
2026-02-21T08:49:38.8654789Z     %62 = tt.broadcast %58 : tensor<32x1xi32, #blocked1> -> tensor<32x16xi32, #blocked1>
2026-02-21T08:49:38.8655011Z     %63 = ttg.convert_layout %62 : tensor<32x16xi32, #blocked1> -> tensor<32x16xi32, #blocked>
2026-02-21T08:49:38.8655233Z     %64 = tt.broadcast %61 : tensor<1x16xi32, #blocked> -> tensor<32x16xi32, #blocked>
2026-02-21T08:49:38.8655418Z     %65 = arith.addi %63, %64 : tensor<32x16xi32, #blocked>
2026-02-21T08:49:38.8655593Z     %66 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x16x!tt.ptr<bf16>, #blocked>
2026-02-21T08:49:38.8655812Z     %67 = tt.addptr %66, %65 : tensor<32x16x!tt.ptr<bf16>, #blocked>, tensor<32x16xi32, #blocked>
2026-02-21T08:49:38.8656008Z     tt.store %67, %54 : tensor<32x16x!tt.ptr<bf16>, #blocked>
2026-02-21T08:49:38.8656140Z     tt.return
2026-02-21T08:49:38.8656219Z   }
2026-02-21T08:49:38.8656296Z }
2026-02-21T08:49:38.8656336Z 
2026-02-21T08:49:38.8656366Z {-#
2026-02-21T08:49:38.8656448Z   external_resources: {
2026-02-21T08:49:38.8656544Z     mlir_reproducer: {
2026-02-21T08:49:38.8658752Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=16}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T08:49:38.8661073Z       disable_threading: false,
2026-02-21T08:49:38.8661182Z       verify_each: true
2026-02-21T08:49:38.8661278Z     }
2026-02-21T08:49:38.8661349Z   }
2026-02-21T08:49:38.8661420Z #-}
2026-02-21T08:49:38.8661698Z /tmp/torchinductor_root/65/c65wjfoaegxg4l3qhkglwuwaoieefsi6tdfg3pxozbvljcnd7vhm.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:49:38.8662431Z /tmp/torchinductor_root/65/c65wjfoaegxg4l3qhkglwuwaoieefsi6tdfg3pxozbvljcnd7vhm.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:49:38.8662989Z [217s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:49:38.8663717Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 32, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:49:38.8671087Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:49:38.8671278Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:49:39.2539140Z /tmp/torchinductor_root/tw/ctwf45jgblprssxbx6digyhxum3ybqilmizfwxub35zw4vh7cfja.py:47:25: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T08:49:39.2539778Z         b_tile = tl.load(tl.make_block_ptr(B, [4096, 7168], [7168, 1], [offset_3, offset_2], [_BLOCK_SIZE_0, _BLOCK_SIZE_2], [1, 0]), boundary_check=[0, 1], padding_option='zero')
2026-02-21T08:49:39.2540146Z                         ^
2026-02-21T08:49:39.2541034Z /tmp/torchinductor_root/tw/ctwf45jgblprssxbx6digyhxum3ybqilmizfwxub35zw4vh7cfja.py:60:41: note: - use: %112 = "ttg.convert_layout"(<<UNKNOWN SSA VALUE>>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>}>>
2026-02-21T08:49:39.2541856Z 
2026-02-21T08:49:39.2541921Z         expanded_1 = tl.expand_dims(v_6, 1)
2026-02-21T08:49:39.2542081Z                                         ^
2026-02-21T08:49:39.2542912Z /tmp/torchinductor_root/tw/ctwf45jgblprssxbx6digyhxum3ybqilmizfwxub35zw4vh7cfja.py:59:41: note: - use: %113 = "ttg.convert_layout"(<<UNKNOWN SSA VALUE>>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>}>>
2026-02-21T08:49:39.2543703Z 
2026-02-21T08:49:39.2543764Z         expanded_0 = tl.expand_dims(v_4, 1)
2026-02-21T08:49:39.2543915Z                                         ^
2026-02-21T08:49:39.2544084Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T08:49:39.2547717Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T08:49:39.2548120Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T08:49:39.2548500Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:49:39.2548874Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:49:39.2549220Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [4], order = [0]}>
2026-02-21T08:49:39.2549694Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [0, 1]}>
2026-02-21T08:49:39.2550035Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T08:49:39.2550385Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [0, 1]}>
2026-02-21T08:49:39.2550727Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T08:49:39.2551086Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T08:49:39.2551457Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T08:49:39.2551837Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:49:39.2552254Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:49:39.2552876Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:49:39.2553318Z     %cst = arith.constant dense<0> : tensor<64x16xi8, #blocked>
2026-02-21T08:49:39.2553535Z     %cst_0 = arith.constant dense<7168> : tensor<1x16xi64, #blocked>
2026-02-21T08:49:39.2553740Z     %cst_1 = arith.constant dense<0> : tensor<1x16xi64, #blocked>
2026-02-21T08:49:39.2553950Z     %cst_2 = arith.constant dense<4096> : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2554158Z     %cst_3 = arith.constant dense<0> : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2554383Z     %cst_4 = arith.constant dense<7168> : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2554598Z     %cst_5 = arith.constant dense<0> : tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:39.2554777Z     %c28672_i32 = arith.constant 28672 : i32
2026-02-21T08:49:39.2554924Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:49:39.2555072Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:49:39.2555236Z     %cst_6 = arith.constant dense<7168> : tensor<32x1xi32, #blocked1>
2026-02-21T08:49:39.2555446Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:49:39.2555645Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:49:39.2555847Z     %cst_9 = arith.constant dense<4> : tensor<64x16xi8, #blocked>
2026-02-21T08:49:39.2556054Z     %cst_10 = arith.constant dense<8192> : tensor<32x1xi32, #blocked1>
2026-02-21T08:49:39.2556225Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:49:39.2556410Z     %cst_11 = arith.constant dense<0.000000e+00> : tensor<32x16xf32, #blocked>
2026-02-21T08:49:39.2556630Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:49:39.2556744Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:49:39.2556858Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:49:39.2556976Z     %0 = tt.get_program_id x : i32
2026-02-21T08:49:39.2557094Z     %1 = arith.divsi %0, %c28672_i32 : i32
2026-02-21T08:49:39.2557210Z     %2 = arith.muli %1, %c64_i32 : i32
2026-02-21T08:49:39.2557323Z     %3 = arith.subi %c2_i32, %2 : i32
2026-02-21T08:49:39.2557433Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T08:49:39.2557549Z     %5 = arith.remsi %0, %c28672_i32 : i32
2026-02-21T08:49:39.2557664Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T08:49:39.2557774Z     %7 = arith.addi %2, %6 : i32
2026-02-21T08:49:39.2557883Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T08:49:39.2557990Z     %9 = arith.muli %7, %c32_i32 : i32
2026-02-21T08:49:39.2558147Z     %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked4>
2026-02-21T08:49:39.2558334Z     %11 = tt.splat %9 : i32 -> tensor<32xi32, #blocked4>
2026-02-21T08:49:39.2558485Z     %12 = arith.addi %11, %10 : tensor<32xi32, #blocked4>
2026-02-21T08:49:39.2558663Z     %13 = arith.muli %8, %c16_i32 : i32
2026-02-21T08:49:39.2558827Z     %14 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked4>
2026-02-21T08:49:39.2559010Z     %15 = tt.splat %13 : i32 -> tensor<16xi32, #blocked4>
2026-02-21T08:49:39.2559154Z     %16 = arith.addi %15, %14 : tensor<16xi32, #blocked4>
2026-02-21T08:49:39.2559326Z     %17 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked4>
2026-02-21T08:49:39.2559602Z     %18 = ttg.convert_layout %12 : tensor<32xi32, #blocked4> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:49:39.2559936Z     %19 = tt.expand_dims %18 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<32x1xi32, #blocked5>
2026-02-21T08:49:39.2560254Z     %20 = ttg.convert_layout %19 : tensor<32x1xi32, #blocked5> -> tensor<32x1xi32, #blocked1>
2026-02-21T08:49:39.2560459Z     %21 = arith.muli %20, %cst_10 : tensor<32x1xi32, #blocked1>
2026-02-21T08:49:39.2571685Z     %22 = tt.broadcast %21 : tensor<32x1xi32, #blocked1> -> tensor<32x128xi32, #blocked1>
2026-02-21T08:49:39.2572033Z     %23 = ttg.convert_layout %22 : tensor<32x128xi32, #blocked1> -> tensor<32x128xi32, #blocked6>
2026-02-21T08:49:39.2572283Z     %24 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T08:49:39.2572452Z     %25 = arith.extsi %13 : i32 to i64
2026-02-21T08:49:39.2572609Z     %26 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<64x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:49:39.2572814Z     %27 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked4>
2026-02-21T08:49:39.2573023Z     %28 = arith.extsi %27 : tensor<64xi32, #blocked4> to tensor<64xi64, #blocked4>
2026-02-21T08:49:39.2573207Z     %29 = tt.splat %25 : i64 -> tensor<16xi64, #blocked4>
2026-02-21T08:49:39.2573388Z     %30 = arith.extsi %14 : tensor<16xi32, #blocked4> to tensor<16xi64, #blocked4>
2026-02-21T08:49:39.2573569Z     %31 = arith.addi %29, %30 : tensor<16xi64, #blocked4>
2026-02-21T08:49:39.2573797Z     %32 = ttg.convert_layout %31 : tensor<16xi64, #blocked4> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:49:39.2574128Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi64, #blocked7>
2026-02-21T08:49:39.2574415Z     %34 = ttg.convert_layout %33 : tensor<1x16xi64, #blocked7> -> tensor<1x16xi64, #blocked>
2026-02-21T08:49:39.2574636Z     %35 = tt.broadcast %34 : tensor<1x16xi64, #blocked> -> tensor<64x16xi64, #blocked>
2026-02-21T08:49:39.2574834Z     %36 = arith.cmpi sge, %34, %cst_1 : tensor<1x16xi64, #blocked>
2026-02-21T08:49:39.2575000Z     %37 = arith.cmpi slt, %34, %cst_0 : tensor<1x16xi64, #blocked>
2026-02-21T08:49:39.2575160Z     %38 = arith.andi %36, %37 : tensor<1x16xi1, #blocked>
2026-02-21T08:49:39.2575336Z     %39 = tt.broadcast %38 : tensor<1x16xi1, #blocked> -> tensor<64x16xi1, #blocked>
2026-02-21T08:49:39.2575542Z     %40 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked4>
2026-02-21T08:49:39.2575798Z     %41 = ttg.convert_layout %40 : tensor<2xi32, #blocked4> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:49:39.2576117Z     %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x2xi32, #blocked7>
2026-02-21T08:49:39.2576398Z     %43 = ttg.convert_layout %42 : tensor<1x2xi32, #blocked7> -> tensor<1x2xi32, #blocked8>
2026-02-21T08:49:39.2576674Z     %44 = ttg.convert_layout %43 : tensor<1x2xi32, #blocked8> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T08:49:39.2576996Z     %45 = tt.expand_dims %44 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x2x1xi32, #blocked9>
2026-02-21T08:49:39.2577289Z     %46 = ttg.convert_layout %45 : tensor<1x2x1xi32, #blocked9> -> tensor<1x2x1xi32, #blocked3>
2026-02-21T08:49:39.2577498Z     %47 = arith.cmpi eq, %46, %cst_8 : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:49:39.2577761Z     %48 = tt.broadcast %47 : tensor<1x2x1xi1, #blocked3> -> tensor<64x2x16xi1, #blocked3>
2026-02-21T08:49:39.2578003Z     %49 = ttg.convert_layout %48 : tensor<64x2x16xi1, #blocked3> -> tensor<64x2x16xi1, #blocked2>
2026-02-21T08:49:39.2578209Z     %50 = arith.cmpi eq, %46, %cst_7 : tensor<1x2x1xi32, #blocked3>
2026-02-21T08:49:39.2578404Z     %51 = tt.broadcast %50 : tensor<1x2x1xi1, #blocked3> -> tensor<64x2x16xi1, #blocked3>
2026-02-21T08:49:39.2578633Z     %52 = ttg.convert_layout %51 : tensor<64x2x16xi1, #blocked3> -> tensor<64x2x16xi1, #blocked2>
2026-02-21T08:49:39.2578823Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:49:39.2579048Z     %53 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %cst_11) -> (tensor<32x16xf32, #blocked>)  : i32 {
2026-02-21T08:49:39.2579272Z       %68 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T08:49:39.2579415Z       %69 = tt.splat %68 : i32 -> tensor<128xi32, #blocked4>
2026-02-21T08:49:39.2579571Z       %70 = arith.addi %69, %17 : tensor<128xi32, #blocked4>
2026-02-21T08:49:39.2579808Z       %71 = ttg.convert_layout %70 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:49:39.2580164Z       %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7>
2026-02-21T08:49:39.2580455Z       %73 = ttg.convert_layout %72 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6>
2026-02-21T08:49:39.2580688Z       %74 = tt.broadcast %73 : tensor<1x128xi32, #blocked6> -> tensor<32x128xi32, #blocked6>
2026-02-21T08:49:39.2580880Z       %75 = arith.addi %23, %74 : tensor<32x128xi32, #blocked6>
2026-02-21T08:49:39.2581084Z       %76 = tt.addptr %24, %75 : tensor<32x128x!tt.ptr<bf16>, #blocked6>, tensor<32x128xi32, #blocked6>
2026-02-21T08:49:39.2581288Z       %77 = tt.load %76 : tensor<32x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T08:49:39.2581485Z       %78 = arith.extf %77 : tensor<32x128xbf16, #blocked6> to tensor<32x128xf32, #blocked6>
2026-02-21T08:49:39.2581663Z       %79 = arith.extsi %arg3 : i32 to i64
2026-02-21T08:49:39.2581801Z       %80 = tt.splat %79 : i64 -> tensor<64xi64, #blocked4>
2026-02-21T08:49:39.2581952Z       %81 = arith.addi %80, %28 : tensor<64xi64, #blocked4>
2026-02-21T08:49:39.2582180Z       %82 = ttg.convert_layout %81 : tensor<64xi64, #blocked4> -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:49:39.2582505Z       %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi64, #blocked5>
2026-02-21T08:49:39.2582793Z       %84 = ttg.convert_layout %83 : tensor<64x1xi64, #blocked5> -> tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2582994Z       %85 = arith.muli %84, %cst_4 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2583186Z       %86 = tt.broadcast %85 : tensor<64x1xi64, #blocked1> -> tensor<64x16xi64, #blocked1>
2026-02-21T08:49:39.2583419Z       %87 = ttg.convert_layout %86 : tensor<64x16xi64, #blocked1> -> tensor<64x16xi64, #blocked>
2026-02-21T08:49:39.2583618Z       %88 = arith.addi %87, %35 : tensor<64x16xi64, #blocked>
2026-02-21T08:49:39.2583812Z       %89 = tt.addptr %26, %88 : tensor<64x16x!tt.ptr<i8>, #blocked>, tensor<64x16xi64, #blocked>
2026-02-21T08:49:39.2584019Z       %90 = arith.cmpi sge, %84, %cst_3 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2584193Z       %91 = arith.cmpi slt, %84, %cst_2 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2584352Z       %92 = arith.andi %90, %91 : tensor<64x1xi1, #blocked1>
2026-02-21T08:49:39.2584534Z       %93 = tt.broadcast %92 : tensor<64x1xi1, #blocked1> -> tensor<64x16xi1, #blocked1>
2026-02-21T08:49:39.2584759Z       %94 = ttg.convert_layout %93 : tensor<64x16xi1, #blocked1> -> tensor<64x16xi1, #blocked>
2026-02-21T08:49:39.2584954Z       %95 = arith.andi %94, %39 : tensor<64x16xi1, #blocked>
2026-02-21T08:49:39.2585117Z       %96 = tt.load %89, %95, %cst : tensor<64x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:49:39.2585317Z       %97 = arith.shli %96, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:39.2585477Z       %98 = arith.shrsi %97, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:39.2585634Z       %99 = arith.shrsi %96, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:39.2585879Z       %100 = ttg.convert_layout %98 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:49:39.2586224Z       %101 = tt.expand_dims %100 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:49:39.2586541Z       %102 = ttg.convert_layout %101 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:49:39.2586838Z       %103 = ttg.convert_layout %99 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:49:39.2587181Z       %104 = tt.expand_dims %103 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:49:39.2587491Z       %105 = ttg.convert_layout %104 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:49:39.2587780Z       %106 = tt.broadcast %102 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:49:39.2588024Z       %107 = ttg.convert_layout %106 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:39.2588284Z       %108 = arith.select %49, %107, %cst_5 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:39.2588530Z       %109 = tt.broadcast %105 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:49:39.2588779Z       %110 = ttg.convert_layout %109 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:39.2589032Z       %111 = arith.select %52, %110, %108 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:39.2589269Z       %112 = tt.reshape %111 : tensor<64x2x16xi8, #blocked2> -> tensor<128x16xi8, #blocked>
2026-02-21T08:49:39.2589500Z       %113 = arith.sitofp %112 : tensor<128x16xi8, #blocked> to tensor<128x16xf32, #blocked>
2026-02-21T08:49:39.2589790Z       %114 = ttg.convert_layout %78 : tensor<32x128xf32, #blocked6> -> tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
2026-02-21T08:49:39.2590137Z       %115 = ttg.convert_layout %113 : tensor<128x16xf32, #blocked> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>>
2026-02-21T08:49:39.2590432Z       %116 = ttg.convert_layout %arg4 : tensor<32x16xf32, #blocked> -> tensor<32x16xf32, #blocked>
2026-02-21T08:49:39.2590834Z       %117 = tt.dot %114, %115, %116, inputPrecision = tf32 : tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<32x16xf32, #blocked>
2026-02-21T08:49:39.2591169Z       %c1_i32 = arith.constant 1 : i32
2026-02-21T08:49:39.2591296Z       %118 = arith.muli %c64_i32, %c1_i32 : i32
2026-02-21T08:49:39.2591420Z       %119 = arith.addi %arg3, %118 : i32
2026-02-21T08:49:39.2591540Z       %120 = arith.muli %119, %c2_i32 : i32
2026-02-21T08:49:39.2591680Z       %121 = tt.splat %120 : i32 -> tensor<128xi32, #blocked4>
2026-02-21T08:49:39.2591836Z       %122 = arith.addi %121, %17 : tensor<128xi32, #blocked4>
2026-02-21T08:49:39.2592076Z       %123 = ttg.convert_layout %122 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:49:39.2592409Z       %124 = tt.expand_dims %123 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7>
2026-02-21T08:49:39.2592703Z       %125 = ttg.convert_layout %124 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6>
2026-02-21T08:49:39.2592940Z       %126 = tt.broadcast %125 : tensor<1x128xi32, #blocked6> -> tensor<32x128xi32, #blocked6>
2026-02-21T08:49:39.2593138Z       %127 = arith.addi %23, %126 : tensor<32x128xi32, #blocked6>
2026-02-21T08:49:39.2593379Z       %128 = tt.addptr %24, %127 : tensor<32x128x!tt.ptr<bf16>, #blocked6>, tensor<32x128xi32, #blocked6>
2026-02-21T08:49:39.2593588Z       %129 = tt.load %128 : tensor<32x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T08:49:39.2593785Z       %130 = arith.extf %129 : tensor<32x128xbf16, #blocked6> to tensor<32x128xf32, #blocked6>
2026-02-21T08:49:39.2593961Z       %131 = arith.extsi %119 : i32 to i64
2026-02-21T08:49:39.2594098Z       %132 = tt.splat %131 : i64 -> tensor<64xi64, #blocked4>
2026-02-21T08:49:39.2594250Z       %133 = arith.addi %132, %28 : tensor<64xi64, #blocked4>
2026-02-21T08:49:39.2594481Z       %134 = ttg.convert_layout %133 : tensor<64xi64, #blocked4> -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:49:39.2594809Z       %135 = tt.expand_dims %134 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi64, #blocked5>
2026-02-21T08:49:39.2595096Z       %136 = ttg.convert_layout %135 : tensor<64x1xi64, #blocked5> -> tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2595301Z       %137 = arith.muli %136, %cst_4 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2595522Z       %138 = tt.broadcast %137 : tensor<64x1xi64, #blocked1> -> tensor<64x16xi64, #blocked1>
2026-02-21T08:49:39.2595756Z       %139 = ttg.convert_layout %138 : tensor<64x16xi64, #blocked1> -> tensor<64x16xi64, #blocked>
2026-02-21T08:49:39.2595956Z       %140 = arith.addi %139, %35 : tensor<64x16xi64, #blocked>
2026-02-21T08:49:39.2596151Z       %141 = tt.addptr %26, %140 : tensor<64x16x!tt.ptr<i8>, #blocked>, tensor<64x16xi64, #blocked>
2026-02-21T08:49:39.2596359Z       %142 = arith.cmpi sge, %136, %cst_3 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2596531Z       %143 = arith.cmpi slt, %136, %cst_2 : tensor<64x1xi64, #blocked1>
2026-02-21T08:49:39.2596696Z       %144 = arith.andi %142, %143 : tensor<64x1xi1, #blocked1>
2026-02-21T08:49:39.2596884Z       %145 = tt.broadcast %144 : tensor<64x1xi1, #blocked1> -> tensor<64x16xi1, #blocked1>
2026-02-21T08:49:39.2597115Z       %146 = ttg.convert_layout %145 : tensor<64x16xi1, #blocked1> -> tensor<64x16xi1, #blocked>
2026-02-21T08:49:39.2597312Z       %147 = arith.andi %146, %39 : tensor<64x16xi1, #blocked>
2026-02-21T08:49:39.2597474Z       %148 = tt.load %141, %147, %cst : tensor<64x16x!tt.ptr<i8>, #blocked>
2026-02-21T08:49:39.2597646Z       %149 = arith.shli %148, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:39.2597807Z       %150 = arith.shrsi %149, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:39.2597966Z       %151 = arith.shrsi %148, %cst_9 : tensor<64x16xi8, #blocked>
2026-02-21T08:49:39.2598212Z       %152 = ttg.convert_layout %150 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:49:39.2598552Z       %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:49:39.2598861Z       %154 = ttg.convert_layout %153 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:49:39.2599154Z       %155 = ttg.convert_layout %151 : tensor<64x16xi8, #blocked> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:49:39.2599492Z       %156 = tt.expand_dims %155 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:49:39.2599796Z       %157 = ttg.convert_layout %156 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:49:39.2600045Z       %158 = tt.broadcast %154 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:49:39.2600294Z       %159 = ttg.convert_layout %158 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:39.2600548Z       %160 = arith.select %49, %159, %cst_5 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:39.2600792Z       %161 = tt.broadcast %157 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:49:39.2601066Z       %162 = ttg.convert_layout %161 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:39.2601315Z       %163 = arith.select %52, %162, %160 : tensor<64x2x16xi1, #blocked2>, tensor<64x2x16xi8, #blocked2>
2026-02-21T08:49:39.2601554Z       %164 = tt.reshape %163 : tensor<64x2x16xi8, #blocked2> -> tensor<128x16xi8, #blocked>
2026-02-21T08:49:39.2601781Z       %165 = arith.sitofp %164 : tensor<128x16xi8, #blocked> to tensor<128x16xf32, #blocked>
2026-02-21T08:49:39.2602065Z       %166 = ttg.convert_layout %130 : tensor<32x128xf32, #blocked6> -> tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
2026-02-21T08:49:39.2602406Z       %167 = ttg.convert_layout %165 : tensor<128x16xf32, #blocked> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>>
2026-02-21T08:49:39.2602733Z       %168 = ttg.convert_layout %117 : tensor<32x16xf32, #blocked> -> tensor<32x16xf32, #blocked>
2026-02-21T08:49:39.2603131Z       %169 = tt.dot %166, %167, %168, inputPrecision = tf32 : tensor<32x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<32x16xf32, #blocked>
2026-02-21T08:49:39.2603515Z       scf.yield %169 : tensor<32x16xf32, #blocked>
2026-02-21T08:49:39.2603637Z     } {tt.flatten}
2026-02-21T08:49:39.2603783Z     %54 = arith.truncf %53 : tensor<32x16xf32, #blocked> to tensor<32x16xbf16, #blocked>
2026-02-21T08:49:39.2604056Z     %55 = ttg.convert_layout %12 : tensor<32xi32, #blocked4> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:49:39.2604381Z     %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<32x1xi32, #blocked5>
2026-02-21T08:49:39.2604664Z     %57 = ttg.convert_layout %56 : tensor<32x1xi32, #blocked5> -> tensor<32x1xi32, #blocked1>
2026-02-21T08:49:39.2604861Z     %58 = arith.muli %57, %cst_6 : tensor<32x1xi32, #blocked1>
2026-02-21T08:49:39.2605096Z     %59 = ttg.convert_layout %16 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:49:39.2605410Z     %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi32, #blocked7>
2026-02-21T08:49:39.2605692Z     %61 = ttg.convert_layout %60 : tensor<1x16xi32, #blocked7> -> tensor<1x16xi32, #blocked>
2026-02-21T08:49:39.2605916Z     %62 = tt.broadcast %58 : tensor<32x1xi32, #blocked1> -> tensor<32x16xi32, #blocked1>
2026-02-21T08:49:39.2606140Z     %63 = ttg.convert_layout %62 : tensor<32x16xi32, #blocked1> -> tensor<32x16xi32, #blocked>
2026-02-21T08:49:39.2606363Z     %64 = tt.broadcast %61 : tensor<1x16xi32, #blocked> -> tensor<32x16xi32, #blocked>
2026-02-21T08:49:39.2606546Z     %65 = arith.addi %63, %64 : tensor<32x16xi32, #blocked>
2026-02-21T08:49:39.2606721Z     %66 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x16x!tt.ptr<bf16>, #blocked>
2026-02-21T08:49:39.2606941Z     %67 = tt.addptr %66, %65 : tensor<32x16x!tt.ptr<bf16>, #blocked>, tensor<32x16xi32, #blocked>
2026-02-21T08:49:39.2607137Z     tt.store %67, %54 : tensor<32x16x!tt.ptr<bf16>, #blocked>
2026-02-21T08:49:39.2607270Z     tt.return
2026-02-21T08:49:39.2607349Z   }
2026-02-21T08:49:39.2607423Z }
2026-02-21T08:49:39.2607465Z 
2026-02-21T08:49:39.2607502Z {-#
2026-02-21T08:49:39.2607581Z   external_resources: {
2026-02-21T08:49:39.2607681Z     mlir_reproducer: {
2026-02-21T08:49:39.2609979Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=16}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T08:49:39.2612253Z       disable_threading: false,
2026-02-21T08:49:39.2612358Z       verify_each: true
2026-02-21T08:49:39.2612449Z     }
2026-02-21T08:49:39.2612522Z   }
2026-02-21T08:49:39.2612592Z #-}
2026-02-21T08:49:39.2612871Z /tmp/torchinductor_root/tw/ctwf45jgblprssxbx6digyhxum3ybqilmizfwxub35zw4vh7cfja.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:49:39.2613639Z /tmp/torchinductor_root/tw/ctwf45jgblprssxbx6digyhxum3ybqilmizfwxub35zw4vh7cfja.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:49:39.2614201Z [217s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:49:39.2614921Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 32, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:49:39.2615571Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:49:39.2615743Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:49:39.4436230Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 17.3 configs/s
2026-02-21T08:49:47.9059375Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 114.2         
2026-02-21T08:49:47.9059824Z                                                                   configs/s     
2026-02-21T08:49:48.6047478Z [226s] Generation 5 complete: 
2026-02-21T08:49:48.6047630Z error=5
2026-02-21T08:49:48.6047710Z ok=81
2026-02-21T08:49:48.6047786Z min=0.1098
2026-02-21T08:49:48.6047868Z mid=0.1358
2026-02-21T08:49:48.6047942Z max=6.3263
2026-02-21T08:49:48.6048028Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:49:48.6048168Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T08:49:48.6048305Z  'l2_groupings': [64],
2026-02-21T08:49:48.6048777Z  'load_eviction_policies': ['', ''],
2026-02-21T08:49:48.6048892Z  'loop_orders': [[0, 1]],
2026-02-21T08:49:48.6048999Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:49:48.6049124Z  'num_stages': 4,
2026-02-21T08:49:48.6049209Z  'num_warps': 2,
2026-02-21T08:49:48.6049296Z  'pid_type': 'flat',
2026-02-21T08:49:48.6049397Z  'range_flattens': [None, True],
2026-02-21T08:49:48.6049509Z  'range_multi_buffers': [None, True],
2026-02-21T08:49:48.6049623Z  'range_num_stages': [0, 1],
2026-02-21T08:49:48.6049728Z  'range_unroll_factors': [0, 2],
2026-02-21T08:49:48.6049838Z  'range_warp_specializes': [],
2026-02-21T08:49:48.6049941Z  'waves_per_eu': 1}
2026-02-21T08:49:48.7048920Z [226s] Fitting surrogate: 592 points, 592 targets
2026-02-21T08:49:49.4145089Z [227s] Generation 6 starting: 69 neighbors, 4 active search path(s)
2026-02-21T08:50:10.0795198Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.5 configs/s
2026-02-21T08:50:14.1922001Z /tmp/torchinductor_root/6s/c6s6ohrgipzdk4omizjox23stwih4vsfrgk7j252il3i4a32r56e.py:48:83: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T08:50:14.1924648Z         b_tile = tl.load(B + (indices_3[:, None] * 7168 + indices_2[None, :] * 1), None)
2026-02-21T08:50:14.1925306Z                                                                                   ^
2026-02-21T08:50:14.1927240Z /tmp/torchinductor_root/6s/c6s6ohrgipzdk4omizjox23stwih4vsfrgk7j252il3i4a32r56e.py:61:41: note: - use: %89 = "ttg.convert_layout"(<<UNKNOWN SSA VALUE>>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>}>>
2026-02-21T08:50:14.1928975Z 
2026-02-21T08:50:14.1929114Z         expanded_1 = tl.expand_dims(v_6, 1)
2026-02-21T08:50:14.1929453Z                                         ^
2026-02-21T08:50:14.1931208Z /tmp/torchinductor_root/6s/c6s6ohrgipzdk4omizjox23stwih4vsfrgk7j252il3i4a32r56e.py:60:41: note: - use: %90 = "ttg.convert_layout"(<<UNKNOWN SSA VALUE>>) : (tensor<64x16xi8, #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>>) -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>}>>
2026-02-21T08:50:14.1933037Z 
2026-02-21T08:50:14.1933086Z         expanded_0 = tl.expand_dims(v_4, 1)
2026-02-21T08:50:14.1933223Z                                         ^
2026-02-21T08:50:14.1933364Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T08:50:14.1933655Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:50:14.1934017Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:50:14.1934371Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:50:14.1934711Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:50:14.1935028Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}>
2026-02-21T08:50:14.1935347Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}>
2026-02-21T08:50:14.1935673Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T08:50:14.1935999Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}>
2026-02-21T08:50:14.1936319Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:50:14.1936834Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}>
2026-02-21T08:50:14.1937192Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}>
2026-02-21T08:50:14.1937546Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:50:14.1937889Z #blocked12 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [8, 8], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T08:50:14.1938265Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:50:14.1938806Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:50:14.1939236Z     %cst = arith.constant dense<0> : tensor<64x2x16xi8, #blocked>
2026-02-21T08:50:14.1939408Z     %c28672_i32 = arith.constant 28672 : i32
2026-02-21T08:50:14.1939548Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:50:14.1939686Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:50:14.1939851Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T08:50:14.1940050Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T08:50:14.1940243Z     %cst_2 = arith.constant dense<4> : tensor<64x16xi8, #blocked2>
2026-02-21T08:50:14.1940443Z     %cst_3 = arith.constant dense<7168> : tensor<64x1xi32, #blocked3>
2026-02-21T08:50:14.1940634Z     %cst_4 = arith.constant dense<8192> : tensor<64x1xi32, #blocked3>
2026-02-21T08:50:14.1940802Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:50:14.1940974Z     %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x16xf32, #blocked2>
2026-02-21T08:50:14.1941154Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:50:14.1941286Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:50:14.1941417Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:50:14.1941543Z     %0 = tt.get_program_id x : i32
2026-02-21T08:50:14.1941706Z     %1 = arith.divsi %0, %c28672_i32 : i32
2026-02-21T08:50:14.1941839Z     %2 = arith.muli %1, %c64_i32 : i32
2026-02-21T08:50:14.1941965Z     %3 = arith.subi %c1_i32, %2 : i32
2026-02-21T08:50:14.1942090Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T08:50:14.1942217Z     %5 = arith.remsi %0, %c28672_i32 : i32
2026-02-21T08:50:14.1942345Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T08:50:14.1942465Z     %7 = arith.addi %2, %6 : i32
2026-02-21T08:50:14.1942579Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T08:50:14.1942699Z     %9 = arith.muli %7, %c64_i32 : i32
2026-02-21T08:50:14.1942853Z     %10 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked4>
2026-02-21T08:50:14.1943043Z     %11 = tt.splat %9 : i32 -> tensor<64xi32, #blocked4>
2026-02-21T08:50:14.1943194Z     %12 = arith.addi %11, %10 : tensor<64xi32, #blocked4>
2026-02-21T08:50:14.1943325Z     %13 = arith.muli %8, %c16_i32 : i32
2026-02-21T08:50:14.1943480Z     %14 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked4>
2026-02-21T08:50:14.1943659Z     %15 = tt.splat %13 : i32 -> tensor<16xi32, #blocked4>
2026-02-21T08:50:14.1943803Z     %16 = arith.addi %15, %14 : tensor<16xi32, #blocked4>
2026-02-21T08:50:14.1943973Z     %17 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked4>
2026-02-21T08:50:14.1944239Z     %18 = ttg.convert_layout %12 : tensor<64xi32, #blocked4> -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:50:14.1944564Z     %19 = tt.expand_dims %18 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi32, #blocked5>
2026-02-21T08:50:14.1944847Z     %20 = ttg.convert_layout %19 : tensor<64x1xi32, #blocked5> -> tensor<64x1xi32, #blocked3>
2026-02-21T08:50:14.1945049Z     %21 = arith.muli %20, %cst_4 : tensor<64x1xi32, #blocked3>
2026-02-21T08:50:14.1945290Z     %22 = tt.broadcast %21 : tensor<64x1xi32, #blocked3> -> tensor<64x128xi32, #blocked3>
2026-02-21T08:50:14.1945530Z     %23 = ttg.convert_layout %22 : tensor<64x128xi32, #blocked3> -> tensor<64x128xi32, #blocked6>
2026-02-21T08:50:14.1945758Z     %24 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T08:50:14.1946012Z     %25 = ttg.convert_layout %16 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:50:14.1946328Z     %26 = tt.expand_dims %25 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi32, #blocked7>
2026-02-21T08:50:14.1946607Z     %27 = ttg.convert_layout %26 : tensor<1x16xi32, #blocked7> -> tensor<1x16xi32, #blocked2>
2026-02-21T08:50:14.1946834Z     %28 = tt.broadcast %27 : tensor<1x16xi32, #blocked2> -> tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1947042Z     %29 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<64x16x!tt.ptr<i8>, #blocked2>
2026-02-21T08:50:14.1947240Z     %30 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked4>
2026-02-21T08:50:14.1947493Z     %31 = ttg.convert_layout %30 : tensor<2xi32, #blocked4> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:50:14.1947804Z     %32 = tt.expand_dims %31 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x2xi32, #blocked7>
2026-02-21T08:50:14.1948080Z     %33 = ttg.convert_layout %32 : tensor<1x2xi32, #blocked7> -> tensor<1x2xi32, #blocked8>
2026-02-21T08:50:14.1948356Z     %34 = ttg.convert_layout %33 : tensor<1x2xi32, #blocked8> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T08:50:14.1948678Z     %35 = tt.expand_dims %34 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x2x1xi32, #blocked9>
2026-02-21T08:50:14.1948969Z     %36 = ttg.convert_layout %35 : tensor<1x2x1xi32, #blocked9> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T08:50:14.1949176Z     %37 = arith.cmpi eq, %36, %cst_1 : tensor<1x2x1xi32, #blocked1>
2026-02-21T08:50:14.1949373Z     %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked1> -> tensor<64x2x16xi1, #blocked1>
2026-02-21T08:50:14.1949640Z     %39 = ttg.convert_layout %38 : tensor<64x2x16xi1, #blocked1> -> tensor<64x2x16xi1, #blocked>
2026-02-21T08:50:14.1949843Z     %40 = arith.cmpi eq, %36, %cst_0 : tensor<1x2x1xi32, #blocked1>
2026-02-21T08:50:14.1950037Z     %41 = tt.broadcast %40 : tensor<1x2x1xi1, #blocked1> -> tensor<64x2x16xi1, #blocked1>
2026-02-21T08:50:14.1950265Z     %42 = ttg.convert_layout %41 : tensor<64x2x16xi1, #blocked1> -> tensor<64x2x16xi1, #blocked>
2026-02-21T08:50:14.1950455Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:50:14.1950678Z     %43 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c128_i32 iter_args(%arg4 = %cst_5) -> (tensor<64x16xf32, #blocked2>)  : i32 {
2026-02-21T08:50:14.1950919Z       %58 = tt.splat %arg3 : i32 -> tensor<64xi32, #blocked4>
2026-02-21T08:50:14.1951076Z       %59 = arith.addi %58, %10 : tensor<64xi32, #blocked4>
2026-02-21T08:50:14.1951214Z       %60 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T08:50:14.1951353Z       %61 = tt.splat %60 : i32 -> tensor<128xi32, #blocked4>
2026-02-21T08:50:14.1951499Z       %62 = arith.addi %61, %17 : tensor<128xi32, #blocked4>
2026-02-21T08:50:14.1951732Z       %63 = ttg.convert_layout %62 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:50:14.1952057Z       %64 = tt.expand_dims %63 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7>
2026-02-21T08:50:14.1952341Z       %65 = ttg.convert_layout %64 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6>
2026-02-21T08:50:14.1952573Z       %66 = tt.broadcast %65 : tensor<1x128xi32, #blocked6> -> tensor<64x128xi32, #blocked6>
2026-02-21T08:50:14.1952762Z       %67 = arith.addi %23, %66 : tensor<64x128xi32, #blocked6>
2026-02-21T08:50:14.1952999Z       %68 = tt.addptr %24, %67 : tensor<64x128x!tt.ptr<bf16>, #blocked6>, tensor<64x128xi32, #blocked6>
2026-02-21T08:50:14.1953205Z       %69 = tt.load %68 : tensor<64x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T08:50:14.1953396Z       %70 = arith.extf %69 : tensor<64x128xbf16, #blocked6> to tensor<64x128xf32, #blocked6>
2026-02-21T08:50:14.1953665Z       %71 = ttg.convert_layout %59 : tensor<64xi32, #blocked4> -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:50:14.1953982Z       %72 = tt.expand_dims %71 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi32, #blocked5>
2026-02-21T08:50:14.1954263Z       %73 = ttg.convert_layout %72 : tensor<64x1xi32, #blocked5> -> tensor<64x1xi32, #blocked3>
2026-02-21T08:50:14.1954462Z       %74 = arith.muli %73, %cst_3 : tensor<64x1xi32, #blocked3>
2026-02-21T08:50:14.1954646Z       %75 = tt.broadcast %74 : tensor<64x1xi32, #blocked3> -> tensor<64x16xi32, #blocked3>
2026-02-21T08:50:14.1954881Z       %76 = ttg.convert_layout %75 : tensor<64x16xi32, #blocked3> -> tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1955079Z       %77 = arith.addi %76, %28 : tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1955275Z       %78 = tt.addptr %29, %77 : tensor<64x16x!tt.ptr<i8>, #blocked2>, tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1955473Z       %79 = tt.load %78 : tensor<64x16x!tt.ptr<i8>, #blocked2>
2026-02-21T08:50:14.1955627Z       %80 = arith.shli %79, %cst_2 : tensor<64x16xi8, #blocked2>
2026-02-21T08:50:14.1955787Z       %81 = arith.shrsi %80, %cst_2 : tensor<64x16xi8, #blocked2>
2026-02-21T08:50:14.1955944Z       %82 = arith.shrsi %79, %cst_2 : tensor<64x16xi8, #blocked2>
2026-02-21T08:50:14.1956186Z       %83 = ttg.convert_layout %81 : tensor<64x16xi8, #blocked2> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:50:14.1956524Z       %84 = tt.expand_dims %83 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:50:14.1956825Z       %85 = ttg.convert_layout %84 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:50:14.1957114Z       %86 = ttg.convert_layout %82 : tensor<64x16xi8, #blocked2> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:50:14.1966337Z       %87 = tt.expand_dims %86 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:50:14.1966638Z       %88 = ttg.convert_layout %87 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:50:14.1966883Z       %89 = tt.broadcast %85 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:50:14.1967121Z       %90 = ttg.convert_layout %89 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked>
2026-02-21T08:50:14.1967362Z       %91 = arith.select %39, %90, %cst : tensor<64x2x16xi1, #blocked>, tensor<64x2x16xi8, #blocked>
2026-02-21T08:50:14.1967600Z       %92 = tt.broadcast %88 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:50:14.1967835Z       %93 = ttg.convert_layout %92 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked>
2026-02-21T08:50:14.1968075Z       %94 = arith.select %42, %93, %91 : tensor<64x2x16xi1, #blocked>, tensor<64x2x16xi8, #blocked>
2026-02-21T08:50:14.1968298Z       %95 = tt.reshape %94 : tensor<64x2x16xi8, #blocked> -> tensor<128x16xi8, #blocked2>
2026-02-21T08:50:14.1968523Z       %96 = arith.sitofp %95 : tensor<128x16xi8, #blocked2> to tensor<128x16xf32, #blocked2>
2026-02-21T08:50:14.1968809Z       %97 = ttg.convert_layout %70 : tensor<64x128xf32, #blocked6> -> tensor<64x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>>
2026-02-21T08:50:14.1969156Z       %98 = ttg.convert_layout %96 : tensor<128x16xf32, #blocked2> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>>
2026-02-21T08:50:14.1969458Z       %99 = ttg.convert_layout %arg4 : tensor<64x16xf32, #blocked2> -> tensor<64x16xf32, #blocked12>
2026-02-21T08:50:14.1969935Z       %100 = tt.dot %97, %98, %99, inputPrecision = tf32 : tensor<64x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> -> tensor<64x16xf32, #blocked12>
2026-02-21T08:50:14.1970344Z       %101 = ttg.convert_layout %100 : tensor<64x16xf32, #blocked12> -> tensor<64x16xf32, #blocked2>
2026-02-21T08:50:14.1970557Z       %c1_i32_6 = arith.constant 1 : i32
2026-02-21T08:50:14.1970683Z       %102 = arith.muli %c64_i32, %c1_i32_6 : i32
2026-02-21T08:50:14.1970807Z       %103 = arith.addi %arg3, %102 : i32
2026-02-21T08:50:14.1970946Z       %104 = tt.splat %103 : i32 -> tensor<64xi32, #blocked4>
2026-02-21T08:50:14.1971097Z       %105 = arith.addi %104, %10 : tensor<64xi32, #blocked4>
2026-02-21T08:50:14.1971232Z       %106 = arith.muli %103, %c2_i32 : i32
2026-02-21T08:50:14.1971365Z       %107 = tt.splat %106 : i32 -> tensor<128xi32, #blocked4>
2026-02-21T08:50:14.1971520Z       %108 = arith.addi %107, %17 : tensor<128xi32, #blocked4>
2026-02-21T08:50:14.1971763Z       %109 = ttg.convert_layout %108 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:50:14.1972094Z       %110 = tt.expand_dims %109 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7>
2026-02-21T08:50:14.1972389Z       %111 = ttg.convert_layout %110 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked6>
2026-02-21T08:50:14.1972626Z       %112 = tt.broadcast %111 : tensor<1x128xi32, #blocked6> -> tensor<64x128xi32, #blocked6>
2026-02-21T08:50:14.1972824Z       %113 = arith.addi %23, %112 : tensor<64x128xi32, #blocked6>
2026-02-21T08:50:14.1973030Z       %114 = tt.addptr %24, %113 : tensor<64x128x!tt.ptr<bf16>, #blocked6>, tensor<64x128xi32, #blocked6>
2026-02-21T08:50:14.1973241Z       %115 = tt.load %114 : tensor<64x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T08:50:14.1973437Z       %116 = arith.extf %115 : tensor<64x128xbf16, #blocked6> to tensor<64x128xf32, #blocked6>
2026-02-21T08:50:14.1973711Z       %117 = ttg.convert_layout %105 : tensor<64xi32, #blocked4> -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:50:14.1974077Z       %118 = tt.expand_dims %117 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi32, #blocked5>
2026-02-21T08:50:14.1974366Z       %119 = ttg.convert_layout %118 : tensor<64x1xi32, #blocked5> -> tensor<64x1xi32, #blocked3>
2026-02-21T08:50:14.1974570Z       %120 = arith.muli %119, %cst_3 : tensor<64x1xi32, #blocked3>
2026-02-21T08:50:14.1974762Z       %121 = tt.broadcast %120 : tensor<64x1xi32, #blocked3> -> tensor<64x16xi32, #blocked3>
2026-02-21T08:50:14.1974993Z       %122 = ttg.convert_layout %121 : tensor<64x16xi32, #blocked3> -> tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1975197Z       %123 = arith.addi %122, %28 : tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1975393Z       %124 = tt.addptr %29, %123 : tensor<64x16x!tt.ptr<i8>, #blocked2>, tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1975595Z       %125 = tt.load %124 : tensor<64x16x!tt.ptr<i8>, #blocked2>
2026-02-21T08:50:14.1975754Z       %126 = arith.shli %125, %cst_2 : tensor<64x16xi8, #blocked2>
2026-02-21T08:50:14.1975919Z       %127 = arith.shrsi %126, %cst_2 : tensor<64x16xi8, #blocked2>
2026-02-21T08:50:14.1976080Z       %128 = arith.shrsi %125, %cst_2 : tensor<64x16xi8, #blocked2>
2026-02-21T08:50:14.1976324Z       %129 = ttg.convert_layout %127 : tensor<64x16xi8, #blocked2> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:50:14.1976668Z       %130 = tt.expand_dims %129 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:50:14.1976976Z       %131 = ttg.convert_layout %130 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:50:14.1977268Z       %132 = ttg.convert_layout %128 : tensor<64x16xi8, #blocked2> -> tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T08:50:14.1977643Z       %133 = tt.expand_dims %132 {axis = 1 : i32} : tensor<64x16xi8, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<64x1x16xi8, #blocked10>
2026-02-21T08:50:14.1977952Z       %134 = ttg.convert_layout %133 : tensor<64x1x16xi8, #blocked10> -> tensor<64x1x16xi8, #blocked11>
2026-02-21T08:50:14.1978196Z       %135 = tt.broadcast %131 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:50:14.1978445Z       %136 = ttg.convert_layout %135 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked>
2026-02-21T08:50:14.1978689Z       %137 = arith.select %39, %136, %cst : tensor<64x2x16xi1, #blocked>, tensor<64x2x16xi8, #blocked>
2026-02-21T08:50:14.1978930Z       %138 = tt.broadcast %134 : tensor<64x1x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked11>
2026-02-21T08:50:14.1979172Z       %139 = ttg.convert_layout %138 : tensor<64x2x16xi8, #blocked11> -> tensor<64x2x16xi8, #blocked>
2026-02-21T08:50:14.1979416Z       %140 = arith.select %42, %139, %137 : tensor<64x2x16xi1, #blocked>, tensor<64x2x16xi8, #blocked>
2026-02-21T08:50:14.1979653Z       %141 = tt.reshape %140 : tensor<64x2x16xi8, #blocked> -> tensor<128x16xi8, #blocked2>
2026-02-21T08:50:14.1979884Z       %142 = arith.sitofp %141 : tensor<128x16xi8, #blocked2> to tensor<128x16xf32, #blocked2>
2026-02-21T08:50:14.1980178Z       %143 = ttg.convert_layout %116 : tensor<64x128xf32, #blocked6> -> tensor<64x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>>
2026-02-21T08:50:14.1980527Z       %144 = ttg.convert_layout %142 : tensor<128x16xf32, #blocked2> -> tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>>
2026-02-21T08:50:14.1980823Z       %145 = ttg.convert_layout %101 : tensor<64x16xf32, #blocked2> -> tensor<64x16xf32, #blocked12>
2026-02-21T08:50:14.1981234Z       %146 = tt.dot %143, %144, %145, inputPrecision = tf32 : tensor<64x128xf32, #ttg.dot_op<{opIdx = 0, parent = #blocked12}>> * tensor<128x16xf32, #ttg.dot_op<{opIdx = 1, parent = #blocked12}>> -> tensor<64x16xf32, #blocked12>
2026-02-21T08:50:14.1981636Z       %147 = ttg.convert_layout %146 : tensor<64x16xf32, #blocked12> -> tensor<64x16xf32, #blocked2>
2026-02-21T08:50:14.1981866Z       scf.yield %147 : tensor<64x16xf32, #blocked2>
2026-02-21T08:50:14.1981993Z     } {tt.flatten}
2026-02-21T08:50:14.1982137Z     %44 = arith.truncf %43 : tensor<64x16xf32, #blocked2> to tensor<64x16xbf16, #blocked2>
2026-02-21T08:50:14.1982408Z     %45 = ttg.convert_layout %12 : tensor<64xi32, #blocked4> -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>>
2026-02-21T08:50:14.1982726Z     %46 = tt.expand_dims %45 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<64x1xi32, #blocked5>
2026-02-21T08:50:14.1983011Z     %47 = ttg.convert_layout %46 : tensor<64x1xi32, #blocked5> -> tensor<64x1xi32, #blocked3>
2026-02-21T08:50:14.1983209Z     %48 = arith.muli %47, %cst_3 : tensor<64x1xi32, #blocked3>
2026-02-21T08:50:14.1983440Z     %49 = ttg.convert_layout %16 : tensor<16xi32, #blocked4> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T08:50:14.1983754Z     %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x16xi32, #blocked7>
2026-02-21T08:50:14.1984034Z     %51 = ttg.convert_layout %50 : tensor<1x16xi32, #blocked7> -> tensor<1x16xi32, #blocked2>
2026-02-21T08:50:14.1984262Z     %52 = tt.broadcast %48 : tensor<64x1xi32, #blocked3> -> tensor<64x16xi32, #blocked3>
2026-02-21T08:50:14.1984494Z     %53 = ttg.convert_layout %52 : tensor<64x16xi32, #blocked3> -> tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1984719Z     %54 = tt.broadcast %51 : tensor<1x16xi32, #blocked2> -> tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1984908Z     %55 = arith.addi %53, %54 : tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1985083Z     %56 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x16x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:50:14.1985304Z     %57 = tt.addptr %56, %55 : tensor<64x16x!tt.ptr<bf16>, #blocked2>, tensor<64x16xi32, #blocked2>
2026-02-21T08:50:14.1985537Z     tt.store %57, %44 : tensor<64x16x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:50:14.1985670Z     tt.return
2026-02-21T08:50:14.1985753Z   }
2026-02-21T08:50:14.1985826Z }
2026-02-21T08:50:14.1985868Z 
2026-02-21T08:50:14.1985901Z {-#
2026-02-21T08:50:14.1985979Z   external_resources: {
2026-02-21T08:50:14.1986079Z     mlir_reproducer: {
2026-02-21T08:50:14.1988302Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=16}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T08:50:14.1990604Z       disable_threading: false,
2026-02-21T08:50:14.1990715Z       verify_each: true
2026-02-21T08:50:14.1990803Z     }
2026-02-21T08:50:14.1990877Z   }
2026-02-21T08:50:14.1990944Z #-}
2026-02-21T08:50:14.1991223Z /tmp/torchinductor_root/6s/c6s6ohrgipzdk4omizjox23stwih4vsfrgk7j252il3i4a32r56e.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:50:14.1991985Z /tmp/torchinductor_root/6s/c6s6ohrgipzdk4omizjox23stwih4vsfrgk7j252il3i4a32r56e.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:50:14.1992536Z [252s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:50:14.1993255Z Config: @helion.kernel(config=helion.Config(block_sizes=[64, 64, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:50:14.1993906Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:50:14.1994072Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:50:14.4577602Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 16.3 configs/s
2026-02-21T08:50:22.8339509Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 126.4         
2026-02-21T08:50:22.8340020Z                                                                   configs/s     
2026-02-21T08:50:23.4685467Z [261s] Generation 6 complete: 
2026-02-21T08:50:23.4685861Z error=1
2026-02-21T08:50:23.4686068Z ok=72
2026-02-21T08:50:23.4686277Z min=0.1096
2026-02-21T08:50:23.4686480Z mid=0.1345
2026-02-21T08:50:23.4686679Z max=3.5852
2026-02-21T08:50:23.4686905Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:50:23.4687282Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T08:50:23.4687644Z  'l2_groupings': [64],
2026-02-21T08:50:23.4687920Z  'load_eviction_policies': ['', ''],
2026-02-21T08:50:23.4688802Z  'loop_orders': [[0, 1]],
2026-02-21T08:50:23.4689084Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:50:23.4689353Z  'num_stages': 4,
2026-02-21T08:50:23.4689607Z  'num_warps': 2,
2026-02-21T08:50:23.4689841Z  'pid_type': 'flat',
2026-02-21T08:50:23.4690099Z  'range_flattens': [None, None],
2026-02-21T08:50:23.4690402Z  'range_multi_buffers': [None, True],
2026-02-21T08:50:23.4690703Z  'range_num_stages': [0, 1],
2026-02-21T08:50:23.4690981Z  'range_unroll_factors': [0, 2],
2026-02-21T08:50:23.4691270Z  'range_warp_specializes': [],
2026-02-21T08:50:23.4691547Z  'waves_per_eu': 1}
2026-02-21T08:50:23.5585999Z [261s] Fitting surrogate: 665 points, 665 targets
2026-02-21T08:50:24.3158495Z [262s] Generation 7 starting: 74 neighbors, 4 active search path(s)
2026-02-21T08:50:37.0906402Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 6.4 configs/s
2026-02-21T08:50:41.8022291Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.2 configs/s
2026-02-21T08:50:51.1438298Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 104.3         
2026-02-21T08:50:51.1438881Z                                                                   configs/s     
2026-02-21T08:50:51.9213975Z [290s] Generation 7 complete: 
2026-02-21T08:50:51.9214206Z ok=78
2026-02-21T08:50:51.9214322Z min=0.1065
2026-02-21T08:50:51.9214445Z mid=0.1277
2026-02-21T08:50:51.9214552Z max=9.2420
2026-02-21T08:50:51.9214673Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:50:51.9214878Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T08:50:51.9215080Z  'l2_groupings': [8],
2026-02-21T08:50:51.9215229Z  'load_eviction_policies': ['', ''],
2026-02-21T08:50:51.9215393Z  'loop_orders': [[1, 0]],
2026-02-21T08:50:51.9215540Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:50:51.9215689Z  'num_sm_multiplier': 16,
2026-02-21T08:50:51.9215832Z  'num_stages': 2,
2026-02-21T08:50:51.9215956Z  'num_warps': 2,
2026-02-21T08:50:51.9216102Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:50:51.9216292Z  'range_flattens': [None, True],
2026-02-21T08:50:51.9216454Z  'range_multi_buffers': [None, None],
2026-02-21T08:50:51.9216971Z  'range_num_stages': [0, 1],
2026-02-21T08:50:51.9217123Z  'range_unroll_factors': [1, 4],
2026-02-21T08:50:51.9217280Z  'range_warp_specializes': [],
2026-02-21T08:50:51.9217442Z  'waves_per_eu': 1}
2026-02-21T08:50:52.0558654Z [290s] Fitting surrogate: 743 points, 743 targets
2026-02-21T08:50:52.8536866Z [291s] Generation 8 starting: 74 neighbors, 4 active search path(s)
2026-02-21T08:51:06.4959831Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 5.7 configs/s
2026-02-21T08:51:11.0025832Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.8 configs/s
2026-02-21T08:51:21.5080026Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 102.0         
2026-02-21T08:51:21.5080293Z                                                                   configs/s     
2026-02-21T08:51:22.3841243Z [320s] Generation 8 complete: 
2026-02-21T08:51:22.3841681Z error=2
2026-02-21T08:51:22.3841886Z ok=77
2026-02-21T08:51:22.3842092Z min=0.1041
2026-02-21T08:51:22.3842297Z mid=0.1346
2026-02-21T08:51:22.3842523Z max=3.5718
2026-02-21T08:51:22.3842967Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:51:22.3843349Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T08:51:22.3843710Z  'l2_groupings': [8],
2026-02-21T08:51:22.3843985Z  'load_eviction_policies': ['', ''],
2026-02-21T08:51:22.3844291Z  'loop_orders': [[1, 0]],
2026-02-21T08:51:22.3844564Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:51:22.3844841Z  'num_sm_multiplier': 16,
2026-02-21T08:51:22.3845100Z  'num_stages': 2,
2026-02-21T08:51:22.3845335Z  'num_warps': 2,
2026-02-21T08:51:22.3845569Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:51:22.3845838Z  'range_flattens': [None, None],
2026-02-21T08:51:22.3846086Z  'range_multi_buffers': [None, None],
2026-02-21T08:51:22.3846340Z  'range_num_stages': [0, 1],
2026-02-21T08:51:22.3846603Z  'range_unroll_factors': [1, 4],
2026-02-21T08:51:22.3847238Z  'range_warp_specializes': [],
2026-02-21T08:51:22.3847473Z  'waves_per_eu': 1}
2026-02-21T08:51:22.3926572Z [320s] Fitting surrogate: 822 points, 822 targets
2026-02-21T08:51:23.0313391Z [321s] Generation 9 starting: 57 neighbors, 3 active search path(s)
2026-02-21T08:51:39.1114758Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 0.3 configs/s
2026-02-21T08:51:42.7161422Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 16.5 configs/s
2026-02-21T08:51:48.7350061Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 158.1         
2026-02-21T08:51:48.7350664Z                                                                   configs/s     
2026-02-21T08:51:49.3063776Z [347s] Generation 9 complete: 
2026-02-21T08:51:49.3063920Z ok=60
2026-02-21T08:51:49.3063999Z min=0.1047
2026-02-21T08:51:49.3064085Z mid=0.1326
2026-02-21T08:51:49.3064163Z max=2.1349
2026-02-21T08:51:49.3064253Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:51:49.3064416Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T08:51:49.3064555Z  'l2_groupings': [8],
2026-02-21T08:51:49.3064680Z  'load_eviction_policies': ['', ''],
2026-02-21T08:51:49.3064799Z  'loop_orders': [[1, 0]],
2026-02-21T08:51:49.3064902Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:51:49.3065008Z  'num_sm_multiplier': 16,
2026-02-21T08:51:49.3065120Z  'num_stages': 2,
2026-02-21T08:51:49.3065204Z  'num_warps': 2,
2026-02-21T08:51:49.3065302Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:51:49.3065422Z  'range_flattens': [None, None],
2026-02-21T08:51:49.3065535Z  'range_multi_buffers': [None, None],
2026-02-21T08:51:49.3065647Z  'range_num_stages': [0, 1],
2026-02-21T08:51:49.3065751Z  'range_unroll_factors': [1, 4],
2026-02-21T08:51:49.3065858Z  'range_warp_specializes': [],
2026-02-21T08:51:49.3065962Z  'waves_per_eu': 1}
2026-02-21T08:51:49.3894807Z [347s] Fitting surrogate: 882 points, 882 targets
2026-02-21T08:51:49.8227066Z [347s] Generation 10 starting: 35 neighbors, 2 active search path(s)
2026-02-21T08:51:56.6342520Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 7.7 configs/s
2026-02-21T08:51:58.8184743Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 16.3 configs/s
2026-02-21T08:52:03.3694296Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 205.6         
2026-02-21T08:52:03.3694929Z                                                                   configs/s     
2026-02-21T08:52:03.8899804Z [362s] Generation 10 complete: 
2026-02-21T08:52:03.8900140Z ok=38
2026-02-21T08:52:03.8900357Z min=0.1038
2026-02-21T08:52:03.8900570Z mid=0.1269
2026-02-21T08:52:03.8900770Z max=3.5875
2026-02-21T08:52:03.8901003Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:52:03.8901355Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T08:52:03.8901750Z  'l2_groupings': [8],
2026-02-21T08:52:03.8902025Z  'load_eviction_policies': ['', ''],
2026-02-21T08:52:03.8902337Z  'loop_orders': [[1, 0]],
2026-02-21T08:52:03.8902613Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:52:03.8902929Z  'num_sm_multiplier': 16,
2026-02-21T08:52:03.8903202Z  'num_stages': 2,
2026-02-21T08:52:03.8903460Z  'num_warps': 2,
2026-02-21T08:52:03.8903723Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:52:03.8904055Z  'range_flattens': [None, None],
2026-02-21T08:52:03.8904362Z  'range_multi_buffers': [None, None],
2026-02-21T08:52:03.8904669Z  'range_num_stages': [0, 1],
2026-02-21T08:52:03.8904949Z  'range_unroll_factors': [1, 4],
2026-02-21T08:52:03.8905248Z  'range_warp_specializes': [],
2026-02-21T08:52:03.8905530Z  'waves_per_eu': 1}
2026-02-21T08:52:03.9513295Z [362s] Fitting surrogate: 920 points, 920 targets
2026-02-21T08:52:04.3783720Z [362s] Generation 11 starting: 30 neighbors, 2 active search path(s)
2026-02-21T08:52:11.9960995Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 2.5 configs/s
2026-02-21T08:52:13.9936956Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 30/30 16.2 configs/s
2026-02-21T08:52:17.4408167Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 265.3         
2026-02-21T08:52:17.4408778Z                                                                   configs/s     
2026-02-21T08:52:17.9297161Z [376s] Generation 11 complete: 
2026-02-21T08:52:17.9297508Z ok=33
2026-02-21T08:52:17.9297690Z min=0.1046
2026-02-21T08:52:17.9297872Z mid=0.1367
2026-02-21T08:52:17.9298043Z max=8.0136
2026-02-21T08:52:17.9298242Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:52:17.9298563Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T08:52:17.9298887Z  'l2_groupings': [8],
2026-02-21T08:52:17.9299115Z  'load_eviction_policies': ['', ''],
2026-02-21T08:52:17.9299406Z  'loop_orders': [[1, 0]],
2026-02-21T08:52:17.9299626Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:52:17.9299849Z  'num_sm_multiplier': 16,
2026-02-21T08:52:17.9300053Z  'num_stages': 2,
2026-02-21T08:52:17.9300237Z  'num_warps': 2,
2026-02-21T08:52:17.9300446Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:52:17.9300752Z  'range_flattens': [None, None],
2026-02-21T08:52:17.9300996Z  'range_multi_buffers': [None, None],
2026-02-21T08:52:17.9301243Z  'range_num_stages': [0, 1],
2026-02-21T08:52:17.9301488Z  'range_unroll_factors': [1, 4],
2026-02-21T08:52:17.9301723Z  'range_warp_specializes': [],
2026-02-21T08:52:17.9301949Z  'waves_per_eu': 1}
2026-02-21T08:52:17.9786485Z [376s] Fitting surrogate: 953 points, 953 targets
2026-02-21T08:52:18.4570359Z [376s] Generation 12 starting: 36 neighbors, 2 active search path(s)
2026-02-21T08:52:25.5193328Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 7.0 configs/s
2026-02-21T08:52:27.8226037Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 36/36 16.7 configs/s
2026-02-21T08:52:31.4668187Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 252.6         
2026-02-21T08:52:31.4668719Z                                                                   configs/s     
2026-02-21T08:52:32.0051473Z [390s] Generation 12 complete: 
2026-02-21T08:52:32.0051756Z ok=39
2026-02-21T08:52:32.0051963Z min=0.1038
2026-02-21T08:52:32.0052143Z mid=0.1439
2026-02-21T08:52:32.0052302Z max=1.6405
2026-02-21T08:52:32.0052864Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:52:32.0053172Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T08:52:32.0053477Z  'l2_groupings': [8],
2026-02-21T08:52:32.0053701Z  'load_eviction_policies': ['', ''],
2026-02-21T08:52:32.0053959Z  'loop_orders': [[1, 0]],
2026-02-21T08:52:32.0054187Z  'matrix_instr_nonkdim': 0,
2026-02-21T08:52:32.0054416Z  'num_sm_multiplier': 16,
2026-02-21T08:52:32.0054632Z  'num_stages': 2,
2026-02-21T08:52:32.0054823Z  'num_warps': 2,
2026-02-21T08:52:32.0055045Z  'pid_type': 'persistent_interleaved',
2026-02-21T08:52:32.0055310Z  'range_flattens': [None, None],
2026-02-21T08:52:32.0055562Z  'range_multi_buffers': [None, None],
2026-02-21T08:52:32.0055811Z  'range_num_stages': [0, 1],
2026-02-21T08:52:32.0056036Z  'range_unroll_factors': [1, 4],
2026-02-21T08:52:32.0056280Z  'range_warp_specializes': [],
2026-02-21T08:52:32.0056517Z  'waves_per_eu': 1}
2026-02-21T08:52:32.0566178Z [390s] Fitting surrogate: 992 points, 992 targets
2026-02-21T08:52:32.3493721Z [390s] Generation 13 starting: 18 neighbors, 1 active search path(s)
2026-02-21T08:52:36.5551460Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 3.8 configs/s
2026-02-21T08:52:37.7614226Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 16.9 configs/s
2026-02-21T08:52:39.5685196Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 470.4         
2026-02-21T08:52:39.5685675Z                                                                   configs/s     
2026-02-21T08:52:40.0403076Z [398s] Generation 13 complete: 
2026-02-21T08:52:40.0403357Z ok=20
2026-02-21T08:52:40.0403530Z min=0.1060
2026-02-21T08:52:40.0403694Z mid=0.1515
2026-02-21T08:52:40.0403853Z max=0.2129
2026-02-21T08:52:40.0404033Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:52:40.0404320Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:52:40.0404968Z  'l2_groupings': [32],
2026-02-21T08:52:40.0405190Z  'load_eviction_policies': ['', ''],
2026-02-21T08:52:40.0405455Z  'loop_orders': [[0, 1]],
2026-02-21T08:52:40.0405683Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:52:40.0405891Z  'num_stages': 3,
2026-02-21T08:52:40.0406072Z  'num_warps': 2,
2026-02-21T08:52:40.0406258Z  'pid_type': 'flat',
2026-02-21T08:52:40.0406461Z  'range_flattens': [None, True],
2026-02-21T08:52:40.0406702Z  'range_multi_buffers': [None, False],
2026-02-21T08:52:40.0406941Z  'range_num_stages': [0, 2],
2026-02-21T08:52:40.0407169Z  'range_unroll_factors': [0, 4],
2026-02-21T08:52:40.0407391Z  'range_warp_specializes': [],
2026-02-21T08:52:40.0407607Z  'waves_per_eu': 1}
2026-02-21T08:52:40.0651815Z [398s] Fitting surrogate: 1012 points, 1012 targets
2026-02-21T08:52:40.3712577Z [398s] Generation 14 starting: 19 neighbors, 1 active search path(s)
2026-02-21T08:52:44.1339685Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 6.1 configs/s
2026-02-21T08:52:45.3941534Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.0 configs/s
2026-02-21T08:52:46.4892523Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 719.3         
2026-02-21T08:52:46.4893006Z                                                                   configs/s     
2026-02-21T08:52:46.9252015Z [405s] Generation 14 complete: 
2026-02-21T08:52:46.9252288Z ok=20
2026-02-21T08:52:46.9252455Z min=0.1058
2026-02-21T08:52:46.9252622Z mid=0.1694
2026-02-21T08:52:46.9252770Z max=1.0152
2026-02-21T08:52:46.9252945Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:52:46.9253230Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:52:46.9253543Z  'l2_groupings': [16],
2026-02-21T08:52:46.9253754Z  'load_eviction_policies': ['', ''],
2026-02-21T08:52:46.9253995Z  'loop_orders': [[0, 1]],
2026-02-21T08:52:46.9254209Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:52:46.9254419Z  'num_stages': 3,
2026-02-21T08:52:46.9254595Z  'num_warps': 2,
2026-02-21T08:52:46.9254795Z  'pid_type': 'flat',
2026-02-21T08:52:46.9255000Z  'range_flattens': [None, True],
2026-02-21T08:52:46.9255248Z  'range_multi_buffers': [None, False],
2026-02-21T08:52:46.9255492Z  'range_num_stages': [0, 3],
2026-02-21T08:52:46.9255701Z  'range_unroll_factors': [0, 4],
2026-02-21T08:52:46.9255925Z  'range_warp_specializes': [],
2026-02-21T08:52:46.9256136Z  'waves_per_eu': 1}
2026-02-21T08:52:46.9424666Z [405s] Fitting surrogate: 1032 points, 1032 targets
2026-02-21T08:52:47.2413616Z [405s] Generation 15 starting: 19 neighbors, 1 active search path(s)
2026-02-21T08:53:20.0872868Z [438s] Timeout after 30s compiling Config(block_sizes=[128, 32, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:53:20.0894977Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 0.2 configs/s
2026-02-21T08:53:21.1965718Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.9 configs/s
2026-02-21T08:53:23.0765612Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 456.5         
2026-02-21T08:53:23.0765983Z                                                                   configs/s     
2026-02-21T08:53:23.5744959Z [441s] Generation 15 complete: 
2026-02-21T08:53:23.5745353Z timeout=1
2026-02-21T08:53:23.5745561Z ok=19
2026-02-21T08:53:23.5745763Z min=0.1054
2026-02-21T08:53:23.5745966Z mid=0.1422
2026-02-21T08:53:23.5746166Z max=1.0147
2026-02-21T08:53:23.5759146Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:53:23.5759297Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:53:23.5759442Z  'l2_groupings': [16],
2026-02-21T08:53:23.5759557Z  'load_eviction_policies': ['', ''],
2026-02-21T08:53:23.5759691Z  'loop_orders': [[0, 1]],
2026-02-21T08:53:23.5759822Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:53:23.5759929Z  'num_stages': 3,
2026-02-21T08:53:23.5760022Z  'num_warps': 2,
2026-02-21T08:53:23.5760127Z  'pid_type': 'flat',
2026-02-21T08:53:23.5760233Z  'range_flattens': [None, None],
2026-02-21T08:53:23.5760352Z  'range_multi_buffers': [None, False],
2026-02-21T08:53:23.5760479Z  'range_num_stages': [0, 3],
2026-02-21T08:53:23.5760589Z  'range_unroll_factors': [0, 4],
2026-02-21T08:53:23.5760710Z  'range_warp_specializes': [],
2026-02-21T08:53:23.5760822Z  'waves_per_eu': 1}
2026-02-21T08:53:23.5971785Z [441s] Fitting surrogate: 1052 points, 1052 targets
2026-02-21T08:53:23.8741510Z [442s] Generation 16 starting: 17 neighbors, 1 active search path(s)
2026-02-21T08:53:27.7634671Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 4.8 configs/s
2026-02-21T08:53:28.8969161Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.2 configs/s
2026-02-21T08:53:31.7632532Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 458.8         
2026-02-21T08:53:31.7632842Z                                                                   configs/s     
2026-02-21T08:53:32.2522007Z [450s] Generation 16 complete: 
2026-02-21T08:53:32.2522381Z ok=18
2026-02-21T08:53:32.2522700Z min=0.1046
2026-02-21T08:53:32.2522917Z mid=0.1480
2026-02-21T08:53:32.2523118Z max=0.9930
2026-02-21T08:53:32.2523345Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:53:32.2523734Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:53:32.2524092Z  'l2_groupings': [16],
2026-02-21T08:53:32.2524374Z  'load_eviction_policies': ['', ''],
2026-02-21T08:53:32.2524688Z  'loop_orders': [[0, 1]],
2026-02-21T08:53:32.2524968Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:53:32.2525238Z  'num_stages': 3,
2026-02-21T08:53:32.2525464Z  'num_warps': 2,
2026-02-21T08:53:32.2525701Z  'pid_type': 'flat',
2026-02-21T08:53:32.2525959Z  'range_flattens': [None, None],
2026-02-21T08:53:32.2526270Z  'range_multi_buffers': [None, False],
2026-02-21T08:53:32.2526597Z  'range_num_stages': [0, 4],
2026-02-21T08:53:32.2526879Z  'range_unroll_factors': [0, 4],
2026-02-21T08:53:32.2527181Z  'range_warp_specializes': [],
2026-02-21T08:53:32.2527465Z  'waves_per_eu': 1}
2026-02-21T08:53:32.2741882Z [450s] Fitting surrogate: 1070 points, 1070 targets
2026-02-21T08:53:32.5678015Z [450s] Generation 17 starting: 16 neighbors, 1 active search path(s)
2026-02-21T08:53:36.1508996Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.5 configs/s
2026-02-21T08:53:37.2360110Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.1 configs/s
2026-02-21T08:53:39.0606077Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 462.0         
2026-02-21T08:53:39.0606673Z                                                                   configs/s     
2026-02-21T08:53:39.5134835Z [457s] Generation 17 complete: 
2026-02-21T08:53:39.5135127Z ok=17
2026-02-21T08:53:39.5135562Z min=0.1064
2026-02-21T08:53:39.5135795Z mid=0.1320
2026-02-21T08:53:39.5136456Z max=1.0106
2026-02-21T08:53:39.5136669Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:53:39.5137019Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:53:39.5137365Z  'l2_groupings': [8],
2026-02-21T08:53:39.5137607Z  'load_eviction_policies': ['', ''],
2026-02-21T08:53:39.5137885Z  'loop_orders': [[0, 1]],
2026-02-21T08:53:39.5138131Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:53:39.5138372Z  'num_stages': 3,
2026-02-21T08:53:39.5138570Z  'num_warps': 2,
2026-02-21T08:53:39.5138772Z  'pid_type': 'flat',
2026-02-21T08:53:39.5139000Z  'range_flattens': [None, None],
2026-02-21T08:53:39.5139269Z  'range_multi_buffers': [None, False],
2026-02-21T08:53:39.5139539Z  'range_num_stages': [0, 4],
2026-02-21T08:53:39.5139779Z  'range_unroll_factors': [0, 4],
2026-02-21T08:53:39.5140032Z  'range_warp_specializes': [],
2026-02-21T08:53:39.5140274Z  'waves_per_eu': 1}
2026-02-21T08:53:39.5328867Z [457s] Fitting surrogate: 1087 points, 1087 targets
2026-02-21T08:53:39.8288087Z [457s] Generation 18 starting: 18 neighbors, 1 active search path(s)
2026-02-21T08:53:43.2287313Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 15.0 configs/s
2026-02-21T08:53:44.4264304Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 17.1 configs/s
2026-02-21T08:53:46.0561434Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 513.1         
2026-02-21T08:53:46.0561928Z                                                                   configs/s     
2026-02-21T08:53:46.5715120Z [464s] Generation 18 complete: 
2026-02-21T08:53:46.5715486Z ok=19
2026-02-21T08:53:46.5716796Z min=0.1052
2026-02-21T08:53:46.5717122Z mid=0.1496
2026-02-21T08:53:46.5717342Z max=0.9927
2026-02-21T08:53:46.5717582Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:53:46.5718024Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:53:46.5718397Z  'l2_groupings': [8],
2026-02-21T08:53:46.5718676Z  'load_eviction_policies': ['', ''],
2026-02-21T08:53:46.5718991Z  'loop_orders': [[1, 0]],
2026-02-21T08:53:46.5719335Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:53:46.5719613Z  'num_stages': 3,
2026-02-21T08:53:46.5720317Z  'num_warps': 2,
2026-02-21T08:53:46.5720449Z  'pid_type': 'flat',
2026-02-21T08:53:46.5720599Z  'range_flattens': [None, None],
2026-02-21T08:53:46.5720780Z  'range_multi_buffers': [None, False],
2026-02-21T08:53:46.5720953Z  'range_num_stages': [0, 4],
2026-02-21T08:53:46.5721124Z  'range_unroll_factors': [0, 4],
2026-02-21T08:53:46.5721299Z  'range_warp_specializes': [],
2026-02-21T08:53:46.5721456Z  'waves_per_eu': 1}
2026-02-21T08:53:46.5918845Z [464s] Fitting surrogate: 1106 points, 1106 targets
2026-02-21T08:53:46.8827236Z [465s] Generation 19 starting: 16 neighbors, 1 active search path(s)
2026-02-21T08:53:50.2426165Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 4.1 configs/s
2026-02-21T08:53:51.3177653Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.3 configs/s
2026-02-21T08:53:52.8203861Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 543.0         
2026-02-21T08:53:52.8204352Z                                                                   configs/s     
2026-02-21T08:53:53.2699510Z [471s] Generation 19 complete: 
2026-02-21T08:53:53.2699778Z ok=17
2026-02-21T08:53:53.2700819Z min=0.1052
2026-02-21T08:53:53.2700981Z mid=0.1450
2026-02-21T08:53:53.2701067Z max=0.9915
2026-02-21T08:53:53.2701163Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:53:53.2701358Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:53:53.2701504Z  'l2_groupings': [8],
2026-02-21T08:53:53.2701615Z  'load_eviction_policies': ['', ''],
2026-02-21T08:53:53.2701735Z  'loop_orders': [[1, 0]],
2026-02-21T08:53:53.2701842Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:53:53.2701951Z  'num_stages': 3,
2026-02-21T08:53:53.2702040Z  'num_warps': 2,
2026-02-21T08:53:53.2702131Z  'pid_type': 'flat',
2026-02-21T08:53:53.2702231Z  'range_flattens': [None, None],
2026-02-21T08:53:53.2702348Z  'range_multi_buffers': [None, True],
2026-02-21T08:53:53.2702906Z  'range_num_stages': [0, 4],
2026-02-21T08:53:53.2703011Z  'range_unroll_factors': [0, 4],
2026-02-21T08:53:53.2703143Z  'range_warp_specializes': [],
2026-02-21T08:53:53.2703246Z  'waves_per_eu': 1}
2026-02-21T08:53:53.2908322Z [471s] Fitting surrogate: 1123 points, 1123 targets
2026-02-21T08:53:53.5588225Z [471s] Generation 20 starting: 17 neighbors, 1 active search path(s)
2026-02-21T08:54:25.7611975Z [503s] Timeout after 30s compiling Config(block_sizes=[128, 32, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:54:25.7631262Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.2 configs/s
2026-02-21T08:54:26.7450199Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 18.2 configs/s
2026-02-21T08:54:28.7073760Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 441.3         
2026-02-21T08:54:28.7076516Z                                                                   configs/s     
2026-02-21T08:54:29.1516359Z [507s] Generation 20 complete: 
2026-02-21T08:54:29.1516740Z timeout=1
2026-02-21T08:54:29.1516972Z ok=17
2026-02-21T08:54:29.1517162Z min=0.1051
2026-02-21T08:54:29.1517363Z mid=0.1188
2026-02-21T08:54:29.1517548Z max=0.1970
2026-02-21T08:54:29.1517757Z best={'block_sizes': [64, 32, 16],
2026-02-21T08:54:29.1518106Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T08:54:29.1518449Z  'l2_groupings': [8],
2026-02-21T08:54:29.1518703Z  'load_eviction_policies': ['', ''],
2026-02-21T08:54:29.1518999Z  'loop_orders': [[1, 0]],
2026-02-21T08:54:29.1519259Z  'matrix_instr_nonkdim': 16,
2026-02-21T08:54:29.1519514Z  'num_stages': 2,
2026-02-21T08:54:29.1519723Z  'num_warps': 2,
2026-02-21T08:54:29.1519935Z  'pid_type': 'flat',
2026-02-21T08:54:29.1520226Z  'range_flattens': [None, None],
2026-02-21T08:54:29.1520508Z  'range_multi_buffers': [None, True],
2026-02-21T08:54:29.1521276Z  'range_num_stages': [0, 4],
2026-02-21T08:54:29.1521540Z  'range_unroll_factors': [0, 4],
2026-02-21T08:54:29.1521817Z  'range_warp_specializes': [],
2026-02-21T08:54:29.1522075Z  'waves_per_eu': 1}
2026-02-21T08:54:29.1745113Z [507s] Fitting surrogate: 1141 points, 1141 targets
2026-02-21T08:54:29.3091733Z [507s] Autotuning complete in 507.5s after searching 1088 configs.
2026-02-21T08:54:29.3092305Z One can hardcode the best config and skip autotuning with:
2026-02-21T08:54:29.3094185Z     @helion.kernel(config=helion.Config(block_sizes=[64, 32, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T08:54:29.3095350Z 
2026-02-21T08:54:29.3095525Z [507s] Code of selected kernel: /tmp/torchinductor_root/hp/chpk4kwpoinvb37woomythwzhrrcj56jbyn2fjphaldipw3fj6kj.py
2026-02-21T08:54:30.2901284Z WARNING:tritonbench.utils.triton_op:Completed input ID 14:
2026-02-21T08:54:30.2901759Z x_val
2026-02-21T08:54:30.2901987Z -------------------
2026-02-21T08:54:30.2902240Z (64, 1, 7168, 8192)
2026-02-21T08:54:30.2902383Z 
2026-02-21T08:54:30.2930878Z  50%|█████     | 5/10 [45:09<43:42, 524.56s/it]WARNING:tritonbench.utils.triton_op:Running input ID 17:
2026-02-21T08:54:30.2931335Z x_val
2026-02-21T08:54:30.2931523Z ---------------------
2026-02-21T08:54:30.2931744Z (1, 4096, 8192, 1024)
2026-02-21T08:54:30.2935512Z INFO:tritonbench.utils.triton_op:Took 0.35ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T08:54:31.3312289Z INFO:tritonbench.utils.triton_op:Took 3.92ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T08:54:32.9403532Z Autotune Choices Stats:
2026-02-21T08:54:32.9404346Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 0.10955700278282166, "best_triton_pos": 1, "best_triton_time": 0.11027800291776657, "best_triton_kernel": "triton_mm_119", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8"}
2026-02-21T08:54:32.9529966Z AUTOTUNE mm(4096x1024, 1024x8192)
2026-02-21T08:54:32.9530184Z strides: [1024, 1], [8192, 1]
2026-02-21T08:54:32.9530358Z dtypes: torch.bfloat16, torch.bfloat16
2026-02-21T08:54:32.9530526Z   mm 0.1096 ms 100.0% 
2026-02-21T08:54:32.9531077Z   triton_mm_119 0.1103 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:54:32.9531985Z   triton_mm_116 0.1117 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:54:32.9532864Z   triton_mm_113 0.1225 ms 89.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:54:32.9533733Z   triton_mm_114 0.1323 ms 82.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:54:32.9534599Z   triton_mm_118 0.1380 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:54:32.9535462Z   triton_mm_107 0.1392 ms 78.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:54:32.9536828Z   triton_mm_110 0.1423 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:54:32.9562386Z   triton_mm_112 0.1536 ms 71.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:54:32.9563511Z   triton_mm_117 0.1559 ms 70.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:54:32.9564181Z SingleProcess AUTOTUNE benchmarking takes 1.1024 seconds and 0.3124 seconds precompiling for 37 choices
2026-02-21T08:54:35.4036396Z INFO:tritonbench.utils.triton_op:Took 0.16ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T08:54:35.4057833Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:54:35.4058185Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:54:35.4058465Z               'dtype': 'torch.bfloat16',
2026-02-21T08:54:35.4058726Z               'shape': (1, 4096, 1024),
2026-02-21T08:54:35.4058982Z               'stride': (4194304, 1024, 1)},
2026-02-21T08:54:35.4059236Z             { 'device': 'cuda:0',
2026-02-21T08:54:35.4059483Z               'dtype': 'torch.int32',
2026-02-21T08:54:35.4059732Z               'shape': (1024, 8192),
2026-02-21T08:54:35.4059974Z               'stride': (8192, 1)}),
2026-02-21T08:54:35.4060206Z   'kwargs': {}}
2026-02-21T08:54:35.4104280Z INFO:tritonbench.utils.triton_op:Took 4.64ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T08:54:35.6033281Z [0s] Autotune random seed: 2134834638
2026-02-21T08:54:35.6923862Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:55:12.8272948Z [37s] Timeout after 30s compiling Config(block_sizes=[128, 2, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:55:12.8288791Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s
2026-02-21T08:55:14.0892938Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:55:14.0921687Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:55:14.0925209Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T08:55:14.0925794Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T08:55:14.0926247Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T08:55:14.0926663Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:55:14.0926958Z #smem = #ttg.shared_memory
2026-02-21T08:55:14.0927321Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:55:14.0928098Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:55:14.0929196Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x16xf32, #mma>
2026-02-21T08:55:14.0929448Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:55:14.0929625Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:55:14.0929793Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:55:14.0930012Z     %cst_0 = arith.constant dense<0> : tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.0930245Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:55:14.0930442Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:55:14.0930599Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T08:55:14.0930820Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T08:55:14.0930978Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T08:55:14.0931129Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:55:14.0931432Z     %cst_1 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0931834Z     %cst_2 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.0932185Z     %cst_3 = arith.constant dense<0> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0932524Z     %cst_4 = arith.constant dense<8192> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0932860Z     %cst_5 = arith.constant dense<0> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0933197Z     %cst_6 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0933532Z     %cst_7 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0934066Z     %cst_8 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0934358Z     %cst_9 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1>
2026-02-21T08:55:14.0934651Z     %cst_10 = arith.constant dense<4> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0934947Z     %cst_11 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:14.0935170Z     %cst_12 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:14.0935408Z     %cst_13 = arith.constant dense<8192> : tensor<64x1xi32, #mma>
2026-02-21T08:55:14.0935602Z     %0 = tt.get_program_id x : i32
2026-02-21T08:55:14.0935751Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T08:55:14.0935912Z     %2 = arith.minsi %1, %c32768_i32 : i32
2026-02-21T08:55:14.0936181Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.0936550Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:14.0936962Z     %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0937385Z     %6 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:14.0937736Z     %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.0938066Z     %8 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.0938385Z     %9 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0938800Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0939376Z     %11 = arith.extsi %10 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0940122Z     %12 = arith.extsi %5 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0940679Z     %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:55:14.0941106Z     %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:55:14.0941536Z     %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:14.0941806Z     %16 = arith.cmpi eq, %15, %cst_11 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:14.0942015Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked>
2026-02-21T08:55:14.0942225Z     %18 = arith.cmpi eq, %15, %cst_12 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:14.0942422Z     %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked>
2026-02-21T08:55:14.0942639Z     %20 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x16x!tt.ptr<bf16>, #mma>
2026-02-21T08:55:14.0942811Z     %21 = arith.subi %2, %0 : i32
2026-02-21T08:55:14.0942934Z     %22 = arith.remsi %21, %c2_i32 : i32
2026-02-21T08:55:14.0943055Z     %23 = arith.subi %21, %22 : i32
2026-02-21T08:55:14.0943168Z     %24 = arith.addi %0, %23 : i32
2026-02-21T08:55:14.0943342Z     %25 = arith.addi %7, %cst_2 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.0943629Z     %26 = tt.expand_dims %25 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:55:14.0943951Z     %27 = tt.broadcast %26 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0944234Z     %28 = arith.addi %11, %cst_1 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0944649Z     %29 = tt.expand_dims %28 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0945019Z     %30 = arith.muli %29, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0945346Z     %31 = tt.broadcast %30 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0945677Z     %32 = arith.cmpi sge, %29, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0945931Z     %33 = arith.cmpi slt, %29, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0946173Z     %34 = arith.andi %32, %33 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0946488Z     %35 = tt.broadcast %34 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0946761Z     scf.for %arg3 = %0 to %24 step %c2_i32  : i32 {
2026-02-21T08:55:14.0946909Z       %36 = arith.remsi %arg3, %c64_i32 : i32
2026-02-21T08:55:14.0947042Z       %37 = arith.divsi %arg3, %c64_i32 : i32
2026-02-21T08:55:14.0947168Z       %38 = arith.muli %36, %c64_i32 : i32
2026-02-21T08:55:14.0947350Z       %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.0947570Z       %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:14.0947791Z       %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.0948018Z       %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:14.0948189Z       %43 = arith.muli %37, %c16_i32 : i32
2026-02-21T08:55:14.0948389Z       %44 = tt.splat %43 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:14.0948599Z       %45 = arith.addi %44, %6 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:14.0948891Z       %46 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T08:55:14.0949151Z       %47 = arith.muli %46, %cst_9 : tensor<64x1xi32, #blocked1>
2026-02-21T08:55:14.0949356Z       %48 = tt.broadcast %47 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0949545Z       %49 = arith.extsi %43 : i32 to i64
2026-02-21T08:55:14.0949756Z       %50 = tt.splat %49 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0950072Z       %51 = arith.addi %50, %12 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0950461Z       %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0950889Z       %53 = tt.broadcast %52 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0951199Z       %54 = arith.cmpi sge, %52, %cst_5 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0951435Z       %55 = arith.cmpi slt, %52, %cst_4 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0951663Z       %56 = arith.andi %54, %55 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0951994Z       %57 = tt.broadcast %56 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0952333Z       %58 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<64x16xf32, #mma>)  : i32 {
2026-02-21T08:55:14.0952551Z         %145 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:55:14.0952749Z         %146 = tt.splat %145 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.0952971Z         %147 = arith.addi %146, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.0953249Z         %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:55:14.0953528Z         %149 = tt.broadcast %148 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0953722Z         %150 = arith.addi %48, %149 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0953934Z         %151 = tt.addptr %8, %150 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0954140Z         %152 = tt.load %151 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.0954360Z         %153 = ttg.local_alloc %152 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.0954689Z         %154 = ttg.local_load %153 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.0955096Z         %155 = arith.extf %154 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.0955378Z         %156 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:55:14.0955588Z         %157 = tt.splat %156 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0955891Z         %158 = arith.addi %157, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0956283Z         %159 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0956686Z         %160 = arith.muli %159, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0956997Z         %161 = tt.broadcast %160 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0957303Z         %162 = arith.addi %161, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0957614Z         %163 = tt.addptr %9, %162 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0957932Z         %164 = arith.cmpi sge, %159, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0958175Z         %165 = arith.cmpi slt, %159, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0958411Z         %166 = arith.andi %164, %165 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0958713Z         %167 = tt.broadcast %166 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0959008Z         %168 = arith.andi %167, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0959253Z         %169 = tt.load %163, %168, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0959498Z         %170 = arith.shli %169, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0959734Z         %171 = arith.shrsi %170, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0959971Z         %172 = arith.shrsi %169, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0960292Z         %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.0960633Z         %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.0960914Z         %175 = tt.broadcast %173 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.0961155Z         %176 = arith.select %17, %175, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.0961392Z         %177 = tt.broadcast %174 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.0961622Z         %178 = arith.select %19, %177, %176 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.0961852Z         %179 = tt.reshape %178 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.0962076Z         %180 = arith.sitofp %179 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.0962370Z         %181 = ttg.convert_layout %180 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.0962925Z         %182 = tt.dot %155, %181, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.0963283Z         %183 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:55:14.0963411Z         %184 = arith.muli %183, %c2_i32 : i32
2026-02-21T08:55:14.0963582Z         %185 = tt.splat %184 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.0963811Z         %186 = arith.addi %185, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.0964094Z         %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:55:14.0964369Z         %188 = tt.broadcast %187 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0964605Z         %189 = arith.addi %48, %188 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0964800Z         %190 = tt.addptr %8, %189 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0965007Z         %191 = tt.load %190 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.0965228Z         %192 = ttg.local_alloc %191 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.0965552Z         %193 = ttg.local_load %192 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.0965956Z         %194 = arith.extf %193 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.0966234Z         %195 = arith.extsi %183 : i32 to i64
2026-02-21T08:55:14.0966451Z         %196 = tt.splat %195 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0966756Z         %197 = arith.addi %196, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0967141Z         %198 = tt.expand_dims %197 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0967497Z         %199 = arith.muli %198, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0967803Z         %200 = tt.broadcast %199 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0968147Z         %201 = arith.addi %200, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0968454Z         %202 = tt.addptr %9, %201 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0968772Z         %203 = arith.cmpi sge, %198, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0969014Z         %204 = arith.cmpi slt, %198, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0969251Z         %205 = arith.andi %203, %204 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0969546Z         %206 = tt.broadcast %205 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0969843Z         %207 = arith.andi %206, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0970089Z         %208 = tt.load %202, %207, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0970334Z         %209 = arith.shli %208, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0970573Z         %210 = arith.shrsi %209, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0970808Z         %211 = arith.shrsi %208, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0971098Z         %212 = tt.expand_dims %210 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.0971433Z         %213 = tt.expand_dims %211 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.0971712Z         %214 = tt.broadcast %212 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.0971952Z         %215 = arith.select %17, %214, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.0972185Z         %216 = tt.broadcast %213 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.0972461Z         %217 = arith.select %19, %216, %215 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.0972688Z         %218 = tt.reshape %217 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.0972907Z         %219 = arith.sitofp %218 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.0973199Z         %220 = ttg.convert_layout %219 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.0973658Z         %221 = tt.dot %194, %220, %182, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.0974001Z         %222 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T08:55:14.0974126Z         %223 = arith.muli %222, %c2_i32 : i32
2026-02-21T08:55:14.0974295Z         %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.0974521Z         %225 = arith.addi %224, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.0974793Z         %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:55:14.0975071Z         %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0975268Z         %228 = arith.addi %48, %227 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0975467Z         %229 = tt.addptr %8, %228 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.0975674Z         %230 = tt.load %229 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.0975925Z         %231 = ttg.local_alloc %230 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.0976251Z         %232 = ttg.local_load %231 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.0976657Z         %233 = arith.extf %232 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.0976933Z         %234 = arith.extsi %222 : i32 to i64
2026-02-21T08:55:14.0977146Z         %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0977441Z         %236 = arith.addi %235, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.0977842Z         %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0978200Z         %238 = arith.muli %237, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0978506Z         %239 = tt.broadcast %238 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0978810Z         %240 = arith.addi %239, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0979117Z         %241 = tt.addptr %9, %240 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0979431Z         %242 = arith.cmpi sge, %237, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0979673Z         %243 = arith.cmpi slt, %237, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0979909Z         %244 = arith.andi %242, %243 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0980208Z         %245 = tt.broadcast %244 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0997869Z         %246 = arith.andi %245, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0998111Z         %247 = tt.load %241, %246, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0998364Z         %248 = arith.shli %247, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0998601Z         %249 = arith.shrsi %248, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0998840Z         %250 = arith.shrsi %247, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.0999132Z         %251 = tt.expand_dims %249 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.0999470Z         %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.0999758Z         %253 = tt.broadcast %251 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.0999996Z         %254 = arith.select %17, %253, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1000230Z         %255 = tt.broadcast %252 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1000463Z         %256 = arith.select %19, %255, %254 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1000691Z         %257 = tt.reshape %256 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.1000915Z         %258 = arith.sitofp %257 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.1001259Z         %259 = ttg.convert_layout %258 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1001720Z         %260 = tt.dot %233, %259, %221, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1002068Z         scf.yield %260 : tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1002197Z       } {tt.disallow_acc_multi_buffer}
2026-02-21T08:55:14.1002336Z       %59 = arith.addi %48, %27 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1002530Z       %60 = tt.addptr %8, %59 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1002777Z       %61 = tt.load %60 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.1002993Z       %62 = ttg.local_alloc %61 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.1003314Z       %63 = ttg.local_load %62 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1003713Z       %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1004043Z       %65 = arith.addi %31, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1004342Z       %66 = tt.addptr %9, %65 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1004643Z       %67 = arith.andi %35, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1004876Z       %68 = tt.load %66, %67, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1005114Z       %69 = arith.shli %68, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1005349Z       %70 = arith.shrsi %69, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1005578Z       %71 = arith.shrsi %68, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1005915Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1006307Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1006580Z       %74 = tt.broadcast %72 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1006812Z       %75 = arith.select %17, %74, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1007036Z       %76 = tt.broadcast %73 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1007260Z       %77 = arith.select %19, %76, %75 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1007475Z       %78 = tt.reshape %77 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.1007690Z       %79 = arith.sitofp %78 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.1007978Z       %80 = ttg.convert_layout %79 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1008426Z       %81 = tt.dot %64, %80, %58, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1008799Z       %82 = arith.truncf %81 : tensor<64x16xf32, #mma> to tensor<64x16xbf16, #mma>
2026-02-21T08:55:14.1009056Z       %83 = tt.expand_dims %42 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T08:55:14.1009319Z       %84 = arith.muli %83, %cst_13 : tensor<64x1xi32, #mma>
2026-02-21T08:55:14.1009542Z       %85 = tt.expand_dims %45 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma>
2026-02-21T08:55:14.1009788Z       %86 = tt.broadcast %84 : tensor<64x1xi32, #mma> -> tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1009981Z       %87 = tt.broadcast %85 : tensor<1x16xi32, #mma> -> tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1010152Z       %88 = arith.addi %86, %87 : tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1010330Z       %89 = tt.addptr %20, %88 : tensor<64x16x!tt.ptr<bf16>, #mma>, tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1010518Z       tt.store %89, %82 : tensor<64x16x!tt.ptr<bf16>, #mma>
2026-02-21T08:55:14.1010654Z       %90 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T08:55:14.1010775Z       %91 = arith.remsi %90, %c64_i32 : i32
2026-02-21T08:55:14.1010888Z       %92 = arith.divsi %90, %c64_i32 : i32
2026-02-21T08:55:14.1011005Z       %93 = arith.muli %91, %c64_i32 : i32
2026-02-21T08:55:14.1011175Z       %94 = tt.splat %93 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.1011386Z       %95 = tt.splat %93 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:14.1011598Z       %96 = arith.addi %94, %3 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.1011805Z       %97 = arith.addi %95, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:14.1011965Z       %98 = arith.muli %92, %c16_i32 : i32
2026-02-21T08:55:14.1012118Z       %99 = tt.splat %98 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:14.1012321Z       %100 = arith.addi %99, %6 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:14.1012592Z       %101 = tt.expand_dims %96 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T08:55:14.1012844Z       %102 = arith.muli %101, %cst_9 : tensor<64x1xi32, #blocked1>
2026-02-21T08:55:14.1013043Z       %103 = tt.broadcast %102 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1013215Z       %104 = arith.extsi %98 : i32 to i64
2026-02-21T08:55:14.1013459Z       %105 = tt.splat %104 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1013758Z       %106 = arith.addi %105, %12 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1014149Z       %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1014575Z       %108 = tt.broadcast %107 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1014885Z       %109 = arith.cmpi sge, %107, %cst_5 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1015131Z       %110 = arith.cmpi slt, %107, %cst_4 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1015370Z       %111 = arith.andi %109, %110 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1015669Z       %112 = tt.broadcast %111 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1016014Z       %113 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<64x16xf32, #mma>)  : i32 {
2026-02-21T08:55:14.1016228Z         %145 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:55:14.1016401Z         %146 = tt.splat %145 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1016626Z         %147 = arith.addi %146, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1016944Z         %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:55:14.1017219Z         %149 = tt.broadcast %148 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1017416Z         %150 = arith.addi %103, %149 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1017615Z         %151 = tt.addptr %8, %150 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1017819Z         %152 = tt.load %151 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.1018038Z         %153 = ttg.local_alloc %152 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.1018368Z         %154 = ttg.local_load %153 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1018774Z         %155 = arith.extf %154 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1019058Z         %156 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:55:14.1019268Z         %157 = tt.splat %156 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1019565Z         %158 = arith.addi %157, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1019953Z         %159 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1020311Z         %160 = arith.muli %159, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1020619Z         %161 = tt.broadcast %160 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1020927Z         %162 = arith.addi %161, %108 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1021233Z         %163 = tt.addptr %9, %162 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1021580Z         %164 = arith.cmpi sge, %159, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1021826Z         %165 = arith.cmpi slt, %159, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1022057Z         %166 = arith.andi %164, %165 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1022359Z         %167 = tt.broadcast %166 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1022659Z         %168 = arith.andi %167, %112 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1022900Z         %169 = tt.load %163, %168, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1023147Z         %170 = arith.shli %169, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1023384Z         %171 = arith.shrsi %170, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1023620Z         %172 = arith.shrsi %169, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1023912Z         %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1024248Z         %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1024532Z         %175 = tt.broadcast %173 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1024768Z         %176 = arith.select %17, %175, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1025037Z         %177 = tt.broadcast %174 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1025271Z         %178 = arith.select %19, %177, %176 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1025498Z         %179 = tt.reshape %178 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.1025719Z         %180 = arith.sitofp %179 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.1026011Z         %181 = ttg.convert_layout %180 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1026477Z         %182 = tt.dot %155, %181, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1026827Z         %183 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:55:14.1026952Z         %184 = arith.muli %183, %c2_i32 : i32
2026-02-21T08:55:14.1027123Z         %185 = tt.splat %184 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1027346Z         %186 = arith.addi %185, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1027621Z         %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:55:14.1027898Z         %188 = tt.broadcast %187 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1028089Z         %189 = arith.addi %103, %188 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1028287Z         %190 = tt.addptr %8, %189 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1028488Z         %191 = tt.load %190 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.1028711Z         %192 = ttg.local_alloc %191 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.1029038Z         %193 = ttg.local_load %192 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1029468Z         %194 = arith.extf %193 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1029747Z         %195 = arith.extsi %183 : i32 to i64
2026-02-21T08:55:14.1029953Z         %196 = tt.splat %195 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1030254Z         %197 = arith.addi %196, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1030656Z         %198 = tt.expand_dims %197 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1031009Z         %199 = arith.muli %198, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1031319Z         %200 = tt.broadcast %199 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1031623Z         %201 = arith.addi %200, %108 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1031930Z         %202 = tt.addptr %9, %201 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1032245Z         %203 = arith.cmpi sge, %198, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1032487Z         %204 = arith.cmpi slt, %198, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1032759Z         %205 = arith.andi %203, %204 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1033060Z         %206 = tt.broadcast %205 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1033357Z         %207 = arith.andi %206, %112 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1033604Z         %208 = tt.load %202, %207, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1033849Z         %209 = arith.shli %208, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1034083Z         %210 = arith.shrsi %209, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1034320Z         %211 = arith.shrsi %208, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1034604Z         %212 = tt.expand_dims %210 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1034939Z         %213 = tt.expand_dims %211 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1035224Z         %214 = tt.broadcast %212 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1035457Z         %215 = arith.select %17, %214, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1035691Z         %216 = tt.broadcast %213 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1035920Z         %217 = arith.select %19, %216, %215 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1036147Z         %218 = tt.reshape %217 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.1036367Z         %219 = arith.sitofp %218 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.1036658Z         %220 = ttg.convert_layout %219 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1037124Z         %221 = tt.dot %194, %220, %182, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1037503Z         %222 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T08:55:14.1037628Z         %223 = arith.muli %222, %c2_i32 : i32
2026-02-21T08:55:14.1037799Z         %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1038018Z         %225 = arith.addi %224, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1038296Z         %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:55:14.1038570Z         %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1038765Z         %228 = arith.addi %103, %227 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1038967Z         %229 = tt.addptr %8, %228 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1039169Z         %230 = tt.load %229 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.1039389Z         %231 = ttg.local_alloc %230 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.1039711Z         %232 = ttg.local_load %231 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1040118Z         %233 = arith.extf %232 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1040397Z         %234 = arith.extsi %222 : i32 to i64
2026-02-21T08:55:14.1040633Z         %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1040932Z         %236 = arith.addi %235, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1041319Z         %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1041678Z         %238 = arith.muli %237, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1041986Z         %239 = tt.broadcast %238 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1042287Z         %240 = arith.addi %239, %108 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1042648Z         %241 = tt.addptr %9, %240 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1042966Z         %242 = arith.cmpi sge, %237, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1043207Z         %243 = arith.cmpi slt, %237, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1043443Z         %244 = arith.andi %242, %243 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1043747Z         %245 = tt.broadcast %244 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1044050Z         %246 = arith.andi %245, %112 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1044298Z         %247 = tt.load %241, %246, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1044547Z         %248 = arith.shli %247, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1044788Z         %249 = arith.shrsi %248, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1045061Z         %250 = arith.shrsi %247, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1045349Z         %251 = tt.expand_dims %249 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1045682Z         %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1045965Z         %253 = tt.broadcast %251 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1046199Z         %254 = arith.select %17, %253, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1046433Z         %255 = tt.broadcast %252 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1046664Z         %256 = arith.select %19, %255, %254 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1046892Z         %257 = tt.reshape %256 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.1047115Z         %258 = arith.sitofp %257 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.1047411Z         %259 = ttg.convert_layout %258 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1047873Z         %260 = tt.dot %233, %259, %221, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1048214Z         scf.yield %260 : tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1048341Z       } {tt.disallow_acc_multi_buffer}
2026-02-21T08:55:14.1048477Z       %114 = arith.addi %103, %27 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1048709Z       %115 = tt.addptr %8, %114 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1048914Z       %116 = tt.load %115 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.1049141Z       %117 = ttg.local_alloc %116 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.1049465Z       %118 = ttg.local_load %117 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1049867Z       %119 = arith.extf %118 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1050199Z       %120 = arith.addi %31, %108 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1050511Z       %121 = tt.addptr %9, %120 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1050811Z       %122 = arith.andi %35, %112 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1051053Z       %123 = tt.load %121, %122, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1051296Z       %124 = arith.shli %123, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1051534Z       %125 = arith.shrsi %124, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1051770Z       %126 = arith.shrsi %123, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1052055Z       %127 = tt.expand_dims %125 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1052389Z       %128 = tt.expand_dims %126 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1052670Z       %129 = tt.broadcast %127 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1052940Z       %130 = arith.select %17, %129, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1053177Z       %131 = tt.broadcast %128 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1053404Z       %132 = arith.select %19, %131, %130 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1053634Z       %133 = tt.reshape %132 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.1053851Z       %134 = arith.sitofp %133 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.1054142Z       %135 = ttg.convert_layout %134 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1054598Z       %136 = tt.dot %119, %135, %113, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1054976Z       %137 = arith.truncf %136 : tensor<64x16xf32, #mma> to tensor<64x16xbf16, #mma>
2026-02-21T08:55:14.1055240Z       %138 = tt.expand_dims %97 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T08:55:14.1055478Z       %139 = arith.muli %138, %cst_13 : tensor<64x1xi32, #mma>
2026-02-21T08:55:14.1055705Z       %140 = tt.expand_dims %100 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma>
2026-02-21T08:55:14.1055960Z       %141 = tt.broadcast %139 : tensor<64x1xi32, #mma> -> tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1056157Z       %142 = tt.broadcast %140 : tensor<1x16xi32, #mma> -> tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1056337Z       %143 = arith.addi %141, %142 : tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1056556Z       %144 = tt.addptr %20, %143 : tensor<64x16x!tt.ptr<bf16>, #mma>, tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1056747Z       tt.store %144, %137 : tensor<64x16x!tt.ptr<bf16>, #mma>
2026-02-21T08:55:14.1056889Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:55:14.1057008Z     scf.for %arg3 = %24 to %2 step %c1_i32  : i32 {
2026-02-21T08:55:14.1057143Z       %36 = arith.remsi %arg3, %c64_i32 : i32
2026-02-21T08:55:14.1057265Z       %37 = arith.divsi %arg3, %c64_i32 : i32
2026-02-21T08:55:14.1057386Z       %38 = arith.muli %36, %c64_i32 : i32
2026-02-21T08:55:14.1057548Z       %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.1057758Z       %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:14.1057969Z       %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.1058174Z       %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:14.1058340Z       %43 = arith.muli %37, %c16_i32 : i32
2026-02-21T08:55:14.1058494Z       %44 = tt.splat %43 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:14.1058695Z       %45 = arith.addi %44, %6 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:14.1058959Z       %46 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T08:55:14.1059202Z       %47 = arith.muli %46, %cst_9 : tensor<64x1xi32, #blocked1>
2026-02-21T08:55:14.1059395Z       %48 = tt.broadcast %47 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1059567Z       %49 = arith.extsi %43 : i32 to i64
2026-02-21T08:55:14.1059775Z       %50 = tt.splat %49 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1060067Z       %51 = arith.addi %50, %12 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1060447Z       %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1060904Z       %53 = tt.broadcast %52 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1061209Z       %54 = arith.cmpi sge, %52, %cst_5 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1061448Z       %55 = arith.cmpi slt, %52, %cst_4 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1061675Z       %56 = arith.andi %54, %55 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1061966Z       %57 = tt.broadcast %56 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1062298Z       %58 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<64x16xf32, #mma>)  : i32 {
2026-02-21T08:55:14.1062511Z         %90 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:55:14.1062678Z         %91 = tt.splat %90 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1062894Z         %92 = arith.addi %91, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1063163Z         %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:55:14.1063434Z         %94 = tt.broadcast %93 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1063620Z         %95 = arith.addi %48, %94 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1063813Z         %96 = tt.addptr %8, %95 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1064011Z         %97 = tt.load %96 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.1064263Z         %98 = ttg.local_alloc %97 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.1064586Z         %99 = ttg.local_load %98 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1064983Z         %100 = arith.extf %99 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1065262Z         %101 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:55:14.1065472Z         %102 = tt.splat %101 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1065767Z         %103 = arith.addi %102, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1066159Z         %104 = tt.expand_dims %103 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1066519Z         %105 = arith.muli %104, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1066829Z         %106 = tt.broadcast %105 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1067138Z         %107 = arith.addi %106, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1067448Z         %108 = tt.addptr %9, %107 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1067769Z         %109 = arith.cmpi sge, %104, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1068016Z         %110 = arith.cmpi slt, %104, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1068254Z         %111 = arith.andi %109, %110 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1068587Z         %112 = tt.broadcast %111 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1068884Z         %113 = arith.andi %112, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1069124Z         %114 = tt.load %108, %113, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1069371Z         %115 = arith.shli %114, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1069608Z         %116 = arith.shrsi %115, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1069847Z         %117 = arith.shrsi %114, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1070137Z         %118 = tt.expand_dims %116 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1070474Z         %119 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1070757Z         %120 = tt.broadcast %118 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1070993Z         %121 = arith.select %17, %120, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1071229Z         %122 = tt.broadcast %119 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1071457Z         %123 = arith.select %19, %122, %121 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1071683Z         %124 = tt.reshape %123 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.1071903Z         %125 = arith.sitofp %124 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.1072224Z         %126 = ttg.convert_layout %125 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1072687Z         %127 = tt.dot %100, %126, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1073033Z         %128 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:55:14.1073153Z         %129 = arith.muli %128, %c2_i32 : i32
2026-02-21T08:55:14.1073323Z         %130 = tt.splat %129 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1073546Z         %131 = arith.addi %130, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1073823Z         %132 = tt.expand_dims %131 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:55:14.1074110Z         %133 = tt.broadcast %132 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1074305Z         %134 = arith.addi %48, %133 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1074502Z         %135 = tt.addptr %8, %134 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1074705Z         %136 = tt.load %135 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.1074925Z         %137 = ttg.local_alloc %136 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.1075248Z         %138 = ttg.local_load %137 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1075647Z         %139 = arith.extf %138 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1075929Z         %140 = arith.extsi %128 : i32 to i64
2026-02-21T08:55:14.1076138Z         %141 = tt.splat %140 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1076467Z         %142 = arith.addi %141, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1076855Z         %143 = tt.expand_dims %142 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1077206Z         %144 = arith.muli %143, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1077512Z         %145 = tt.broadcast %144 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1077814Z         %146 = arith.addi %145, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1078119Z         %147 = tt.addptr %9, %146 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1078435Z         %148 = arith.cmpi sge, %143, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1078677Z         %149 = arith.cmpi slt, %143, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1078913Z         %150 = arith.andi %148, %149 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1079215Z         %151 = tt.broadcast %150 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1079508Z         %152 = arith.andi %151, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1079753Z         %153 = tt.load %147, %152, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1080027Z         %154 = arith.shli %153, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1080261Z         %155 = arith.shrsi %154, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1080501Z         %156 = arith.shrsi %153, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1080792Z         %157 = tt.expand_dims %155 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1081129Z         %158 = tt.expand_dims %156 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1081412Z         %159 = tt.broadcast %157 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1081650Z         %160 = arith.select %17, %159, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1081892Z         %161 = tt.broadcast %158 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1082119Z         %162 = arith.select %19, %161, %160 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1082352Z         %163 = tt.reshape %162 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.1082623Z         %164 = arith.sitofp %163 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.1082923Z         %165 = ttg.convert_layout %164 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1083387Z         %166 = tt.dot %139, %165, %127, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1095532Z         %167 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T08:55:14.1095668Z         %168 = arith.muli %167, %c2_i32 : i32
2026-02-21T08:55:14.1095856Z         %169 = tt.splat %168 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1096175Z         %170 = arith.addi %169, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1096456Z         %171 = tt.expand_dims %170 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T08:55:14.1096739Z         %172 = tt.broadcast %171 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1096947Z         %173 = arith.addi %48, %172 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1097154Z         %174 = tt.addptr %8, %173 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1097367Z         %175 = tt.load %174 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.1097591Z         %176 = ttg.local_alloc %175 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.1097927Z         %177 = ttg.local_load %176 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1098336Z         %178 = arith.extf %177 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1098623Z         %179 = arith.extsi %167 : i32 to i64
2026-02-21T08:55:14.1098838Z         %180 = tt.splat %179 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1099137Z         %181 = arith.addi %180, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:55:14.1099534Z         %182 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1099934Z         %183 = arith.muli %182, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1100244Z         %184 = tt.broadcast %183 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1100554Z         %185 = arith.addi %184, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1100862Z         %186 = tt.addptr %9, %185 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1101182Z         %187 = arith.cmpi sge, %182, %cst_7 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1101430Z         %188 = arith.cmpi slt, %182, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1101666Z         %189 = arith.andi %187, %188 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1101977Z         %190 = tt.broadcast %189 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1102277Z         %191 = arith.andi %190, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1102522Z         %192 = tt.load %186, %191, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1102771Z         %193 = arith.shli %192, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1103006Z         %194 = arith.shrsi %193, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1103244Z         %195 = arith.shrsi %192, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1103532Z         %196 = tt.expand_dims %194 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1103871Z         %197 = tt.expand_dims %195 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1104189Z         %198 = tt.broadcast %196 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1104425Z         %199 = arith.select %17, %198, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1104665Z         %200 = tt.broadcast %197 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1104895Z         %201 = arith.select %19, %200, %199 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1105127Z         %202 = tt.reshape %201 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.1105353Z         %203 = arith.sitofp %202 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.1105654Z         %204 = ttg.convert_layout %203 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1106130Z         %205 = tt.dot %178, %204, %166, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1106485Z         scf.yield %205 : tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1106615Z       } {tt.disallow_acc_multi_buffer}
2026-02-21T08:55:14.1106758Z       %59 = arith.addi %48, %27 : tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1106953Z       %60 = tt.addptr %8, %59 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T08:55:14.1107156Z       %61 = tt.load %60 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:14.1107373Z       %62 = ttg.local_alloc %61 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T08:55:14.1107694Z       %63 = ttg.local_load %62 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1108127Z       %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1108453Z       %65 = arith.addi %31, %53 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1108757Z       %66 = tt.addptr %9, %65 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1109060Z       %67 = arith.andi %35, %57 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1109295Z       %68 = tt.load %66, %67, %cst_3 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1109538Z       %69 = arith.shli %68, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1109765Z       %70 = arith.shrsi %69, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1110004Z       %71 = arith.shrsi %68, %cst_10 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1110298Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1110624Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T08:55:14.1110898Z       %74 = tt.broadcast %72 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1111129Z       %75 = arith.select %17, %74, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1111362Z       %76 = tt.broadcast %73 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1111589Z       %77 = arith.select %19, %76, %75 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T08:55:14.1111811Z       %78 = tt.reshape %77 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T08:55:14.1112027Z       %79 = arith.sitofp %78 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T08:55:14.1112347Z       %80 = ttg.convert_layout %79 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1112804Z       %81 = tt.dot %64, %80, %58, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x16xf32, #mma>
2026-02-21T08:55:14.1113180Z       %82 = arith.truncf %81 : tensor<64x16xf32, #mma> to tensor<64x16xbf16, #mma>
2026-02-21T08:55:14.1113434Z       %83 = tt.expand_dims %42 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T08:55:14.1113667Z       %84 = arith.muli %83, %cst_13 : tensor<64x1xi32, #mma>
2026-02-21T08:55:14.1113893Z       %85 = tt.expand_dims %45 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma>
2026-02-21T08:55:14.1114143Z       %86 = tt.broadcast %84 : tensor<64x1xi32, #mma> -> tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1114343Z       %87 = tt.broadcast %85 : tensor<1x16xi32, #mma> -> tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1114516Z       %88 = arith.addi %86, %87 : tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1114697Z       %89 = tt.addptr %20, %88 : tensor<64x16x!tt.ptr<bf16>, #mma>, tensor<64x16xi32, #mma>
2026-02-21T08:55:14.1114885Z       tt.store %89, %82 : tensor<64x16x!tt.ptr<bf16>, #mma>
2026-02-21T08:55:14.1115024Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:55:14.1115131Z     tt.return
2026-02-21T08:55:14.1115212Z   }
2026-02-21T08:55:14.1115290Z }
2026-02-21T08:55:14.1115334Z 
2026-02-21T08:55:14.1115366Z {-#
2026-02-21T08:55:14.1115448Z   external_resources: {
2026-02-21T08:55:14.1115548Z     mlir_reproducer: {
2026-02-21T08:55:14.1116579Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:55:14.1117588Z       disable_threading: false,
2026-02-21T08:55:14.1117693Z       verify_each: true
2026-02-21T08:55:14.1117783Z     }
2026-02-21T08:55:14.1117853Z   }
2026-02-21T08:55:14.1117923Z #-}
2026-02-21T08:55:14.1118202Z /tmp/torchinductor_root/sk/cskbzeetrf2svkrdmuxrjrtonuici6bh4yb2dhzv47xc7dtiicyl.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:55:14.1118950Z /tmp/torchinductor_root/sk/cskbzeetrf2svkrdmuxrjrtonuici6bh4yb2dhzv47xc7dtiicyl.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:55:14.1119506Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:55:14.1120283Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, False], range_num_stages=[3, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T08:55:14.1120993Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:55:14.1121164Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:55:14.1461058Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:55:14.1465497Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:55:14.1465870Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:55:14.1466179Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:55:14.1466475Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:55:14.1466756Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:55:14.1467015Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:55:14.1467205Z #smem = #ttg.shared_memory
2026-02-21T08:55:14.1467444Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:55:14.1467939Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:55:14.1468320Z     %cst = arith.constant dense<8192> : tensor<16x1xi32, #mma>
2026-02-21T08:55:14.1468498Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:14.1468673Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:14.1468856Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T08:55:14.1469121Z     %cst_3 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T08:55:14.1469299Z     %cst_4 = arith.constant dense<1024> : tensor<16x1xi32, #blocked2>
2026-02-21T08:55:14.1469452Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:55:14.1469570Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:55:14.1469684Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:55:14.1469802Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:55:14.1469916Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:55:14.1470058Z     %cst_5 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked>
2026-02-21T08:55:14.1470209Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:55:14.1470321Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:55:14.1470506Z     %cst_6 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1470693Z     %0 = tt.get_program_id x : i32
2026-02-21T08:55:14.1470846Z     %1 = arith.remsi %0, %c32_i32 : i32
2026-02-21T08:55:14.1470969Z     %2 = arith.divsi %0, %c32_i32 : i32
2026-02-21T08:55:14.1471080Z     %3 = arith.muli %1, %c256_i32 : i32
2026-02-21T08:55:14.1471270Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:14.1471546Z     %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1471797Z     %6 = tt.splat %3 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:14.1472006Z     %7 = tt.splat %3 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1472221Z     %8 = arith.addi %6, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:14.1472421Z     %9 = arith.addi %7, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:14.1472582Z     %10 = arith.muli %2, %c16_i32 : i32
2026-02-21T08:55:14.1472779Z     %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:55:14.1473050Z     %12 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:14.1473333Z     %13 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:55:14.1473542Z     %14 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:14.1473746Z     %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:55:14.1473951Z     %16 = arith.addi %14, %12 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:14.1474185Z     %17 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.1474452Z     %18 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:55:14.1474784Z     %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T08:55:14.1475033Z     %20 = arith.muli %19, %cst_4 : tensor<16x1xi32, #blocked2>
2026-02-21T08:55:14.1475225Z     %21 = tt.broadcast %20 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:55:14.1475436Z     %22 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:55:14.1475699Z     %23 = tt.expand_dims %9 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T08:55:14.1475978Z     %24 = tt.broadcast %23 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T08:55:14.1476186Z     %25 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T08:55:14.1476457Z     %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:55:14.1476912Z     %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:55:14.1477313Z     %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:14.1477563Z     %29 = arith.cmpi eq, %28, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:14.1477757Z     %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T08:55:14.1477954Z     %31 = arith.cmpi eq, %28, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:14.1478139Z     %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T08:55:14.1478406Z     %33 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst_2) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T08:55:14.1478670Z       %43 = tt.splat %arg3 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.1478894Z       %44 = arith.addi %43, %17 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.1479064Z       %45 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T08:55:14.1479226Z       %46 = tt.splat %45 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:55:14.1479435Z       %47 = arith.addi %46, %18 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:55:14.1479704Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:55:14.1479970Z       %49 = tt.broadcast %48 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:55:14.1480153Z       %50 = arith.addi %21, %49 : tensor<16x4xi32, #blocked2>
2026-02-21T08:55:14.1480346Z       %51 = tt.addptr %22, %50 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T08:55:14.1480545Z       %52 = tt.load %51 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:55:14.1480805Z       %53 = ttg.convert_layout %52 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1481235Z       %54 = arith.extf %53 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1481607Z       %55 = tt.expand_dims %44 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T08:55:14.1481844Z       %56 = arith.muli %55, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T08:55:14.1482029Z       %57 = tt.broadcast %56 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T08:55:14.1482214Z       %58 = arith.addi %57, %24 : tensor<2x256xi32, #blocked1>
2026-02-21T08:55:14.1482407Z       %59 = tt.addptr %25, %58 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T08:55:14.1482669Z       %60 = tt.load %59 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T08:55:14.1482911Z       %61 = ttg.convert_layout %60 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1483189Z       %62 = arith.shli %61, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1483417Z       %63 = arith.shrsi %62, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1483646Z       %64 = arith.shrsi %61, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1483927Z       %65 = tt.expand_dims %63 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T08:55:14.1484260Z       %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T08:55:14.1484581Z       %67 = tt.broadcast %65 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T08:55:14.1484815Z       %68 = arith.select %30, %67, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T08:55:14.1485054Z       %69 = tt.broadcast %66 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T08:55:14.1485278Z       %70 = arith.select %32, %69, %68 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T08:55:14.1485508Z       %71 = tt.reshape %70 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T08:55:14.1485725Z       %72 = arith.sitofp %71 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T08:55:14.1485969Z       %73 = ttg.local_alloc %72 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem>
2026-02-21T08:55:14.1486299Z       %74 = ttg.local_load %73 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1486775Z       %75 = tt.dot %54, %74, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T08:55:14.1487129Z       %76 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T08:55:14.1487297Z       %77 = tt.splat %76 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.1487511Z       %78 = arith.addi %77, %17 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:14.1487678Z       %79 = arith.muli %76, %c2_i32 : i32
2026-02-21T08:55:14.1487839Z       %80 = tt.splat %79 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:55:14.1488050Z       %81 = arith.addi %80, %18 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:55:14.1488317Z       %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:55:14.1488587Z       %83 = tt.broadcast %82 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:55:14.1488772Z       %84 = arith.addi %21, %83 : tensor<16x4xi32, #blocked2>
2026-02-21T08:55:14.1489002Z       %85 = tt.addptr %22, %84 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T08:55:14.1489199Z       %86 = tt.load %85 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:55:14.1489457Z       %87 = ttg.convert_layout %86 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1489851Z       %88 = arith.extf %87 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1490225Z       %89 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T08:55:14.1490463Z       %90 = arith.muli %89, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T08:55:14.1490648Z       %91 = tt.broadcast %90 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T08:55:14.1490837Z       %92 = arith.addi %91, %24 : tensor<2x256xi32, #blocked1>
2026-02-21T08:55:14.1491024Z       %93 = tt.addptr %25, %92 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T08:55:14.1491218Z       %94 = tt.load %93 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T08:55:14.1491450Z       %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1491730Z       %96 = arith.shli %95, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1491962Z       %97 = arith.shrsi %96, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1492187Z       %98 = arith.shrsi %95, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:14.1492502Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T08:55:14.1492832Z       %100 = tt.expand_dims %98 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T08:55:14.1493114Z       %101 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T08:55:14.1493354Z       %102 = arith.select %30, %101, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T08:55:14.1493594Z       %103 = tt.broadcast %100 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T08:55:14.1493828Z       %104 = arith.select %32, %103, %102 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T08:55:14.1494057Z       %105 = tt.reshape %104 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T08:55:14.1494280Z       %106 = arith.sitofp %105 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T08:55:14.1494531Z       %107 = ttg.local_alloc %106 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem>
2026-02-21T08:55:14.1494852Z       %108 = ttg.local_load %107 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:14.1495326Z       %109 = tt.dot %88, %108, %75, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T08:55:14.1495671Z       scf.yield %109 : tensor<16x256xf32, #mma>
2026-02-21T08:55:14.1495794Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:55:14.1495945Z     %34 = arith.truncf %33 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T08:55:14.1496199Z     %35 = tt.expand_dims %16 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T08:55:14.1496430Z     %36 = arith.muli %35, %cst : tensor<16x1xi32, #mma>
2026-02-21T08:55:14.1496655Z     %37 = tt.expand_dims %8 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T08:55:14.1496947Z     %38 = tt.broadcast %36 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T08:55:14.1497142Z     %39 = tt.broadcast %37 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T08:55:14.1497312Z     %40 = arith.addi %38, %39 : tensor<16x256xi32, #mma>
2026-02-21T08:55:14.1497479Z     %41 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:55:14.1497683Z     %42 = tt.addptr %41, %40 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T08:55:14.1497868Z     tt.store %42, %34 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:55:14.1497992Z     tt.return
2026-02-21T08:55:14.1498069Z   }
2026-02-21T08:55:14.1498138Z }
2026-02-21T08:55:14.1498179Z 
2026-02-21T08:55:14.1498207Z {-#
2026-02-21T08:55:14.1498285Z   external_resources: {
2026-02-21T08:55:14.1498381Z     mlir_reproducer: {
2026-02-21T08:55:14.1499378Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:55:14.1500378Z       disable_threading: false,
2026-02-21T08:55:14.1500479Z       verify_each: true
2026-02-21T08:55:14.1500566Z     }
2026-02-21T08:55:14.1500635Z   }
2026-02-21T08:55:14.1500701Z #-}
2026-02-21T08:55:14.1501011Z /tmp/torchinductor_root/5m/c5m2kcg5c6cotizhcpyzqmazzriuywnv35vebvsgepalqimlyqxy.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:55:14.1501696Z /tmp/torchinductor_root/5m/c5m2kcg5c6cotizhcpyzqmazzriuywnv35vebvsgepalqimlyqxy.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:55:14.1502241Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:55:14.1502959Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 256], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T08:55:14.1503609Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:55:14.1503774Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:55:15.3996627Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:55:15.3999318Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}>
2026-02-21T08:55:15.4000212Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T08:55:15.4000984Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}>
2026-02-21T08:55:15.4001663Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:55:15.4002293Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 32, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:55:15.4002932Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:55:15.4003655Z #smem = #ttg.shared_memory
2026-02-21T08:55:15.4004219Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:55:15.4005369Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:55:15.4006394Z     %cst = arith.constant dense<4096> : tensor<512x1xi64, #mma>
2026-02-21T08:55:15.4006812Z     %cst_0 = arith.constant dense<0> : tensor<512x1xi64, #mma>
2026-02-21T08:55:15.4007210Z     %cst_1 = arith.constant dense<8192> : tensor<512x1xi64, #mma>
2026-02-21T08:55:15.4007619Z     %cst_2 = arith.constant dense<8192> : tensor<1x256xi64, #mma>
2026-02-21T08:55:15.4008007Z     %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #mma>
2026-02-21T08:55:15.4008421Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:15.4008835Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:15.4009270Z     %cst_6 = arith.constant dense<1024> : tensor<512x1xi32, #blocked1>
2026-02-21T08:55:15.4009637Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:55:15.4010004Z     %cst_7 = arith.constant dense<0.000000e+00> : tensor<512x256xf32, #mma>
2026-02-21T08:55:15.4010328Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:55:15.4010518Z     %c512_i64 = arith.constant 512 : i64
2026-02-21T08:55:15.4010710Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:55:15.4010899Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:55:15.4011087Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:55:15.4011272Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:55:15.4011657Z     %cst_8 = arith.constant dense<0> : tensor<1x256xi8, #blocked2>
2026-02-21T08:55:15.4011947Z     %cst_9 = arith.constant dense<8192> : tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4012242Z     %cst_10 = arith.constant dense<0> : tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4012524Z     %cst_11 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked>
2026-02-21T08:55:15.4012758Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:55:15.4012948Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:55:15.4013122Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:55:15.4013309Z     %c4864_i32 = arith.constant 4864 : i32
2026-02-21T08:55:15.4013611Z     %cst_12 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:15.4013923Z     %0 = tt.get_program_id x : i32
2026-02-21T08:55:15.4014253Z     %1 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:15.4014718Z     %2 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:15.4015164Z     %3 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:15.4015567Z     %4 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<512x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:15.4015888Z     %5 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T08:55:15.4016284Z     %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:55:15.4016815Z     %7 = arith.extsi %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:55:15.4017414Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:55:15.4018108Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:55:15.4018822Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:15.4019241Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:15.4019566Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T08:55:15.4019883Z     %13 = arith.cmpi eq, %10, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:55:15.4020213Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T08:55:15.4020507Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<512x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:55:15.4020847Z     %16 = arith.extsi %2 : tensor<512xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:15.4021221Z     %17 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:15.4021608Z     %18 = arith.extsi %17 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:15.4021921Z     scf.for %arg3 = %0 to %c256_i32 step %c4864_i32  : i32 {
2026-02-21T08:55:15.4022097Z       %19 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T08:55:15.4022250Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T08:55:15.4022395Z       %21 = arith.subi %c8_i32, %20 : i32
2026-02-21T08:55:15.4022540Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T08:55:15.4022692Z       %23 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T08:55:15.4022835Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T08:55:15.4022971Z       %25 = arith.addi %20, %24 : i32
2026-02-21T08:55:15.4023110Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T08:55:15.4023252Z       %27 = arith.muli %25, %c512_i32 : i32
2026-02-21T08:55:15.4023517Z       %28 = tt.splat %27 : i32 -> tensor<512xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:15.4023799Z       %29 = arith.addi %28, %1 : tensor<512xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:55:15.4024019Z       %30 = arith.muli %26, %c256_i32 : i32
2026-02-21T08:55:15.4024295Z       %31 = tt.expand_dims %29 {axis = 1 : i32} : tensor<512xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<512x1xi32, #blocked1>
2026-02-21T08:55:15.4024617Z       %32 = arith.muli %31, %cst_6 : tensor<512x1xi32, #blocked1>
2026-02-21T08:55:15.4024854Z       %33 = tt.broadcast %32 : tensor<512x1xi32, #blocked1> -> tensor<512x2xi32, #blocked1>
2026-02-21T08:55:15.4025071Z       %34 = arith.extsi %30 : i32 to i64
2026-02-21T08:55:15.4025276Z       %35 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:55:15.4025554Z       %36 = arith.addi %35, %7 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:55:15.4025891Z       %37 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4026218Z       %38 = arith.cmpi sge, %37, %cst_10 : tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4026431Z       %39 = arith.cmpi slt, %37, %cst_9 : tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4026635Z       %40 = arith.andi %38, %39 : tensor<1x256xi1, #blocked2>
2026-02-21T08:55:15.4026921Z       %41 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst_7) -> (tensor<512x256xf32, #mma>)  : i32 {
2026-02-21T08:55:15.4027199Z         %64 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:55:15.4027405Z         %65 = tt.splat %64 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:15.4027671Z         %66 = arith.addi %65, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:15.4028012Z         %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:55:15.4028345Z         %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<512x2xi32, #blocked1>
2026-02-21T08:55:15.4028627Z         %69 = arith.addi %33, %68 : tensor<512x2xi32, #blocked1>
2026-02-21T08:55:15.4028865Z         %70 = tt.addptr %4, %69 : tensor<512x2x!tt.ptr<bf16>, #blocked1>, tensor<512x2xi32, #blocked1>
2026-02-21T08:55:15.4029118Z         %71 = tt.load %70 : tensor<512x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:15.4029396Z         %72 = ttg.local_alloc %71 : (tensor<512x2xbf16, #blocked1>) -> !ttg.memdesc<512x2xbf16, #shared, #smem>
2026-02-21T08:55:15.4029807Z         %73 = ttg.local_load %72 : !ttg.memdesc<512x2xbf16, #shared, #smem> -> tensor<512x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:15.4030328Z         %74 = arith.extf %73 : tensor<512x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<512x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:15.4030616Z         %75 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:55:15.4030747Z         %76 = arith.muli %75, %c8192_i64 : i64
2026-02-21T08:55:15.4030892Z         %77 = tt.splat %76 : i64 -> tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4031048Z         %78 = arith.addi %77, %37 : tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4031243Z         %79 = tt.addptr %5, %78 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4031451Z         %80 = tt.load %79, %40, %cst_8 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T08:55:15.4031709Z         %81 = ttg.convert_layout %80 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:15.4031992Z         %82 = arith.shli %81, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:15.4032226Z         %83 = arith.shrsi %82, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:15.4032499Z         %84 = arith.shrsi %81, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:15.4032789Z         %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:55:15.4033128Z         %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:55:15.4033412Z         %87 = tt.broadcast %85 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:55:15.4033650Z         %88 = arith.select %12, %87, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:55:15.4033885Z         %89 = tt.broadcast %86 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:55:15.4034112Z         %90 = arith.select %14, %89, %88 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:55:15.4034343Z         %91 = tt.reshape %90 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T08:55:15.4034566Z         %92 = arith.sitofp %91 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T08:55:15.4034816Z         %93 = ttg.local_alloc %92 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared1, #smem>
2026-02-21T08:55:15.4035141Z         %94 = ttg.local_load %93 : !ttg.memdesc<2x256xf32, #shared1, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:15.4035617Z         %95 = tt.dot %74, %94, %arg5, inputPrecision = tf32 : tensor<512x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<512x256xf32, #mma>
2026-02-21T08:55:15.4035965Z         %96 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T08:55:15.4036089Z         %97 = arith.muli %96, %c2_i32 : i32
2026-02-21T08:55:15.4036256Z         %98 = tt.splat %97 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:15.4036477Z         %99 = arith.addi %98, %3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:55:15.4036788Z         %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:55:15.4037065Z         %101 = tt.broadcast %100 : tensor<1x2xi32, #blocked1> -> tensor<512x2xi32, #blocked1>
2026-02-21T08:55:15.4037265Z         %102 = arith.addi %33, %101 : tensor<512x2xi32, #blocked1>
2026-02-21T08:55:15.4037469Z         %103 = tt.addptr %4, %102 : tensor<512x2x!tt.ptr<bf16>, #blocked1>, tensor<512x2xi32, #blocked1>
2026-02-21T08:55:15.4037678Z         %104 = tt.load %103 : tensor<512x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:55:15.4037905Z         %105 = ttg.local_alloc %104 : (tensor<512x2xbf16, #blocked1>) -> !ttg.memdesc<512x2xbf16, #shared, #smem>
2026-02-21T08:55:15.4038241Z         %106 = ttg.local_load %105 : !ttg.memdesc<512x2xbf16, #shared, #smem> -> tensor<512x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:15.4038653Z         %107 = arith.extf %106 : tensor<512x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<512x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:15.4038940Z         %108 = arith.extsi %96 : i32 to i64
2026-02-21T08:55:15.4039065Z         %109 = arith.muli %108, %c8192_i64 : i64
2026-02-21T08:55:15.4039210Z         %110 = tt.splat %109 : i64 -> tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4039373Z         %111 = arith.addi %110, %37 : tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4039575Z         %112 = tt.addptr %5, %111 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi64, #blocked2>
2026-02-21T08:55:15.4039767Z         %113 = arith.cmpi slt, %108, %c512_i64 : i64
2026-02-21T08:55:15.4039914Z         %114 = tt.splat %113 : i1 -> tensor<1x256xi1, #blocked2>
2026-02-21T08:55:15.4040071Z         %115 = arith.andi %114, %40 : tensor<1x256xi1, #blocked2>
2026-02-21T08:55:15.4040294Z         %116 = tt.load %112, %115, %cst_8 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T08:55:15.4040561Z         %117 = ttg.convert_layout %116 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:15.4040842Z         %118 = arith.shli %117, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:15.4041079Z         %119 = arith.shrsi %118, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:15.4041314Z         %120 = arith.shrsi %117, %cst_12 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:55:15.4041606Z         %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:55:15.4041946Z         %122 = tt.expand_dims %120 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T08:55:15.4042235Z         %123 = tt.broadcast %121 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:55:15.4042480Z         %124 = arith.select %12, %123, %cst_11 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:55:15.4042769Z         %125 = tt.broadcast %122 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T08:55:15.4043004Z         %126 = arith.select %14, %125, %124 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T08:55:15.4043235Z         %127 = tt.reshape %126 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T08:55:15.4043460Z         %128 = arith.sitofp %127 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T08:55:15.4043718Z         %129 = ttg.local_alloc %128 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared1, #smem>
2026-02-21T08:55:15.4044044Z         %130 = ttg.local_load %129 : !ttg.memdesc<2x256xf32, #shared1, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:55:15.4044518Z         %131 = tt.dot %107, %130, %95, inputPrecision = tf32 : tensor<512x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<512x256xf32, #mma>
2026-02-21T08:55:15.4044909Z         scf.yield %131 : tensor<512x256xf32, #mma>
2026-02-21T08:55:15.4045059Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:55:15.4045247Z       %42 = arith.truncf %41 : tensor<512x256xf32, #mma> to tensor<512x256xbf16, #mma>
2026-02-21T08:55:15.4045414Z       %43 = arith.extsi %27 : i32 to i64
2026-02-21T08:55:15.4045581Z       %44 = tt.splat %43 : i64 -> tensor<512xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:15.4045789Z       %45 = arith.addi %44, %16 : tensor<512xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:55:15.4046052Z       %46 = tt.expand_dims %45 {axis = 1 : i32} : tensor<512xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<512x1xi64, #mma>
2026-02-21T08:55:15.4046287Z       %47 = arith.muli %46, %cst_1 : tensor<512x1xi64, #mma>
2026-02-21T08:55:15.4046464Z       %48 = tt.broadcast %47 : tensor<512x1xi64, #mma> -> tensor<512x256xi64, #mma>
2026-02-21T08:55:15.4046667Z       %49 = tt.splat %34 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:15.4046874Z       %50 = arith.addi %49, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:55:15.4047136Z       %51 = tt.expand_dims %50 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T08:55:15.4047395Z       %52 = tt.broadcast %51 : tensor<1x256xi64, #mma> -> tensor<512x256xi64, #mma>
2026-02-21T08:55:15.4047573Z       %53 = arith.addi %48, %52 : tensor<512x256xi64, #mma>
2026-02-21T08:55:15.4047760Z       %54 = tt.addptr %15, %53 : tensor<512x256x!tt.ptr<bf16>, #mma>, tensor<512x256xi64, #mma>
2026-02-21T08:55:15.4048000Z       %55 = arith.cmpi sge, %46, %cst_0 : tensor<512x1xi64, #mma>
2026-02-21T08:55:15.4048160Z       %56 = arith.cmpi slt, %46, %cst : tensor<512x1xi64, #mma>
2026-02-21T08:55:15.4048312Z       %57 = arith.andi %55, %56 : tensor<512x1xi1, #mma>
2026-02-21T08:55:15.4048477Z       %58 = tt.broadcast %57 : tensor<512x1xi1, #mma> -> tensor<512x256xi1, #mma>
2026-02-21T08:55:15.4048658Z       %59 = arith.cmpi sge, %51, %cst_3 : tensor<1x256xi64, #mma>
2026-02-21T08:55:15.4048815Z       %60 = arith.cmpi slt, %51, %cst_2 : tensor<1x256xi64, #mma>
2026-02-21T08:55:15.4048966Z       %61 = arith.andi %59, %60 : tensor<1x256xi1, #mma>
2026-02-21T08:55:15.4049132Z       %62 = tt.broadcast %61 : tensor<1x256xi1, #mma> -> tensor<512x256xi1, #mma>
2026-02-21T08:55:15.4049302Z       %63 = arith.andi %58, %62 : tensor<512x256xi1, #mma>
2026-02-21T08:55:15.4049458Z       tt.store %54, %42, %63 : tensor<512x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:55:15.4049663Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T08:55:15.4049848Z     tt.return
2026-02-21T08:55:15.4049927Z   }
2026-02-21T08:55:15.4050004Z }
2026-02-21T08:55:15.4050050Z 
2026-02-21T08:55:15.4050083Z {-#
2026-02-21T08:55:15.4050163Z   external_resources: {
2026-02-21T08:55:15.4050265Z     mlir_reproducer: {
2026-02-21T08:55:15.4051253Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:55:15.4052246Z       disable_threading: false,
2026-02-21T08:55:15.4052357Z       verify_each: true
2026-02-21T08:55:15.4052445Z     }
2026-02-21T08:55:15.4052518Z   }
2026-02-21T08:55:15.4052585Z #-}
2026-02-21T08:55:15.4052898Z /tmp/torchinductor_root/h5/ch546cwt6jmz5jakx6zpb6tj35nv6hr6ogcmrgeettbjvatzf4ob.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:55:15.4053571Z /tmp/torchinductor_root/h5/ch546cwt6jmz5jakx6zpb6tj35nv6hr6ogcmrgeettbjvatzf4ob.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:55:15.4054115Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:55:15.4054900Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 512, 256], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[2, 4], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T08:55:15.4055616Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:55:15.4055781Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:55:16.3142655Z Initial population exploring neighbors  12% ━╸              12/100 3.3 configs/s
2026-02-21T08:55:16.3145262Z WARNING:tritonbench.utils.triton_op:Completed input ID 17:
2026-02-21T08:55:16.3145681Z x_val
2026-02-21T08:55:16.3145920Z ---------------------
2026-02-21T08:55:16.3146181Z (1, 4096, 8192, 1024)
2026-02-21T08:55:16.3146337Z 
2026-02-21T08:55:16.3161607Z  60%|██████    | 6/10 [45:55<24:07, 361.86s/it]WARNING:tritonbench.utils.triton_op:Running input ID 21:
2026-02-21T08:55:16.3161909Z x_val
2026-02-21T08:55:16.3162036Z ---------------------
2026-02-21T08:55:16.3162790Z (4, 4096, 8192, 1024)
2026-02-21T08:55:16.3165170Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T08:55:17.3446484Z INFO:tritonbench.utils.triton_op:Took 4.36ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T08:55:19.3304995Z Autotune Choices Stats:
2026-02-21T08:55:19.3306667Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "triton_mm_155", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.4005120098590851, "best_triton_pos": 0}
2026-02-21T08:55:19.3313854Z AUTOTUNE mm(16384x1024, 1024x8192)
2026-02-21T08:55:19.3315691Z strides: [1024, 1], [8192, 1]
2026-02-21T08:55:19.3316162Z dtypes: torch.bfloat16, torch.bfloat16
2026-02-21T08:55:19.3317262Z   triton_mm_155 0.4005 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:55:19.3319039Z   triton_mm_152 0.4050 ms 98.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:55:19.3320550Z   triton_mm_149 0.4173 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:55:19.3321448Z   mm 0.4410 ms 90.8% 
2026-02-21T08:55:19.3322302Z   triton_mm_143 0.4791 ms 83.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:55:19.3323760Z   triton_mm_150 0.4793 ms 83.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:55:19.3325451Z   triton_mm_146 0.4929 ms 81.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:55:19.3326654Z   triton_mm_154 0.4969 ms 80.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:55:19.3327853Z   triton_mm_153 0.5374 ms 74.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:55:19.3329066Z   triton_mm_148 0.5606 ms 71.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:55:19.3329999Z SingleProcess AUTOTUNE benchmarking takes 1.4528 seconds and 0.3548 seconds precompiling for 37 choices
2026-02-21T08:55:20.7390495Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T08:55:20.7401424Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:55:20.7401831Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:55:20.7402192Z               'dtype': 'torch.bfloat16',
2026-02-21T08:55:20.7402521Z               'shape': (4, 4096, 1024),
2026-02-21T08:55:20.7402939Z               'stride': (4194304, 1024, 1)},
2026-02-21T08:55:20.7403219Z             { 'device': 'cuda:0',
2026-02-21T08:55:20.7403464Z               'dtype': 'torch.int32',
2026-02-21T08:55:20.7404220Z               'shape': (1024, 8192),
2026-02-21T08:55:20.7404469Z               'stride': (8192, 1)}),
2026-02-21T08:55:20.7404713Z   'kwargs': {}}
2026-02-21T08:55:20.7417853Z INFO:tritonbench.utils.triton_op:Took 1.87ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T08:55:20.9163230Z [0s] Autotune random seed: 2134834638
2026-02-21T08:55:21.0209622Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:55:58.8921394Z [37s] Timeout after 30s compiling Config(block_sizes=[128, 2, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:56:02.9408338Z [41s] Timeout after 30s compiling Config(block_sizes=[32, 2048, 1], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:56:02.9428940Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s
2026-02-21T08:56:07.0499481Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:56:07.0508226Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:56:07.0511274Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T08:56:07.0511835Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T08:56:07.0512804Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T08:56:07.0513329Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:56:07.0514092Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:56:07.0514755Z     %cst = arith.constant dense<0.000000e+00> : tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0515025Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:56:07.0515225Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:56:07.0515417Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:56:07.0515683Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:56:07.0515935Z     %cst_0 = arith.constant dense<0> : tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0516191Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:56:07.0516391Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T08:56:07.0516556Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:56:07.0516740Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T08:56:07.0516917Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:56:07.0517097Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T08:56:07.0517371Z     %cst_1 = arith.constant dense<0> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0517759Z     %cst_2 = arith.constant dense<8192> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0518131Z     %cst_3 = arith.constant dense<0> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0518638Z     %cst_4 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1>
2026-02-21T08:56:07.0518941Z     %cst_5 = arith.constant dense<4> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0519258Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:07.0519506Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:07.0519757Z     %cst_8 = arith.constant dense<8192> : tensor<32x1xi32, #mma>
2026-02-21T08:56:07.0519972Z     %0 = tt.get_program_id x : i32
2026-02-21T08:56:07.0520129Z     %1 = arith.muli %0, %c4_i32 : i32
2026-02-21T08:56:07.0520293Z     %2 = arith.addi %1, %c4_i32 : i32
2026-02-21T08:56:07.0520464Z     %3 = arith.minsi %2, %c131072_i32 : i32
2026-02-21T08:56:07.0520758Z     %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.0521159Z     %5 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:07.0521654Z     %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:07.0522102Z     %7 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:07.0522486Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0522944Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0523297Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0523873Z     %11 = arith.extsi %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:07.0524507Z     %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:56:07.0525160Z     %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:56:07.0525749Z     %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:07.0526118Z     %15 = arith.cmpi eq, %14, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:07.0526390Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked>
2026-02-21T08:56:07.0526692Z     %17 = arith.cmpi eq, %14, %cst_7 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:07.0526901Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked>
2026-02-21T08:56:07.0527143Z     %19 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:07.0527324Z     %20 = arith.subi %3, %1 : i32
2026-02-21T08:56:07.0527453Z     %21 = arith.remsi %20, %c2_i32 : i32
2026-02-21T08:56:07.0527598Z     %22 = arith.subi %20, %21 : i32
2026-02-21T08:56:07.0527723Z     %23 = arith.addi %1, %22 : i32
2026-02-21T08:56:07.0527868Z     scf.for %arg3 = %1 to %23 step %c2_i32  : i32 {
2026-02-21T08:56:07.0528023Z       %24 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T08:56:07.0528169Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T08:56:07.0528307Z       %26 = arith.muli %24, %c32_i32 : i32
2026-02-21T08:56:07.0528499Z       %27 = tt.splat %26 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.0528748Z       %28 = tt.splat %26 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:07.0528987Z       %29 = arith.addi %27, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.0529274Z       %30 = arith.addi %28, %5 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:07.0529465Z       %31 = arith.muli %25, %c32_i32 : i32
2026-02-21T08:56:07.0529649Z       %32 = tt.splat %31 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:07.0529877Z       %33 = arith.addi %32, %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:07.0530174Z       %34 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T08:56:07.0530465Z       %35 = arith.muli %34, %cst_4 : tensor<32x1xi32, #blocked1>
2026-02-21T08:56:07.0530680Z       %36 = tt.broadcast %35 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0530884Z       %37 = arith.extsi %31 : i32 to i64
2026-02-21T08:56:07.0531117Z       %38 = tt.splat %37 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:07.0531459Z       %39 = arith.addi %38, %11 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:07.0531901Z       %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0532313Z       %41 = arith.cmpi sge, %40, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0532587Z       %42 = arith.cmpi slt, %40, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0532857Z       %43 = arith.andi %41, %42 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0533152Z       %44 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T08:56:07.0533400Z         %85 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:07.0533594Z         %86 = tt.splat %85 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0533852Z         %87 = arith.addi %86, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0534206Z         %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0534519Z         %89 = tt.broadcast %88 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0534740Z         %90 = arith.addi %36, %89 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0534967Z         %91 = tt.addptr %9, %90 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0535198Z         %92 = tt.load %91 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0535507Z         %93 = ttg.convert_layout %92 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0535965Z         %94 = arith.extf %93 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0536285Z         %95 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:56:07.0536428Z         %96 = arith.muli %95, %c8192_i64 : i64
2026-02-21T08:56:07.0536609Z         %97 = tt.splat %96 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0536838Z         %98 = arith.addi %97, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0537146Z         %99 = tt.addptr %10, %98 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0537474Z         %100 = tt.load %99, %43, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0537726Z         %101 = arith.shli %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0538012Z         %102 = arith.shrsi %101, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0538255Z         %103 = arith.shrsi %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0538553Z         %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0538894Z         %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0539181Z         %106 = tt.broadcast %104 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0539427Z         %107 = arith.select %16, %106, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0539672Z         %108 = tt.broadcast %105 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0539906Z         %109 = arith.select %18, %108, %107 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0540146Z         %110 = tt.reshape %109 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0540371Z         %111 = arith.sitofp %110 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0540671Z         %112 = ttg.convert_layout %111 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0541147Z         %113 = tt.dot %94, %112, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0541498Z         %114 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T08:56:07.0541629Z         %115 = arith.muli %114, %c2_i32 : i32
2026-02-21T08:56:07.0541806Z         %116 = tt.splat %115 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0542035Z         %117 = arith.addi %116, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0542316Z         %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0542637Z         %119 = tt.broadcast %118 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0542839Z         %120 = arith.addi %36, %119 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0543042Z         %121 = tt.addptr %9, %120 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0543257Z         %122 = tt.load %121 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0543532Z         %123 = ttg.convert_layout %122 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0543941Z         %124 = arith.extf %123 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0544229Z         %125 = arith.extsi %114 : i32 to i64
2026-02-21T08:56:07.0544356Z         %126 = arith.muli %125, %c8192_i64 : i64
2026-02-21T08:56:07.0544542Z         %127 = tt.splat %126 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0544782Z         %128 = arith.addi %127, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0545098Z         %129 = tt.addptr %10, %128 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0545432Z         %130 = tt.load %129, %43, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0545685Z         %131 = arith.shli %130, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0545923Z         %132 = arith.shrsi %131, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0546220Z         %133 = arith.shrsi %130, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0546518Z         %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0546859Z         %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0547147Z         %136 = tt.broadcast %134 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0547387Z         %137 = arith.select %16, %136, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0547627Z         %138 = tt.broadcast %135 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0547860Z         %139 = arith.select %18, %138, %137 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0548099Z         %140 = tt.reshape %139 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0548325Z         %141 = arith.sitofp %140 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0548626Z         %142 = ttg.convert_layout %141 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0549094Z         %143 = tt.dot %124, %142, %113, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0549441Z         %144 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:56:07.0549572Z         %145 = arith.muli %144, %c2_i32 : i32
2026-02-21T08:56:07.0549748Z         %146 = tt.splat %145 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0549974Z         %147 = arith.addi %146, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0550259Z         %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0550576Z         %149 = tt.broadcast %148 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0550780Z         %150 = arith.addi %36, %149 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0550986Z         %151 = tt.addptr %9, %150 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0551191Z         %152 = tt.load %151 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0551463Z         %153 = ttg.convert_layout %152 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0551864Z         %154 = arith.extf %153 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0552153Z         %155 = arith.extsi %144 : i32 to i64
2026-02-21T08:56:07.0552283Z         %156 = arith.muli %155, %c8192_i64 : i64
2026-02-21T08:56:07.0552464Z         %157 = tt.splat %156 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0552701Z         %158 = arith.addi %157, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0553013Z         %159 = tt.addptr %10, %158 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0553341Z         %160 = tt.load %159, %43, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0553593Z         %161 = arith.shli %160, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0553830Z         %162 = arith.shrsi %161, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0554101Z         %163 = arith.shrsi %160, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0554392Z         %164 = tt.expand_dims %162 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0554735Z         %165 = tt.expand_dims %163 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0555029Z         %166 = tt.broadcast %164 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0555268Z         %167 = arith.select %16, %166, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0555512Z         %168 = tt.broadcast %165 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0555744Z         %169 = arith.select %18, %168, %167 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0555983Z         %170 = tt.reshape %169 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0556209Z         %171 = arith.sitofp %170 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0556506Z         %172 = ttg.convert_layout %171 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0556972Z         %173 = tt.dot %154, %172, %143, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0557323Z         scf.yield %173 : tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0557456Z       } {tt.disallow_acc_multi_buffer}
2026-02-21T08:56:07.0557676Z       %45 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %44) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T08:56:07.0557895Z         %85 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:07.0558075Z         %86 = tt.splat %85 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0558297Z         %87 = arith.addi %86, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0558610Z         %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0558888Z         %89 = tt.broadcast %88 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0559081Z         %90 = arith.addi %36, %89 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0559282Z         %91 = tt.addptr %9, %90 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0559484Z         %92 = tt.load %91 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0559750Z         %93 = ttg.convert_layout %92 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0560153Z         %94 = arith.extf %93 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0560438Z         %95 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:56:07.0560564Z         %96 = arith.muli %95, %c8192_i64 : i64
2026-02-21T08:56:07.0560734Z         %97 = tt.splat %96 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0560964Z         %98 = arith.addi %97, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0561271Z         %99 = tt.addptr %10, %98 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0561592Z         %100 = tt.load %99, %43, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0561842Z         %101 = arith.shli %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0562117Z         %102 = arith.shrsi %101, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0562392Z         %103 = arith.shrsi %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0562732Z         %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0563072Z         %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0563361Z         %106 = tt.broadcast %104 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0563607Z         %107 = arith.select %16, %106, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0563846Z         %108 = tt.broadcast %105 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0564085Z         %109 = arith.select %18, %108, %107 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0564318Z         %110 = tt.reshape %109 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0564548Z         %111 = arith.sitofp %110 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0564848Z         %112 = ttg.convert_layout %111 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0565314Z         %113 = tt.dot %94, %112, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0565669Z         scf.yield %113 : tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0565821Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:56:07.0566013Z       %46 = arith.truncf %45 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T08:56:07.0566280Z       %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma>
2026-02-21T08:56:07.0566558Z       %48 = arith.muli %47, %cst_8 : tensor<32x1xi32, #mma>
2026-02-21T08:56:07.0566790Z       %49 = tt.expand_dims %33 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma>
2026-02-21T08:56:07.0567045Z       %50 = tt.broadcast %48 : tensor<32x1xi32, #mma> -> tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0567246Z       %51 = tt.broadcast %49 : tensor<1x32xi32, #mma> -> tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0567426Z       %52 = arith.addi %50, %51 : tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0567609Z       %53 = tt.addptr %19, %52 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0567802Z       tt.store %53, %46 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:07.0567941Z       %54 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T08:56:07.0568070Z       %55 = arith.remsi %54, %c512_i32 : i32
2026-02-21T08:56:07.0568191Z       %56 = arith.divsi %54, %c512_i32 : i32
2026-02-21T08:56:07.0568314Z       %57 = arith.muli %55, %c32_i32 : i32
2026-02-21T08:56:07.0568488Z       %58 = tt.splat %57 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.0568701Z       %59 = tt.splat %57 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:07.0568919Z       %60 = arith.addi %58, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.0569129Z       %61 = arith.addi %59, %5 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:07.0569300Z       %62 = arith.muli %56, %c32_i32 : i32
2026-02-21T08:56:07.0569457Z       %63 = tt.splat %62 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:07.0569663Z       %64 = arith.addi %63, %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:07.0569977Z       %65 = tt.expand_dims %60 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T08:56:07.0570227Z       %66 = arith.muli %65, %cst_4 : tensor<32x1xi32, #blocked1>
2026-02-21T08:56:07.0570426Z       %67 = tt.broadcast %66 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0570598Z       %68 = arith.extsi %62 : i32 to i64
2026-02-21T08:56:07.0570807Z       %69 = tt.splat %68 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:07.0571107Z       %70 = arith.addi %69, %11 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:07.0571493Z       %71 = tt.expand_dims %70 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0571856Z       %72 = arith.cmpi sge, %71, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0572098Z       %73 = arith.cmpi slt, %71, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0594337Z       %74 = arith.andi %72, %73 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0594634Z       %75 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T08:56:07.0594858Z         %85 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:07.0595045Z         %86 = tt.splat %85 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0595279Z         %87 = arith.addi %86, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0595561Z         %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0595849Z         %89 = tt.broadcast %88 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0596051Z         %90 = arith.addi %67, %89 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0596257Z         %91 = tt.addptr %9, %90 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0625171Z         %92 = tt.load %91 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0625454Z         %93 = ttg.convert_layout %92 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0625873Z         %94 = arith.extf %93 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0626163Z         %95 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:56:07.0626295Z         %96 = arith.muli %95, %c8192_i64 : i64
2026-02-21T08:56:07.0626474Z         %97 = tt.splat %96 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0626715Z         %98 = arith.addi %97, %71 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0627026Z         %99 = tt.addptr %10, %98 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0627355Z         %100 = tt.load %99, %74, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0627608Z         %101 = arith.shli %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0627845Z         %102 = arith.shrsi %101, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0628085Z         %103 = arith.shrsi %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0628384Z         %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0628843Z         %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0629129Z         %106 = tt.broadcast %104 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0629371Z         %107 = arith.select %16, %106, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0629608Z         %108 = tt.broadcast %105 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0629839Z         %109 = arith.select %18, %108, %107 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0630066Z         %110 = tt.reshape %109 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0630295Z         %111 = arith.sitofp %110 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0630591Z         %112 = ttg.convert_layout %111 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0631067Z         %113 = tt.dot %94, %112, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0631424Z         %114 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T08:56:07.0631549Z         %115 = arith.muli %114, %c2_i32 : i32
2026-02-21T08:56:07.0631724Z         %116 = tt.splat %115 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0631951Z         %117 = arith.addi %116, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0632222Z         %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0632496Z         %119 = tt.broadcast %118 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0632694Z         %120 = arith.addi %67, %119 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0632902Z         %121 = tt.addptr %9, %120 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0633161Z         %122 = tt.load %121 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0633425Z         %123 = ttg.convert_layout %122 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0633827Z         %124 = arith.extf %123 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0634107Z         %125 = arith.extsi %114 : i32 to i64
2026-02-21T08:56:07.0634233Z         %126 = arith.muli %125, %c8192_i64 : i64
2026-02-21T08:56:07.0634409Z         %127 = tt.splat %126 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0634646Z         %128 = arith.addi %127, %71 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0634960Z         %129 = tt.addptr %10, %128 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0635287Z         %130 = tt.load %129, %74, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0635536Z         %131 = arith.shli %130, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0635777Z         %132 = arith.shrsi %131, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0636014Z         %133 = arith.shrsi %130, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0636306Z         %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0636675Z         %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0636963Z         %136 = tt.broadcast %134 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0637207Z         %137 = arith.select %16, %136, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0637442Z         %138 = tt.broadcast %135 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0637676Z         %139 = arith.select %18, %138, %137 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0637907Z         %140 = tt.reshape %139 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0638133Z         %141 = arith.sitofp %140 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0638431Z         %142 = ttg.convert_layout %141 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0638895Z         %143 = tt.dot %124, %142, %113, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0639245Z         %144 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:56:07.0639368Z         %145 = arith.muli %144, %c2_i32 : i32
2026-02-21T08:56:07.0639543Z         %146 = tt.splat %145 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0639771Z         %147 = arith.addi %146, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0640045Z         %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0640323Z         %149 = tt.broadcast %148 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0640517Z         %150 = arith.addi %67, %149 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0640723Z         %151 = tt.addptr %9, %150 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0640968Z         %152 = tt.load %151 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0641231Z         %153 = ttg.convert_layout %152 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0641634Z         %154 = arith.extf %153 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0641911Z         %155 = arith.extsi %144 : i32 to i64
2026-02-21T08:56:07.0642040Z         %156 = arith.muli %155, %c8192_i64 : i64
2026-02-21T08:56:07.0642218Z         %157 = tt.splat %156 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0642443Z         %158 = arith.addi %157, %71 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0642825Z         %159 = tt.addptr %10, %158 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0643154Z         %160 = tt.load %159, %74, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0643401Z         %161 = arith.shli %160, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0643640Z         %162 = arith.shrsi %161, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0643879Z         %163 = arith.shrsi %160, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0644171Z         %164 = tt.expand_dims %162 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0644504Z         %165 = tt.expand_dims %163 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0644866Z         %166 = tt.broadcast %164 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0645105Z         %167 = arith.select %16, %166, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0645344Z         %168 = tt.broadcast %165 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0645573Z         %169 = arith.select %18, %168, %167 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0645805Z         %170 = tt.reshape %169 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0646028Z         %171 = arith.sitofp %170 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0646323Z         %172 = ttg.convert_layout %171 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0646787Z         %173 = tt.dot %154, %172, %143, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0647136Z         scf.yield %173 : tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0647266Z       } {tt.disallow_acc_multi_buffer}
2026-02-21T08:56:07.0647479Z       %76 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %75) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T08:56:07.0647693Z         %85 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:07.0647864Z         %86 = tt.splat %85 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0648081Z         %87 = arith.addi %86, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0648356Z         %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0648637Z         %89 = tt.broadcast %88 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0648828Z         %90 = arith.addi %67, %89 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0649067Z         %91 = tt.addptr %9, %90 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0649265Z         %92 = tt.load %91 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0649532Z         %93 = ttg.convert_layout %92 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0649932Z         %94 = arith.extf %93 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0650212Z         %95 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:56:07.0650339Z         %96 = arith.muli %95, %c8192_i64 : i64
2026-02-21T08:56:07.0650507Z         %97 = tt.splat %96 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0650737Z         %98 = arith.addi %97, %71 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0651046Z         %99 = tt.addptr %10, %98 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0651363Z         %100 = tt.load %99, %74, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0651612Z         %101 = arith.shli %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0651849Z         %102 = arith.shrsi %101, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0652087Z         %103 = arith.shrsi %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0652379Z         %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0652746Z         %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0653036Z         %106 = tt.broadcast %104 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0653273Z         %107 = arith.select %16, %106, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0653509Z         %108 = tt.broadcast %105 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0653743Z         %109 = arith.select %18, %108, %107 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0653973Z         %110 = tt.reshape %109 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0654200Z         %111 = arith.sitofp %110 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0654497Z         %112 = ttg.convert_layout %111 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0654961Z         %113 = tt.dot %94, %112, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0655313Z         scf.yield %113 : tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0655461Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:56:07.0655649Z       %77 = arith.truncf %76 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T08:56:07.0655911Z       %78 = tt.expand_dims %61 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma>
2026-02-21T08:56:07.0656145Z       %79 = arith.muli %78, %cst_8 : tensor<32x1xi32, #mma>
2026-02-21T08:56:07.0656373Z       %80 = tt.expand_dims %64 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma>
2026-02-21T08:56:07.0656625Z       %81 = tt.broadcast %79 : tensor<32x1xi32, #mma> -> tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0656824Z       %82 = tt.broadcast %80 : tensor<1x32xi32, #mma> -> tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0657036Z       %83 = arith.addi %81, %82 : tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0657219Z       %84 = tt.addptr %19, %83 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0657412Z       tt.store %84, %77 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:07.0657548Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:56:07.0657672Z     scf.for %arg3 = %23 to %3 step %c1_i32  : i32 {
2026-02-21T08:56:07.0657806Z       %24 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T08:56:07.0657933Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T08:56:07.0658054Z       %26 = arith.muli %24, %c32_i32 : i32
2026-02-21T08:56:07.0658221Z       %27 = tt.splat %26 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.0658435Z       %28 = tt.splat %26 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:07.0658644Z       %29 = arith.addi %27, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.0658859Z       %30 = arith.addi %28, %5 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:07.0659021Z       %31 = arith.muli %25, %c32_i32 : i32
2026-02-21T08:56:07.0659182Z       %32 = tt.splat %31 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:07.0659385Z       %33 = arith.addi %32, %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:07.0659649Z       %34 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T08:56:07.0659902Z       %35 = arith.muli %34, %cst_4 : tensor<32x1xi32, #blocked1>
2026-02-21T08:56:07.0660090Z       %36 = tt.broadcast %35 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0660323Z       %37 = arith.extsi %31 : i32 to i64
2026-02-21T08:56:07.0660526Z       %38 = tt.splat %37 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:07.0660823Z       %39 = arith.addi %38, %11 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:07.0661208Z       %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0661565Z       %41 = arith.cmpi sge, %40, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0661808Z       %42 = arith.cmpi slt, %40, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0662041Z       %43 = arith.andi %41, %42 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0662302Z       %44 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T08:56:07.0662516Z         %54 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:07.0662686Z         %55 = tt.splat %54 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0662910Z         %56 = arith.addi %55, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0663183Z         %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0663453Z         %58 = tt.broadcast %57 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0663644Z         %59 = arith.addi %36, %58 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0663838Z         %60 = tt.addptr %9, %59 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0664040Z         %61 = tt.load %60 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0664305Z         %62 = ttg.convert_layout %61 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0664732Z         %63 = arith.extf %62 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0665012Z         %64 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:56:07.0665135Z         %65 = arith.muli %64, %c8192_i64 : i64
2026-02-21T08:56:07.0665309Z         %66 = tt.splat %65 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0665537Z         %67 = arith.addi %66, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0665843Z         %68 = tt.addptr %10, %67 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0666166Z         %69 = tt.load %68, %43, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0666407Z         %70 = arith.shli %69, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0666648Z         %71 = arith.shrsi %70, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0666887Z         %72 = arith.shrsi %69, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0667171Z         %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0667508Z         %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0667784Z         %75 = tt.broadcast %73 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0668021Z         %76 = arith.select %16, %75, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0668311Z         %77 = tt.broadcast %74 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0668537Z         %78 = arith.select %18, %77, %76 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0668764Z         %79 = tt.reshape %78 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0668981Z         %80 = arith.sitofp %79 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0669277Z         %81 = ttg.convert_layout %80 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0669743Z         %82 = tt.dot %63, %81, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0670085Z         %83 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T08:56:07.0670219Z         %84 = arith.muli %83, %c2_i32 : i32
2026-02-21T08:56:07.0670385Z         %85 = tt.splat %84 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0670607Z         %86 = arith.addi %85, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0670887Z         %87 = tt.expand_dims %86 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0671158Z         %88 = tt.broadcast %87 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0671353Z         %89 = arith.addi %36, %88 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0671548Z         %90 = tt.addptr %9, %89 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0671755Z         %91 = tt.load %90 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0672029Z         %92 = ttg.convert_layout %91 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0672423Z         %93 = arith.extf %92 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0672739Z         %94 = arith.extsi %83 : i32 to i64
2026-02-21T08:56:07.0672860Z         %95 = arith.muli %94, %c8192_i64 : i64
2026-02-21T08:56:07.0673036Z         %96 = tt.splat %95 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0673263Z         %97 = arith.addi %96, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0673569Z         %98 = tt.addptr %10, %97 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0673893Z         %99 = tt.load %98, %43, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0674140Z         %100 = arith.shli %99, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0674383Z         %101 = arith.shrsi %100, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0674632Z         %102 = arith.shrsi %99, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0674925Z         %103 = tt.expand_dims %101 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0675264Z         %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0675547Z         %105 = tt.broadcast %103 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0675792Z         %106 = arith.select %16, %105, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0676072Z         %107 = tt.broadcast %104 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0676306Z         %108 = arith.select %18, %107, %106 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0676544Z         %109 = tt.reshape %108 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0676768Z         %110 = arith.sitofp %109 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0677070Z         %111 = ttg.convert_layout %110 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0677529Z         %112 = tt.dot %93, %111, %82, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0677871Z         %113 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:56:07.0678001Z         %114 = arith.muli %113, %c2_i32 : i32
2026-02-21T08:56:07.0678179Z         %115 = tt.splat %114 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0678404Z         %116 = arith.addi %115, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0678686Z         %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0678964Z         %118 = tt.broadcast %117 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0679164Z         %119 = arith.addi %36, %118 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0679364Z         %120 = tt.addptr %9, %119 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0679576Z         %121 = tt.load %120 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0679852Z         %122 = ttg.convert_layout %121 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0680264Z         %123 = arith.extf %122 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0680589Z         %124 = arith.extsi %113 : i32 to i64
2026-02-21T08:56:07.0680718Z         %125 = arith.muli %124, %c8192_i64 : i64
2026-02-21T08:56:07.0680897Z         %126 = tt.splat %125 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0681133Z         %127 = arith.addi %126, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0681445Z         %128 = tt.addptr %10, %127 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0681775Z         %129 = tt.load %128, %43, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0682029Z         %130 = arith.shli %129, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0682267Z         %131 = arith.shrsi %130, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0682514Z         %132 = arith.shrsi %129, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0682846Z         %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0683185Z         %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0683475Z         %135 = tt.broadcast %133 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0683714Z         %136 = arith.select %16, %135, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0683955Z         %137 = tt.broadcast %134 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0684237Z         %138 = arith.select %18, %137, %136 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0684475Z         %139 = tt.reshape %138 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0684703Z         %140 = arith.sitofp %139 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0684999Z         %141 = ttg.convert_layout %140 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0685462Z         %142 = tt.dot %123, %141, %112, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0685808Z         scf.yield %142 : tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0685941Z       } {tt.disallow_acc_multi_buffer}
2026-02-21T08:56:07.0686157Z       %45 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %44) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T08:56:07.0686372Z         %54 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:07.0686546Z         %55 = tt.splat %54 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0686764Z         %56 = arith.addi %55, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.0687041Z         %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:07.0687319Z         %58 = tt.broadcast %57 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0687509Z         %59 = arith.addi %36, %58 : tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0687709Z         %60 = tt.addptr %9, %59 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T08:56:07.0687909Z         %61 = tt.load %60 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:07.0688180Z         %62 = ttg.convert_layout %61 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0688620Z         %63 = arith.extf %62 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0688896Z         %64 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:56:07.0689023Z         %65 = arith.muli %64, %c8192_i64 : i64
2026-02-21T08:56:07.0689195Z         %66 = tt.splat %65 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0689421Z         %67 = arith.addi %66, %40 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0689727Z         %68 = tt.addptr %10, %67 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0690048Z         %69 = tt.load %68, %43, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0690295Z         %70 = arith.shli %69, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0690530Z         %71 = arith.shrsi %70, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0690766Z         %72 = arith.shrsi %69, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.0691052Z         %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0691379Z         %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T08:56:07.0691662Z         %75 = tt.broadcast %73 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0691894Z         %76 = arith.select %16, %75, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0692169Z         %77 = tt.broadcast %74 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0692402Z         %78 = arith.select %18, %77, %76 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T08:56:07.0692625Z         %79 = tt.reshape %78 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T08:56:07.0692846Z         %80 = arith.sitofp %79 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T08:56:07.0693135Z         %81 = ttg.convert_layout %80 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.0693594Z         %82 = tt.dot %63, %81, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0693945Z         scf.yield %82 : tensor<32x32xf32, #mma>
2026-02-21T08:56:07.0694094Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:56:07.0694282Z       %46 = arith.truncf %45 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T08:56:07.0694542Z       %47 = tt.expand_dims %30 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma>
2026-02-21T08:56:07.0694779Z       %48 = arith.muli %47, %cst_8 : tensor<32x1xi32, #mma>
2026-02-21T08:56:07.0695010Z       %49 = tt.expand_dims %33 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma>
2026-02-21T08:56:07.0695260Z       %50 = tt.broadcast %48 : tensor<32x1xi32, #mma> -> tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0695459Z       %51 = tt.broadcast %49 : tensor<1x32xi32, #mma> -> tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0695631Z       %52 = arith.addi %50, %51 : tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0695815Z       %53 = tt.addptr %19, %52 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi32, #mma>
2026-02-21T08:56:07.0696010Z       tt.store %53, %46 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:07.0696145Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:56:07.0696284Z     tt.return
2026-02-21T08:56:07.0696367Z   }
2026-02-21T08:56:07.0696453Z }
2026-02-21T08:56:07.0696497Z 
2026-02-21T08:56:07.0696530Z {-#
2026-02-21T08:56:07.0696619Z   external_resources: {
2026-02-21T08:56:07.0696720Z     mlir_reproducer: {
2026-02-21T08:56:07.0697723Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:56:07.0698726Z       disable_threading: false,
2026-02-21T08:56:07.0698838Z       verify_each: true
2026-02-21T08:56:07.0698933Z     }
2026-02-21T08:56:07.0699013Z   }
2026-02-21T08:56:07.0699089Z #-}
2026-02-21T08:56:07.0699375Z /tmp/torchinductor_root/ab/cab4wikgpuaqbr5wve3okzh76jhlc7ysmubfwuocu4fyavuwbuse.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:56:07.0700098Z /tmp/torchinductor_root/ab/cab4wikgpuaqbr5wve3okzh76jhlc7ysmubfwuocu4fyavuwbuse.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:56:07.0700648Z [46s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:56:07.0701466Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 32], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, False], range_num_stages=[3, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T08:56:07.0702181Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:56:07.0702350Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:56:07.1074981Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:56:07.1079093Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:56:07.1079419Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:56:07.1079730Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:56:07.1080037Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:56:07.1080323Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:56:07.1080586Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:56:07.1080770Z #smem = #ttg.shared_memory
2026-02-21T08:56:07.1081010Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:56:07.1081490Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:56:07.1081864Z     %cst = arith.constant dense<8192> : tensor<16x1xi32, #mma>
2026-02-21T08:56:07.1082155Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:07.1082330Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:07.1082519Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T08:56:07.1082790Z     %cst_3 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T08:56:07.1082972Z     %cst_4 = arith.constant dense<1024> : tensor<16x1xi32, #blocked2>
2026-02-21T08:56:07.1083129Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:56:07.1083249Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:56:07.1083373Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:56:07.1083490Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:56:07.1083613Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:56:07.1083763Z     %cst_5 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked>
2026-02-21T08:56:07.1083912Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:56:07.1084031Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:56:07.1084218Z     %cst_6 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.1084415Z     %0 = tt.get_program_id x : i32
2026-02-21T08:56:07.1084529Z     %1 = arith.remsi %0, %c32_i32 : i32
2026-02-21T08:56:07.1084650Z     %2 = arith.divsi %0, %c32_i32 : i32
2026-02-21T08:56:07.1084763Z     %3 = arith.muli %1, %c256_i32 : i32
2026-02-21T08:56:07.1084960Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:07.1085244Z     %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.1085493Z     %6 = tt.splat %3 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:07.1085761Z     %7 = tt.splat %3 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.1085972Z     %8 = arith.addi %6, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:07.1086184Z     %9 = arith.addi %7, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:07.1086356Z     %10 = arith.muli %2, %c16_i32 : i32
2026-02-21T08:56:07.1086577Z     %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:07.1086852Z     %12 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:07.1087100Z     %13 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:07.1087312Z     %14 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:07.1087528Z     %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:07.1087747Z     %16 = arith.addi %14, %12 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:07.1087988Z     %17 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.1088267Z     %18 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:07.1088573Z     %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T08:56:07.1088830Z     %20 = arith.muli %19, %cst_4 : tensor<16x1xi32, #blocked2>
2026-02-21T08:56:07.1089021Z     %21 = tt.broadcast %20 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:56:07.1089239Z     %22 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:07.1089512Z     %23 = tt.expand_dims %9 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T08:56:07.1089798Z     %24 = tt.broadcast %23 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T08:56:07.1090052Z     %25 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T08:56:07.1090326Z     %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:56:07.1090740Z     %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:56:07.1091141Z     %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:07.1091393Z     %29 = arith.cmpi eq, %28, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:07.1091594Z     %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T08:56:07.1091794Z     %31 = arith.cmpi eq, %28, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:07.1091982Z     %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T08:56:07.1092252Z     %33 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst_2) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T08:56:07.1092529Z       %43 = tt.splat %arg3 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.1092756Z       %44 = arith.addi %43, %17 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.1092935Z       %45 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T08:56:07.1093104Z       %46 = tt.splat %45 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:07.1093324Z       %47 = arith.addi %46, %18 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:07.1093642Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:07.1093918Z       %49 = tt.broadcast %48 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:56:07.1094114Z       %50 = arith.addi %21, %49 : tensor<16x4xi32, #blocked2>
2026-02-21T08:56:07.1094315Z       %51 = tt.addptr %22, %50 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T08:56:07.1094527Z       %52 = tt.load %51 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:07.1094794Z       %53 = ttg.convert_layout %52 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.1095202Z       %54 = arith.extf %53 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.1095590Z       %55 = tt.expand_dims %44 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T08:56:07.1095840Z       %56 = arith.muli %55, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T08:56:07.1096034Z       %57 = tt.broadcast %56 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T08:56:07.1096225Z       %58 = arith.addi %57, %24 : tensor<2x256xi32, #blocked1>
2026-02-21T08:56:07.1096423Z       %59 = tt.addptr %25, %58 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T08:56:07.1096621Z       %60 = tt.load %59 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T08:56:07.1096864Z       %61 = ttg.convert_layout %60 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.1097145Z       %62 = arith.shli %61, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.1097376Z       %63 = arith.shrsi %62, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.1097616Z       %64 = arith.shrsi %61, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.1097904Z       %65 = tt.expand_dims %63 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T08:56:07.1098469Z       %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T08:56:07.1098755Z       %67 = tt.broadcast %65 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T08:56:07.1098992Z       %68 = arith.select %30, %67, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T08:56:07.1099232Z       %69 = tt.broadcast %66 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T08:56:07.1099464Z       %70 = arith.select %32, %69, %68 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T08:56:07.1099692Z       %71 = tt.reshape %70 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T08:56:07.1099918Z       %72 = arith.sitofp %71 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T08:56:07.1100165Z       %73 = ttg.local_alloc %72 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem>
2026-02-21T08:56:07.1100491Z       %74 = ttg.local_load %73 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.1100968Z       %75 = tt.dot %54, %74, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T08:56:07.1101312Z       %76 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T08:56:07.1101487Z       %77 = tt.splat %76 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.1101705Z       %78 = arith.addi %77, %17 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:07.1101918Z       %79 = arith.muli %76, %c2_i32 : i32
2026-02-21T08:56:07.1102087Z       %80 = tt.splat %79 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:07.1102300Z       %81 = arith.addi %80, %18 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:07.1102574Z       %82 = tt.expand_dims %81 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:07.1102845Z       %83 = tt.broadcast %82 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:56:07.1103039Z       %84 = arith.addi %21, %83 : tensor<16x4xi32, #blocked2>
2026-02-21T08:56:07.1103238Z       %85 = tt.addptr %22, %84 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T08:56:07.1103438Z       %86 = tt.load %85 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:07.1103702Z       %87 = ttg.convert_layout %86 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.1104097Z       %88 = arith.extf %87 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.1104477Z       %89 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T08:56:07.1104724Z       %90 = arith.muli %89, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T08:56:07.1104914Z       %91 = tt.broadcast %90 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T08:56:07.1105113Z       %92 = arith.addi %91, %24 : tensor<2x256xi32, #blocked1>
2026-02-21T08:56:07.1105306Z       %93 = tt.addptr %25, %92 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T08:56:07.1105509Z       %94 = tt.load %93 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T08:56:07.1105755Z       %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.1106036Z       %96 = arith.shli %95, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.1106306Z       %97 = arith.shrsi %96, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.1106541Z       %98 = arith.shrsi %95, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:07.1106829Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T08:56:07.1107168Z       %100 = tt.expand_dims %98 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T08:56:07.1107453Z       %101 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T08:56:07.1107702Z       %102 = arith.select %30, %101, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T08:56:07.1107949Z       %103 = tt.broadcast %100 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T08:56:07.1108189Z       %104 = arith.select %32, %103, %102 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T08:56:07.1108431Z       %105 = tt.reshape %104 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T08:56:07.1108658Z       %106 = arith.sitofp %105 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T08:56:07.1108916Z       %107 = ttg.local_alloc %106 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem>
2026-02-21T08:56:07.1109239Z       %108 = ttg.local_load %107 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:07.1109748Z       %109 = tt.dot %88, %108, %75, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T08:56:07.1110098Z       scf.yield %109 : tensor<16x256xf32, #mma>
2026-02-21T08:56:07.1110228Z     } {tt.num_stages = 1 : i32}
2026-02-21T08:56:07.1110388Z     %34 = arith.truncf %33 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T08:56:07.1110647Z     %35 = tt.expand_dims %16 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T08:56:07.1110883Z     %36 = arith.muli %35, %cst : tensor<16x1xi32, #mma>
2026-02-21T08:56:07.1111116Z     %37 = tt.expand_dims %8 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T08:56:07.1111370Z     %38 = tt.broadcast %36 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T08:56:07.1111574Z     %39 = tt.broadcast %37 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T08:56:07.1111750Z     %40 = arith.addi %38, %39 : tensor<16x256xi32, #mma>
2026-02-21T08:56:07.1111931Z     %41 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:07.1112142Z     %42 = tt.addptr %41, %40 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T08:56:07.1112340Z     tt.store %42, %34 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:07.1112475Z     tt.return
2026-02-21T08:56:07.1112560Z   }
2026-02-21T08:56:07.1112640Z }
2026-02-21T08:56:07.1112684Z 
2026-02-21T08:56:07.1112717Z {-#
2026-02-21T08:56:07.1112805Z   external_resources: {
2026-02-21T08:56:07.1112906Z     mlir_reproducer: {
2026-02-21T08:56:07.1113922Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:56:07.1114951Z       disable_threading: false,
2026-02-21T08:56:07.1115060Z       verify_each: true
2026-02-21T08:56:07.1115158Z     }
2026-02-21T08:56:07.1115237Z   }
2026-02-21T08:56:07.1115310Z #-}
2026-02-21T08:56:07.1115596Z /tmp/torchinductor_root/ws/cwsiwzhboqoddose42mt7vdu26zo3fsbcmzz47doftpr32jj3n5r.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:56:07.1116283Z /tmp/torchinductor_root/ws/cwsiwzhboqoddose42mt7vdu26zo3fsbcmzz47doftpr32jj3n5r.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:56:07.1116842Z [46s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:56:07.1117569Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 256], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T08:56:07.1118222Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:56:07.1118398Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:56:39.2865476Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:56:39.2870515Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 16], order = [2, 1, 0]}>
2026-02-21T08:56:39.2874649Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [1, 64], warpsPerCTA = [2, 8], order = [1, 0]}>
2026-02-21T08:56:39.2875583Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T08:56:39.2876374Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 16], order = [1, 0]}>
2026-02-21T08:56:39.2877097Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 16], instrShape = [32, 32], isTransposed = true}>
2026-02-21T08:56:39.2877749Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:56:39.2878220Z #smem = #ttg.shared_memory
2026-02-21T08:56:39.2878805Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:56:39.2880004Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:56:39.2880992Z     %cst = arith.constant dense<8192> : tensor<1x4096xi64, #mma>
2026-02-21T08:56:39.2881434Z     %cst_0 = arith.constant dense<0> : tensor<1x4096xi64, #mma>
2026-02-21T08:56:39.2881872Z     %cst_1 = arith.constant dense<16384> : tensor<16x1xi64, #mma>
2026-02-21T08:56:39.2882287Z     %cst_2 = arith.constant dense<0> : tensor<16x1xi64, #mma>
2026-02-21T08:56:39.2882837Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi64, #mma>
2026-02-21T08:56:39.2883150Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:39.2883457Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:39.2883782Z     %cst_6 = arith.constant dense<0.000000e+00> : tensor<16x4096xf32, #mma>
2026-02-21T08:56:39.2884119Z     %cst_7 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T08:56:39.2884439Z     %cst_8 = arith.constant dense<1024> : tensor<16x1xi32, #blocked2>
2026-02-21T08:56:39.2884823Z     %cst_9 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:39.2885477Z     %cst_10 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:39.2885818Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:56:39.2886023Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T08:56:39.2886234Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T08:56:39.2886445Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:56:39.2886659Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:56:39.2886865Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:56:39.2887072Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T08:56:39.2887270Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:56:39.2887525Z     %cst_11 = arith.constant dense<0> : tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2887791Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:56:39.2887994Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:56:39.2888199Z     %c38912_i32 = arith.constant 38912 : i32
2026-02-21T08:56:39.2888533Z     %cst_12 = arith.constant dense<4> : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2888876Z     %0 = tt.get_program_id x : i32
2026-02-21T08:56:39.2889216Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:39.2889700Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:39.2890175Z     %3 = tt.make_range {end = 4096 : i32, start = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:39.2890671Z     %4 = tt.make_range {end = 4096 : i32, start = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:39.2891243Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:39.2891705Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:39.2892128Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:39.2892484Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x4096x!tt.ptr<i8>, #blocked1>
2026-02-21T08:56:39.2892959Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:56:39.2893505Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:56:39.2894022Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:39.2894367Z     %12 = arith.cmpi eq, %11, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:39.2894630Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x4096xi1, #blocked>
2026-02-21T08:56:39.2894889Z     %14 = arith.cmpi eq, %11, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:39.2895147Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x4096xi1, #blocked>
2026-02-21T08:56:39.2895422Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x4096x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:39.2895771Z     %17 = arith.extsi %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:39.2896215Z     %18 = arith.extsi %3 : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<4096xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:39.2896578Z     %19 = arith.addi %5, %cst_9 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:39.2896880Z     %20 = arith.addi %6, %cst_10 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:39.2897242Z     %21 = tt.expand_dims %20 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:39.2897653Z     %22 = tt.broadcast %21 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2898010Z     %23 = tt.expand_dims %19 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T08:56:39.2898326Z     %24 = arith.muli %23, %cst_7 : tensor<2x1xi32, #blocked1>
2026-02-21T08:56:39.2898579Z     %25 = tt.broadcast %24 : tensor<2x1xi32, #blocked1> -> tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2898834Z     scf.for %arg3 = %0 to %c2048_i32 step %c38912_i32  : i32 {
2026-02-21T08:56:39.2899031Z       %26 = arith.divsi %arg3, %c32_i32 : i32
2026-02-21T08:56:39.2899193Z       %27 = arith.muli %26, %c16_i32 : i32
2026-02-21T08:56:39.2899350Z       %28 = arith.subi %c1024_i32, %27 : i32
2026-02-21T08:56:39.2899509Z       %29 = arith.minsi %28, %c16_i32 : i32
2026-02-21T08:56:39.2899667Z       %30 = arith.remsi %arg3, %c32_i32 : i32
2026-02-21T08:56:39.2899831Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T08:56:39.2899978Z       %32 = arith.addi %27, %31 : i32
2026-02-21T08:56:39.2900130Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T08:56:39.2900277Z       %34 = arith.muli %32, %c16_i32 : i32
2026-02-21T08:56:39.2900505Z       %35 = tt.splat %34 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:39.2900788Z       %36 = arith.addi %35, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:39.2901019Z       %37 = arith.muli %33, %c4096_i32 : i32
2026-02-21T08:56:39.2901241Z       %38 = tt.splat %37 : i32 -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:39.2901538Z       %39 = arith.addi %38, %4 : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:39.2901953Z       %40 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T08:56:39.2902283Z       %41 = arith.muli %40, %cst_8 : tensor<16x1xi32, #blocked2>
2026-02-21T08:56:39.2902531Z       %42 = tt.broadcast %41 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2902934Z       %43 = tt.expand_dims %39 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4096xi32, #blocked1>
2026-02-21T08:56:39.2903279Z       %44 = tt.broadcast %43 : tensor<1x4096xi32, #blocked1> -> tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2903568Z       %45 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_6) -> (tensor<16x4096xf32, #mma>)  : i32 {
2026-02-21T08:56:39.2903850Z         %92 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:39.2904091Z         %93 = arith.addi %92, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:39.2904274Z         %94 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:39.2904456Z         %95 = tt.splat %94 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:39.2904681Z         %96 = arith.addi %95, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:39.2904966Z         %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:39.2905253Z         %98 = tt.broadcast %97 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2905451Z         %99 = arith.addi %42, %98 : tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2905665Z         %100 = tt.addptr %7, %99 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2905885Z         %101 = tt.load %100 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:39.2906174Z         %102 = ttg.convert_layout %101 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2906602Z         %103 = arith.extf %102 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2907048Z         %104 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T08:56:39.2907310Z         %105 = arith.muli %104, %cst_7 : tensor<2x1xi32, #blocked1>
2026-02-21T08:56:39.2907522Z         %106 = tt.broadcast %105 : tensor<2x1xi32, #blocked1> -> tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2907730Z         %107 = arith.addi %106, %44 : tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2907947Z         %108 = tt.addptr %8, %107 : tensor<2x4096x!tt.ptr<i8>, #blocked1>, tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2908161Z         %109 = tt.load %108 : tensor<2x4096x!tt.ptr<i8>, #blocked1>
2026-02-21T08:56:39.2908428Z         %110 = ttg.convert_layout %109 : tensor<2x4096xi8, #blocked1> -> tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2908739Z         %111 = arith.shli %110, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2908994Z         %112 = arith.shrsi %111, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2909255Z         %113 = arith.shrsi %110, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2909568Z         %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked>
2026-02-21T08:56:39.2909934Z         %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked>
2026-02-21T08:56:39.2910246Z         %116 = tt.broadcast %114 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2910547Z         %117 = arith.select %13, %116, %cst_11 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2910816Z         %118 = tt.broadcast %115 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2911072Z         %119 = arith.select %15, %118, %117 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2911330Z         %120 = tt.reshape %119 : tensor<2x2x4096xi8, #blocked> -> tensor<4x4096xi8, #blocked3>
2026-02-21T08:56:39.2911581Z         %121 = arith.sitofp %120 : tensor<4x4096xi8, #blocked3> to tensor<4x4096xf32, #blocked3>
2026-02-21T08:56:39.2911854Z         %122 = ttg.local_alloc %121 : (tensor<4x4096xf32, #blocked3>) -> !ttg.memdesc<4x4096xf32, #shared, #smem>
2026-02-21T08:56:39.2912211Z         %123 = ttg.local_load %122 : !ttg.memdesc<4x4096xf32, #shared, #smem> -> tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2912724Z         %124 = tt.dot %103, %123, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x4096xf32, #mma>
2026-02-21T08:56:39.2913106Z         %125 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T08:56:39.2913291Z         %126 = tt.splat %125 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:39.2913516Z         %127 = arith.addi %126, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:39.2913697Z         %128 = arith.muli %125, %c2_i32 : i32
2026-02-21T08:56:39.2913866Z         %129 = tt.splat %128 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:39.2914091Z         %130 = arith.addi %129, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:39.2914374Z         %131 = tt.expand_dims %130 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:39.2914835Z         %132 = tt.broadcast %131 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2915291Z         %133 = arith.addi %42, %132 : tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2915527Z         %134 = tt.addptr %7, %133 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2915769Z         %135 = tt.load %134 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:39.2916093Z         %136 = ttg.convert_layout %135 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2916524Z         %137 = arith.extf %136 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2916956Z         %138 = tt.expand_dims %127 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T08:56:39.2917467Z         %139 = arith.muli %138, %cst_7 : tensor<2x1xi32, #blocked1>
2026-02-21T08:56:39.2917693Z         %140 = tt.broadcast %139 : tensor<2x1xi32, #blocked1> -> tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2917922Z         %141 = arith.addi %140, %44 : tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2918154Z         %142 = tt.addptr %8, %141 : tensor<2x4096x!tt.ptr<i8>, #blocked1>, tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2918402Z         %143 = tt.load %142 : tensor<2x4096x!tt.ptr<i8>, #blocked1>
2026-02-21T08:56:39.2918767Z         %144 = ttg.convert_layout %143 : tensor<2x4096xi8, #blocked1> -> tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2919105Z         %145 = arith.shli %144, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2919392Z         %146 = arith.shrsi %145, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2919690Z         %147 = arith.shrsi %144, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2920036Z         %148 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked>
2026-02-21T08:56:39.2920421Z         %149 = tt.expand_dims %147 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked>
2026-02-21T08:56:39.2920736Z         %150 = tt.broadcast %148 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2921221Z         %151 = arith.select %13, %150, %cst_11 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2921491Z         %152 = tt.broadcast %149 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2921772Z         %153 = arith.select %15, %152, %151 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2922055Z         %154 = tt.reshape %153 : tensor<2x2x4096xi8, #blocked> -> tensor<4x4096xi8, #blocked3>
2026-02-21T08:56:39.2922308Z         %155 = arith.sitofp %154 : tensor<4x4096xi8, #blocked3> to tensor<4x4096xf32, #blocked3>
2026-02-21T08:56:39.2922645Z         %156 = ttg.local_alloc %155 : (tensor<4x4096xf32, #blocked3>) -> !ttg.memdesc<4x4096xf32, #shared, #smem>
2026-02-21T08:56:39.2923037Z         %157 = ttg.local_load %156 : !ttg.memdesc<4x4096xf32, #shared, #smem> -> tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2923543Z         %158 = tt.dot %137, %157, %124, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x4096xf32, #mma>
2026-02-21T08:56:39.2923932Z         %159 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T08:56:39.2924124Z         %160 = tt.splat %159 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:39.2924395Z         %161 = arith.addi %160, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:39.2924605Z         %162 = arith.muli %159, %c2_i32 : i32
2026-02-21T08:56:39.2925019Z         %163 = tt.splat %162 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:39.2925279Z         %164 = arith.addi %163, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:39.2925581Z         %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:39.2925894Z         %166 = tt.broadcast %165 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2926124Z         %167 = arith.addi %42, %166 : tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2926422Z         %168 = tt.addptr %7, %167 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2926658Z         %169 = tt.load %168 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:39.2926952Z         %170 = ttg.convert_layout %169 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2927400Z         %171 = arith.extf %170 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2927830Z         %172 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T08:56:39.2928125Z         %173 = arith.muli %172, %cst_7 : tensor<2x1xi32, #blocked1>
2026-02-21T08:56:39.2928337Z         %174 = tt.broadcast %173 : tensor<2x1xi32, #blocked1> -> tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2928766Z         %175 = arith.addi %174, %44 : tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2928998Z         %176 = tt.addptr %8, %175 : tensor<2x4096x!tt.ptr<i8>, #blocked1>, tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2929276Z         %177 = tt.load %176 : tensor<2x4096x!tt.ptr<i8>, #blocked1>
2026-02-21T08:56:39.2929556Z         %178 = ttg.convert_layout %177 : tensor<2x4096xi8, #blocked1> -> tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2929875Z         %179 = arith.shli %178, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2930150Z         %180 = arith.shrsi %179, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2930434Z         %181 = arith.shrsi %178, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2930761Z         %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked>
2026-02-21T08:56:39.2931140Z         %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked>
2026-02-21T08:56:39.2931446Z         %184 = tt.broadcast %182 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2931763Z         %185 = arith.select %13, %184, %cst_11 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2932232Z         %186 = tt.broadcast %183 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2932508Z         %187 = arith.select %15, %186, %185 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2932803Z         %188 = tt.reshape %187 : tensor<2x2x4096xi8, #blocked> -> tensor<4x4096xi8, #blocked3>
2026-02-21T08:56:39.2933050Z         %189 = arith.sitofp %188 : tensor<4x4096xi8, #blocked3> to tensor<4x4096xf32, #blocked3>
2026-02-21T08:56:39.2933345Z         %190 = ttg.local_alloc %189 : (tensor<4x4096xf32, #blocked3>) -> !ttg.memdesc<4x4096xf32, #shared, #smem>
2026-02-21T08:56:39.2933726Z         %191 = ttg.local_load %190 : !ttg.memdesc<4x4096xf32, #shared, #smem> -> tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2934222Z         %192 = tt.dot %171, %191, %158, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x4096xf32, #mma>
2026-02-21T08:56:39.2934643Z         scf.yield %192 : tensor<16x4096xf32, #mma>
2026-02-21T08:56:39.2934837Z       } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T08:56:39.2935018Z       %46 = arith.addi %42, %22 : tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2935245Z       %47 = tt.addptr %7, %46 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T08:56:39.2935477Z       %48 = tt.load %47 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:39.2935969Z       %49 = ttg.convert_layout %48 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2936393Z       %50 = arith.extf %49 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2936715Z       %51 = arith.addi %25, %44 : tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2936951Z       %52 = tt.addptr %8, %51 : tensor<2x4096x!tt.ptr<i8>, #blocked1>, tensor<2x4096xi32, #blocked1>
2026-02-21T08:56:39.2937173Z       %53 = tt.load %52 : tensor<2x4096x!tt.ptr<i8>, #blocked1>
2026-02-21T08:56:39.2937449Z       %54 = ttg.convert_layout %53 : tensor<2x4096xi8, #blocked1> -> tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2937770Z       %55 = arith.shli %54, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2938032Z       %56 = arith.shrsi %55, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2938306Z       %57 = arith.shrsi %54, %cst_12 : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:39.2938668Z       %58 = tt.expand_dims %56 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked>
2026-02-21T08:56:39.2939027Z       %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<2x4096xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x4096xi8, #blocked>
2026-02-21T08:56:39.2939374Z       %60 = tt.broadcast %58 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2939820Z       %61 = arith.select %13, %60, %cst_11 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2940096Z       %62 = tt.broadcast %59 : tensor<2x1x4096xi8, #blocked> -> tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2940371Z       %63 = arith.select %15, %62, %61 : tensor<2x2x4096xi1, #blocked>, tensor<2x2x4096xi8, #blocked>
2026-02-21T08:56:39.2940619Z       %64 = tt.reshape %63 : tensor<2x2x4096xi8, #blocked> -> tensor<4x4096xi8, #blocked3>
2026-02-21T08:56:39.2940879Z       %65 = arith.sitofp %64 : tensor<4x4096xi8, #blocked3> to tensor<4x4096xf32, #blocked3>
2026-02-21T08:56:39.2941146Z       %66 = ttg.local_alloc %65 : (tensor<4x4096xf32, #blocked3>) -> !ttg.memdesc<4x4096xf32, #shared, #smem>
2026-02-21T08:56:39.2941511Z       %67 = ttg.local_load %66 : !ttg.memdesc<4x4096xf32, #shared, #smem> -> tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:39.2942008Z       %68 = tt.dot %50, %67, %45, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x4096xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x4096xf32, #mma>
2026-02-21T08:56:39.2942413Z       %69 = arith.truncf %68 : tensor<16x4096xf32, #mma> to tensor<16x4096xbf16, #mma>
2026-02-21T08:56:39.2942625Z       %70 = arith.extsi %34 : i32 to i64
2026-02-21T08:56:39.2942774Z       %71 = arith.extsi %37 : i32 to i64
2026-02-21T08:56:39.2942953Z       %72 = tt.splat %70 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:39.2943383Z       %73 = arith.addi %72, %17 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:39.2943657Z       %74 = tt.expand_dims %73 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma>
2026-02-21T08:56:39.2943969Z       %75 = arith.muli %74, %cst_3 : tensor<16x1xi64, #mma>
2026-02-21T08:56:39.2944184Z       %76 = tt.broadcast %75 : tensor<16x1xi64, #mma> -> tensor<16x4096xi64, #mma>
2026-02-21T08:56:39.2944411Z       %77 = tt.splat %71 : i64 -> tensor<4096xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:39.2944650Z       %78 = arith.addi %77, %18 : tensor<4096xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:39.2944940Z       %79 = tt.expand_dims %78 {axis = 0 : i32} : tensor<4096xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x4096xi64, #mma>
2026-02-21T08:56:39.2945239Z       %80 = tt.broadcast %79 : tensor<1x4096xi64, #mma> -> tensor<16x4096xi64, #mma>
2026-02-21T08:56:39.2945446Z       %81 = arith.addi %76, %80 : tensor<16x4096xi64, #mma>
2026-02-21T08:56:39.2945663Z       %82 = tt.addptr %16, %81 : tensor<16x4096x!tt.ptr<bf16>, #mma>, tensor<16x4096xi64, #mma>
2026-02-21T08:56:39.2945902Z       %83 = arith.cmpi sge, %74, %cst_2 : tensor<16x1xi64, #mma>
2026-02-21T08:56:39.2946076Z       %84 = arith.cmpi slt, %74, %cst_1 : tensor<16x1xi64, #mma>
2026-02-21T08:56:39.2946543Z       %85 = arith.andi %83, %84 : tensor<16x1xi1, #mma>
2026-02-21T08:56:39.2946756Z       %86 = tt.broadcast %85 : tensor<16x1xi1, #mma> -> tensor<16x4096xi1, #mma>
2026-02-21T08:56:39.2946951Z       %87 = arith.cmpi sge, %79, %cst_0 : tensor<1x4096xi64, #mma>
2026-02-21T08:56:39.2947155Z       %88 = arith.cmpi slt, %79, %cst : tensor<1x4096xi64, #mma>
2026-02-21T08:56:39.2947331Z       %89 = arith.andi %87, %88 : tensor<1x4096xi1, #mma>
2026-02-21T08:56:39.2947525Z       %90 = tt.broadcast %89 : tensor<1x4096xi1, #mma> -> tensor<16x4096xi1, #mma>
2026-02-21T08:56:39.2947730Z       %91 = arith.andi %86, %90 : tensor<16x4096xi1, #mma>
2026-02-21T08:56:39.2947961Z       tt.store %82, %69, %91 : tensor<16x4096x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:39.2948165Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32}
2026-02-21T08:56:39.2948348Z     tt.return
2026-02-21T08:56:39.2948462Z   }
2026-02-21T08:56:39.2948558Z }
2026-02-21T08:56:39.2948802Z 
2026-02-21T08:56:39.2948853Z {-#
2026-02-21T08:56:39.2948954Z   external_resources: {
2026-02-21T08:56:39.2949084Z     mlir_reproducer: {
2026-02-21T08:56:39.2950127Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:56:39.2951152Z       disable_threading: false,
2026-02-21T08:56:39.2951278Z       verify_each: true
2026-02-21T08:56:39.2951408Z     }
2026-02-21T08:56:39.2951503Z   }
2026-02-21T08:56:39.2951606Z #-}
2026-02-21T08:56:39.2951914Z /tmp/torchinductor_root/xo/cxobs6s22xqqcebp7prdcno4xjeayhmev525ddwkr7mbhksquilk.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:56:39.2952660Z /tmp/torchinductor_root/xo/cxobs6s22xqqcebp7prdcno4xjeayhmev525ddwkr7mbhksquilk.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:56:39.2953432Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:56:39.2954262Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 4096], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 3], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T08:56:39.2955058Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:56:39.2955263Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:56:51.2251313Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:56:51.2260173Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:56:51.2261737Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [64, 1], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T08:56:51.2263048Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T08:56:51.2264194Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}>
2026-02-21T08:56:51.2265313Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 32, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:56:51.2266175Z #smem = #ttg.shared_memory
2026-02-21T08:56:51.2267216Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:56:51.2278209Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:56:51.2279819Z     %cst = arith.constant dense<0.000000e+00> : tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2280105Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:56:51.2280306Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:56:51.2280504Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:56:51.2280776Z     %cst_0 = arith.constant dense<0> : tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2281181Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T08:56:51.2281489Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:56:51.2281758Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:56:51.2281988Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:56:51.2282216Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:56:51.2282443Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T08:56:51.2282922Z     %cst_1 = arith.constant dense<4161536> : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2283499Z     %cst_2 = arith.constant dense<4128768> : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2283871Z     %cst_3 = arith.constant dense<4096000> : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2284131Z     %c15_i32 = arith.constant 15 : i32
2026-02-21T08:56:51.2284287Z     %c14_i32 = arith.constant 14 : i32
2026-02-21T08:56:51.2284435Z     %c13_i32 = arith.constant 13 : i32
2026-02-21T08:56:51.2284791Z     %cst_4 = arith.constant dense<22> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2285292Z     %cst_5 = arith.constant dense<20> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2285794Z     %cst_6 = arith.constant dense<18> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2286281Z     %cst_7 = arith.constant dense<16> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2286727Z     %cst_8 = arith.constant dense<14> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2287221Z     %cst_9 = arith.constant dense<12> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2287713Z     %cst_10 = arith.constant dense<10> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2288346Z     %cst_11 = arith.constant dense<8> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2288659Z     %cst_12 = arith.constant dense<6> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2288974Z     %cst_13 = arith.constant dense<4> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2289281Z     %cst_14 = arith.constant dense<2> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2289524Z     %c12_i32 = arith.constant 12 : i32
2026-02-21T08:56:51.2289677Z     %c500_i32 = arith.constant 500 : i32
2026-02-21T08:56:51.2289833Z     %c11_i32 = arith.constant 11 : i32
2026-02-21T08:56:51.2289981Z     %c10_i32 = arith.constant 10 : i32
2026-02-21T08:56:51.2290130Z     %c9_i32 = arith.constant 9 : i32
2026-02-21T08:56:51.2290280Z     %c7_i32 = arith.constant 7 : i32
2026-02-21T08:56:51.2290418Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T08:56:51.2290567Z     %c5_i32 = arith.constant 5 : i32
2026-02-21T08:56:51.2290756Z     %cst_15 = arith.constant dense<1024> : tensor<2048x1xi32, #blocked1>
2026-02-21T08:56:51.2291045Z     %cst_16 = arith.constant dense<4> : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2291321Z     %cst_17 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:51.2291557Z     %cst_18 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:51.2291781Z     %cst_19 = arith.constant dense<8192> : tensor<2048x1xi64, #mma>
2026-02-21T08:56:51.2291996Z     %cst_20 = arith.constant dense<0> : tensor<2048x1xi64, #mma>
2026-02-21T08:56:51.2292239Z     %cst_21 = arith.constant dense<16384> : tensor<2048x1xi64, #mma>
2026-02-21T08:56:51.2292414Z     %cst_22 = arith.constant dense<0> : tensor<1x16xi64, #mma>
2026-02-21T08:56:51.2292640Z     %cst_23 = arith.constant dense<8192> : tensor<1x16xi64, #mma>
2026-02-21T08:56:51.2292793Z     %0 = tt.get_program_id x : i32
2026-02-21T08:56:51.2292920Z     %1 = arith.divsi %0, %c2048_i32 : i32
2026-02-21T08:56:51.2293048Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T08:56:51.2293162Z     %3 = arith.subi %c8_i32, %2 : i32
2026-02-21T08:56:51.2293280Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T08:56:51.2293396Z     %5 = arith.remsi %0, %c2048_i32 : i32
2026-02-21T08:56:51.2293519Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T08:56:51.2293632Z     %7 = arith.addi %2, %6 : i32
2026-02-21T08:56:51.2293745Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T08:56:51.2293856Z     %9 = arith.muli %7, %c2048_i32 : i32
2026-02-21T08:56:51.2294075Z     %10 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:51.2294371Z     %11 = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:51.2294640Z     %12 = tt.splat %9 : i32 -> tensor<2048xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:51.2294881Z     %13 = arith.addi %12, %10 : tensor<2048xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:51.2295064Z     %14 = arith.muli %8, %c16_i32 : i32
2026-02-21T08:56:51.2295310Z     %15 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:51.2295632Z     %16 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:51.2295920Z     %17 = tt.splat %14 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:51.2296223Z     %18 = arith.addi %17, %15 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T08:56:51.2296514Z     %19 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2296837Z     %20 = tt.expand_dims %13 {axis = 1 : i32} : tensor<2048xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2048x1xi32, #blocked1>
2026-02-21T08:56:51.2297138Z     %21 = arith.muli %20, %cst_15 : tensor<2048x1xi32, #blocked1>
2026-02-21T08:56:51.2297345Z     %22 = tt.broadcast %21 : tensor<2048x1xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2297578Z     %23 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2297936Z     %24 = tt.expand_dims %18 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2298324Z     %25 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2298642Z     %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:56:51.2299063Z     %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:56:51.2299478Z     %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:51.2299740Z     %29 = arith.cmpi eq, %28, %cst_17 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:51.2299944Z     %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x16xi1, #blocked>
2026-02-21T08:56:51.2300152Z     %31 = arith.cmpi eq, %28, %cst_18 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:56:51.2300348Z     %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x16xi1, #blocked>
2026-02-21T08:56:51.2300576Z     %33 = ttg.local_alloc : () -> !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable>
2026-02-21T08:56:51.2300833Z     %34 = ttg.local_alloc : () -> !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable>
2026-02-21T08:56:51.2301096Z     %35 = ttg.local_alloc : () -> !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable>
2026-02-21T08:56:51.2301310Z     %36 = ttg.local_alloc : () -> !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable>
2026-02-21T08:56:51.2301580Z     %37 = tt.expand_dims %19 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2301859Z     %38 = tt.broadcast %37 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2302056Z     %39 = arith.addi %22, %38 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2302258Z     %40 = tt.addptr %23, %39 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2302471Z     %41 = tt.load %40 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2302669Z     %42 = arith.addi %19, %cst_14 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2302953Z     %43 = tt.expand_dims %42 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2303229Z     %44 = tt.broadcast %43 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2303420Z     %45 = arith.addi %22, %44 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2303626Z     %46 = tt.addptr %23, %45 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2303831Z     %47 = tt.load %46 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2304025Z     %48 = arith.addi %19, %cst_13 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2304297Z     %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2304567Z     %50 = tt.broadcast %49 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2304763Z     %51 = arith.addi %22, %50 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2304961Z     %52 = tt.addptr %23, %51 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2305206Z     %53 = tt.load %52 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2305395Z     %54 = arith.addi %19, %cst_12 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2305671Z     %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2305943Z     %56 = tt.broadcast %55 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2306131Z     %57 = arith.addi %22, %56 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2306329Z     %58 = tt.addptr %23, %57 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2306528Z     %59 = tt.load %58 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2306825Z     %60 = ttg.memdesc_index %33[%c0_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2307201Z     ttg.local_store %41, %60 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2307566Z     %61 = ttg.memdesc_index %34[%c0_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2307931Z     ttg.local_store %47, %61 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2308297Z     %62 = ttg.memdesc_index %35[%c0_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2308700Z     ttg.local_store %53, %62 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2309063Z     %63 = ttg.memdesc_index %36[%c0_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2309423Z     ttg.local_store %59, %63 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2309706Z     %64 = arith.addi %19, %cst_11 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2309987Z     %65 = tt.expand_dims %64 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2310260Z     %66 = tt.broadcast %65 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2310456Z     %67 = arith.addi %22, %66 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2310655Z     %68 = tt.addptr %23, %67 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2310864Z     %69 = tt.load %68 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2311058Z     %70 = arith.addi %19, %cst_10 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2311331Z     %71 = tt.expand_dims %70 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2311605Z     %72 = tt.broadcast %71 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2311795Z     %73 = arith.addi %22, %72 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2311996Z     %74 = tt.addptr %23, %73 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2312202Z     %75 = tt.load %74 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2312392Z     %76 = arith.addi %19, %cst_9 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2312671Z     %77 = tt.expand_dims %76 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2312940Z     %78 = tt.broadcast %77 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2333763Z     %79 = arith.addi %22, %78 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2333966Z     %80 = tt.addptr %23, %79 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2334172Z     %81 = tt.load %80 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2334367Z     %82 = arith.addi %19, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2334645Z     %83 = tt.expand_dims %82 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2334922Z     %84 = tt.broadcast %83 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2335117Z     %85 = arith.addi %22, %84 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2335318Z     %86 = tt.addptr %23, %85 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2335526Z     %87 = tt.load %86 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2335816Z     %88 = ttg.memdesc_index %33[%c1_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2336190Z     ttg.local_store %69, %88 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2336559Z     %89 = ttg.memdesc_index %34[%c1_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2336921Z     ttg.local_store %75, %89 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2337336Z     %90 = ttg.memdesc_index %35[%c1_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2337696Z     ttg.local_store %81, %90 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2338064Z     %91 = ttg.memdesc_index %36[%c1_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2338434Z     ttg.local_store %87, %91 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2338713Z     %92 = arith.addi %19, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2338996Z     %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2339276Z     %94 = tt.broadcast %93 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2339469Z     %95 = arith.addi %22, %94 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2339677Z     %96 = tt.addptr %23, %95 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2339879Z     %97 = tt.load %96 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2340073Z     %98 = arith.addi %19, %cst_6 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2340344Z     %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2340620Z     %100 = tt.broadcast %99 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2340823Z     %101 = arith.addi %22, %100 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2341031Z     %102 = tt.addptr %23, %101 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2341249Z     %103 = tt.load %102 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2341453Z     %104 = arith.addi %19, %cst_5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2341737Z     %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2342060Z     %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2342260Z     %107 = arith.addi %22, %106 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2342472Z     %108 = tt.addptr %23, %107 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2342684Z     %109 = tt.load %108 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2342886Z     %110 = arith.addi %19, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2343169Z     %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2343451Z     %112 = tt.broadcast %111 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2343651Z     %113 = arith.addi %22, %112 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2343860Z     %114 = tt.addptr %23, %113 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2344075Z     %115 = tt.load %114 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2344366Z     %116 = ttg.memdesc_index %33[%c2_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2344734Z     ttg.local_store %97, %116 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2345107Z     %117 = ttg.memdesc_index %34[%c2_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2345512Z     ttg.local_store %103, %117 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2345885Z     %118 = ttg.memdesc_index %35[%c2_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2346258Z     ttg.local_store %109, %118 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2346623Z     %119 = ttg.memdesc_index %36[%c2_i32] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2346998Z     ttg.local_store %115, %119 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2348710Z     %120:23 = scf.for %arg3 = %c0_i32 to %c500_i32 step %c4_i32 iter_args(%arg4 = %cst, %arg5 = %c2_i32, %arg6 = %60, %arg7 = %88, %arg8 = %116, %arg9 = %61, %arg10 = %89, %arg11 = %117, %arg12 = %c1_i32, %arg13 = %c5_i32, %arg14 = %c9_i32, %arg15 = %62, %arg16 = %90, %arg17 = %118, %arg18 = %c2_i32, %arg19 = %c6_i32, %arg20 = %c10_i32, %arg21 = %63, %arg22 = %91, %arg23 = %119, %arg24 = %c3_i32, %arg25 = %c7_i32, %arg26 = %c11_i32) -> (tensor<2048x16xf32, #mma>, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32)  : i32 {
2026-02-21T08:56:51.2350407Z       %381 = arith.addi %arg3, %c12_i32 : i32
2026-02-21T08:56:51.2350539Z       %382 = arith.muli %381, %c2_i32 : i32
2026-02-21T08:56:51.2350750Z       %383 = tt.splat %382 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2350979Z       %384 = arith.addi %383, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2351259Z       %385 = tt.expand_dims %384 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2351550Z       %386 = tt.broadcast %385 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2351761Z       %387 = arith.addi %22, %386 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2351973Z       %388 = tt.addptr %23, %387 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2352192Z       %389 = tt.load %388 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2352503Z       %390 = ttg.local_load %arg6 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2352962Z       %391 = arith.extf %390 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2353260Z       %392 = arith.muli %arg3, %c8192_i32 : i32
2026-02-21T08:56:51.2353442Z       %393 = tt.splat %392 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2353679Z       %394 = arith.addi %393, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2353992Z       %395 = tt.addptr %25, %394 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2354309Z       %396 = tt.load %395 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2354586Z       %397 = arith.shli %396, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2354833Z       %398 = arith.shrsi %397, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2355077Z       %399 = arith.shrsi %396, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2355371Z       %400 = tt.expand_dims %398 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2355712Z       %401 = tt.expand_dims %399 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2356001Z       %402 = tt.broadcast %400 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2356244Z       %403 = arith.select %30, %402, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2356491Z       %404 = tt.broadcast %401 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2356730Z       %405 = arith.select %32, %404, %403 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2356964Z       %406 = tt.reshape %405 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2357194Z       %407 = arith.sitofp %406 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2357493Z       %408 = ttg.convert_layout %407 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2357980Z       %409 = tt.dot %391, %408, %arg4, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2358346Z       %410 = arith.addi %arg3, %c13_i32 : i32
2026-02-21T08:56:51.2358472Z       %411 = arith.muli %410, %c2_i32 : i32
2026-02-21T08:56:51.2358651Z       %412 = tt.splat %411 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2358879Z       %413 = arith.addi %412, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2359199Z       %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2359486Z       %415 = tt.broadcast %414 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2359687Z       %416 = arith.addi %22, %415 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2359901Z       %417 = tt.addptr %23, %416 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2360116Z       %418 = tt.load %417 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2360430Z       %419 = ttg.local_load %arg9 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2360882Z       %420 = arith.extf %419 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2361181Z       %421 = arith.muli %arg12, %c8192_i32 : i32
2026-02-21T08:56:51.2361365Z       %422 = tt.splat %421 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2361593Z       %423 = arith.addi %422, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2361907Z       %424 = tt.addptr %25, %423 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2362222Z       %425 = tt.load %424 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2362456Z       %426 = arith.shli %425, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2362828Z       %427 = arith.shrsi %426, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2363066Z       %428 = arith.shrsi %425, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2363360Z       %429 = tt.expand_dims %427 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2363696Z       %430 = tt.expand_dims %428 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2363973Z       %431 = tt.broadcast %429 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2364211Z       %432 = arith.select %30, %431, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2364445Z       %433 = tt.broadcast %430 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2364681Z       %434 = arith.select %32, %433, %432 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2364909Z       %435 = tt.reshape %434 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2365132Z       %436 = arith.sitofp %435 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2365430Z       %437 = ttg.convert_layout %436 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2365895Z       %438 = tt.dot %420, %437, %409, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2366248Z       %439 = arith.addi %arg3, %c14_i32 : i32
2026-02-21T08:56:51.2366371Z       %440 = arith.muli %439, %c2_i32 : i32
2026-02-21T08:56:51.2366539Z       %441 = tt.splat %440 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2366769Z       %442 = arith.addi %441, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2367044Z       %443 = tt.expand_dims %442 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2367382Z       %444 = tt.broadcast %443 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2367581Z       %445 = arith.addi %22, %444 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2367783Z       %446 = tt.addptr %23, %445 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2367995Z       %447 = tt.load %446 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2368302Z       %448 = ttg.local_load %arg15 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2368749Z       %449 = arith.extf %448 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2369037Z       %450 = arith.muli %arg18, %c8192_i32 : i32
2026-02-21T08:56:51.2369215Z       %451 = tt.splat %450 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2369444Z       %452 = arith.addi %451, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2369755Z       %453 = tt.addptr %25, %452 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2370059Z       %454 = tt.load %453 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2370291Z       %455 = arith.shli %454, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2370528Z       %456 = arith.shrsi %455, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2370806Z       %457 = arith.shrsi %454, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2371093Z       %458 = tt.expand_dims %456 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2371426Z       %459 = tt.expand_dims %457 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2371707Z       %460 = tt.broadcast %458 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2371943Z       %461 = arith.select %30, %460, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2372183Z       %462 = tt.broadcast %459 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2372418Z       %463 = arith.select %32, %462, %461 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2372651Z       %464 = tt.reshape %463 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2372873Z       %465 = arith.sitofp %464 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2373167Z       %466 = ttg.convert_layout %465 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2373638Z       %467 = tt.dot %449, %466, %438, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2373986Z       %468 = arith.addi %arg3, %c15_i32 : i32
2026-02-21T08:56:51.2374107Z       %469 = arith.muli %468, %c2_i32 : i32
2026-02-21T08:56:51.2374277Z       %470 = tt.splat %469 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2374500Z       %471 = arith.addi %470, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:56:51.2374782Z       %472 = tt.expand_dims %471 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T08:56:51.2375096Z       %473 = tt.broadcast %472 : tensor<1x2xi32, #blocked1> -> tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2375292Z       %474 = arith.addi %22, %473 : tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2375501Z       %475 = tt.addptr %23, %474 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>, tensor<2048x2xi32, #blocked1>
2026-02-21T08:56:51.2375713Z       %476 = tt.load %475 : tensor<2048x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:56:51.2376024Z       %477 = ttg.local_load %arg21 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2376475Z       %478 = arith.extf %477 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2376765Z       %479 = arith.muli %arg24, %c8192_i32 : i32
2026-02-21T08:56:51.2376946Z       %480 = tt.splat %479 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2377175Z       %481 = arith.addi %480, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2377481Z       %482 = tt.addptr %25, %481 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2377790Z       %483 = tt.load %482 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2378020Z       %484 = arith.shli %483, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2378259Z       %485 = arith.shrsi %484, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2378494Z       %486 = arith.shrsi %483, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2378814Z       %487 = tt.expand_dims %485 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2379151Z       %488 = tt.expand_dims %486 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2379430Z       %489 = tt.broadcast %487 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2379668Z       %490 = arith.select %30, %489, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2379904Z       %491 = tt.broadcast %488 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2380131Z       %492 = arith.select %32, %491, %490 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2380362Z       %493 = tt.reshape %492 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2380584Z       %494 = arith.sitofp %493 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2380880Z       %495 = ttg.convert_layout %494 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2381347Z       %496 = tt.dot %478, %495, %467, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2381692Z       %497 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T08:56:51.2381821Z       %498 = arith.cmpi slt, %497, %c3_i32 : i32
2026-02-21T08:56:51.2381949Z       %499 = arith.select %498, %497, %c0_i32 : i32
2026-02-21T08:56:51.2382229Z       %500 = ttg.memdesc_index %33[%499] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2382602Z       ttg.local_store %389, %500 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2382970Z       %501 = ttg.memdesc_index %34[%499] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2383376Z       ttg.local_store %418, %501 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2383743Z       %502 = ttg.memdesc_index %35[%499] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2384108Z       ttg.local_store %447, %502 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2384474Z       %503 = ttg.memdesc_index %36[%499] : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2384839Z       ttg.local_store %476, %503 : tensor<2048x2xbf16, #blocked1> -> !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>
2026-02-21T08:56:51.2386301Z       scf.yield %496, %499, %arg7, %arg8, %500, %arg10, %arg11, %501, %arg13, %arg14, %410, %arg16, %arg17, %502, %arg19, %arg20, %439, %arg22, %arg23, %503, %arg25, %arg26, %468 : tensor<2048x16xf32, #mma>, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2>, i32, i32, i32
2026-02-21T08:56:51.2387585Z     }
2026-02-21T08:56:51.2387832Z     %121 = ttg.local_load %120#2 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2388274Z     %122 = arith.extf %121 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2388614Z     %123 = arith.addi %24, %cst_3 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2388928Z     %124 = tt.addptr %25, %123 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2389233Z     %125 = tt.load %124 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2389464Z     %126 = arith.shli %125, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2389699Z     %127 = arith.shrsi %126, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2389939Z     %128 = arith.shrsi %125, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2390226Z     %129 = tt.expand_dims %127 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2390553Z     %130 = tt.expand_dims %128 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2390833Z     %131 = tt.broadcast %129 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2391071Z     %132 = arith.select %30, %131, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2391303Z     %133 = tt.broadcast %130 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2391539Z     %134 = arith.select %32, %133, %132 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2391763Z     %135 = tt.reshape %134 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2392024Z     %136 = arith.sitofp %135 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2392319Z     %137 = ttg.convert_layout %136 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2392787Z     %138 = tt.dot %122, %137, %120#0, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2393300Z     %139 = ttg.local_load %120#5 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2393747Z     %140 = arith.extf %139 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2394033Z     %141 = arith.muli %120#8, %c8192_i32 : i32
2026-02-21T08:56:51.2394209Z     %142 = tt.splat %141 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2394433Z     %143 = arith.addi %142, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2394741Z     %144 = tt.addptr %25, %143 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2395049Z     %145 = tt.load %144 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2395279Z     %146 = arith.shli %145, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2395515Z     %147 = arith.shrsi %146, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2395786Z     %148 = arith.shrsi %145, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2396079Z     %149 = tt.expand_dims %147 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2396410Z     %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2396685Z     %151 = tt.broadcast %149 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2396922Z     %152 = arith.select %30, %151, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2397155Z     %153 = tt.broadcast %150 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2397386Z     %154 = arith.select %32, %153, %152 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2397617Z     %155 = tt.reshape %154 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2397835Z     %156 = arith.sitofp %155 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2398129Z     %157 = ttg.convert_layout %156 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2398589Z     %158 = tt.dot %140, %157, %138, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2399089Z     %159 = ttg.local_load %120#11 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2399530Z     %160 = arith.extf %159 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2399818Z     %161 = arith.muli %120#14, %c8192_i32 : i32
2026-02-21T08:56:51.2399994Z     %162 = tt.splat %161 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2400252Z     %163 = arith.addi %162, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2400554Z     %164 = tt.addptr %25, %163 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2400859Z     %165 = tt.load %164 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2401088Z     %166 = arith.shli %165, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2401326Z     %167 = arith.shrsi %166, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2401560Z     %168 = arith.shrsi %165, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2401846Z     %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2402178Z     %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2402459Z     %171 = tt.broadcast %169 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2402808Z     %172 = arith.select %30, %171, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2403046Z     %173 = tt.broadcast %170 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2403275Z     %174 = arith.select %32, %173, %172 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2403518Z     %175 = tt.reshape %174 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2403733Z     %176 = arith.sitofp %175 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2404065Z     %177 = ttg.convert_layout %176 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2404528Z     %178 = tt.dot %160, %177, %158, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2405027Z     %179 = ttg.local_load %120#17 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2405464Z     %180 = arith.extf %179 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2405748Z     %181 = arith.muli %120#20, %c8192_i32 : i32
2026-02-21T08:56:51.2405919Z     %182 = tt.splat %181 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2406142Z     %183 = arith.addi %182, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2406444Z     %184 = tt.addptr %25, %183 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2406748Z     %185 = tt.load %184 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2406980Z     %186 = arith.shli %185, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2407212Z     %187 = arith.shrsi %186, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2407445Z     %188 = arith.shrsi %185, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2407726Z     %189 = tt.expand_dims %187 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2408061Z     %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2408378Z     %191 = tt.broadcast %189 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2408611Z     %192 = arith.select %30, %191, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2408848Z     %193 = tt.broadcast %190 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2409083Z     %194 = arith.select %32, %193, %192 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2409322Z     %195 = tt.reshape %194 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2409546Z     %196 = arith.sitofp %195 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2409840Z     %197 = ttg.convert_layout %196 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2410314Z     %198 = tt.dot %180, %197, %178, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2410823Z     %199 = ttg.local_load %120#3 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2411262Z     %200 = arith.extf %199 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2411602Z     %201 = arith.addi %24, %cst_2 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2411915Z     %202 = tt.addptr %25, %201 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2412277Z     %203 = tt.load %202 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2412514Z     %204 = arith.shli %203, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2412752Z     %205 = arith.shrsi %204, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2412994Z     %206 = arith.shrsi %203, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2413284Z     %207 = tt.expand_dims %205 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2413622Z     %208 = tt.expand_dims %206 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2413912Z     %209 = tt.broadcast %207 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2414149Z     %210 = arith.select %30, %209, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2414391Z     %211 = tt.broadcast %208 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2414624Z     %212 = arith.select %32, %211, %210 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2414856Z     %213 = tt.reshape %212 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2415080Z     %214 = arith.sitofp %213 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2415371Z     %215 = ttg.convert_layout %214 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2415837Z     %216 = tt.dot %200, %215, %198, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2416351Z     %217 = ttg.local_load %120#6 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2416799Z     %218 = arith.extf %217 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2417122Z     %219 = arith.muli %120#9, %c8192_i32 : i32
2026-02-21T08:56:51.2417297Z     %220 = tt.splat %219 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2417527Z     %221 = arith.addi %220, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2417841Z     %222 = tt.addptr %25, %221 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2418150Z     %223 = tt.load %222 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2418387Z     %224 = arith.shli %223, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2418626Z     %225 = arith.shrsi %224, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2418871Z     %226 = arith.shrsi %223, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2419162Z     %227 = tt.expand_dims %225 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2419495Z     %228 = tt.expand_dims %226 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2419781Z     %229 = tt.broadcast %227 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2420020Z     %230 = arith.select %30, %229, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2420265Z     %231 = tt.broadcast %228 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2420548Z     %232 = arith.select %32, %231, %230 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2420776Z     %233 = tt.reshape %232 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2421002Z     %234 = arith.sitofp %233 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2421294Z     %235 = ttg.convert_layout %234 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2421761Z     %236 = tt.dot %218, %235, %216, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2422265Z     %237 = ttg.local_load %120#12 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2422710Z     %238 = arith.extf %237 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2423003Z     %239 = arith.muli %120#15, %c8192_i32 : i32
2026-02-21T08:56:51.2423184Z     %240 = tt.splat %239 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2423407Z     %241 = arith.addi %240, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2423720Z     %242 = tt.addptr %25, %241 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2424030Z     %243 = tt.load %242 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2424267Z     %244 = arith.shli %243, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2424508Z     %245 = arith.shrsi %244, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2424751Z     %246 = arith.shrsi %243, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2425074Z     %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2425411Z     %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2425692Z     %249 = tt.broadcast %247 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2425932Z     %250 = arith.select %30, %249, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2426167Z     %251 = tt.broadcast %248 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2426402Z     %252 = arith.select %32, %251, %250 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2426635Z     %253 = tt.reshape %252 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2426855Z     %254 = arith.sitofp %253 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2427155Z     %255 = ttg.convert_layout %254 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2427623Z     %256 = tt.dot %238, %255, %236, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2428130Z     %257 = ttg.local_load %120#18 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2428577Z     %258 = arith.extf %257 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2428899Z     %259 = arith.muli %120#21, %c8192_i32 : i32
2026-02-21T08:56:51.2429080Z     %260 = tt.splat %259 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2429310Z     %261 = arith.addi %260, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2429621Z     %262 = tt.addptr %25, %261 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2429931Z     %263 = tt.load %262 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2430163Z     %264 = arith.shli %263, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2430403Z     %265 = arith.shrsi %264, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2430639Z     %266 = arith.shrsi %263, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2430931Z     %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2431275Z     %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2431556Z     %269 = tt.broadcast %267 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2431796Z     %270 = arith.select %30, %269, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2432037Z     %271 = tt.broadcast %268 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2432266Z     %272 = arith.select %32, %271, %270 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2432500Z     %273 = tt.reshape %272 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2432719Z     %274 = arith.sitofp %273 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2433026Z     %275 = ttg.convert_layout %274 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2433530Z     %276 = tt.dot %258, %275, %256, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2434030Z     %277 = ttg.local_load %120#4 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2434473Z     %278 = arith.extf %277 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2434810Z     %279 = arith.addi %24, %cst_1 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2435127Z     %280 = tt.addptr %25, %279 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2435440Z     %281 = tt.load %280 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2435672Z     %282 = arith.shli %281, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2435912Z     %283 = arith.shrsi %282, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2436151Z     %284 = arith.shrsi %281, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2436440Z     %285 = tt.expand_dims %283 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2436772Z     %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2437084Z     %287 = tt.broadcast %285 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2437325Z     %288 = arith.select %30, %287, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2437566Z     %289 = tt.broadcast %286 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2437796Z     %290 = arith.select %32, %289, %288 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2438031Z     %291 = tt.reshape %290 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2438251Z     %292 = arith.sitofp %291 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2438555Z     %293 = ttg.convert_layout %292 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2439024Z     %294 = tt.dot %278, %293, %276, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2439527Z     %295 = ttg.local_load %120#7 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2439973Z     %296 = arith.extf %295 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2440261Z     %297 = arith.muli %120#10, %c8192_i32 : i32
2026-02-21T08:56:51.2440437Z     %298 = tt.splat %297 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2440666Z     %299 = arith.addi %298, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2440970Z     %300 = tt.addptr %25, %299 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2441286Z     %301 = tt.load %300 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2441522Z     %302 = arith.shli %301, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2441792Z     %303 = arith.shrsi %302, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2442033Z     %304 = arith.shrsi %301, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2442322Z     %305 = tt.expand_dims %303 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2442878Z     %306 = tt.expand_dims %304 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2443218Z     %307 = tt.broadcast %305 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2443464Z     %308 = arith.select %30, %307, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2443705Z     %309 = tt.broadcast %306 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2443940Z     %310 = arith.select %32, %309, %308 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2444171Z     %311 = tt.reshape %310 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2444395Z     %312 = arith.sitofp %311 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2444686Z     %313 = ttg.convert_layout %312 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2445153Z     %314 = tt.dot %296, %313, %294, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2445772Z     %315 = ttg.local_load %120#13 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2446222Z     %316 = arith.extf %315 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2446515Z     %317 = arith.muli %120#16, %c8192_i32 : i32
2026-02-21T08:56:51.2446691Z     %318 = tt.splat %317 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2446924Z     %319 = arith.addi %318, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2447233Z     %320 = tt.addptr %25, %319 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2447540Z     %321 = tt.load %320 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2447780Z     %322 = arith.shli %321, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2448015Z     %323 = arith.shrsi %322, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2448257Z     %324 = arith.shrsi %321, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2448547Z     %325 = tt.expand_dims %323 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2448876Z     %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2449158Z     %327 = tt.broadcast %325 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2449393Z     %328 = arith.select %30, %327, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2449630Z     %329 = tt.broadcast %326 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2449866Z     %330 = arith.select %32, %329, %328 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2450139Z     %331 = tt.reshape %330 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2450360Z     %332 = arith.sitofp %331 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2450650Z     %333 = ttg.convert_layout %332 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2451114Z     %334 = tt.dot %316, %333, %314, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2451615Z     %335 = ttg.local_load %120#19 : !ttg.memdesc<2048x2xbf16, #shared, #smem, mutable, 3x2048x2> -> tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2452053Z     %336 = arith.extf %335 : tensor<2048x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2452344Z     %337 = arith.muli %120#22, %c8192_i32 : i32
2026-02-21T08:56:51.2452523Z     %338 = tt.splat %337 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2452750Z     %339 = arith.addi %338, %24 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2453059Z     %340 = tt.addptr %25, %339 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2453366Z     %341 = tt.load %340 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2453604Z     %342 = arith.shli %341, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2453845Z     %343 = arith.shrsi %342, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2454289Z     %344 = arith.shrsi %341, %cst_16 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:51.2454584Z     %345 = tt.expand_dims %343 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2454912Z     %346 = tt.expand_dims %344 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T08:56:51.2455194Z     %347 = tt.broadcast %345 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2455433Z     %348 = arith.select %30, %347, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2455670Z     %349 = tt.broadcast %346 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2455905Z     %350 = arith.select %32, %349, %348 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T08:56:51.2456135Z     %351 = tt.reshape %350 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T08:56:51.2456359Z     %352 = arith.sitofp %351 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T08:56:51.2456658Z     %353 = ttg.convert_layout %352 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:51.2457119Z     %354 = tt.dot %336, %353, %334, inputPrecision = tf32 : tensor<2048x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<2048x16xf32, #mma>
2026-02-21T08:56:51.2457511Z     ttg.local_dealloc %36 : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable>
2026-02-21T08:56:51.2457725Z     ttg.local_dealloc %35 : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable>
2026-02-21T08:56:51.2457930Z     ttg.local_dealloc %34 : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable>
2026-02-21T08:56:51.2458139Z     ttg.local_dealloc %33 : !ttg.memdesc<3x2048x2xbf16, #shared, #smem, mutable>
2026-02-21T08:56:51.2458352Z     %355 = arith.truncf %354 : tensor<2048x16xf32, #mma> to tensor<2048x16xbf16, #mma>
2026-02-21T08:56:51.2458564Z     %356 = arith.extsi %9 : i32 to i64
2026-02-21T08:56:51.2458684Z     %357 = arith.extsi %14 : i32 to i64
2026-02-21T08:56:51.2458848Z     %358 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<2048x16x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:51.2459058Z     %359 = tt.splat %356 : i64 -> tensor<2048xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:51.2459341Z     %360 = arith.extsi %11 : tensor<2048xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<2048xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:51.2459631Z     %361 = arith.addi %359, %360 : tensor<2048xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:51.2459908Z     %362 = tt.expand_dims %361 {axis = 1 : i32} : tensor<2048xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<2048x1xi64, #mma>
2026-02-21T08:56:51.2460159Z     %363 = arith.muli %362, %cst_19 : tensor<2048x1xi64, #mma>
2026-02-21T08:56:51.2460353Z     %364 = tt.broadcast %363 : tensor<2048x1xi64, #mma> -> tensor<2048x16xi64, #mma>
2026-02-21T08:56:51.2460573Z     %365 = tt.splat %357 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:51.2460849Z     %366 = arith.extsi %16 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:51.2461127Z     %367 = arith.addi %365, %366 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:51.2461395Z     %368 = tt.expand_dims %367 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi64, #mma>
2026-02-21T08:56:51.2461661Z     %369 = tt.broadcast %368 : tensor<1x16xi64, #mma> -> tensor<2048x16xi64, #mma>
2026-02-21T08:56:51.2461845Z     %370 = arith.addi %364, %369 : tensor<2048x16xi64, #mma>
2026-02-21T08:56:51.2462044Z     %371 = tt.addptr %358, %370 : tensor<2048x16x!tt.ptr<bf16>, #mma>, tensor<2048x16xi64, #mma>
2026-02-21T08:56:51.2462284Z     %372 = arith.cmpi sge, %362, %cst_20 : tensor<2048x1xi64, #mma>
2026-02-21T08:56:51.2462461Z     %373 = arith.cmpi slt, %362, %cst_21 : tensor<2048x1xi64, #mma>
2026-02-21T08:56:51.2462628Z     %374 = arith.andi %372, %373 : tensor<2048x1xi1, #mma>
2026-02-21T08:56:51.2462814Z     %375 = tt.broadcast %374 : tensor<2048x1xi1, #mma> -> tensor<2048x16xi1, #mma>
2026-02-21T08:56:51.2463008Z     %376 = arith.cmpi sge, %368, %cst_22 : tensor<1x16xi64, #mma>
2026-02-21T08:56:51.2463175Z     %377 = arith.cmpi slt, %368, %cst_23 : tensor<1x16xi64, #mma>
2026-02-21T08:56:51.2463337Z     %378 = arith.andi %376, %377 : tensor<1x16xi1, #mma>
2026-02-21T08:56:51.2463505Z     %379 = tt.broadcast %378 : tensor<1x16xi1, #mma> -> tensor<2048x16xi1, #mma>
2026-02-21T08:56:51.2463686Z     %380 = arith.andi %375, %379 : tensor<2048x16xi1, #mma>
2026-02-21T08:56:51.2463851Z     tt.store %371, %355, %380 : tensor<2048x16x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:51.2463994Z     tt.return
2026-02-21T08:56:51.2464090Z   }
2026-02-21T08:56:51.2464174Z }
2026-02-21T08:56:51.2464219Z 
2026-02-21T08:56:51.2464259Z {-#
2026-02-21T08:56:51.2464345Z   external_resources: {
2026-02-21T08:56:51.2464454Z     mlir_reproducer: {
2026-02-21T08:56:51.2465455Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:56:51.2466459Z       disable_threading: false,
2026-02-21T08:56:51.2466569Z       verify_each: true
2026-02-21T08:56:51.2466668Z     }
2026-02-21T08:56:51.2466747Z   }
2026-02-21T08:56:51.2466825Z #-}
2026-02-21T08:56:51.2467105Z /tmp/torchinductor_root/w6/cw6ainxzjafdthoodfldr5o2eaj3ll2vel4ift4ys25lrduk4l3i.py:12:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:56:51.2467858Z /tmp/torchinductor_root/w6/cw6ainxzjafdthoodfldr5o2eaj3ll2vel4ift4ys25lrduk4l3i.py:12:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:56:51.2468421Z [90s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:56:51.2469152Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T08:56:51.2469814Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:56:51.2469988Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:56:53.2573820Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:56:53.2580571Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}>
2026-02-21T08:56:53.2581104Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}>
2026-02-21T08:56:53.2581770Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T08:56:53.2582788Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}>
2026-02-21T08:56:53.2583275Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:56:53.2583717Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:56:53.2584048Z #smem = #ttg.shared_memory
2026-02-21T08:56:53.2584479Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:56:53.2585365Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:56:53.2586222Z     %cst = arith.constant dense<16384> : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2586532Z     %cst_0 = arith.constant dense<0> : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2586840Z     %cst_1 = arith.constant dense<8192> : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2587153Z     %cst_2 = arith.constant dense<8192> : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2587470Z     %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2587780Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2588096Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2588421Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2588740Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T08:56:53.2589063Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T08:56:53.2589387Z     %cst_9 = arith.constant dense<1024> : tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2589668Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:56:53.2589957Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2590266Z     %c16991_i32 = arith.constant 16991 : i32
2026-02-21T08:56:53.2590434Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:56:53.2590733Z     %c-1_i32 = arith.constant -1 : i32
2026-02-21T08:56:53.2590910Z     %cst_11 = arith.constant dense<0> : tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2591120Z     %cst_12 = arith.constant dense<0> : tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2591357Z     %cst_13 = arith.constant dense<false> : tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2591662Z     %c255_i32 = arith.constant 255 : i32
2026-02-21T08:56:53.2591857Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T08:56:53.2592042Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:56:53.2592227Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:56:53.2592416Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:56:53.2592598Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T08:56:53.2592789Z     %c1216_i32 = arith.constant 1216 : i32
2026-02-21T08:56:53.2592971Z     %c1824_i32 = arith.constant 1824 : i32
2026-02-21T08:56:53.2593199Z     %cst_14 = arith.constant dense<0> : tensor<2x256xi8, #blocked>
2026-02-21T08:56:53.2593478Z     %cst_15 = arith.constant dense<8192> : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2593756Z     %cst_16 = arith.constant dense<0> : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2594040Z     %cst_17 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2594277Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T08:56:53.2594462Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T08:56:53.2594622Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:56:53.2594743Z     %c608_i32 = arith.constant 608 : i32
2026-02-21T08:56:53.2594984Z     %cst_18 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2595288Z     %0 = tt.get_program_id x : i32
2026-02-21T08:56:53.2595569Z     %1 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2596044Z     %2 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2596326Z     %3 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2596578Z     %4 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:53.2596787Z     %5 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T08:56:53.2597032Z     %6 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2597353Z     %7 = arith.extsi %6 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2597684Z     %8 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2598017Z     %9 = arith.extsi %8 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2598387Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T08:56:53.2598870Z     %11 = tt.expand_dims %10 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T08:56:53.2599281Z     %12 = tt.expand_dims %11 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T08:56:53.2599546Z     %13 = arith.cmpi eq, %12, %cst_8 : tensor<1x2x1xi32, #blocked1>
2026-02-21T08:56:53.2599762Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1>
2026-02-21T08:56:53.2599970Z     %15 = arith.cmpi eq, %12, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T08:56:53.2600177Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1>
2026-02-21T08:56:53.2600393Z     %17 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:53.2600722Z     %18 = arith.extsi %2 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2601042Z     %19 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2601350Z     %20 = arith.extsi %19 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2601601Z     %21 = arith.subi %c16991_i32, %0 : i32
2026-02-21T08:56:53.2601742Z     %22 = arith.divui %21, %c608_i32 : i32
2026-02-21T08:56:53.2601865Z     %23 = arith.remsi %22, %c4_i32 : i32
2026-02-21T08:56:53.2601987Z     %24 = arith.subi %22, %23 : i32
2026-02-21T08:56:53.2602105Z     %25 = arith.muli %24, %c608_i32 : i32
2026-02-21T08:56:53.2602230Z     %26 = arith.addi %0, %25 : i32
2026-02-21T08:56:53.2602363Z     scf.for %arg3 = %0 to %26 step %c2432_i32  : i32 {
2026-02-21T08:56:53.2602513Z       %32 = arith.divsi %arg3, %c1024_i32 : i32
2026-02-21T08:56:53.2602725Z       %33 = arith.muli %32, %c32_i32 : i32
2026-02-21T08:56:53.2602850Z       %34 = arith.subi %c512_i32, %33 : i32
2026-02-21T08:56:53.2602969Z       %35 = arith.minsi %34, %c32_i32 : i32
2026-02-21T08:56:53.2603099Z       %36 = arith.remsi %arg3, %c1024_i32 : i32
2026-02-21T08:56:53.2603227Z       %37 = arith.remsi %36, %35 : i32
2026-02-21T08:56:53.2603345Z       %38 = arith.addi %33, %37 : i32
2026-02-21T08:56:53.2603466Z       %39 = arith.divsi %36, %35 : i32
2026-02-21T08:56:53.2603580Z       %40 = arith.muli %38, %c32_i32 : i32
2026-02-21T08:56:53.2603759Z       %41 = tt.splat %40 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2604035Z       %42 = arith.addi %41, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2604213Z       %43 = arith.muli %39, %c256_i32 : i32
2026-02-21T08:56:53.2604441Z       %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2604694Z       %45 = arith.muli %44, %cst_9 : tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2604893Z       %46 = tt.broadcast %45 : tensor<32x1xi32, #blocked2> -> tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2605068Z       %47 = arith.extsi %43 : i32 to i64
2026-02-21T08:56:53.2605240Z       %48 = tt.splat %47 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2605457Z       %49 = arith.addi %48, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2605734Z       %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2606019Z       %51 = tt.broadcast %50 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2606219Z       %52 = arith.cmpi sge, %50, %cst_16 : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2606399Z       %53 = arith.cmpi slt, %50, %cst_15 : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2606563Z       %54 = arith.andi %52, %53 : tensor<1x256xi1, #blocked>
2026-02-21T08:56:53.2606752Z       %55 = tt.broadcast %54 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2607028Z       %56 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst_10) -> (tensor<32x256xf32, #mma>)  : i32 {
2026-02-21T08:56:53.2607250Z         %223 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:53.2607434Z         %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2607663Z         %225 = arith.addi %224, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2607949Z         %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:53.2608232Z         %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked2> -> tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2608475Z         %228 = arith.addi %46, %227 : tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2608682Z         %229 = tt.addptr %4, %228 : tensor<32x4x!tt.ptr<bf16>, #blocked2>, tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2608890Z         %230 = tt.load %229 : tensor<32x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:53.2609118Z         %231 = ttg.local_alloc %230 : (tensor<32x4xbf16, #blocked2>) -> !ttg.memdesc<32x4xbf16, #shared, #smem>
2026-02-21T08:56:53.2609455Z         %232 = ttg.local_load %231 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2609868Z         %233 = arith.extf %232 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2610156Z         %234 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:56:53.2610337Z         %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2610564Z         %236 = arith.addi %235, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2610841Z         %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2611088Z         %238 = arith.muli %237, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2611288Z         %239 = tt.broadcast %238 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2611481Z         %240 = arith.addi %239, %51 : tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2611686Z         %241 = tt.addptr %5, %240 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2611939Z         %242 = arith.cmpi sge, %237, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2612111Z         %243 = arith.cmpi slt, %237, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2612283Z         %244 = arith.andi %242, %243 : tensor<2x1xi1, #blocked>
2026-02-21T08:56:53.2612471Z         %245 = tt.broadcast %244 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2612665Z         %246 = arith.andi %245, %55 : tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2612836Z         %247 = tt.load %241, %246, %cst_14 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T08:56:53.2613107Z         %248 = ttg.convert_layout %247 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2613407Z         %249 = arith.shli %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2613657Z         %250 = arith.shrsi %249, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2613916Z         %251 = arith.shrsi %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2614222Z         %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T08:56:53.2614577Z         %253 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T08:56:53.2614880Z         %254 = tt.broadcast %252 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2615134Z         %255 = arith.select %14, %254, %cst_17 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2615391Z         %256 = tt.broadcast %253 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2615643Z         %257 = arith.select %16, %256, %255 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2615887Z         %258 = tt.reshape %257 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked>
2026-02-21T08:56:53.2616117Z         %259 = arith.sitofp %258 : tensor<4x256xi8, #blocked> to tensor<4x256xf32, #blocked>
2026-02-21T08:56:53.2616418Z         %260 = ttg.local_alloc %259 : (tensor<4x256xf32, #blocked>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T08:56:53.2616754Z         %261 = ttg.local_load %260 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2617249Z         %262 = tt.dot %233, %261, %arg5, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2617608Z         scf.yield %262 : tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2617782Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32}
2026-02-21T08:56:53.2617991Z       %57 = arith.truncf %56 : tensor<32x256xf32, #mma> to tensor<32x256xbf16, #mma>
2026-02-21T08:56:53.2618166Z       %58 = arith.extsi %40 : i32 to i64
2026-02-21T08:56:53.2618337Z       %59 = tt.splat %58 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2618545Z       %60 = arith.addi %59, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2618807Z       %61 = tt.expand_dims %60 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2619042Z       %62 = arith.muli %61, %cst_1 : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2619222Z       %63 = tt.broadcast %62 : tensor<32x1xi64, #mma> -> tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2619431Z       %64 = tt.splat %47 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2619638Z       %65 = arith.addi %64, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2619943Z       %66 = tt.expand_dims %65 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2620414Z       %67 = tt.broadcast %66 : tensor<1x256xi64, #mma> -> tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2620613Z       %68 = arith.addi %63, %67 : tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2620851Z       %69 = tt.addptr %17, %68 : tensor<32x256x!tt.ptr<bf16>, #mma>, tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2628256Z       %70 = arith.cmpi sge, %61, %cst_0 : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2628450Z       %71 = arith.cmpi slt, %61, %cst : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2628609Z       %72 = arith.andi %70, %71 : tensor<32x1xi1, #mma>
2026-02-21T08:56:53.2628785Z       %73 = tt.broadcast %72 : tensor<32x1xi1, #mma> -> tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2628973Z       %74 = arith.cmpi sge, %66, %cst_3 : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2629144Z       %75 = arith.cmpi slt, %66, %cst_2 : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2629310Z       %76 = arith.andi %74, %75 : tensor<1x256xi1, #mma>
2026-02-21T08:56:53.2629488Z       %77 = tt.broadcast %76 : tensor<1x256xi1, #mma> -> tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2629668Z       %78 = arith.andi %73, %77 : tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2629826Z       tt.store %69, %57, %78 : tensor<32x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:53.2629981Z       %79 = arith.addi %arg3, %c608_i32 : i32
2026-02-21T08:56:53.2630108Z       %80 = arith.divsi %79, %c1024_i32 : i32
2026-02-21T08:56:53.2630237Z       %81 = arith.muli %80, %c32_i32 : i32
2026-02-21T08:56:53.2630361Z       %82 = arith.subi %c512_i32, %81 : i32
2026-02-21T08:56:53.2630488Z       %83 = arith.minsi %82, %c32_i32 : i32
2026-02-21T08:56:53.2630612Z       %84 = arith.remsi %79, %c1024_i32 : i32
2026-02-21T08:56:53.2630735Z       %85 = arith.remsi %84, %83 : i32
2026-02-21T08:56:53.2630857Z       %86 = arith.addi %81, %85 : i32
2026-02-21T08:56:53.2630972Z       %87 = arith.divsi %84, %83 : i32
2026-02-21T08:56:53.2631100Z       %88 = arith.muli %86, %c32_i32 : i32
2026-02-21T08:56:53.2631274Z       %89 = tt.splat %88 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2631566Z       %90 = arith.addi %89, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2631742Z       %91 = arith.muli %87, %c256_i32 : i32
2026-02-21T08:56:53.2631974Z       %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2632232Z       %93 = arith.muli %92, %cst_9 : tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2632427Z       %94 = tt.broadcast %93 : tensor<32x1xi32, #blocked2> -> tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2632606Z       %95 = arith.extsi %91 : i32 to i64
2026-02-21T08:56:53.2632777Z       %96 = tt.splat %95 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2633002Z       %97 = arith.addi %96, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2633289Z       %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2633573Z       %99 = tt.broadcast %98 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2633783Z       %100 = arith.cmpi sge, %98, %cst_16 : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2633964Z       %101 = arith.cmpi slt, %98, %cst_15 : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2634141Z       %102 = arith.andi %100, %101 : tensor<1x256xi1, #blocked>
2026-02-21T08:56:53.2634339Z       %103 = tt.broadcast %102 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2634610Z       %104 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst_10) -> (tensor<32x256xf32, #mma>)  : i32 {
2026-02-21T08:56:53.2634835Z         %223 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:53.2635062Z         %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2635297Z         %225 = arith.addi %224, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2635586Z         %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:53.2635868Z         %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked2> -> tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2636072Z         %228 = arith.addi %94, %227 : tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2636278Z         %229 = tt.addptr %4, %228 : tensor<32x4x!tt.ptr<bf16>, #blocked2>, tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2636492Z         %230 = tt.load %229 : tensor<32x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:53.2636718Z         %231 = ttg.local_alloc %230 : (tensor<32x4xbf16, #blocked2>) -> !ttg.memdesc<32x4xbf16, #shared, #smem>
2026-02-21T08:56:53.2637063Z         %232 = ttg.local_load %231 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2637481Z         %233 = arith.extf %232 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2637771Z         %234 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:56:53.2637953Z         %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2638183Z         %236 = arith.addi %235, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2638462Z         %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2638718Z         %238 = arith.muli %237, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2638917Z         %239 = tt.broadcast %238 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2639122Z         %240 = arith.addi %239, %99 : tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2639323Z         %241 = tt.addptr %5, %240 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2639575Z         %242 = arith.cmpi sge, %237, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2639755Z         %243 = arith.cmpi slt, %237, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2639922Z         %244 = arith.andi %242, %243 : tensor<2x1xi1, #blocked>
2026-02-21T08:56:53.2640118Z         %245 = tt.broadcast %244 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2640313Z         %246 = arith.andi %245, %103 : tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2640492Z         %247 = tt.load %241, %246, %cst_14 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T08:56:53.2640765Z         %248 = ttg.convert_layout %247 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2641059Z         %249 = arith.shli %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2641314Z         %250 = arith.shrsi %249, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2641569Z         %251 = arith.shrsi %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2641878Z         %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T08:56:53.2642235Z         %253 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T08:56:53.2642535Z         %254 = tt.broadcast %252 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2642864Z         %255 = arith.select %14, %254, %cst_17 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2643160Z         %256 = tt.broadcast %253 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2643414Z         %257 = arith.select %16, %256, %255 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2643662Z         %258 = tt.reshape %257 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked>
2026-02-21T08:56:53.2643891Z         %259 = arith.sitofp %258 : tensor<4x256xi8, #blocked> to tensor<4x256xf32, #blocked>
2026-02-21T08:56:53.2644150Z         %260 = ttg.local_alloc %259 : (tensor<4x256xf32, #blocked>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T08:56:53.2644483Z         %261 = ttg.local_load %260 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2644974Z         %262 = tt.dot %233, %261, %arg5, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2645347Z         scf.yield %262 : tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2645519Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32}
2026-02-21T08:56:53.2645737Z       %105 = arith.truncf %104 : tensor<32x256xf32, #mma> to tensor<32x256xbf16, #mma>
2026-02-21T08:56:53.2645918Z       %106 = arith.extsi %88 : i32 to i64
2026-02-21T08:56:53.2646088Z       %107 = tt.splat %106 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2646308Z       %108 = arith.addi %107, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2646581Z       %109 = tt.expand_dims %108 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2646827Z       %110 = arith.muli %109, %cst_1 : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2647013Z       %111 = tt.broadcast %110 : tensor<32x1xi64, #mma> -> tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2647231Z       %112 = tt.splat %95 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2647451Z       %113 = arith.addi %112, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2647764Z       %114 = tt.expand_dims %113 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2648035Z       %115 = tt.broadcast %114 : tensor<1x256xi64, #mma> -> tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2648221Z       %116 = arith.addi %111, %115 : tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2648418Z       %117 = tt.addptr %17, %116 : tensor<32x256x!tt.ptr<bf16>, #mma>, tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2648630Z       %118 = arith.cmpi sge, %109, %cst_0 : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2648797Z       %119 = arith.cmpi slt, %109, %cst : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2648963Z       %120 = arith.andi %118, %119 : tensor<32x1xi1, #mma>
2026-02-21T08:56:53.2649141Z       %121 = tt.broadcast %120 : tensor<32x1xi1, #mma> -> tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2649336Z       %122 = arith.cmpi sge, %114, %cst_3 : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2649508Z       %123 = arith.cmpi slt, %114, %cst_2 : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2649674Z       %124 = arith.andi %122, %123 : tensor<1x256xi1, #mma>
2026-02-21T08:56:53.2649856Z       %125 = tt.broadcast %124 : tensor<1x256xi1, #mma> -> tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2650039Z       %126 = arith.andi %121, %125 : tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2650210Z       tt.store %117, %105, %126 : tensor<32x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:53.2650367Z       %127 = arith.addi %arg3, %c1216_i32 : i32
2026-02-21T08:56:53.2650502Z       %128 = arith.divsi %127, %c1024_i32 : i32
2026-02-21T08:56:53.2650629Z       %129 = arith.muli %128, %c32_i32 : i32
2026-02-21T08:56:53.2650757Z       %130 = arith.subi %c512_i32, %129 : i32
2026-02-21T08:56:53.2650885Z       %131 = arith.minsi %130, %c32_i32 : i32
2026-02-21T08:56:53.2651049Z       %132 = arith.remsi %127, %c1024_i32 : i32
2026-02-21T08:56:53.2651177Z       %133 = arith.remsi %132, %131 : i32
2026-02-21T08:56:53.2651301Z       %134 = arith.addi %129, %133 : i32
2026-02-21T08:56:53.2651424Z       %135 = arith.divsi %132, %131 : i32
2026-02-21T08:56:53.2651545Z       %136 = arith.muli %134, %c32_i32 : i32
2026-02-21T08:56:53.2651727Z       %137 = tt.splat %136 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2651964Z       %138 = arith.addi %137, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2652144Z       %139 = arith.muli %135, %c256_i32 : i32
2026-02-21T08:56:53.2652381Z       %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2652640Z       %141 = arith.muli %140, %cst_9 : tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2652850Z       %142 = tt.broadcast %141 : tensor<32x1xi32, #blocked2> -> tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2653032Z       %143 = arith.extsi %139 : i32 to i64
2026-02-21T08:56:53.2653209Z       %144 = tt.splat %143 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2653440Z       %145 = arith.addi %144, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2653723Z       %146 = tt.expand_dims %145 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2654013Z       %147 = tt.broadcast %146 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2654222Z       %148 = arith.cmpi sge, %146, %cst_16 : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2654407Z       %149 = arith.cmpi slt, %146, %cst_15 : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2654583Z       %150 = arith.andi %148, %149 : tensor<1x256xi1, #blocked>
2026-02-21T08:56:53.2654783Z       %151 = tt.broadcast %150 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2655061Z       %152 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst_10) -> (tensor<32x256xf32, #mma>)  : i32 {
2026-02-21T08:56:53.2679480Z         %223 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:53.2679665Z         %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2679901Z         %225 = arith.addi %224, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2680187Z         %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:53.2680475Z         %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked2> -> tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2680675Z         %228 = arith.addi %142, %227 : tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2680886Z         %229 = tt.addptr %4, %228 : tensor<32x4x!tt.ptr<bf16>, #blocked2>, tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2681104Z         %230 = tt.load %229 : tensor<32x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:53.2681331Z         %231 = ttg.local_alloc %230 : (tensor<32x4xbf16, #blocked2>) -> !ttg.memdesc<32x4xbf16, #shared, #smem>
2026-02-21T08:56:53.2681674Z         %232 = ttg.local_load %231 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2682087Z         %233 = arith.extf %232 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2682380Z         %234 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:56:53.2682610Z         %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2682842Z         %236 = arith.addi %235, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2683177Z         %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2683428Z         %238 = arith.muli %237, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2683631Z         %239 = tt.broadcast %238 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2683835Z         %240 = arith.addi %239, %147 : tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2684042Z         %241 = tt.addptr %5, %240 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2684265Z         %242 = arith.cmpi sge, %237, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2684447Z         %243 = arith.cmpi slt, %237, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2684624Z         %244 = arith.andi %242, %243 : tensor<2x1xi1, #blocked>
2026-02-21T08:56:53.2684822Z         %245 = tt.broadcast %244 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2685028Z         %246 = arith.andi %245, %151 : tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2685211Z         %247 = tt.load %241, %246, %cst_14 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T08:56:53.2685482Z         %248 = ttg.convert_layout %247 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2685781Z         %249 = arith.shli %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2686031Z         %250 = arith.shrsi %249, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2686285Z         %251 = arith.shrsi %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2686597Z         %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T08:56:53.2686952Z         %253 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T08:56:53.2687264Z         %254 = tt.broadcast %252 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2687574Z         %255 = arith.select %14, %254, %cst_17 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2687830Z         %256 = tt.broadcast %253 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2688086Z         %257 = arith.select %16, %256, %255 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2688330Z         %258 = tt.reshape %257 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked>
2026-02-21T08:56:53.2688563Z         %259 = arith.sitofp %258 : tensor<4x256xi8, #blocked> to tensor<4x256xf32, #blocked>
2026-02-21T08:56:53.2688829Z         %260 = ttg.local_alloc %259 : (tensor<4x256xf32, #blocked>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T08:56:53.2689169Z         %261 = ttg.local_load %260 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2689657Z         %262 = tt.dot %233, %261, %arg5, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2690021Z         scf.yield %262 : tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2690195Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32}
2026-02-21T08:56:53.2690415Z       %153 = arith.truncf %152 : tensor<32x256xf32, #mma> to tensor<32x256xbf16, #mma>
2026-02-21T08:56:53.2690592Z       %154 = arith.extsi %136 : i32 to i64
2026-02-21T08:56:53.2690765Z       %155 = tt.splat %154 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2690982Z       %156 = arith.addi %155, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2691295Z       %157 = tt.expand_dims %156 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2691540Z       %158 = arith.muli %157, %cst_1 : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2691727Z       %159 = tt.broadcast %158 : tensor<32x1xi64, #mma> -> tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2691943Z       %160 = tt.splat %143 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2692158Z       %161 = arith.addi %160, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2692435Z       %162 = tt.expand_dims %161 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2692710Z       %163 = tt.broadcast %162 : tensor<1x256xi64, #mma> -> tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2692898Z       %164 = arith.addi %159, %163 : tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2693097Z       %165 = tt.addptr %17, %164 : tensor<32x256x!tt.ptr<bf16>, #mma>, tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2693305Z       %166 = arith.cmpi sge, %157, %cst_0 : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2693477Z       %167 = arith.cmpi slt, %157, %cst : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2693650Z       %168 = arith.andi %166, %167 : tensor<32x1xi1, #mma>
2026-02-21T08:56:53.2693830Z       %169 = tt.broadcast %168 : tensor<32x1xi1, #mma> -> tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2694021Z       %170 = arith.cmpi sge, %162, %cst_3 : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2694194Z       %171 = arith.cmpi slt, %162, %cst_2 : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2694354Z       %172 = arith.andi %170, %171 : tensor<1x256xi1, #mma>
2026-02-21T08:56:53.2694537Z       %173 = tt.broadcast %172 : tensor<1x256xi1, #mma> -> tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2694757Z       %174 = arith.andi %169, %173 : tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2694931Z       tt.store %165, %153, %174 : tensor<32x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:53.2695093Z       %175 = arith.addi %arg3, %c1824_i32 : i32
2026-02-21T08:56:53.2695230Z       %176 = arith.divsi %175, %c1024_i32 : i32
2026-02-21T08:56:53.2695357Z       %177 = arith.muli %176, %c32_i32 : i32
2026-02-21T08:56:53.2695525Z       %178 = arith.subi %c512_i32, %177 : i32
2026-02-21T08:56:53.2695653Z       %179 = arith.minsi %178, %c32_i32 : i32
2026-02-21T08:56:53.2695780Z       %180 = arith.remsi %175, %c1024_i32 : i32
2026-02-21T08:56:53.2695910Z       %181 = arith.remsi %180, %179 : i32
2026-02-21T08:56:53.2696030Z       %182 = arith.addi %177, %181 : i32
2026-02-21T08:56:53.2696154Z       %183 = arith.divsi %180, %179 : i32
2026-02-21T08:56:53.2696276Z       %184 = arith.muli %182, %c32_i32 : i32
2026-02-21T08:56:53.2696457Z       %185 = tt.splat %184 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2696694Z       %186 = arith.addi %185, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2696875Z       %187 = arith.muli %183, %c256_i32 : i32
2026-02-21T08:56:53.2697117Z       %188 = tt.expand_dims %186 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2697378Z       %189 = arith.muli %188, %cst_9 : tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2697592Z       %190 = tt.broadcast %189 : tensor<32x1xi32, #blocked2> -> tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2697777Z       %191 = arith.extsi %187 : i32 to i64
2026-02-21T08:56:53.2697951Z       %192 = tt.splat %191 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2698182Z       %193 = arith.addi %192, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2698464Z       %194 = tt.expand_dims %193 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2698752Z       %195 = tt.broadcast %194 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2699010Z       %196 = arith.cmpi sge, %194, %cst_16 : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2699196Z       %197 = arith.cmpi slt, %194, %cst_15 : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2699374Z       %198 = arith.andi %196, %197 : tensor<1x256xi1, #blocked>
2026-02-21T08:56:53.2699565Z       %199 = tt.broadcast %198 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2699839Z       %200 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst_10) -> (tensor<32x256xf32, #mma>)  : i32 {
2026-02-21T08:56:53.2700059Z         %223 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T08:56:53.2700239Z         %224 = tt.splat %223 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2700470Z         %225 = arith.addi %224, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2700750Z         %226 = tt.expand_dims %225 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:53.2701041Z         %227 = tt.broadcast %226 : tensor<1x4xi32, #blocked2> -> tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2701240Z         %228 = arith.addi %190, %227 : tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2701450Z         %229 = tt.addptr %4, %228 : tensor<32x4x!tt.ptr<bf16>, #blocked2>, tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2701666Z         %230 = tt.load %229 : tensor<32x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:53.2701896Z         %231 = ttg.local_alloc %230 : (tensor<32x4xbf16, #blocked2>) -> !ttg.memdesc<32x4xbf16, #shared, #smem>
2026-02-21T08:56:53.2702235Z         %232 = ttg.local_load %231 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2702649Z         %233 = arith.extf %232 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2702945Z         %234 = arith.extsi %arg4 : i32 to i64
2026-02-21T08:56:53.2703123Z         %235 = tt.splat %234 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2703385Z         %236 = arith.addi %235, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2703663Z         %237 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2703912Z         %238 = arith.muli %237, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2704115Z         %239 = tt.broadcast %238 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2704312Z         %240 = arith.addi %239, %195 : tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2704512Z         %241 = tt.addptr %5, %240 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2704725Z         %242 = arith.cmpi sge, %237, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2704901Z         %243 = arith.cmpi slt, %237, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2705072Z         %244 = arith.andi %242, %243 : tensor<2x1xi1, #blocked>
2026-02-21T08:56:53.2705268Z         %245 = tt.broadcast %244 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2705463Z         %246 = arith.andi %245, %199 : tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2705641Z         %247 = tt.load %241, %246, %cst_14 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T08:56:53.2705910Z         %248 = ttg.convert_layout %247 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2706206Z         %249 = arith.shli %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2706454Z         %250 = arith.shrsi %249, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2706704Z         %251 = arith.shrsi %248, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2707052Z         %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T08:56:53.2707408Z         %253 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T08:56:53.2707710Z         %254 = tt.broadcast %252 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2707966Z         %255 = arith.select %14, %254, %cst_17 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2708220Z         %256 = tt.broadcast %253 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2708473Z         %257 = arith.select %16, %256, %255 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2708715Z         %258 = tt.reshape %257 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked>
2026-02-21T08:56:53.2708948Z         %259 = arith.sitofp %258 : tensor<4x256xi8, #blocked> to tensor<4x256xf32, #blocked>
2026-02-21T08:56:53.2709211Z         %260 = ttg.local_alloc %259 : (tensor<4x256xf32, #blocked>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T08:56:53.2709542Z         %261 = ttg.local_load %260 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2710032Z         %262 = tt.dot %233, %261, %arg5, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2710390Z         scf.yield %262 : tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2710563Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 2 : i32}
2026-02-21T08:56:53.2710780Z       %201 = arith.truncf %200 : tensor<32x256xf32, #mma> to tensor<32x256xbf16, #mma>
2026-02-21T08:56:53.2710958Z       %202 = arith.extsi %184 : i32 to i64
2026-02-21T08:56:53.2711129Z       %203 = tt.splat %202 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2711375Z       %204 = arith.addi %203, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2711646Z       %205 = tt.expand_dims %204 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2711889Z       %206 = arith.muli %205, %cst_1 : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2712073Z       %207 = tt.broadcast %206 : tensor<32x1xi64, #mma> -> tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2712288Z       %208 = tt.splat %191 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2712503Z       %209 = arith.addi %208, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2712780Z       %210 = tt.expand_dims %209 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2713047Z       %211 = tt.broadcast %210 : tensor<1x256xi64, #mma> -> tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2713236Z       %212 = arith.addi %207, %211 : tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2713437Z       %213 = tt.addptr %17, %212 : tensor<32x256x!tt.ptr<bf16>, #mma>, tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2713638Z       %214 = arith.cmpi sge, %205, %cst_0 : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2713810Z       %215 = arith.cmpi slt, %205, %cst : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2713970Z       %216 = arith.andi %214, %215 : tensor<32x1xi1, #mma>
2026-02-21T08:56:53.2714152Z       %217 = tt.broadcast %216 : tensor<32x1xi1, #mma> -> tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2714346Z       %218 = arith.cmpi sge, %210, %cst_3 : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2714517Z       %219 = arith.cmpi slt, %210, %cst_2 : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2714683Z       %220 = arith.andi %218, %219 : tensor<1x256xi1, #mma>
2026-02-21T08:56:53.2714903Z       %221 = tt.broadcast %220 : tensor<1x256xi1, #mma> -> tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2715091Z       %222 = arith.andi %217, %221 : tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2715257Z       tt.store %213, %201, %222 : tensor<32x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:53.2715414Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T08:56:53.2715544Z     %27 = arith.subi %c16384_i32, %26 : i32
2026-02-21T08:56:53.2715675Z     %28 = arith.ceildivsi %27, %c608_i32 : i32
2026-02-21T08:56:53.2715803Z     %29 = arith.muli %28, %c256_i32 : i32
2026-02-21T08:56:53.2715922Z     %30 = arith.subi %26, %c608_i32 : i32
2026-02-21T08:56:53.2716429Z     %31:9 = scf.for %arg3 = %c0_i32 to %29 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %30, %arg6 = %c0_i32, %arg7 = %cst_10, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_11, %arg11 = %cst_12, %arg12 = %cst_13) -> (i32, i32, i32, tensor<32x256xf32, #mma>, i32, i32, tensor<32x4xi32, #blocked2>, tensor<2x256xi64, #blocked>, tensor<2x256xi1, #blocked>)  : i32 {
2026-02-21T08:56:53.2716934Z       %32 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T08:56:53.2717071Z       %33 = arith.cmpi eq, %arg4, %c255_i32 : i32
2026-02-21T08:56:53.2717211Z       %34 = arith.select %33, %c0_i32, %32 : i32
2026-02-21T08:56:53.2717341Z       %35 = arith.cmpi eq, %34, %c0_i32 : i32
2026-02-21T08:56:53.2717474Z       %36 = arith.select %35, %c0_i32, %arg6 : i32
2026-02-21T08:56:53.2717707Z       %37:6 = scf.if %35 -> (i32, i32, tensor<32x4xi32, #blocked2>, tensor<2x256xi64, #blocked>, tensor<2x256xi1, #blocked>, i32) {
2026-02-21T08:56:53.2717943Z         %81 = arith.addi %arg5, %c608_i32 : i32
2026-02-21T08:56:53.2718075Z         %82 = arith.divsi %81, %c1024_i32 : i32
2026-02-21T08:56:53.2718200Z         %83 = arith.muli %82, %c32_i32 : i32
2026-02-21T08:56:53.2718326Z         %84 = arith.subi %c512_i32, %83 : i32
2026-02-21T08:56:53.2718448Z         %85 = arith.minsi %84, %c32_i32 : i32
2026-02-21T08:56:53.2718576Z         %86 = arith.remsi %81, %c1024_i32 : i32
2026-02-21T08:56:53.2718700Z         %87 = arith.remsi %86, %85 : i32
2026-02-21T08:56:53.2718823Z         %88 = arith.addi %83, %87 : i32
2026-02-21T08:56:53.2718988Z         %89 = arith.divsi %86, %85 : i32
2026-02-21T08:56:53.2719107Z         %90 = arith.muli %88, %c32_i32 : i32
2026-02-21T08:56:53.2719283Z         %91 = tt.splat %90 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2719511Z         %92 = arith.addi %91, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:56:53.2719690Z         %93 = arith.muli %89, %c256_i32 : i32
2026-02-21T08:56:53.2719918Z         %94 = tt.expand_dims %92 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2720175Z         %95 = arith.muli %94, %cst_9 : tensor<32x1xi32, #blocked2>
2026-02-21T08:56:53.2720373Z         %96 = tt.broadcast %95 : tensor<32x1xi32, #blocked2> -> tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2720552Z         %97 = arith.extsi %93 : i32 to i64
2026-02-21T08:56:53.2720724Z         %98 = tt.splat %97 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2720947Z         %99 = arith.addi %98, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T08:56:53.2721254Z         %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2721543Z         %101 = tt.broadcast %100 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2721751Z         %102 = arith.cmpi sge, %100, %cst_16 : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2721935Z         %103 = arith.cmpi slt, %100, %cst_15 : tensor<1x256xi64, #blocked>
2026-02-21T08:56:53.2722114Z         %104 = arith.andi %102, %103 : tensor<1x256xi1, #blocked>
2026-02-21T08:56:53.2722306Z         %105 = tt.broadcast %104 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2722705Z         scf.yield %90, %93, %96, %101, %105, %81 : i32, i32, tensor<32x4xi32, #blocked2>, tensor<2x256xi64, #blocked>, tensor<2x256xi1, #blocked>, i32
2026-02-21T08:56:53.2722949Z       } else {
2026-02-21T08:56:53.2723190Z         scf.yield %arg8, %arg9, %arg10, %arg11, %arg12, %arg5 : i32, i32, tensor<32x4xi32, #blocked2>, tensor<2x256xi64, #blocked>, tensor<2x256xi1, #blocked>, i32
2026-02-21T08:56:53.2723452Z       }
2026-02-21T08:56:53.2723540Z       %38 = arith.muli %36, %c2_i32 : i32
2026-02-21T08:56:53.2723716Z       %39 = tt.splat %38 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2723935Z       %40 = arith.addi %39, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:56:53.2724212Z       %41 = tt.expand_dims %40 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T08:56:53.2724493Z       %42 = tt.broadcast %41 : tensor<1x4xi32, #blocked2> -> tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2724698Z       %43 = arith.addi %37#2, %42 : tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2724904Z       %44 = tt.addptr %4, %43 : tensor<32x4x!tt.ptr<bf16>, #blocked2>, tensor<32x4xi32, #blocked2>
2026-02-21T08:56:53.2725105Z       %45 = tt.load %44 : tensor<32x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T08:56:53.2725325Z       %46 = ttg.local_alloc %45 : (tensor<32x4xbf16, #blocked2>) -> !ttg.memdesc<32x4xbf16, #shared, #smem>
2026-02-21T08:56:53.2725651Z       %47 = ttg.local_load %46 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2726056Z       %48 = arith.extf %47 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2726336Z       %49 = arith.extsi %36 : i32 to i64
2026-02-21T08:56:53.2726505Z       %50 = tt.splat %49 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2726728Z       %51 = arith.addi %50, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:56:53.2727044Z       %52 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2727288Z       %53 = arith.muli %52, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2727480Z       %54 = tt.broadcast %53 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2727671Z       %55 = arith.addi %54, %37#3 : tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2727870Z       %56 = tt.addptr %5, %55 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T08:56:53.2728075Z       %57 = arith.cmpi sge, %52, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2728247Z       %58 = arith.cmpi slt, %52, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T08:56:53.2728412Z       %59 = arith.andi %57, %58 : tensor<2x1xi1, #blocked>
2026-02-21T08:56:53.2728594Z       %60 = tt.broadcast %59 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2728785Z       %61 = arith.andi %60, %37#4 : tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2728955Z       %62 = tt.load %56, %61, %cst_14 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T08:56:53.2729219Z       %63 = ttg.convert_layout %62 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2729510Z       %64 = arith.shli %63, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2729750Z       %65 = arith.shrsi %64, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2729996Z       %66 = arith.shrsi %63, %cst_18 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:56:53.2730293Z       %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T08:56:53.2730675Z       %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T08:56:53.2730969Z       %69 = tt.broadcast %67 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2731215Z       %70 = arith.select %14, %69, %cst_17 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2731461Z       %71 = tt.broadcast %68 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2731700Z       %72 = arith.select %16, %71, %70 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T08:56:53.2731932Z       %73 = tt.reshape %72 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked>
2026-02-21T08:56:53.2732152Z       %74 = arith.sitofp %73 : tensor<4x256xi8, #blocked> to tensor<4x256xf32, #blocked>
2026-02-21T08:56:53.2732394Z       %75 = ttg.local_alloc %74 : (tensor<4x256xf32, #blocked>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T08:56:53.2732726Z       %76 = ttg.local_load %75 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:56:53.2733201Z       %77 = tt.dot %48, %76, %arg7, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2733550Z       %78 = arith.addi %36, %c2_i32 : i32
2026-02-21T08:56:53.2733683Z       %79 = arith.cmpi eq, %34, %c255_i32 : i32
2026-02-21T08:56:53.2733835Z       %80 = arith.select %79, %cst_10, %77 : tensor<32x256xf32, #mma>
2026-02-21T08:56:53.2733980Z       scf.if %79 {
2026-02-21T08:56:53.2734119Z         %81 = arith.truncf %77 : tensor<32x256xf32, #mma> to tensor<32x256xbf16, #mma>
2026-02-21T08:56:53.2734292Z         %82 = arith.extsi %37#0 : i32 to i64
2026-02-21T08:56:53.2734415Z         %83 = arith.extsi %37#1 : i32 to i64
2026-02-21T08:56:53.2734586Z         %84 = tt.splat %82 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2734806Z         %85 = arith.addi %84, %18 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:56:53.2735108Z         %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2735352Z         %87 = arith.muli %86, %cst_1 : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2735535Z         %88 = tt.broadcast %87 : tensor<32x1xi64, #mma> -> tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2735749Z         %89 = tt.splat %83 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2735967Z         %90 = arith.addi %89, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:56:53.2736235Z         %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2736504Z         %92 = tt.broadcast %91 : tensor<1x256xi64, #mma> -> tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2736687Z         %93 = arith.addi %88, %92 : tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2736882Z         %94 = tt.addptr %17, %93 : tensor<32x256x!tt.ptr<bf16>, #mma>, tensor<32x256xi64, #mma>
2026-02-21T08:56:53.2737084Z         %95 = arith.cmpi sge, %86, %cst_0 : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2737249Z         %96 = arith.cmpi slt, %86, %cst : tensor<32x1xi64, #mma>
2026-02-21T08:56:53.2737409Z         %97 = arith.andi %95, %96 : tensor<32x1xi1, #mma>
2026-02-21T08:56:53.2737582Z         %98 = tt.broadcast %97 : tensor<32x1xi1, #mma> -> tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2737771Z         %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2737940Z         %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x256xi64, #mma>
2026-02-21T08:56:53.2738108Z         %101 = arith.andi %99, %100 : tensor<1x256xi1, #mma>
2026-02-21T08:56:53.2738325Z         %102 = tt.broadcast %101 : tensor<1x256xi1, #mma> -> tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2738510Z         %103 = arith.andi %98, %102 : tensor<32x256xi1, #mma>
2026-02-21T08:56:53.2738681Z         tt.store %94, %81, %103 : tensor<32x256x!tt.ptr<bf16>, #mma>
2026-02-21T08:56:53.2738820Z       }
2026-02-21T08:56:53.2739091Z       scf.yield %34, %37#5, %78, %80, %37#0, %37#1, %37#2, %37#3, %37#4 : i32, i32, i32, tensor<32x256xf32, #mma>, i32, i32, tensor<32x4xi32, #blocked2>, tensor<2x256xi64, #blocked>, tensor<2x256xi1, #blocked>
2026-02-21T08:56:53.2739411Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 2 : i32}
2026-02-21T08:56:53.2739553Z     tt.return
2026-02-21T08:56:53.2739642Z   }
2026-02-21T08:56:53.2739728Z }
2026-02-21T08:56:53.2739775Z 
2026-02-21T08:56:53.2739814Z {-#
2026-02-21T08:56:53.2739901Z   external_resources: {
2026-02-21T08:56:53.2740011Z     mlir_reproducer: {
2026-02-21T08:56:53.2741014Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:56:53.2742018Z       disable_threading: false,
2026-02-21T08:56:53.2742130Z       verify_each: true
2026-02-21T08:56:53.2742229Z     }
2026-02-21T08:56:53.2742309Z   }
2026-02-21T08:56:53.2742386Z #-}
2026-02-21T08:56:53.2742668Z /tmp/torchinductor_root/gz/cgzzn756yxc7hskf4mdl2sgzbtg7provhizvgiphj4oto5qophj2.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:56:53.2743387Z /tmp/torchinductor_root/gz/cgzzn756yxc7hskf4mdl2sgzbtg7provhizvgiphj4oto5qophj2.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:56:53.2743981Z [92s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:56:53.2744765Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 32, 256], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T08:56:53.2745486Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:56:53.2745660Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:57:10.0402125Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T08:57:10.0406848Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T08:57:10.0407738Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:57:10.0408624Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:57:10.0409445Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T08:57:10.0410220Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T08:57:10.0411382Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T08:57:10.0411852Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T08:57:10.0412220Z #smem = #ttg.shared_memory
2026-02-21T08:57:10.0412687Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T08:57:10.0413643Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T08:57:10.0414478Z     %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma>
2026-02-21T08:57:10.0414835Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:57:10.0415180Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T08:57:10.0415546Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0415866Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T08:57:10.0416107Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T08:57:10.0416341Z     %c31_i32 = arith.constant 31 : i32
2026-02-21T08:57:10.0416703Z     %cst_3 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0417206Z     %cst_4 = arith.constant dense<16> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0417694Z     %cst_5 = arith.constant dense<24> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0418118Z     %cst_6 = arith.constant dense<8192> : tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0418472Z     %cst_7 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1>
2026-02-21T08:57:10.0418772Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T08:57:10.0419014Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T08:57:10.0419242Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T08:57:10.0419474Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T08:57:10.0419707Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T08:57:10.0419946Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T08:57:10.0420355Z     %c12_i32 = arith.constant 12 : i32
2026-02-21T08:57:10.0420641Z     %cst_8 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0420868Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T08:57:10.0421037Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T08:57:10.0421198Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T08:57:10.0421364Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T08:57:10.0421580Z     %cst_9 = arith.constant dense<4> : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0421884Z     %cst_10 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0422157Z     %0 = tt.get_program_id x : i32
2026-02-21T08:57:10.0422431Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:57:10.0422838Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:57:10.0423247Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:57:10.0423627Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:57:10.0423999Z     %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0424372Z     %6 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0424726Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:57:10.0425015Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T08:57:10.0425454Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T08:57:10.0426053Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T08:57:10.0426625Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T08:57:10.0426985Z     %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:57:10.0427266Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T08:57:10.0427544Z     %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T08:57:10.0427813Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T08:57:10.0428111Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T08:57:10.0428347Z     %17 = arith.subi %c16384_i32, %0 : i32
2026-02-21T08:57:10.0428529Z     %18 = arith.ceildivsi %17, %c2432_i32 : i32
2026-02-21T08:57:10.0428709Z     %19 = arith.muli %18, %c32_i32 : i32
2026-02-21T08:57:10.0428946Z     %20 = ttg.local_alloc : () -> !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable>
2026-02-21T08:57:10.0429238Z     %21 = ttg.local_alloc : () -> !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable>
2026-02-21T08:57:10.0429529Z     %22 = ttg.local_alloc : () -> !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable>
2026-02-21T08:57:10.0429817Z     %23 = ttg.local_alloc : () -> !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable>
2026-02-21T08:57:10.0430057Z     %24 = arith.cmpi sgt, %19, %c0_i32 : i32
2026-02-21T08:57:10.0430231Z     %25 = arith.divsi %0, %c2048_i32 : i32
2026-02-21T08:57:10.0430396Z     %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T08:57:10.0430567Z     %27 = arith.subi %c64_i32, %26 : i32
2026-02-21T08:57:10.0430711Z     %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T08:57:10.0430846Z     %29 = arith.remsi %0, %c2048_i32 : i32
2026-02-21T08:57:10.0430972Z     %30 = arith.remsi %29, %28 : i32
2026-02-21T08:57:10.0431148Z     %31 = arith.addi %26, %30 : i32
2026-02-21T08:57:10.0431275Z     %32 = arith.divsi %29, %28 : i32
2026-02-21T08:57:10.0431401Z     %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T08:57:10.0431583Z     %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:57:10.0431825Z     %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:57:10.0432063Z     %36 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:57:10.0432296Z     %37 = arith.addi %35, %2 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:57:10.0432491Z     %38 = arith.muli %32, %c64_i32 : i32
2026-02-21T08:57:10.0432679Z     %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:57:10.0432918Z     %40 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:57:10.0433227Z     %41 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T08:57:10.0433506Z     %42 = arith.muli %41, %cst_7 : tensor<64x1xi32, #blocked1>
2026-02-21T08:57:10.0433719Z     %43 = tt.broadcast %42 : tensor<64x1xi32, #blocked1> -> tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0434037Z     %44 = tt.expand_dims %37 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi32, #blocked2>
2026-02-21T08:57:10.0434347Z     %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked2> -> tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0434654Z     %46 = tt.expand_dims %6 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T08:57:10.0434956Z     %47 = tt.broadcast %46 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0435211Z     %48 = arith.addi %43, %47 : tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0435433Z     %49 = tt.addptr %7, %48 : tensor<64x8x!tt.ptr<bf16>, #blocked1>, tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0435652Z     %50 = tt.splat %24 : i1 -> tensor<64x8xi1, #blocked1>
2026-02-21T08:57:10.0435941Z     %51 = tt.load %49, %50 {amd.pipeliner_part = "prologue"} : tensor<64x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:57:10.0436210Z     %52 = arith.addi %6, %cst_3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0436512Z     %53 = tt.expand_dims %52 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T08:57:10.0436816Z     %54 = tt.broadcast %53 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0437029Z     %55 = arith.addi %43, %54 : tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0437250Z     %56 = tt.addptr %7, %55 : tensor<64x8x!tt.ptr<bf16>, #blocked1>, tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0437521Z     %57 = tt.load %56, %50 {amd.pipeliner_part = "prologue"} : tensor<64x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:57:10.0437783Z     %58 = arith.addi %6, %cst_4 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0438089Z     %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T08:57:10.0438386Z     %60 = tt.broadcast %59 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0438587Z     %61 = arith.addi %43, %60 : tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0438799Z     %62 = tt.addptr %7, %61 : tensor<64x8x!tt.ptr<bf16>, #blocked1>, tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0439063Z     %63 = tt.load %62, %50 {amd.pipeliner_part = "prologue"} : tensor<64x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:57:10.0439322Z     %64 = arith.addi %6, %cst_5 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0439629Z     %65 = tt.expand_dims %64 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T08:57:10.0439966Z     %66 = tt.broadcast %65 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0440175Z     %67 = arith.addi %43, %66 : tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0440387Z     %68 = tt.addptr %7, %67 : tensor<64x8x!tt.ptr<bf16>, #blocked1>, tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0440652Z     %69 = tt.load %68, %50 {amd.pipeliner_part = "prologue"} : tensor<64x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:57:10.0440987Z     %70 = ttg.memdesc_index %20[%c0_i32] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0441341Z     ttg.local_store %51, %70 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0441696Z     %71 = ttg.memdesc_index %21[%c0_i32] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0442049Z     ttg.local_store %57, %71 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0442396Z     %72 = ttg.memdesc_index %22[%c0_i32] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0442813Z     ttg.local_store %63, %72 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0443157Z     %73 = ttg.memdesc_index %23[%c0_i32] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0443508Z     ttg.local_store %69, %73 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0443728Z     %74 = arith.subi %19, %c1_i32 : i32
2026-02-21T08:57:10.0444798Z     %75:17 = scf.for %arg3 = %c0_i32 to %74 step %c1_i32 iter_args(%arg4 = %c0_i32, %arg5 = %0, %arg6 = %cst_2, %arg7 = %37, %arg8 = %38, %arg9 = %43, %arg10 = %45, %arg11 = %c0_i32, %arg12 = %c0_i32, %arg13 = %70, %arg14 = %c4_i32, %arg15 = %71, %arg16 = %c8_i32, %arg17 = %72, %arg18 = %c12_i32, %arg19 = %73, %arg20 = %36) -> (i32, i32, tensor<64x128xf32, #mma>, tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>, i32, tensor<64x8xi32, #blocked1>, tensor<4x128xi32, #blocked2>, i32, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>)  : i32 {
2026-02-21T08:57:10.0445833Z       %184 = arith.addi %arg12, %c16_i32 : i32
2026-02-21T08:57:10.0445960Z       %185 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T08:57:10.0446099Z       %186 = arith.cmpi eq, %arg4, %c31_i32 : i32
2026-02-21T08:57:10.0446232Z       %187 = arith.select %186, %c0_i32, %185 : i32
2026-02-21T08:57:10.0446372Z       %188 = arith.cmpi eq, %187, %c0_i32 : i32
2026-02-21T08:57:10.0446501Z       %189 = arith.select %188, %c0_i32, %184 : i32
2026-02-21T08:57:10.0446851Z       %190:6 = scf.if %188 -> (tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>, i32, tensor<64x8xi32, #blocked1>, tensor<4x128xi32, #blocked2>, i32, tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>) {
2026-02-21T08:57:10.0447191Z         %334 = arith.addi %arg5, %c2432_i32 : i32
2026-02-21T08:57:10.0447319Z         %335 = arith.divsi %334, %c2048_i32 : i32
2026-02-21T08:57:10.0447449Z         %336 = arith.muli %335, %c8_i32 : i32
2026-02-21T08:57:10.0447570Z         %337 = arith.subi %c64_i32, %336 : i32
2026-02-21T08:57:10.0447695Z         %338 = arith.minsi %337, %c8_i32 : i32
2026-02-21T08:57:10.0447816Z         %339 = arith.remsi %334, %c2048_i32 : i32
2026-02-21T08:57:10.0447946Z         %340 = arith.remsi %339, %338 : i32
2026-02-21T08:57:10.0448069Z         %341 = arith.addi %336, %340 : i32
2026-02-21T08:57:10.0448185Z         %342 = arith.divsi %339, %338 : i32
2026-02-21T08:57:10.0448363Z         %343 = arith.muli %341, %c128_i32 : i32
2026-02-21T08:57:10.0448534Z         %344 = tt.splat %343 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:57:10.0448789Z         %345 = tt.splat %343 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:57:10.0449016Z         %346 = arith.addi %344, %1 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:57:10.0449236Z         %347 = arith.addi %345, %2 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T08:57:10.0449423Z         %348 = arith.muli %342, %c64_i32 : i32
2026-02-21T08:57:10.0449596Z         %349 = tt.splat %348 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:57:10.0449828Z         %350 = arith.addi %349, %3 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T08:57:10.0450120Z         %351 = tt.expand_dims %350 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T08:57:10.0450382Z         %352 = arith.muli %351, %cst_7 : tensor<64x1xi32, #blocked1>
2026-02-21T08:57:10.0450583Z         %353 = tt.broadcast %352 : tensor<64x1xi32, #blocked1> -> tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0450875Z         %354 = tt.expand_dims %347 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi32, #blocked2>
2026-02-21T08:57:10.0451166Z         %355 = tt.broadcast %354 : tensor<1x128xi32, #blocked2> -> tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0451576Z         scf.yield %347, %348, %353, %355, %334, %346 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>, i32, tensor<64x8xi32, #blocked1>, tensor<4x128xi32, #blocked2>, i32, tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:57:10.0451955Z       } else {
2026-02-21T08:57:10.0452295Z         scf.yield %arg7, %arg8, %arg9, %arg10, %arg5, %arg20 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>, i32, tensor<64x8xi32, #blocked1>, tensor<4x128xi32, #blocked2>, i32, tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:57:10.0452659Z       }
2026-02-21T08:57:10.0452808Z       %191 = tt.splat %arg12 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0453042Z       %192 = arith.addi %191, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0453221Z       %193 = arith.muli %189, %c2_i32 : i32
2026-02-21T08:57:10.0453392Z       %194 = tt.splat %193 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0453614Z       %195 = arith.addi %194, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0453893Z       %196 = tt.expand_dims %195 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T08:57:10.0454176Z       %197 = tt.broadcast %196 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0454376Z       %198 = arith.addi %190#2, %197 : tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0454586Z       %199 = tt.addptr %7, %198 : tensor<64x8x!tt.ptr<bf16>, #blocked1>, tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0454793Z       %200 = tt.load %199 : tensor<64x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:57:10.0455104Z       %201 = ttg.local_load %arg13 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0455556Z       %202 = arith.extf %201 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0455946Z       %203 = tt.expand_dims %192 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0456212Z       %204 = arith.muli %203, %cst_6 : tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0456451Z       %205 = tt.broadcast %204 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0456651Z       %206 = arith.addi %205, %arg10 : tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0456859Z       %207 = tt.addptr %8, %206 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0457063Z       %208 = tt.load %207 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T08:57:10.0457314Z       %209 = ttg.convert_layout %208 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0457602Z       %210 = arith.shli %209, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0457841Z       %211 = arith.shrsi %210, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0458092Z       %212 = arith.shrsi %209, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0458391Z       %213 = tt.expand_dims %211 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0458738Z       %214 = tt.expand_dims %212 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0459028Z       %215 = tt.broadcast %213 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0459271Z       %216 = arith.select %13, %215, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0459514Z       %217 = tt.broadcast %214 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0459744Z       %218 = arith.select %15, %217, %216 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0460011Z       %219 = tt.reshape %218 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T08:57:10.0460257Z       %220 = arith.sitofp %219 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T08:57:10.0460514Z       %221 = ttg.local_alloc %220 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T08:57:10.0460842Z       %222 = ttg.local_load %221 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0461316Z       %223 = tt.dot %202, %222, %arg6, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0461669Z       %224 = arith.addi %189, %c4_i32 : i32
2026-02-21T08:57:10.0461847Z       %225 = tt.splat %arg14 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0462074Z       %226 = arith.addi %225, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0462247Z       %227 = arith.muli %224, %c2_i32 : i32
2026-02-21T08:57:10.0462413Z       %228 = tt.splat %227 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0462634Z       %229 = arith.addi %228, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0462910Z       %230 = tt.expand_dims %229 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T08:57:10.0463181Z       %231 = tt.broadcast %230 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0463377Z       %232 = arith.addi %190#2, %231 : tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0463579Z       %233 = tt.addptr %7, %232 : tensor<64x8x!tt.ptr<bf16>, #blocked1>, tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0463786Z       %234 = tt.load %233 : tensor<64x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:57:10.0464086Z       %235 = ttg.local_load %arg15 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0464555Z       %236 = arith.extf %235 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0464938Z       %237 = tt.expand_dims %226 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0465184Z       %238 = arith.muli %237, %cst_6 : tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0465379Z       %239 = tt.broadcast %238 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0465580Z       %240 = arith.addi %239, %arg10 : tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0465782Z       %241 = tt.addptr %8, %240 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0465991Z       %242 = tt.load %241 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T08:57:10.0466240Z       %243 = ttg.convert_layout %242 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0466528Z       %244 = arith.shli %243, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0466765Z       %245 = arith.shrsi %244, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0467001Z       %246 = arith.shrsi %243, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0467293Z       %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0467635Z       %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0467920Z       %249 = tt.broadcast %247 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0468198Z       %250 = arith.select %13, %249, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0468438Z       %251 = tt.broadcast %248 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0468677Z       %252 = arith.select %15, %251, %250 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0468911Z       %253 = tt.reshape %252 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T08:57:10.0469131Z       %254 = arith.sitofp %253 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T08:57:10.0469388Z       %255 = ttg.local_alloc %254 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T08:57:10.0469715Z       %256 = ttg.local_load %255 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0470189Z       %257 = tt.dot %236, %256, %223, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0470531Z       %258 = arith.addi %189, %c8_i32 : i32
2026-02-21T08:57:10.0470702Z       %259 = tt.splat %arg16 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0470929Z       %260 = arith.addi %259, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0471098Z       %261 = arith.muli %258, %c2_i32 : i32
2026-02-21T08:57:10.0471265Z       %262 = tt.splat %261 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0471482Z       %263 = arith.addi %262, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0471758Z       %264 = tt.expand_dims %263 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T08:57:10.0472042Z       %265 = tt.broadcast %264 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0472235Z       %266 = arith.addi %190#2, %265 : tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0472469Z       %267 = tt.addptr %7, %266 : tensor<64x8x!tt.ptr<bf16>, #blocked1>, tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0472671Z       %268 = tt.load %267 : tensor<64x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:57:10.0472964Z       %269 = ttg.local_load %arg17 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0473393Z       %270 = arith.extf %269 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0473765Z       %271 = tt.expand_dims %260 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0474015Z       %272 = arith.muli %271, %cst_6 : tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0474206Z       %273 = tt.broadcast %272 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0474402Z       %274 = arith.addi %273, %arg10 : tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0474605Z       %275 = tt.addptr %8, %274 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0474806Z       %276 = tt.load %275 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T08:57:10.0475054Z       %277 = ttg.convert_layout %276 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0475337Z       %278 = arith.shli %277, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0475570Z       %279 = arith.shrsi %278, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0475805Z       %280 = arith.shrsi %277, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0476137Z       %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0476476Z       %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0476760Z       %283 = tt.broadcast %281 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0476996Z       %284 = arith.select %13, %283, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0477234Z       %285 = tt.broadcast %282 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0477464Z       %286 = arith.select %15, %285, %284 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0477698Z       %287 = tt.reshape %286 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T08:57:10.0477927Z       %288 = arith.sitofp %287 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T08:57:10.0478182Z       %289 = ttg.local_alloc %288 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T08:57:10.0478512Z       %290 = ttg.local_load %289 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0478977Z       %291 = tt.dot %270, %290, %257, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0479322Z       %292 = arith.addi %189, %c12_i32 : i32
2026-02-21T08:57:10.0479500Z       %293 = tt.splat %arg18 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0479725Z       %294 = arith.addi %293, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0479903Z       %295 = arith.muli %292, %c2_i32 : i32
2026-02-21T08:57:10.0480069Z       %296 = tt.splat %295 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0489832Z       %297 = arith.addi %296, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T08:57:10.0490106Z       %298 = tt.expand_dims %297 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T08:57:10.0490377Z       %299 = tt.broadcast %298 : tensor<1x8xi32, #blocked1> -> tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0490573Z       %300 = arith.addi %190#2, %299 : tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0490824Z       %301 = tt.addptr %7, %300 : tensor<64x8x!tt.ptr<bf16>, #blocked1>, tensor<64x8xi32, #blocked1>
2026-02-21T08:57:10.0491028Z       %302 = tt.load %301 : tensor<64x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T08:57:10.0491332Z       %303 = ttg.local_load %arg19 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0491760Z       %304 = arith.extf %303 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0492140Z       %305 = tt.expand_dims %294 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0492388Z       %306 = arith.muli %305, %cst_6 : tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0492582Z       %307 = tt.broadcast %306 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0492775Z       %308 = arith.addi %307, %arg10 : tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0492977Z       %309 = tt.addptr %8, %308 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0493177Z       %310 = tt.load %309 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T08:57:10.0493463Z       %311 = ttg.convert_layout %310 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0493745Z       %312 = arith.shli %311, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0493981Z       %313 = arith.shrsi %312, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0494216Z       %314 = arith.shrsi %311, %cst_10 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0494507Z       %315 = tt.expand_dims %313 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0494845Z       %316 = tt.expand_dims %314 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0495131Z       %317 = tt.broadcast %315 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0495376Z       %318 = arith.select %13, %317, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0495615Z       %319 = tt.broadcast %316 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0495847Z       %320 = arith.select %15, %319, %318 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0496082Z       %321 = tt.reshape %320 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T08:57:10.0496306Z       %322 = arith.sitofp %321 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T08:57:10.0496558Z       %323 = ttg.local_alloc %322 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T08:57:10.0496884Z       %324 = ttg.local_load %323 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0497359Z       %325 = tt.dot %304, %324, %291, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0497763Z       %326 = arith.select %186, %cst_2, %325 : tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0497907Z       scf.if %186 {
2026-02-21T08:57:10.0498050Z         %334 = tt.splat %arg8 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:57:10.0498270Z         %335 = arith.addi %334, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:57:10.0498484Z         %336 = arith.truncf %325 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T08:57:10.0498751Z         %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T08:57:10.0498991Z         %338 = arith.muli %337, %cst : tensor<64x1xi32, #mma>
2026-02-21T08:57:10.0499234Z         %339 = tt.expand_dims %arg20 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T08:57:10.0499501Z         %340 = tt.broadcast %338 : tensor<64x1xi32, #mma> -> tensor<64x128xi32, #mma>
2026-02-21T08:57:10.0499709Z         %341 = tt.broadcast %339 : tensor<1x128xi32, #mma> -> tensor<64x128xi32, #mma>
2026-02-21T08:57:10.0499897Z         %342 = arith.addi %340, %341 : tensor<64x128xi32, #mma>
2026-02-21T08:57:10.0500093Z         %343 = tt.addptr %16, %342 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi32, #mma>
2026-02-21T08:57:10.0500292Z         tt.store %343, %336 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T08:57:10.0500425Z       }
2026-02-21T08:57:10.0500512Z       %327 = arith.addi %arg11, %c1_i32 : i32
2026-02-21T08:57:10.0500643Z       %328 = arith.cmpi slt, %327, %c1_i32 : i32
2026-02-21T08:57:10.0500772Z       %329 = arith.select %328, %327, %c0_i32 : i32
2026-02-21T08:57:10.0501040Z       %330 = ttg.memdesc_index %20[%329] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0501435Z       ttg.local_store %200, %330 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0501792Z       %331 = ttg.memdesc_index %21[%329] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0502136Z       ttg.local_store %234, %331 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0502478Z       %332 = ttg.memdesc_index %22[%329] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0502818Z       ttg.local_store %268, %332 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0503159Z       %333 = ttg.memdesc_index %23[%329] : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0503501Z       ttg.local_store %302, %333 : tensor<64x8xbf16, #blocked1> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>
2026-02-21T08:57:10.0504380Z       scf.yield %187, %190#4, %326, %190#0, %190#1, %190#2, %190#3, %329, %189, %330, %224, %331, %258, %332, %292, %333, %190#5 : i32, i32, tensor<64x128xf32, #mma>, tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>, i32, tensor<64x8xi32, #blocked1>, tensor<4x128xi32, #blocked2>, i32, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8>, tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:57:10.0505137Z     }
2026-02-21T08:57:10.0505229Z     %76 = arith.cmpi sge, %19, %c1_i32 : i32
2026-02-21T08:57:10.0505401Z     %77 = tt.splat %75#8 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0505627Z     %78 = arith.addi %77, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0505941Z     %79 = ttg.local_load %75#9 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0506392Z     %80 = arith.extf %79 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0506761Z     %81 = tt.expand_dims %78 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0507001Z     %82 = arith.muli %81, %cst_6 : tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0507188Z     %83 = tt.broadcast %82 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0507379Z     %84 = arith.addi %83, %75#6 : tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0507574Z     %85 = tt.addptr %8, %84 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0507769Z     %86 = tt.splat %76 : i1 -> tensor<4x128xi1, #blocked2>
2026-02-21T08:57:10.0507924Z     %87 = tt.load %85, %86 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T08:57:10.0508085Z     %88 = arith.shli %87, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0508243Z     %89 = arith.shrsi %88, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0508485Z     %90 = ttg.convert_layout %89 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0508727Z     %91 = arith.shrsi %87, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0508965Z     %92 = ttg.convert_layout %91 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0509296Z     %93 = tt.expand_dims %90 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0509669Z     %94 = tt.expand_dims %92 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0509947Z     %95 = tt.broadcast %93 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0510303Z     %96 = arith.select %13, %95, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0510560Z     %97 = tt.broadcast %94 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0510827Z     %98 = arith.select %15, %97, %96 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0511240Z     %99 = tt.reshape %98 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T08:57:10.0519256Z     %100 = arith.sitofp %99 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T08:57:10.0519537Z     %101 = ttg.local_alloc %100 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T08:57:10.0519879Z     %102 = ttg.local_load %101 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0520154Z     %103 = scf.if %76 -> (tensor<64x128xf32, #mma>) {
2026-02-21T08:57:10.0520511Z       %184 = tt.dot %80, %102, %75#2, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0520870Z       scf.yield %184 : tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0520995Z     } else {
2026-02-21T08:57:10.0521097Z       scf.yield %75#2 : tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0521222Z     }
2026-02-21T08:57:10.0521364Z     %104 = tt.splat %75#10 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0521596Z     %105 = arith.addi %104, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0521925Z     %106 = ttg.local_load %75#11 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0522414Z     %107 = arith.extf %106 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0522843Z     %108 = tt.expand_dims %105 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0523095Z     %109 = arith.muli %108, %cst_6 : tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0523299Z     %110 = tt.broadcast %109 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0523497Z     %111 = arith.addi %110, %75#6 : tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0523711Z     %112 = tt.addptr %8, %111 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0523929Z     %113 = tt.load %112, %86 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T08:57:10.0524103Z     %114 = arith.shli %113, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0524277Z     %115 = arith.shrsi %114, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0524536Z     %116 = ttg.convert_layout %115 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0524792Z     %117 = arith.shrsi %113, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0525042Z     %118 = ttg.convert_layout %117 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0525378Z     %119 = tt.expand_dims %116 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0525721Z     %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0526050Z     %121 = tt.broadcast %119 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0526299Z     %122 = arith.select %13, %121, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0526549Z     %123 = tt.broadcast %120 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0526785Z     %124 = arith.select %15, %123, %122 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0527020Z     %125 = tt.reshape %124 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T08:57:10.0527244Z     %126 = arith.sitofp %125 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T08:57:10.0527501Z     %127 = ttg.local_alloc %126 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T08:57:10.0527832Z     %128 = ttg.local_load %127 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0528099Z     %129 = scf.if %76 -> (tensor<64x128xf32, #mma>) {
2026-02-21T08:57:10.0528456Z       %184 = tt.dot %107, %128, %103, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0528804Z       scf.yield %184 : tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0528929Z     } else {
2026-02-21T08:57:10.0529029Z       scf.yield %103 : tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0529143Z     }
2026-02-21T08:57:10.0529288Z     %130 = tt.splat %75#12 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0529516Z     %131 = arith.addi %130, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0529848Z     %132 = ttg.local_load %75#13 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0530278Z     %133 = arith.extf %132 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0530715Z     %134 = tt.expand_dims %131 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0530970Z     %135 = arith.muli %134, %cst_6 : tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0531164Z     %136 = tt.broadcast %135 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0531365Z     %137 = arith.addi %136, %75#6 : tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0531572Z     %138 = tt.addptr %8, %137 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0531782Z     %139 = tt.load %138, %86 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T08:57:10.0531954Z     %140 = arith.shli %139, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0532121Z     %141 = arith.shrsi %140, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0532380Z     %142 = ttg.convert_layout %141 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0532633Z     %143 = arith.shrsi %139, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0532881Z     %144 = ttg.convert_layout %143 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0533220Z     %145 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0533558Z     %146 = tt.expand_dims %144 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0533851Z     %147 = tt.broadcast %145 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0534095Z     %148 = arith.select %13, %147, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0534372Z     %149 = tt.broadcast %146 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0534611Z     %150 = arith.select %15, %149, %148 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0534845Z     %151 = tt.reshape %150 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T08:57:10.0535075Z     %152 = arith.sitofp %151 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T08:57:10.0535330Z     %153 = ttg.local_alloc %152 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T08:57:10.0535662Z     %154 = ttg.local_load %153 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0535927Z     %155 = scf.if %76 -> (tensor<64x128xf32, #mma>) {
2026-02-21T08:57:10.0536285Z       %184 = tt.dot %133, %154, %129, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0536643Z       scf.yield %184 : tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0536765Z     } else {
2026-02-21T08:57:10.0536861Z       scf.yield %129 : tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0537023Z     }
2026-02-21T08:57:10.0537176Z     %156 = tt.splat %75#14 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0537407Z     %157 = arith.addi %156, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T08:57:10.0537736Z     %158 = ttg.local_load %75#15 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 1x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0538167Z     %159 = arith.extf %158 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0538556Z     %160 = tt.expand_dims %157 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0538845Z     %161 = arith.muli %160, %cst_6 : tensor<4x1xi32, #blocked2>
2026-02-21T08:57:10.0539044Z     %162 = tt.broadcast %161 : tensor<4x1xi32, #blocked2> -> tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0539239Z     %163 = arith.addi %162, %75#6 : tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0539445Z     %164 = tt.addptr %8, %163 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi32, #blocked2>
2026-02-21T08:57:10.0539657Z     %165 = tt.load %164, %86 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T08:57:10.0539823Z     %166 = arith.shli %165, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0539994Z     %167 = arith.shrsi %166, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0540244Z     %168 = ttg.convert_layout %167 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0540505Z     %169 = arith.shrsi %165, %cst_9 : tensor<4x128xi8, #blocked2>
2026-02-21T08:57:10.0540757Z     %170 = ttg.convert_layout %169 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T08:57:10.0541091Z     %171 = tt.expand_dims %168 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0541429Z     %172 = tt.expand_dims %170 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T08:57:10.0541711Z     %173 = tt.broadcast %171 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0541956Z     %174 = arith.select %13, %173, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0542202Z     %175 = tt.broadcast %172 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0542477Z     %176 = arith.select %15, %175, %174 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T08:57:10.0542716Z     %177 = tt.reshape %176 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T08:57:10.0542942Z     %178 = arith.sitofp %177 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T08:57:10.0543200Z     %179 = ttg.local_alloc %178 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T08:57:10.0543532Z     %180 = ttg.local_load %179 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T08:57:10.0543792Z     %181 = scf.if %76 -> (tensor<64x128xf32, #mma>) {
2026-02-21T08:57:10.0544153Z       %184 = tt.dot %159, %180, %155, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0544499Z       scf.yield %184 : tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0544626Z     } else {
2026-02-21T08:57:10.0544727Z       scf.yield %155 : tensor<64x128xf32, #mma>
2026-02-21T08:57:10.0544843Z     }
2026-02-21T08:57:10.0544939Z     %182 = arith.cmpi eq, %75#0, %c31_i32 : i32
2026-02-21T08:57:10.0545060Z     %183 = arith.andi %76, %182 : i1
2026-02-21T08:57:10.0545174Z     scf.if %183 {
2026-02-21T08:57:10.0545318Z       %184 = tt.splat %75#4 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:57:10.0545538Z       %185 = arith.addi %184, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T08:57:10.0545755Z       %186 = arith.truncf %181 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T08:57:10.0546027Z       %187 = tt.expand_dims %185 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T08:57:10.0546268Z       %188 = arith.muli %187, %cst : tensor<64x1xi32, #mma>
2026-02-21T08:57:10.0546550Z       %189 = ttg.convert_layout %75#3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T08:57:10.0546908Z       %190 = tt.expand_dims %189 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T08:57:10.0547218Z       %191 = tt.broadcast %188 : tensor<64x1xi32, #mma> -> tensor<64x128xi32, #mma>
2026-02-21T08:57:10.0547427Z       %192 = tt.broadcast %190 : tensor<1x128xi32, #mma> -> tensor<64x128xi32, #mma>
2026-02-21T08:57:10.0547618Z       %193 = arith.addi %191, %192 : tensor<64x128xi32, #mma>
2026-02-21T08:57:10.0547813Z       %194 = tt.addptr %16, %193 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi32, #mma>
2026-02-21T08:57:10.0548019Z       tt.store %194, %186 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T08:57:10.0548154Z     }
2026-02-21T08:57:10.0548289Z     ttg.local_dealloc %23 : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable>
2026-02-21T08:57:10.0548506Z     ttg.local_dealloc %22 : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable>
2026-02-21T08:57:10.0548710Z     ttg.local_dealloc %21 : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable>
2026-02-21T08:57:10.0548917Z     ttg.local_dealloc %20 : !ttg.memdesc<1x64x8xbf16, #shared, #smem, mutable>
2026-02-21T08:57:10.0549074Z     tt.return
2026-02-21T08:57:10.0549161Z   }
2026-02-21T08:57:10.0549245Z }
2026-02-21T08:57:10.0549297Z 
2026-02-21T08:57:10.0549330Z {-#
2026-02-21T08:57:10.0549414Z   external_resources: {
2026-02-21T08:57:10.0549521Z     mlir_reproducer: {
2026-02-21T08:57:10.0550574Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T08:57:10.0551569Z       disable_threading: false,
2026-02-21T08:57:10.0551677Z       verify_each: true
2026-02-21T08:57:10.0551774Z     }
2026-02-21T08:57:10.0551847Z   }
2026-02-21T08:57:10.0551922Z #-}
2026-02-21T08:57:10.0552199Z /tmp/torchinductor_root/rf/crfyp33noxhmzkqm66tofoowim2no45tvuuxuk6t2jwtqco6f6eg.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T08:57:10.0555660Z /tmp/torchinductor_root/rf/crfyp33noxhmzkqm66tofoowim2no45tvuuxuk6t2jwtqco6f6eg.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T08:57:10.0556252Z [109s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T08:57:10.0557050Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 64, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T08:57:10.0557770Z Error: RuntimeError: PassManager::run failed
2026-02-21T08:57:10.0557951Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T08:57:10.0558174Z WARNING:tritonbench.utils.triton_op:Completed input ID 21:
2026-02-21T08:57:10.0558323Z x_val
2026-02-21T08:57:10.0558410Z ---------------------
2026-02-21T08:57:10.0558511Z (4, 4096, 8192, 1024)
2026-02-21T08:57:10.0558570Z 
2026-02-21T08:57:10.0558984Z  70%|███████   | 7/10 [47:49<14:02, 280.74s/it]WARNING:tritonbench.utils.triton_op:Running input ID 24:
2026-02-21T08:57:10.0559178Z x_val
2026-02-21T08:57:10.0559264Z ----------------------
2026-02-21T08:57:10.0559359Z (16, 4096, 1280, 8192)
2026-02-21T08:57:10.0559625Z INFO:tritonbench.utils.triton_op:Took 0.22ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T08:57:10.0559943Z Initial population exploring neighbors  36% ━━━━━           36/100 0.5 configs/s
2026-02-21T08:57:10.9811783Z INFO:tritonbench.utils.triton_op:Took 2.46ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T08:57:14.2895580Z Autotune Choices Stats:
2026-02-21T08:57:14.2897478Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 1.6694060564041138, "best_triton_pos": 1, "best_triton_time": 2.027919054031372, "best_triton_kernel": "triton_mm_185", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4"}
2026-02-21T08:57:14.2902260Z AUTOTUNE mm(65536x8192, 8192x1280)
2026-02-21T08:57:14.2902557Z strides: [8192, 1], [1280, 1]
2026-02-21T08:57:14.2902859Z dtypes: torch.bfloat16, torch.bfloat16
2026-02-21T08:57:14.2903136Z   mm 1.6694 ms 100.0% 
2026-02-21T08:57:14.2903970Z   triton_mm_185 2.0279 ms 82.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:57:14.2905383Z   triton_mm_191 2.1667 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:57:14.2906755Z   triton_mm_188 2.2115 ms 75.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:57:14.2908649Z   triton_mm_186 2.2370 ms 74.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:57:14.2909762Z   triton_mm_182 2.3020 ms 72.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:57:14.2910755Z   triton_mm_190 2.3526 ms 71.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:57:14.2911757Z   triton_mm_179 2.3692 ms 70.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:57:14.2912769Z   triton_mm_183 2.3976 ms 69.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:57:14.2913776Z   triton_mm_184 2.5001 ms 66.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:57:14.2914543Z SingleProcess AUTOTUNE benchmarking takes 2.8034 seconds and 0.3202 seconds precompiling for 37 choices
2026-02-21T08:57:16.0504219Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T08:57:16.0513572Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:57:16.0513887Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:57:16.0514158Z               'dtype': 'torch.bfloat16',
2026-02-21T08:57:16.0514440Z               'shape': (16, 4096, 8192),
2026-02-21T08:57:16.0514696Z               'stride': (33554432, 8192, 1)},
2026-02-21T08:57:16.0515254Z             { 'device': 'cuda:0',
2026-02-21T08:57:16.0515504Z               'dtype': 'torch.int32',
2026-02-21T08:57:16.0515751Z               'shape': (8192, 1280),
2026-02-21T08:57:16.0515999Z               'stride': (1280, 1)}),
2026-02-21T08:57:16.0516230Z   'kwargs': {}}
2026-02-21T08:57:16.0527350Z INFO:tritonbench.utils.triton_op:Took 1.54ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T08:57:16.2279140Z [0s] Autotune random seed: 2134834638
2026-02-21T08:57:16.3658360Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:57:48.4588303Z [32s] Timeout after 30s compiling Config(block_sizes=[32, 8192, 1], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[1, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T08:57:54.9420278Z [38s] Timeout after 30s compiling Config(block_sizes=[64, 2048, 2], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:57:55.2390630Z [38s] Timeout after 30s compiling Config(block_sizes=[256, 128, 4], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:57:58.3724666Z [42s] Timeout after 30s compiling Config(block_sizes=[16, 4096, 1], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:57:58.3746993Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T08:58:09.1532312Z Initial population exploring neighbors   2%                  2/100 0.1 configs/s
2026-02-21T08:58:09.1536833Z WARNING:tritonbench.utils.triton_op:Completed input ID 24:
2026-02-21T08:58:09.1537032Z x_val
2026-02-21T08:58:09.1537171Z ----------------------
2026-02-21T08:58:09.1537276Z (16, 4096, 1280, 8192)
2026-02-21T08:58:09.1538068Z 
2026-02-21T08:58:09.1562217Z  80%|████████  | 8/10 [48:48<07:00, 210.18s/it]WARNING:tritonbench.utils.triton_op:Running input ID 28:
2026-02-21T08:58:09.1562503Z x_val
2026-02-21T08:58:09.1562672Z ----------------------
2026-02-21T08:58:09.1562806Z (64, 4096, 1280, 8192)
2026-02-21T08:58:09.1602333Z INFO:tritonbench.utils.triton_op:Took 0.45ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T08:58:10.2205357Z INFO:tritonbench.utils.triton_op:Took 2.77ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T08:58:19.4087613Z Autotune Choices Stats:
2026-02-21T08:58:19.4089300Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 7.480846881866455, "best_triton_pos": 1, "best_triton_time": 7.912878036499023, "best_triton_kernel": "triton_mm_221", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4"}
2026-02-21T08:58:19.4094693Z AUTOTUNE mm(262144x8192, 8192x1280)
2026-02-21T08:58:19.4094925Z strides: [8192, 1], [1280, 1]
2026-02-21T08:58:19.4095164Z dtypes: torch.bfloat16, torch.bfloat16
2026-02-21T08:58:19.4095390Z   mm 7.4808 ms 100.0% 
2026-02-21T08:58:19.4096118Z   triton_mm_221 7.9129 ms 94.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:58:19.4097326Z   triton_mm_227 8.2071 ms 91.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:58:19.4098525Z   triton_mm_222 8.7857 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:58:19.4099708Z   triton_mm_224 8.8305 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:58:19.4100896Z   triton_mm_218 9.3142 ms 80.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:58:19.4102079Z   triton_mm_226 9.8029 ms 76.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:58:19.4103675Z   triton_mm_219 10.5175 ms 71.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:58:19.4104639Z   triton_mm_220 10.9395 ms 68.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:58:19.4105579Z   triton_mm_215 11.1460 ms 67.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:58:19.4106301Z SingleProcess AUTOTUNE benchmarking takes 8.6244 seconds and 0.3689 seconds precompiling for 37 choices
2026-02-21T08:58:21.6975171Z INFO:tritonbench.utils.triton_op:Took 0.19ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T08:58:21.6987961Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:58:21.6988118Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:58:21.6988300Z               'dtype': 'torch.bfloat16',
2026-02-21T08:58:21.6988951Z               'shape': (64, 4096, 8192),
2026-02-21T08:58:21.6989078Z               'stride': (33554432, 8192, 1)},
2026-02-21T08:58:21.6989195Z             { 'device': 'cuda:0',
2026-02-21T08:58:21.6989308Z               'dtype': 'torch.int32',
2026-02-21T08:58:21.6989420Z               'shape': (8192, 1280),
2026-02-21T08:58:21.6989529Z               'stride': (1280, 1)}),
2026-02-21T08:58:21.6989635Z   'kwargs': {}}
2026-02-21T08:58:21.7005443Z INFO:tritonbench.utils.triton_op:Took 1.96ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T08:58:21.8825372Z [0s] Autotune random seed: 2134834638
2026-02-21T08:58:22.1632267Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T08:58:54.3769464Z [32s] Timeout after 30s compiling Config(block_sizes=[32, 8192, 1], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[1, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T08:58:59.4880226Z [37s] Timeout after 30s compiling Config(block_sizes=[8, 2048, 4], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:59:01.7031087Z [39s] Timeout after 30s compiling Config(block_sizes=[64, 2048, 2], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T08:59:01.9392850Z [39s] Timeout after 30s compiling Config(block_sizes=[256, 128, 4], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:59:05.0507072Z [42s] Timeout after 30s compiling Config(block_sizes=[16, 4096, 1], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T08:59:05.0526288Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T08:59:06.7653505Z Initial population exploring neighbors   1%                    1/100 - configs/s
2026-02-21T08:59:06.7658066Z WARNING:tritonbench.utils.triton_op:Completed input ID 28:
2026-02-21T08:59:06.7660426Z x_val
2026-02-21T08:59:06.7660631Z ----------------------
2026-02-21T08:59:06.7660803Z (64, 4096, 1280, 8192)
2026-02-21T08:59:06.7660967Z 
2026-02-21T08:59:06.7679448Z  90%|█████████ | 9/10 [49:45<02:42, 162.49s/it]WARNING:tritonbench.utils.triton_op:Running input ID 31:
2026-02-21T08:59:06.7679683Z x_val
2026-02-21T08:59:06.7679779Z ----------------------
2026-02-21T08:59:06.7679890Z (64, 4096, 8192, 3584)
2026-02-21T08:59:06.7728090Z INFO:tritonbench.utils.triton_op:Took 0.37ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T08:59:07.7681621Z INFO:tritonbench.utils.triton_op:Took 3.05ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T08:59:30.7765039Z Autotune Choices Stats:
2026-02-21T08:59:30.7766960Z {"num_choices": 37, "num_triton_choices": 36, "best_kernel": "mm", "best_time": 20.915571212768555, "best_triton_pos": 1, "best_triton_time": 23.032167434692383, "best_triton_kernel": "triton_mm_263", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8"}
2026-02-21T08:59:30.7772538Z AUTOTUNE mm(262144x3584, 3584x8192)
2026-02-21T08:59:30.7772735Z strides: [3584, 1], [8192, 1]
2026-02-21T08:59:30.7773741Z dtypes: torch.bfloat16, torch.bfloat16
2026-02-21T08:59:30.7774035Z   mm 20.9156 ms 100.0% 
2026-02-21T08:59:30.7774432Z   triton_mm_263 23.0322 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:59:30.7775152Z   triton_mm_260 24.1291 ms 86.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:59:30.7775765Z   triton_mm_257 25.0306 ms 83.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:59:30.7776996Z   triton_mm_262 25.7462 ms 81.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:59:30.7777589Z   triton_mm_254 26.9698 ms 77.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4
2026-02-21T08:59:30.7778191Z   triton_mm_251 27.2256 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4
2026-02-21T08:59:30.7778783Z   triton_mm_258 28.3907 ms 73.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:59:30.7779392Z   triton_mm_261 29.7136 ms 70.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:59:30.7779990Z   triton_mm_256 29.8412 ms 70.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8
2026-02-21T08:59:30.7780450Z SingleProcess AUTOTUNE benchmarking takes 22.4539 seconds and 0.3177 seconds precompiling for 37 choices
2026-02-21T08:59:32.9099278Z INFO:tritonbench.utils.triton_op:Took 0.21ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T08:59:32.9113799Z WARNING:__main__:Input tensor metadata:
2026-02-21T08:59:32.9114823Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T08:59:32.9115062Z               'dtype': 'torch.bfloat16',
2026-02-21T08:59:32.9115194Z               'shape': (64, 4096, 3584),
2026-02-21T08:59:32.9115316Z               'stride': (14680064, 3584, 1)},
2026-02-21T08:59:32.9115437Z             { 'device': 'cuda:0',
2026-02-21T08:59:32.9115600Z               'dtype': 'torch.int32',
2026-02-21T08:59:32.9115714Z               'shape': (3584, 8192),
2026-02-21T08:59:32.9116338Z               'stride': (8192, 1)}),
2026-02-21T08:59:32.9116442Z   'kwargs': {}}
2026-02-21T08:59:32.9130806Z INFO:tritonbench.utils.triton_op:Took 1.89ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T08:59:33.1007419Z [0s] Autotune random seed: 2134834638
2026-02-21T08:59:33.7407183Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:00:06.0429158Z [32s] Timeout after 30s compiling Config(block_sizes=[32, 8192, 1], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[1, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T09:00:12.7103906Z [38s] Timeout after 30s compiling Config(block_sizes=[64, 2048, 2], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:00:12.9324544Z [39s] Timeout after 30s compiling Config(block_sizes=[256, 128, 4], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[0, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:00:12.9342644Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T09:00:17.8615556Z Initial population exploring neighbors   1%                    1/100 - configs/s
2026-02-21T09:00:17.8618944Z WARNING:tritonbench.utils.triton_op:Completed input ID 31:
2026-02-21T09:00:17.8619259Z x_val
2026-02-21T09:00:17.8619382Z ----------------------
2026-02-21T09:00:17.8619571Z (64, 4096, 8192, 3584)
2026-02-21T09:00:17.8619677Z 
2026-02-21T09:00:17.8620237Z 100%|██████████| 10/10 [50:57<00:00, 134.27s/it]
2026-02-21T09:00:17.8620521Z 100%|██████████| 10/10 [50:57<00:00, 305.71s/it]
2026-02-21T09:00:17.8632242Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmpxao986lk.csv
2026-02-21T09:00:17.8644949Z                  x_val    preprocessed_torch_compile_int4_gemm-speedup    preprocessed_torch_compile_int4_gemm-accuracy                                                                                               preprocessed_triton_int4_gemm-speedup    preprocessed_triton_int4_gemm-accuracy                                                                                                                                                                                                                                                                                                                                                                                                              helion_int4_gemm_tritonbench-speedup    helion_int4_gemm_tritonbench-accuracy
2026-02-21T09:00:17.8647488Z ----------------------  ----------------------------------------------  -----------------------------------------------  ----------------------------------------------------------------------------------------------------------------------------------  ----------------------------------------  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  ---------------------------------------
2026-02-21T09:00:17.8650083Z     (1, 1, 1280, 8192)                                        18.543                                                  1  Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.                                                                                                                                                                                                                                                                                                                                                                                                                                                                          10.202746020611514                                      1
2026-02-21T09:00:17.8651152Z ERROR:__main__:failed to process results
2026-02-21T09:00:17.8651371Z Traceback (most recent call last):
2026-02-21T09:00:17.8657986Z   File "/__w/helion/helion/benchmarks/run.py", line 1312, in run_kernel_variants
2026-02-21T09:00:17.8658254Z     process_result(
2026-02-21T09:00:17.8658476Z   File "/__w/helion/helion/benchmarks/run.py", line 1380, in process_result
2026-02-21T09:00:17.8658732Z     metrics[active_metrics[kernel_name][name]].append(float(item))
2026-02-21T09:00:17.8658941Z                                                       ^^^^^^^^^^^
2026-02-21T09:00:17.8659353Z ValueError: could not convert string to float: 'Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.'
2026-02-21T09:00:17.8660597Z     (1, 1, 8192, 3584)                                        13.5655                                                 1  Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.                                                                                                                                                                                                                                                                                                                                                                                                                                                                          11.663682014903282                                      1
2026-02-21T09:00:17.8661980Z     (4, 1, 8192, 3584)                                         4.89291                                                1  Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.                                                                                                                                                                                                                                                                                                                                                                                                                                                                             6.8418023230135                                      1
2026-02-21T09:00:17.8663359Z    (16, 1, 7168, 8192)                                         9.25921                                                1  Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.                                                                                                                                                                                                                                                                                                                                                                                                                                                                          10.242236859684546                                      1
2026-02-21T09:00:17.8664789Z    (64, 1, 7168, 8192)                                         7.82407                                                1  Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.                                                                                                                                                                                                                                                                                                                                                                                                                                                                           7.149499113923205                                      1
2026-02-21T09:00:17.8666171Z  (1, 4096, 8192, 1024)                                         1.55308                                                1  Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Error from Triton code:
2026-02-21T09:00:17.8667449Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:00:17.8667894Z 
2026-02-21T09:00:17.8668285Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Error running generated Triton program:
2026-02-21T09:00:17.8669159Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    SystemError: <built-in method __reversed__ of list object at 0x7ff1a86aba00> returned a result with an exception set
2026-02-21T09:00:17.8670375Z                                                                                                                                                                                                                                                                                                        @helion.kernel(config=helion.Config(block_sizes=[512, 1, 2], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:00:17.8671600Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning.
2026-02-21T09:00:17.8672629Z  (4, 4096, 8192, 1024)                                         1.15597                                                1  Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Error from Triton code:
2026-02-21T09:00:17.8673588Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:00:17.8674037Z 
2026-02-21T09:00:17.8674432Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Error running generated Triton program:
2026-02-21T09:00:17.8675305Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    SystemError: <built-in method __reversed__ of list object at 0x7ff1a86aba00> returned a result with an exception set
2026-02-21T09:00:17.8676521Z                                                                                                                                                                                                                                                                                                           @helion.kernel(config=helion.Config(block_sizes=[256, 4, 1], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:00:17.8677694Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning.
2026-02-21T09:00:17.8678717Z (16, 4096, 1280, 8192)                                         1.05667                                                1  Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Error from Triton code:
2026-02-21T09:00:17.8679657Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:00:17.8680089Z 
2026-02-21T09:00:17.8680474Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Error running generated Triton program:
2026-02-21T09:00:17.8681324Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    SystemError: <built-in method __reversed__ of list object at 0x7ff1a86aba00> returned a result with an exception set
2026-02-21T09:00:17.8682521Z                                                                                                                                                                                                                                                                                                             @helion.kernel(config=helion.Config(block_sizes=[128, 1, 1], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:00:17.8683798Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning.
2026-02-21T09:00:17.8684767Z (64, 4096, 1280, 8192)                                         1.0136                                                 1  Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Error from Triton code:
2026-02-21T09:00:17.8685710Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:00:17.8686151Z 
2026-02-21T09:00:17.8686532Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Error running generated Triton program:
2026-02-21T09:00:17.8687445Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    SystemError: <built-in method __reversed__ of list object at 0x7ff1a86aba00> returned a result with an exception set
2026-02-21T09:00:17.8688651Z                                                                                                                                                                                                                                                                                                        @helion.kernel(config=helion.Config(block_sizes=[4, 32, 2], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:00:17.8689812Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning.
2026-02-21T09:00:17.8690785Z (64, 4096, 8192, 3584)                                         1.01338                                                1  Triton OOR: out of resource: shared memory, Required: 90112, Hardware limit: 65536. Reducing block sizes or `num_stages` may help.                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Error from Triton code:
2026-02-21T09:00:17.8691736Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:00:17.8692196Z 
2026-02-21T09:00:17.8692581Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Error running generated Triton program:
2026-02-21T09:00:17.8693427Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    SystemError: <built-in method __reversed__ of list object at 0x7ff1a86aba00> returned a result with an exception set
2026-02-21T09:00:17.8694628Z                                                                                                                                                                                                                                                                                                        @helion.kernel(config=helion.Config(block_sizes=[4, 32, 2], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:00:17.8695782Z                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning.
2026-02-21T09:00:20.7083453Z                average                                         5.98773                                                1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               4.609996633213604                                      0.5
2026-02-21T09:02:38.5936741Z Applying custom args for int4_gemm: {'num_inputs': 10}
2026-02-21T09:02:38.6022615Z Running int4_gemm benchmark with Helion implementation...
2026-02-21T09:02:38.6022883Z 
2026-02-21T09:02:38.7622259Z Equally-spaced-k mode: Selected 10 equally spaced inputs (total available: 32)
2026-02-21T09:02:38.7622885Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 3, 7, 10, 14, 17, 21, 24, 28, 31]
2026-02-21T09:02:38.7629195Z 
2026-02-21T09:02:38.7635185Z   0%|          | 0/10 [00:00<?, ?it/s]WARNING:tritonbench.utils.triton_op:Running input ID 0:
2026-02-21T09:02:38.7635569Z x_val
2026-02-21T09:02:38.7635749Z ------------------
2026-02-21T09:02:38.7635939Z (1, 1, 1280, 8192)
2026-02-21T09:02:38.7940030Z INFO:tritonbench.utils.triton_op:Took 19.11ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T09:02:41.3419354Z INFO:tritonbench.utils.triton_op:Took 31.56ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T09:02:43.8229514Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T09:02:43.8459630Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:02:43.8460060Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:02:43.8460392Z               'dtype': 'torch.bfloat16',
2026-02-21T09:02:43.8460715Z               'shape': (1, 1, 8192),
2026-02-21T09:02:43.8461021Z               'stride': (8192, 8192, 1)},
2026-02-21T09:02:43.8461326Z             { 'device': 'cuda:0',
2026-02-21T09:02:43.8461630Z               'dtype': 'torch.int32',
2026-02-21T09:02:43.8461925Z               'shape': (8192, 1280),
2026-02-21T09:02:43.8462211Z               'stride': (1280, 1)}),
2026-02-21T09:02:43.8462488Z   'kwargs': {}}
2026-02-21T09:02:43.8463021Z INFO:tritonbench.utils.triton_op:Took 0.53ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T09:02:44.0978459Z [0s] Autotune random seed: 2138032649
2026-02-21T09:02:44.1207078Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:03:46.3226449Z [62s] Timeout after 60s compiling Config(block_sizes=[512, 1, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[4, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T09:03:47.7968367Z [63s] Timeout after 60s compiling Config(block_sizes=[1024, 1, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:03:47.7984696Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.1 configs/s
2026-02-21T09:03:54.0808648Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.0 configs/s
2026-02-21T09:03:54.0816859Z [69s] Adaptive compile timeout: 30s (90% percentile=8.7s, bounds=[30.0s, 60s])
2026-02-21T09:03:54.0818828Z [69s] Initial random population of 100, 5 starting points: 
2026-02-21T09:03:54.0821465Z timeout=2
2026-02-21T09:03:54.0821697Z ok=98
2026-02-21T09:03:54.0821916Z min=0.0207
2026-02-21T09:03:54.0822104Z mid=0.2392
2026-02-21T09:03:54.0822294Z max=24.4661
2026-02-21T09:03:54.0822508Z best={'block_sizes': [512, 1, 8],
2026-02-21T09:03:54.0822863Z  'indexing': ['pointer', 'block_ptr', 'pointer'],
2026-02-21T09:03:54.0823191Z  'l2_groupings': [2],
2026-02-21T09:03:54.0823447Z  'load_eviction_policies': ['', ''],
2026-02-21T09:03:54.0823760Z  'loop_orders': [[1, 0]],
2026-02-21T09:03:54.0824016Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:03:54.0824311Z  'num_stages': 1,
2026-02-21T09:03:54.0824523Z  'num_warps': 8,
2026-02-21T09:03:54.0824720Z  'pid_type': 'flat',
2026-02-21T09:03:54.0825481Z  'range_flattens': [None, None],
2026-02-21T09:03:54.0825740Z  'range_multi_buffers': [None, None],
2026-02-21T09:03:54.0825994Z  'range_num_stages': [0, 2],
2026-02-21T09:03:54.0826228Z  'range_unroll_factors': [0, 0],
2026-02-21T09:03:54.0826469Z  'range_warp_specializes': [],
2026-02-21T09:03:54.0826704Z  'waves_per_eu': 1}
2026-02-21T09:03:54.0832672Z [69s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:03:54.9682705Z [70s] Generation 1 starting: 85 neighbors, 5 active search path(s)
2026-02-21T09:04:29.4588848Z [105s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 8], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[4, 1], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:04:29.4604583Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.4 configs/s
2026-02-21T09:04:35.5037111Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 15.0 configs/s
2026-02-21T09:04:37.7858265Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 431.8         
2026-02-21T09:04:37.7858542Z                                                                   configs/s     
2026-02-21T09:04:38.1164108Z [113s] Generation 1 complete: 
2026-02-21T09:04:38.1164385Z timeout=1
2026-02-21T09:04:38.1164503Z ok=90
2026-02-21T09:04:38.1164586Z min=0.0201
2026-02-21T09:04:38.1164671Z mid=0.0329
2026-02-21T09:04:38.1164750Z max=0.2445
2026-02-21T09:04:38.1164838Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:04:38.1164984Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:04:38.1165119Z  'l2_groupings': [2],
2026-02-21T09:04:38.1165794Z  'load_eviction_policies': ['', ''],
2026-02-21T09:04:38.1165911Z  'loop_orders': [[1, 0]],
2026-02-21T09:04:38.1166077Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:04:38.1166243Z  'num_stages': 1,
2026-02-21T09:04:38.1166371Z  'num_warps': 16,
2026-02-21T09:04:38.1166502Z  'pid_type': 'flat',
2026-02-21T09:04:38.1166656Z  'range_flattens': [None, None],
2026-02-21T09:04:38.1166842Z  'range_multi_buffers': [None, None],
2026-02-21T09:04:38.1167015Z  'range_num_stages': [0, 2],
2026-02-21T09:04:38.1167174Z  'range_unroll_factors': [0, 0],
2026-02-21T09:04:38.1167346Z  'range_warp_specializes': [],
2026-02-21T09:04:38.1167495Z  'waves_per_eu': 1}
2026-02-21T09:04:38.1190820Z [113s] Fitting surrogate: 191 points, 191 targets
2026-02-21T09:04:38.9184789Z [114s] Generation 2 starting: 74 neighbors, 5 active search path(s)
2026-02-21T09:04:43.9757762Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 15.0 configs/s
2026-02-21T09:04:49.0688825Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 15.5 configs/s
2026-02-21T09:04:53.6515737Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 230.7         
2026-02-21T09:04:53.6516058Z                                                                   configs/s     
2026-02-21T09:04:54.1715535Z [130s] Generation 2 complete: 
2026-02-21T09:04:54.1715737Z ok=80
2026-02-21T09:04:54.1715884Z min=0.0198
2026-02-21T09:04:54.1715980Z mid=0.0249
2026-02-21T09:04:54.1716071Z max=0.1133
2026-02-21T09:04:54.1716177Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:04:54.1716336Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:04:54.1716494Z  'l2_groupings': [2],
2026-02-21T09:04:54.1718004Z  'load_eviction_policies': ['', ''],
2026-02-21T09:04:54.1718946Z  'loop_orders': [[1, 0]],
2026-02-21T09:04:54.1719165Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:04:54.1719284Z  'num_stages': 1,
2026-02-21T09:04:54.1719383Z  'num_warps': 8,
2026-02-21T09:04:54.1719478Z  'pid_type': 'flat',
2026-02-21T09:04:54.1719647Z  'range_flattens': [None, None],
2026-02-21T09:04:54.1719769Z  'range_multi_buffers': [None, None],
2026-02-21T09:04:54.1720378Z  'range_num_stages': [0, 2],
2026-02-21T09:04:54.1720491Z  'range_unroll_factors': [0, 0],
2026-02-21T09:04:54.1720609Z  'range_warp_specializes': [],
2026-02-21T09:04:54.1720716Z  'waves_per_eu': 1}
2026-02-21T09:04:54.1890270Z [130s] Fitting surrogate: 271 points, 271 targets
2026-02-21T09:04:55.0235522Z [130s] Generation 3 starting: 75 neighbors, 5 active search path(s)
2026-02-21T09:05:31.2263578Z [167s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:05:31.2275302Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 0.1 configs/s
2026-02-21T09:05:35.8033020Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 16.5 configs/s
2026-02-21T09:05:40.0830383Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 252.7         
2026-02-21T09:05:40.0830660Z                                                                   configs/s     
2026-02-21T09:05:40.4838735Z [176s] Generation 3 complete: 
2026-02-21T09:05:40.4838978Z timeout=1
2026-02-21T09:05:40.4839112Z ok=79
2026-02-21T09:05:40.4839227Z min=0.0197
2026-02-21T09:05:40.4839345Z mid=0.0225
2026-02-21T09:05:40.4839457Z max=0.4289
2026-02-21T09:05:40.4841148Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:05:40.4841378Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:05:40.4841571Z  'l2_groupings': [2],
2026-02-21T09:05:40.4841718Z  'load_eviction_policies': ['', ''],
2026-02-21T09:05:40.4841885Z  'loop_orders': [[1, 0]],
2026-02-21T09:05:40.4842036Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:05:40.4842735Z  'num_stages': 1,
2026-02-21T09:05:40.4842868Z  'num_warps': 8,
2026-02-21T09:05:40.4842992Z  'pid_type': 'flat',
2026-02-21T09:05:40.4843154Z  'range_flattens': [None, True],
2026-02-21T09:05:40.4843314Z  'range_multi_buffers': [None, None],
2026-02-21T09:05:40.4843481Z  'range_num_stages': [0, 2],
2026-02-21T09:05:40.4843629Z  'range_unroll_factors': [0, 0],
2026-02-21T09:05:40.4843792Z  'range_warp_specializes': [],
2026-02-21T09:05:40.4843938Z  'waves_per_eu': 1}
2026-02-21T09:05:40.5758408Z [176s] Fitting surrogate: 351 points, 351 targets
2026-02-21T09:05:41.3716312Z [177s] Generation 4 starting: 68 neighbors, 5 active search path(s)
2026-02-21T09:05:50.9676549Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 20.5 configs/s
2026-02-21T09:05:55.5504488Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 15.3 configs/s
2026-02-21T09:05:59.1368740Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 278.4         
2026-02-21T09:05:59.1369445Z                                                                   configs/s     
2026-02-21T09:05:59.6685108Z [195s] Generation 4 complete: 
2026-02-21T09:05:59.6685354Z ok=73
2026-02-21T09:05:59.6685441Z min=0.0198
2026-02-21T09:05:59.6685536Z mid=0.0232
2026-02-21T09:05:59.6685614Z max=0.1129
2026-02-21T09:05:59.6685709Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:05:59.6685855Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:05:59.6685994Z  'l2_groupings': [2],
2026-02-21T09:05:59.6686106Z  'load_eviction_policies': ['', ''],
2026-02-21T09:05:59.6686225Z  'loop_orders': [[1, 0]],
2026-02-21T09:05:59.6686340Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:05:59.6686444Z  'num_stages': 1,
2026-02-21T09:05:59.6686537Z  'num_warps': 8,
2026-02-21T09:05:59.6686627Z  'pid_type': 'flat',
2026-02-21T09:05:59.6686734Z  'range_flattens': [None, True],
2026-02-21T09:05:59.6686850Z  'range_multi_buffers': [None, None],
2026-02-21T09:05:59.6686970Z  'range_num_stages': [0, 2],
2026-02-21T09:05:59.6687092Z  'range_unroll_factors': [0, 0],
2026-02-21T09:05:59.6687205Z  'range_warp_specializes': [],
2026-02-21T09:05:59.6687338Z  'waves_per_eu': 1}
2026-02-21T09:05:59.8392058Z [195s] Fitting surrogate: 424 points, 424 targets
2026-02-21T09:06:00.9264623Z [196s] Generation 5 starting: 70 neighbors, 5 active search path(s)
2026-02-21T09:06:32.2246625Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 0.2 configs/s
2026-02-21T09:06:36.7438159Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.0 configs/s
2026-02-21T09:06:41.4790294Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 228.6         
2026-02-21T09:06:41.4790909Z                                                                   configs/s     
2026-02-21T09:06:41.9794665Z [237s] Generation 5 complete: 
2026-02-21T09:06:41.9795047Z ok=75
2026-02-21T09:06:41.9795294Z min=0.0201
2026-02-21T09:06:41.9795505Z mid=0.0223
2026-02-21T09:06:41.9795705Z max=0.4276
2026-02-21T09:06:41.9795930Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:06:41.9796349Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:06:41.9796718Z  'l2_groupings': [2],
2026-02-21T09:06:41.9797018Z  'load_eviction_policies': ['', ''],
2026-02-21T09:06:41.9797622Z  'loop_orders': [[1, 0]],
2026-02-21T09:06:41.9807713Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:06:41.9807921Z  'num_stages': 1,
2026-02-21T09:06:41.9808093Z  'num_warps': 8,
2026-02-21T09:06:41.9808270Z  'pid_type': 'flat',
2026-02-21T09:06:41.9808455Z  'range_flattens': [None, True],
2026-02-21T09:06:41.9808660Z  'range_multi_buffers': [None, None],
2026-02-21T09:06:41.9808832Z  'range_num_stages': [0, 2],
2026-02-21T09:06:41.9808982Z  'range_unroll_factors': [0, 0],
2026-02-21T09:06:41.9809142Z  'range_warp_specializes': [],
2026-02-21T09:06:41.9809304Z  'waves_per_eu': 1}
2026-02-21T09:06:42.1019060Z [237s] Fitting surrogate: 499 points, 499 targets
2026-02-21T09:06:42.7348603Z [238s] Generation 6 starting: 53 neighbors, 4 active search path(s)
2026-02-21T09:07:02.1884925Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55/55 0.6 configs/s
2026-02-21T09:07:05.8625663Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 55/55 15.6 configs/s
2026-02-21T09:07:09.0636725Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 342.3         
2026-02-21T09:07:09.0639009Z                                                                   configs/s     
2026-02-21T09:07:09.3573475Z [265s] Generation 6 complete: 
2026-02-21T09:07:09.3573895Z ok=57
2026-02-21T09:07:09.3574111Z min=0.0198
2026-02-21T09:07:09.3574328Z mid=0.0254
2026-02-21T09:07:09.3574526Z max=0.5049
2026-02-21T09:07:09.3574678Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:07:09.3574826Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:07:09.3574967Z  'l2_groupings': [2],
2026-02-21T09:07:09.3575074Z  'load_eviction_policies': ['', ''],
2026-02-21T09:07:09.3575194Z  'loop_orders': [[1, 0]],
2026-02-21T09:07:09.3575301Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:07:09.3575486Z  'num_stages': 1,
2026-02-21T09:07:09.3575576Z  'num_warps': 8,
2026-02-21T09:07:09.3575672Z  'pid_type': 'flat',
2026-02-21T09:07:09.3575795Z  'range_flattens': [None, True],
2026-02-21T09:07:09.3575909Z  'range_multi_buffers': [None, None],
2026-02-21T09:07:09.3576029Z  'range_num_stages': [0, 2],
2026-02-21T09:07:09.3576137Z  'range_unroll_factors': [0, 0],
2026-02-21T09:07:09.3576255Z  'range_warp_specializes': [],
2026-02-21T09:07:09.3576364Z  'waves_per_eu': 1}
2026-02-21T09:07:09.4523691Z [265s] Fitting surrogate: 556 points, 556 targets
2026-02-21T09:07:09.9424592Z [265s] Generation 7 starting: 41 neighbors, 3 active search path(s)
2026-02-21T09:07:21.7635883Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 1.3 configs/s
2026-02-21T09:07:24.5788858Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 43/43 16.1 configs/s
2026-02-21T09:07:26.3385349Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 550.8         
2026-02-21T09:07:26.3385677Z                                                                   configs/s     
2026-02-21T09:07:26.6313616Z [282s] Generation 7 complete: 
2026-02-21T09:07:26.6314112Z ok=45
2026-02-21T09:07:26.6314199Z min=0.0200
2026-02-21T09:07:26.6314280Z mid=0.0272
2026-02-21T09:07:26.6314361Z max=0.5049
2026-02-21T09:07:26.6314449Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:07:26.6314597Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:07:26.6314735Z  'l2_groupings': [2],
2026-02-21T09:07:26.6314844Z  'load_eviction_policies': ['', ''],
2026-02-21T09:07:26.6314964Z  'loop_orders': [[1, 0]],
2026-02-21T09:07:26.6315072Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:07:26.6315180Z  'num_stages': 1,
2026-02-21T09:07:26.6315272Z  'num_warps': 8,
2026-02-21T09:07:26.6315370Z  'pid_type': 'flat',
2026-02-21T09:07:26.6315471Z  'range_flattens': [None, True],
2026-02-21T09:07:26.6315591Z  'range_multi_buffers': [None, None],
2026-02-21T09:07:26.6315708Z  'range_num_stages': [0, 2],
2026-02-21T09:07:26.6315985Z  'range_unroll_factors': [0, 0],
2026-02-21T09:07:26.6316096Z  'range_warp_specializes': [],
2026-02-21T09:07:26.6316207Z  'waves_per_eu': 1}
2026-02-21T09:07:26.6689647Z [282s] Fitting surrogate: 601 points, 601 targets
2026-02-21T09:07:27.0927969Z [282s] Generation 8 starting: 30 neighbors, 2 active search path(s)
2026-02-21T09:07:37.6374970Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 0.6 configs/s
2026-02-21T09:07:39.7555833Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.2 configs/s
2026-02-21T09:07:41.7232729Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 581.2         
2026-02-21T09:07:41.7233366Z                                                                   configs/s     
2026-02-21T09:07:42.0368029Z [297s] Generation 8 complete: 
2026-02-21T09:07:42.0368387Z ok=33
2026-02-21T09:07:42.0369503Z min=0.0200
2026-02-21T09:07:42.0369658Z mid=0.0250
2026-02-21T09:07:42.0369742Z max=0.1114
2026-02-21T09:07:42.0369840Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:07:42.0370087Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:07:42.0370224Z  'l2_groupings': [2],
2026-02-21T09:07:42.0370359Z  'load_eviction_policies': ['', ''],
2026-02-21T09:07:42.0370472Z  'loop_orders': [[1, 0]],
2026-02-21T09:07:42.0370589Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:07:42.0370692Z  'num_stages': 1,
2026-02-21T09:07:42.0370776Z  'num_warps': 8,
2026-02-21T09:07:42.0370863Z  'pid_type': 'flat',
2026-02-21T09:07:42.0370958Z  'range_flattens': [None, True],
2026-02-21T09:07:42.0371070Z  'range_multi_buffers': [None, None],
2026-02-21T09:07:42.0371180Z  'range_num_stages': [0, 2],
2026-02-21T09:07:42.0371283Z  'range_unroll_factors': [0, 0],
2026-02-21T09:07:42.0371390Z  'range_warp_specializes': [],
2026-02-21T09:07:42.0371497Z  'waves_per_eu': 1}
2026-02-21T09:07:42.0821541Z [297s] Fitting surrogate: 634 points, 634 targets
2026-02-21T09:07:42.4849748Z [298s] Generation 9 starting: 31 neighbors, 2 active search path(s)
2026-02-21T09:08:16.0183701Z [331s] Timeout after 30s compiling Config(block_sizes=[4096, 1, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=1, num_warps=4, pid_type='xyz', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:08:16.0199534Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 0.2 configs/s
2026-02-21T09:08:18.0516179Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 16.6 configs/s
2026-02-21T09:08:19.3874014Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 717.2         
2026-02-21T09:08:19.3876438Z                                                                   configs/s     
2026-02-21T09:08:19.6585175Z [335s] Generation 9 complete: 
2026-02-21T09:08:19.6585560Z timeout=1
2026-02-21T09:08:19.6585774Z ok=33
2026-02-21T09:08:19.6585978Z min=0.0200
2026-02-21T09:08:19.6586238Z mid=0.0274
2026-02-21T09:08:19.6586438Z max=0.4325
2026-02-21T09:08:19.6586669Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:08:19.6587065Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:08:19.6587436Z  'l2_groupings': [2],
2026-02-21T09:08:19.6587712Z  'load_eviction_policies': ['', ''],
2026-02-21T09:08:19.6588026Z  'loop_orders': [[1, 0]],
2026-02-21T09:08:19.6588302Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:08:19.6588576Z  'num_stages': 1,
2026-02-21T09:08:19.6588825Z  'num_warps': 8,
2026-02-21T09:08:19.6589059Z  'pid_type': 'flat',
2026-02-21T09:08:19.6589328Z  'range_flattens': [None, True],
2026-02-21T09:08:19.6589630Z  'range_multi_buffers': [None, None],
2026-02-21T09:08:19.6589934Z  'range_num_stages': [0, 2],
2026-02-21T09:08:19.6590207Z  'range_unroll_factors': [0, 0],
2026-02-21T09:08:19.6590506Z  'range_warp_specializes': [],
2026-02-21T09:08:19.6590780Z  'waves_per_eu': 1}
2026-02-21T09:08:19.6931528Z [335s] Fitting surrogate: 668 points, 668 targets
2026-02-21T09:08:20.0824044Z [335s] Generation 10 starting: 31 neighbors, 2 active search path(s)
2026-02-21T09:08:53.8471908Z [369s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=1, num_warps=4, pid_type='xyz', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:08:53.8487129Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 0.2 configs/s
2026-02-21T09:08:56.1487019Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 33/33 14.6 configs/s
2026-02-21T09:08:57.2586375Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 857.2         
2026-02-21T09:08:57.2586968Z                                                                   configs/s     
2026-02-21T09:08:57.5046220Z [373s] Generation 10 complete: 
2026-02-21T09:08:57.5046493Z timeout=1
2026-02-21T09:08:57.5047036Z ok=33
2026-02-21T09:08:57.5047351Z min=0.0200
2026-02-21T09:08:57.5047579Z mid=0.0272
2026-02-21T09:08:57.5050698Z max=0.2262
2026-02-21T09:08:57.5051896Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:08:57.5052424Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:08:57.5052811Z  'l2_groupings': [2],
2026-02-21T09:08:57.5053121Z  'load_eviction_policies': ['', ''],
2026-02-21T09:08:57.5053436Z  'loop_orders': [[1, 0]],
2026-02-21T09:08:57.5053729Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:08:57.5054002Z  'num_stages': 1,
2026-02-21T09:08:57.5054264Z  'num_warps': 8,
2026-02-21T09:08:57.5054498Z  'pid_type': 'flat',
2026-02-21T09:08:57.5054767Z  'range_flattens': [None, True],
2026-02-21T09:08:57.5055068Z  'range_multi_buffers': [None, None],
2026-02-21T09:08:57.5055382Z  'range_num_stages': [0, 2],
2026-02-21T09:08:57.5055656Z  'range_unroll_factors': [0, 0],
2026-02-21T09:08:57.5056017Z  'range_warp_specializes': [],
2026-02-21T09:08:57.5056311Z  'waves_per_eu': 1}
2026-02-21T09:08:57.5274063Z [373s] Fitting surrogate: 702 points, 702 targets
2026-02-21T09:08:57.7708668Z [373s] Generation 11 starting: 14 neighbors, 1 active search path(s)
2026-02-21T09:08:59.5622751Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 20.6 configs/s
2026-02-21T09:09:00.4824364Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 15.9 configs/s
2026-02-21T09:09:01.3517744Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1085.1        
2026-02-21T09:09:01.3518023Z                                                                   configs/s     
2026-02-21T09:09:01.6008047Z [377s] Generation 11 complete: 
2026-02-21T09:09:01.6008268Z ok=16
2026-02-21T09:09:01.6008352Z min=0.0198
2026-02-21T09:09:01.6008442Z mid=0.0221
2026-02-21T09:09:01.6008530Z max=0.0583
2026-02-21T09:09:01.6008627Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:09:01.6008832Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:09:01.6008977Z  'l2_groupings': [2],
2026-02-21T09:09:01.6009087Z  'load_eviction_policies': ['', ''],
2026-02-21T09:09:01.6009224Z  'loop_orders': [[1, 0]],
2026-02-21T09:09:01.6009333Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:09:01.6009444Z  'num_stages': 1,
2026-02-21T09:09:01.6009542Z  'num_warps': 8,
2026-02-21T09:09:01.6009632Z  'pid_type': 'flat',
2026-02-21T09:09:01.6009740Z  'range_flattens': [None, True],
2026-02-21T09:09:01.6009857Z  'range_multi_buffers': [None, None],
2026-02-21T09:09:01.6009979Z  'range_num_stages': [0, 2],
2026-02-21T09:09:01.6010098Z  'range_unroll_factors': [0, 0],
2026-02-21T09:09:01.6010216Z  'range_warp_specializes': [],
2026-02-21T09:09:01.6010322Z  'waves_per_eu': 1}
2026-02-21T09:09:01.6189077Z [377s] Fitting surrogate: 718 points, 718 targets
2026-02-21T09:09:01.8440297Z [377s] Generation 12 starting: 13 neighbors, 1 active search path(s)
2026-02-21T09:09:03.4810996Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 13.4 configs/s
2026-02-21T09:09:04.3152508Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 13/13 16.4 configs/s
2026-02-21T09:09:05.0505023Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1263.8        
2026-02-21T09:09:05.0505604Z                                                                   configs/s     
2026-02-21T09:09:05.2733889Z [381s] Generation 12 complete: 
2026-02-21T09:09:05.2734112Z ok=15
2026-02-21T09:09:05.2734199Z min=0.0200
2026-02-21T09:09:05.2734288Z mid=0.0215
2026-02-21T09:09:05.2734366Z max=0.0731
2026-02-21T09:09:05.2734461Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:09:05.2734610Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:09:05.2734753Z  'l2_groupings': [2],
2026-02-21T09:09:05.2734861Z  'load_eviction_policies': ['', ''],
2026-02-21T09:09:05.2734979Z  'loop_orders': [[1, 0]],
2026-02-21T09:09:05.2735094Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:09:05.2735200Z  'num_stages': 1,
2026-02-21T09:09:05.2735344Z  'num_warps': 8,
2026-02-21T09:09:05.2735434Z  'pid_type': 'flat',
2026-02-21T09:09:05.2735539Z  'range_flattens': [None, True],
2026-02-21T09:09:05.2735672Z  'range_multi_buffers': [None, None],
2026-02-21T09:09:05.2735793Z  'range_num_stages': [0, 2],
2026-02-21T09:09:05.2735904Z  'range_unroll_factors': [0, 0],
2026-02-21T09:09:05.2736020Z  'range_warp_specializes': [],
2026-02-21T09:09:05.2736129Z  'waves_per_eu': 1}
2026-02-21T09:09:05.2910359Z [381s] Fitting surrogate: 733 points, 733 targets
2026-02-21T09:09:05.5213787Z [381s] Generation 13 starting: 14 neighbors, 1 active search path(s)
2026-02-21T09:09:07.2896218Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 10.0 configs/s
2026-02-21T09:09:08.2686248Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 16.8 configs/s
2026-02-21T09:09:09.4855454Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1097.1        
2026-02-21T09:09:09.4856082Z                                                                   configs/s     
2026-02-21T09:09:09.7201745Z [385s] Generation 13 complete: 
2026-02-21T09:09:09.7202077Z ok=16
2026-02-21T09:09:09.7203048Z min=0.0196
2026-02-21T09:09:09.7203271Z mid=0.0216
2026-02-21T09:09:09.7203472Z max=0.0369
2026-02-21T09:09:09.7207047Z best={'block_sizes': [1024, 1, 8],
2026-02-21T09:09:09.7207453Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:09:09.7207838Z  'l2_groupings': [2],
2026-02-21T09:09:09.7208124Z  'load_eviction_policies': ['', ''],
2026-02-21T09:09:09.7208449Z  'loop_orders': [[1, 0]],
2026-02-21T09:09:09.7208737Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:09:09.7209017Z  'num_stages': 1,
2026-02-21T09:09:09.7209254Z  'num_warps': 8,
2026-02-21T09:09:09.7209485Z  'pid_type': 'flat',
2026-02-21T09:09:09.7209755Z  'range_flattens': [None, True],
2026-02-21T09:09:09.7210063Z  'range_multi_buffers': [None, None],
2026-02-21T09:09:09.7210379Z  'range_num_stages': [0, 2],
2026-02-21T09:09:09.7210670Z  'range_unroll_factors': [0, 0],
2026-02-21T09:09:09.7210976Z  'range_warp_specializes': [],
2026-02-21T09:09:09.7211272Z  'waves_per_eu': 1}
2026-02-21T09:09:09.7387365Z [385s] Fitting surrogate: 749 points, 749 targets
2026-02-21T09:09:09.8548386Z [385s] Autotuning complete in 385.7s after searching 711 configs.
2026-02-21T09:09:09.8548935Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:09:09.8550811Z     @helion.kernel(config=helion.Config(block_sizes=[1024, 1, 8], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:09:09.8552583Z 
2026-02-21T09:09:09.8553061Z [385s] Code of selected kernel: /tmp/torchinductor_root/we/cwez5efefdlwhe2ccy46m737q6cmtmh4mrdxinr34izyuxz7sbun.py
2026-02-21T09:09:09.8696812Z from __future__ import annotations
2026-02-21T09:09:09.8697092Z 
2026-02-21T09:09:09.8697195Z import torch
2026-02-21T09:09:09.8697430Z import triton
2026-02-21T09:09:09.8697688Z import triton.language as tl
2026-02-21T09:09:09.8698109Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:09:09.8698444Z 
2026-02-21T09:09:09.8698567Z _BLOCK_SIZE_2 = tl.constexpr(8)
2026-02-21T09:09:09.8698865Z _BLOCK_SIZE_1 = tl.constexpr(1)
2026-02-21T09:09:09.8699167Z _BLOCK_SIZE_0 = tl.constexpr(1024)
2026-02-21T09:09:09.8699369Z 
2026-02-21T09:09:09.8699457Z @triton.jit
2026-02-21T09:09:09.8699991Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr, _SHAPE_DIM_3: tl.constexpr):
2026-02-21T09:09:09.8700690Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:09:09.8701113Z     num_pid_m = tl.cdiv(1280, _BLOCK_SIZE_2)
2026-02-21T09:09:09.8701427Z     num_pid_n = 1
2026-02-21T09:09:09.8701690Z     inner_2d_pid = tl.program_id(0)
2026-02-21T09:09:09.8702024Z     num_pid_in_group = 2 * num_pid_n
2026-02-21T09:09:09.8702371Z     group_id = inner_2d_pid // num_pid_in_group
2026-02-21T09:09:09.8702733Z     first_pid_m = group_id * 2
2026-02-21T09:09:09.8703046Z     group_size_m = min(num_pid_m - first_pid_m, 2)
2026-02-21T09:09:09.8703279Z     pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T09:09:09.8703482Z     offset_2 = pid_0 * _BLOCK_SIZE_2
2026-02-21T09:09:09.8703669Z     indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32)
2026-02-21T09:09:09.8703923Z     # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:09:09.8704165Z     acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32)
2026-02-21T09:09:09.8704437Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:09:09.8704770Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:09:09.8705099Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:09:09.8705399Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:09:09.8705597Z     for offset_3 in tl.range(0, 4096, _BLOCK_SIZE_0, num_stages=2, flatten=True):
2026-02-21T09:09:09.8705804Z         acc_copy = acc
2026-02-21T09:09:09.8705929Z         acc_copy_0 = acc_copy
2026-02-21T09:09:09.8706105Z         # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2
2026-02-21T09:09:09.8706291Z         mul = 2 * offset_3
2026-02-21T09:09:09.8706494Z         # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:09:09.8706726Z         iota = mul + tl.arange(0, mul_1)
2026-02-21T09:09:09.8706958Z         load = tl.broadcast_to(tl.load(A + iota[None, :] * 1, None), [_BLOCK_SIZE_1, _SHAPE_DIM_2])
2026-02-21T09:09:09.8707263Z         # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:09:09.8707497Z         # src[int4_gemm.py:66]:     torch.float32
2026-02-21T09:09:09.8707684Z         # src[int4_gemm.py:67]: )  # [BLOCK_SIZE_M, BLOCK_SIZE_K]
2026-02-21T09:09:09.8707872Z         v_0 = tl.cast(load, tl.float32)
2026-02-21T09:09:09.8708090Z         # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n]  # [BLOCK_SIZE_K//2, BLOCK_SIZE_N]
2026-02-21T09:09:09.8708531Z         b_tile = tl.load(tl.make_block_ptr(B, [4096, 1280], [1280, 1], [offset_3, offset_2], [_BLOCK_SIZE_0, _BLOCK_SIZE_2], [1, 0]), boundary_check=[0, 1], padding_option='zero')
2026-02-21T09:09:09.8708952Z         # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8)  # Sign-extend low 4 bits
2026-02-21T09:09:09.8709181Z         v_1 = tl.full([], 4, tl.int8)
2026-02-21T09:09:09.8709326Z         v_2 = b_tile << v_1
2026-02-21T09:09:09.8709456Z         v_3 = tl.full([], 4, tl.int8)
2026-02-21T09:09:09.8709593Z         v_4 = v_2 >> v_3
2026-02-21T09:09:09.8709831Z         # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8)  # Sign-extend high 4 bits
2026-02-21T09:09:09.8710046Z         v_5 = tl.full([], 4, tl.int8)
2026-02-21T09:09:09.8710185Z         v_6 = b_tile >> v_5
2026-02-21T09:09:09.8710358Z         # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1)
2026-02-21T09:09:09.8710554Z         stack_idx = tl.arange(0, 2)
2026-02-21T09:09:09.8710708Z         broadcast_idx = stack_idx[None, :, None]
2026-02-21T09:09:09.8710876Z         expanded_0 = tl.expand_dims(v_4, 1)
2026-02-21T09:09:09.8711044Z         expanded_1 = tl.expand_dims(v_6, 1)
2026-02-21T09:09:09.8711205Z         stacked_result = tl.zeros_like(expanded_0)
2026-02-21T09:09:09.8711367Z         mask_0 = broadcast_idx == 0
2026-02-21T09:09:09.8711549Z         stacked_result = tl.where(mask_0, expanded_0, stacked_result)
2026-02-21T09:09:09.8711739Z         mask_1 = broadcast_idx == 1
2026-02-21T09:09:09.8711919Z         stacked_result = tl.where(mask_1, expanded_1, stacked_result)
2026-02-21T09:09:09.8712149Z         # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:09:09.8712384Z         # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:09:09.8712605Z         # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:09:09.8712795Z         view = tl.reshape(stacked_result, [_SHAPE_DIM_3, _BLOCK_SIZE_2])
2026-02-21T09:09:09.8712962Z         v_7 = tl.cast(view, tl.float32)
2026-02-21T09:09:09.8713145Z         # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2)  # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1]
2026-02-21T09:09:09.8713323Z         a_tile_1 = v_0[:, :, None]
2026-02-21T09:09:09.8713468Z         # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0)
2026-02-21T09:09:09.8713617Z         b_unpacked_1 = v_7[None, :, :]
2026-02-21T09:09:09.8713808Z         # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1)  # [BLOCK_SIZE_M, BLOCK_SIZE_N]
2026-02-21T09:09:09.8714002Z         v_8 = a_tile_1 * b_unpacked_1
2026-02-21T09:09:09.8714129Z         sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32)
2026-02-21T09:09:09.8714262Z         acc = acc_copy_0 + sum_1
2026-02-21T09:09:09.8714406Z     # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16)
2026-02-21T09:09:09.8714597Z     v_10 = tl.cast(acc, tl.bfloat16)
2026-02-21T09:09:09.8714782Z     tl.store(C + tl.broadcast_to(indices_2[None, :] * 1, [_BLOCK_SIZE_1, _BLOCK_SIZE_2]), v_10, None)
2026-02-21T09:09:09.8714938Z 
2026-02-21T09:09:09.8715025Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher):
2026-02-21T09:09:09.8715187Z     """
2026-02-21T09:09:09.8715298Z     BFloat16 x INT4 General Matrix Multiplication (GEMM).
2026-02-21T09:09:09.8715401Z 
2026-02-21T09:09:09.8715465Z     This kernel performs matrix multiplication where:
2026-02-21T09:09:09.8715612Z     - A is a bfloat16 matrix of shape [M, K]
2026-02-21T09:09:09.8715780Z     - B is an int8 matrix of shape [K//2, N] containing packed int4 values
2026-02-21T09:09:09.8715947Z       (two 4-bit values packed into each int8)
2026-02-21T09:09:09.8716037Z 
2026-02-21T09:09:09.8716072Z     Args:
2026-02-21T09:09:09.8716193Z         A (Tensor): Input tensor of shape [M, K] in bfloat16 format.
2026-02-21T09:09:09.8716377Z         B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format.
2026-02-21T09:09:09.8716493Z 
2026-02-21T09:09:09.8716528Z     Returns:
2026-02-21T09:09:09.8716644Z         Tensor: Output tensor of shape [M, N] in bfloat16 format.
2026-02-21T09:09:09.8716785Z     """
2026-02-21T09:09:09.8716892Z     # src[int4_gemm.py:50]: M, K = A.shape
2026-02-21T09:09:09.8717010Z     M, K = A.shape
2026-02-21T09:09:09.8717111Z     # src[int4_gemm.py:51]: _, N = B.shape
2026-02-21T09:09:09.8717222Z     _, N = B.shape
2026-02-21T09:09:09.8717371Z     # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:09:09.8717574Z     C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:09:09.8717755Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:09:09.8717939Z     _BLOCK_SIZE_2 = 8
2026-02-21T09:09:09.8718108Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:09:09.8718379Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:09:09.8718631Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:09:09.8718813Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:09:09.8718924Z     _BLOCK_SIZE_0 = 1024
2026-02-21T09:09:09.8719086Z     # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:09:09.8719264Z     _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0
2026-02-21T09:09:09.8719406Z     # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:09:09.8719589Z     # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:09:09.8719756Z     # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:09:09.8719889Z     _SHAPE_DIM_3 = 2 * _BLOCK_SIZE_0
2026-02-21T09:09:09.8720029Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:09:09.8720225Z     # src[int4_gemm.py:58]:     acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:09:09.8720392Z     # src[int4_gemm.py:57-91]: ...
2026-02-21T09:09:09.8720532Z     _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0)
2026-02-21T09:09:09.8720878Z     _launcher(_helion_matmul_bf16_int4, (triton.cdiv(1280, _BLOCK_SIZE_2) * 1,), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, _SHAPE_DIM_3, num_warps=8, num_stages=1, waves_per_eu=1, matrix_instr_nonkdim=16)
2026-02-21T09:09:09.8721200Z     # src[int4_gemm.py:93]: return C
2026-02-21T09:09:09.8721313Z     return C
2026-02-21T09:09:10.6688424Z WARNING:tritonbench.utils.triton_op:Completed input ID 0:
2026-02-21T09:09:10.6688684Z x_val
2026-02-21T09:09:10.6688822Z ------------------
2026-02-21T09:09:10.6688965Z (1, 1, 1280, 8192)
2026-02-21T09:09:10.6689057Z 
2026-02-21T09:09:10.6703881Z  10%|█         | 1/10 [06:31<58:47, 391.91s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3:
2026-02-21T09:09:10.6704437Z x_val
2026-02-21T09:09:10.6704564Z ------------------
2026-02-21T09:09:10.6704696Z (1, 1, 8192, 3584)
2026-02-21T09:09:10.6706725Z INFO:tritonbench.utils.triton_op:Took 0.21ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T09:09:11.8554674Z INFO:tritonbench.utils.triton_op:Took 4.61ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T09:09:14.0902513Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T09:09:14.0933825Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:09:14.0934103Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:09:14.0934329Z               'dtype': 'torch.bfloat16',
2026-02-21T09:09:14.0934553Z               'shape': (1, 1, 3584),
2026-02-21T09:09:14.0934756Z               'stride': (3584, 3584, 1)},
2026-02-21T09:09:14.0935014Z             { 'device': 'cuda:0',
2026-02-21T09:09:14.0935211Z               'dtype': 'torch.int32',
2026-02-21T09:09:14.0935410Z               'shape': (3584, 8192),
2026-02-21T09:09:14.0935615Z               'stride': (8192, 1)}),
2026-02-21T09:09:14.0935805Z   'kwargs': {}}
2026-02-21T09:09:14.0954463Z INFO:tritonbench.utils.triton_op:Took 2.27ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T09:09:14.2851289Z [0s] Autotune random seed: 2138032649
2026-02-21T09:09:14.3061363Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:09:49.4575707Z [35s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[4, 1], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:09:51.1427018Z [36s] Timeout after 30s compiling Config(block_sizes=[512, 1, 1024], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:09:52.7224775Z [38s] Timeout after 30s compiling Config(block_sizes=[256, 1, 512], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:09:53.4099709Z [39s] Timeout after 30s compiling Config(block_sizes=[128, 1, 1024], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T09:09:53.4117217Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T09:10:00.7250001Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.9 configs/s
2026-02-21T09:10:00.7258308Z [46s] Adaptive compile timeout: 30s (90% percentile=10.8s, bounds=[30.0s, 30s])
2026-02-21T09:10:01.6816005Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 992.0 configs/s
2026-02-21T09:10:01.9553771Z [47s] Initial random population of 100, 5 starting points: 
2026-02-21T09:10:01.9553999Z timeout=4
2026-02-21T09:10:01.9554125Z ok=96
2026-02-21T09:10:01.9554213Z min=0.0358
2026-02-21T09:10:01.9554321Z mid=0.2003
2026-02-21T09:10:01.9555111Z max=111.4510
2026-02-21T09:10:01.9555263Z best={'block_sizes': [128, 1, 16],
2026-02-21T09:10:01.9555478Z  'indexing': ['pointer', 'block_ptr', 'block_ptr'],
2026-02-21T09:10:01.9555664Z  'l2_groupings': [1],
2026-02-21T09:10:01.9555770Z  'load_eviction_policies': ['', ''],
2026-02-21T09:10:01.9555933Z  'loop_orders': [[1, 0]],
2026-02-21T09:10:01.9556060Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:10:01.9556215Z  'num_stages': 3,
2026-02-21T09:10:01.9556342Z  'num_warps': 16,
2026-02-21T09:10:01.9556441Z  'pid_type': 'flat',
2026-02-21T09:10:01.9556597Z  'range_flattens': [None, False],
2026-02-21T09:10:01.9556783Z  'range_multi_buffers': [None, False],
2026-02-21T09:10:01.9556945Z  'range_num_stages': [0, 4],
2026-02-21T09:10:01.9557081Z  'range_unroll_factors': [0, 1],
2026-02-21T09:10:01.9557218Z  'range_warp_specializes': [],
2026-02-21T09:10:01.9557387Z  'waves_per_eu': 3}
2026-02-21T09:10:01.9723198Z [47s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:10:02.8412429Z [48s] Generation 1 starting: 85 neighbors, 5 active search path(s)
2026-02-21T09:10:37.9396204Z [83s] Timeout after 30s compiling Config(block_sizes=[512, 1, 8], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[4, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:10:43.9639659Z [89s] Timeout after 30s compiling Config(block_sizes=[128, 1, 128], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:10:43.9655554Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 0.5 configs/s
2026-02-21T09:10:49.3645393Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 16.4 configs/s
2026-02-21T09:10:54.5585336Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 190.4         
2026-02-21T09:10:54.5585767Z                                                                   configs/s     
2026-02-21T09:10:55.0704429Z [100s] Generation 1 complete: 
2026-02-21T09:10:55.0704639Z timeout=2
2026-02-21T09:10:55.0704751Z ok=88
2026-02-21T09:10:55.0704839Z min=0.0351
2026-02-21T09:10:55.0704921Z mid=0.0392
2026-02-21T09:10:55.0705005Z max=0.4275
2026-02-21T09:10:55.0705096Z best={'block_sizes': [128, 1, 16],
2026-02-21T09:10:55.0705237Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:10:55.0705369Z  'l2_groupings': [1],
2026-02-21T09:10:55.0705519Z  'load_eviction_policies': ['', ''],
2026-02-21T09:10:55.0705639Z  'loop_orders': [[1, 0]],
2026-02-21T09:10:55.0705741Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:10:55.0706231Z  'num_stages': 3,
2026-02-21T09:10:55.0706316Z  'num_warps': 16,
2026-02-21T09:10:55.0706404Z  'pid_type': 'flat',
2026-02-21T09:10:55.0706501Z  'range_flattens': [None, False],
2026-02-21T09:10:55.0706622Z  'range_multi_buffers': [None, False],
2026-02-21T09:10:55.0706738Z  'range_num_stages': [0, 4],
2026-02-21T09:10:55.0706845Z  'range_unroll_factors': [0, 1],
2026-02-21T09:10:55.0706954Z  'range_warp_specializes': [],
2026-02-21T09:10:55.0707060Z  'waves_per_eu': 3}
2026-02-21T09:10:55.1812718Z [100s] Fitting surrogate: 190 points, 190 targets
2026-02-21T09:10:56.5712820Z [102s] Generation 2 starting: 77 neighbors, 5 active search path(s)
2026-02-21T09:11:10.0446815Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 1.8 configs/s
2026-02-21T09:11:14.9957939Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 16.0 configs/s
2026-02-21T09:11:20.7201098Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 185.0         
2026-02-21T09:11:20.7201500Z                                                                   configs/s     
2026-02-21T09:11:21.2576765Z [126s] Generation 2 complete: 
2026-02-21T09:11:21.2577035Z ok=82
2026-02-21T09:11:21.2577156Z min=0.0322
2026-02-21T09:11:21.2577283Z mid=0.0369
2026-02-21T09:11:21.2577412Z max=0.1388
2026-02-21T09:11:21.2577545Z best={'block_sizes': [256, 1, 16],
2026-02-21T09:11:21.2577746Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T09:11:21.2577981Z  'l2_groupings': [32],
2026-02-21T09:11:21.2578107Z  'load_eviction_policies': ['', ''],
2026-02-21T09:11:21.2578232Z  'loop_orders': [[0, 1]],
2026-02-21T09:11:21.2578345Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:11:21.2578452Z  'num_stages': 1,
2026-02-21T09:11:21.2578546Z  'num_warps': 4,
2026-02-21T09:11:21.2578642Z  'pid_type': 'xyz',
2026-02-21T09:11:21.2579318Z  'range_flattens': [None, True],
2026-02-21T09:11:21.2579436Z  'range_multi_buffers': [None, None],
2026-02-21T09:11:21.2579557Z  'range_num_stages': [0, 1],
2026-02-21T09:11:21.2579684Z  'range_unroll_factors': [0, 4],
2026-02-21T09:11:21.2579801Z  'range_warp_specializes': [],
2026-02-21T09:11:21.2579905Z  'waves_per_eu': 2}
2026-02-21T09:11:21.3821831Z [127s] Fitting surrogate: 272 points, 272 targets
2026-02-21T09:11:22.3566519Z [128s] Generation 3 starting: 76 neighbors, 5 active search path(s)
2026-02-21T09:11:54.8460389Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 0.1 configs/s
2026-02-21T09:11:59.8831114Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 15.9 configs/s
2026-02-21T09:12:05.3760701Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 193.1         
2026-02-21T09:12:05.3760967Z                                                                   configs/s     
2026-02-21T09:12:05.9858946Z [171s] Generation 3 complete: 
2026-02-21T09:12:05.9859302Z ok=81
2026-02-21T09:12:05.9859542Z min=0.0316
2026-02-21T09:12:05.9859749Z mid=0.0368
2026-02-21T09:12:05.9859934Z max=0.3577
2026-02-21T09:12:05.9860185Z best={'block_sizes': [128, 1, 32],
2026-02-21T09:12:05.9860535Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:12:05.9860885Z  'l2_groupings': [32],
2026-02-21T09:12:05.9861157Z  'load_eviction_policies': ['', ''],
2026-02-21T09:12:05.9861449Z  'loop_orders': [[1, 0]],
2026-02-21T09:12:05.9861722Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:12:05.9861980Z  'num_stages': 4,
2026-02-21T09:12:05.9862205Z  'num_warps': 16,
2026-02-21T09:12:05.9862420Z  'pid_type': 'xyz',
2026-02-21T09:12:05.9862667Z  'range_flattens': [None, False],
2026-02-21T09:12:05.9862957Z  'range_multi_buffers': [None, None],
2026-02-21T09:12:05.9863259Z  'range_num_stages': [0, 3],
2026-02-21T09:12:05.9863527Z  'range_unroll_factors': [0, 0],
2026-02-21T09:12:05.9863815Z  'range_warp_specializes': [],
2026-02-21T09:12:05.9864085Z  'waves_per_eu': 2}
2026-02-21T09:12:06.1523082Z [171s] Fitting surrogate: 353 points, 353 targets
2026-02-21T09:12:06.9503847Z [172s] Generation 4 starting: 75 neighbors, 5 active search path(s)
2026-02-21T09:12:33.1259175Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.2 configs/s
2026-02-21T09:12:38.0933133Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 15.7 configs/s
2026-02-21T09:12:43.6594190Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 192.1         
2026-02-21T09:12:43.6594826Z                                                                   configs/s     
2026-02-21T09:12:44.2292308Z [209s] Generation 4 complete: 
2026-02-21T09:12:44.2292679Z ok=81
2026-02-21T09:12:44.2292924Z min=0.0276
2026-02-21T09:12:44.2293156Z mid=0.0339
2026-02-21T09:12:44.2293358Z max=0.3608
2026-02-21T09:12:44.2293591Z best={'block_sizes': [256, 1, 32],
2026-02-21T09:12:44.2293956Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:12:44.2294321Z  'l2_groupings': [32],
2026-02-21T09:12:44.2294639Z  'load_eviction_policies': ['', ''],
2026-02-21T09:12:44.2294950Z  'loop_orders': [[1, 0]],
2026-02-21T09:12:44.2295234Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:12:44.2295533Z  'num_stages': 4,
2026-02-21T09:12:44.2295777Z  'num_warps': 4,
2026-02-21T09:12:44.2296010Z  'pid_type': 'xyz',
2026-02-21T09:12:44.2296267Z  'range_flattens': [None, False],
2026-02-21T09:12:44.2296577Z  'range_multi_buffers': [None, None],
2026-02-21T09:12:44.2296899Z  'range_num_stages': [0, 3],
2026-02-21T09:12:44.2297181Z  'range_unroll_factors': [0, 0],
2026-02-21T09:12:44.2297491Z  'range_warp_specializes': [],
2026-02-21T09:12:44.2297776Z  'waves_per_eu': 2}
2026-02-21T09:12:44.3959665Z [210s] Fitting surrogate: 434 points, 434 targets
2026-02-21T09:12:45.2570411Z [210s] Generation 5 starting: 72 neighbors, 5 active search path(s)
2026-02-21T09:13:23.0140716Z [248s] Timeout after 30s compiling Config(block_sizes=[256, 1, 32], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:13:23.0157762Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.2 configs/s
2026-02-21T09:13:27.5328345Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 16.1 configs/s
2026-02-21T09:13:31.8574304Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 229.9         
2026-02-21T09:13:31.8574691Z                                                                   configs/s     
2026-02-21T09:13:32.3665090Z [258s] Generation 5 complete: 
2026-02-21T09:13:32.3665481Z timeout=1
2026-02-21T09:13:32.3665686Z ok=76
2026-02-21T09:13:32.3665894Z min=0.0259
2026-02-21T09:13:32.3666100Z mid=0.0320
2026-02-21T09:13:32.3666297Z max=0.3357
2026-02-21T09:13:32.3666524Z best={'block_sizes': [256, 1, 32],
2026-02-21T09:13:32.3666932Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:13:32.3667298Z  'l2_groupings': [2],
2026-02-21T09:13:32.3667605Z  'load_eviction_policies': ['', ''],
2026-02-21T09:13:32.3667921Z  'loop_orders': [[1, 0]],
2026-02-21T09:13:32.3668202Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:13:32.3668473Z  'num_stages': 3,
2026-02-21T09:13:32.3668699Z  'num_warps': 8,
2026-02-21T09:13:32.3668932Z  'pid_type': 'xyz',
2026-02-21T09:13:32.3669188Z  'range_flattens': [None, False],
2026-02-21T09:13:32.3669498Z  'range_multi_buffers': [None, False],
2026-02-21T09:13:32.3669812Z  'range_num_stages': [0, 2],
2026-02-21T09:13:32.3670093Z  'range_unroll_factors': [0, 1],
2026-02-21T09:13:32.3670392Z  'range_warp_specializes': [],
2026-02-21T09:13:32.3670666Z  'waves_per_eu': 2}
2026-02-21T09:13:32.4912121Z [258s] Fitting surrogate: 511 points, 511 targets
2026-02-21T09:13:33.1593940Z [258s] Generation 6 starting: 58 neighbors, 4 active search path(s)
2026-02-21T09:13:55.7952510Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 0.4 configs/s
2026-02-21T09:13:59.6566097Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 15.9 configs/s
2026-02-21T09:14:03.3483864Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 268.1         
2026-02-21T09:14:03.3484135Z                                                                   configs/s     
2026-02-21T09:14:03.7996160Z [289s] Generation 6 complete: 
2026-02-21T09:14:03.7996568Z ok=63
2026-02-21T09:14:03.7996781Z min=0.0258
2026-02-21T09:14:03.7996997Z mid=0.0318
2026-02-21T09:14:03.7997196Z max=0.3579
2026-02-21T09:14:03.7997428Z best={'block_sizes': [256, 1, 32],
2026-02-21T09:14:03.7997793Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:14:03.7998161Z  'l2_groupings': [32],
2026-02-21T09:14:03.7998448Z  'load_eviction_policies': ['', ''],
2026-02-21T09:14:03.7998760Z  'loop_orders': [[1, 0]],
2026-02-21T09:14:03.7999038Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:14:03.7999311Z  'num_stages': 4,
2026-02-21T09:14:03.7999580Z  'num_warps': 8,
2026-02-21T09:14:03.7999811Z  'pid_type': 'xyz',
2026-02-21T09:14:03.8000067Z  'range_flattens': [None, False],
2026-02-21T09:14:03.8000408Z  'range_multi_buffers': [None, False],
2026-02-21T09:14:03.8000723Z  'range_num_stages': [0, 2],
2026-02-21T09:14:03.8001004Z  'range_unroll_factors': [0, 1],
2026-02-21T09:14:03.8001300Z  'range_warp_specializes': [],
2026-02-21T09:14:03.8001575Z  'waves_per_eu': 2}
2026-02-21T09:14:03.9071742Z [289s] Fitting surrogate: 574 points, 574 targets
2026-02-21T09:14:04.4392546Z [290s] Generation 7 starting: 43 neighbors, 3 active search path(s)
2026-02-21T09:14:38.9837460Z [324s] Timeout after 30s compiling Config(block_sizes=[512, 1, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:14:38.9857975Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 0.3 configs/s
2026-02-21T09:14:41.8458909Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 45/45 16.0 configs/s
2026-02-21T09:14:43.7996203Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 496.8         
2026-02-21T09:14:43.7996473Z                                                                   configs/s     
2026-02-21T09:14:44.0880878Z [329s] Generation 7 complete: 
2026-02-21T09:14:44.0881058Z timeout=1
2026-02-21T09:14:44.0881142Z ok=45
2026-02-21T09:14:44.0881223Z min=0.0257
2026-02-21T09:14:44.0881304Z mid=0.0313
2026-02-21T09:14:44.0881381Z max=0.2706
2026-02-21T09:14:44.0881468Z best={'block_sizes': [256, 1, 32],
2026-02-21T09:14:44.0881608Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:14:44.0881741Z  'l2_groupings': [32],
2026-02-21T09:14:44.0881847Z  'load_eviction_policies': ['', ''],
2026-02-21T09:14:44.0881981Z  'loop_orders': [[1, 0]],
2026-02-21T09:14:44.0882090Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:14:44.0882192Z  'num_stages': 4,
2026-02-21T09:14:44.0882299Z  'num_warps': 8,
2026-02-21T09:14:44.0882388Z  'pid_type': 'xyz',
2026-02-21T09:14:44.0882481Z  'range_flattens': [None, False],
2026-02-21T09:14:44.0882705Z  'range_multi_buffers': [None, False],
2026-02-21T09:14:44.0882819Z  'range_num_stages': [0, 2],
2026-02-21T09:14:44.0882924Z  'range_unroll_factors': [0, 1],
2026-02-21T09:14:44.0883032Z  'range_warp_specializes': [],
2026-02-21T09:14:44.0883136Z  'waves_per_eu': 2}
2026-02-21T09:14:44.1390430Z [329s] Fitting surrogate: 620 points, 620 targets
2026-02-21T09:14:44.6744650Z [330s] Generation 8 starting: 41 neighbors, 3 active search path(s)
2026-02-21T09:15:02.5677929Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 0.4 configs/s
2026-02-21T09:15:05.3206913Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 16.1 configs/s
2026-02-21T09:15:07.1090122Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 538.1         
2026-02-21T09:15:07.1091494Z                                                                   configs/s     
2026-02-21T09:15:07.3954703Z [353s] Generation 8 complete: 
2026-02-21T09:15:07.3955064Z ok=44
2026-02-21T09:15:07.3955279Z min=0.0257
2026-02-21T09:15:07.3955486Z mid=0.0307
2026-02-21T09:15:07.3955685Z max=0.6465
2026-02-21T09:15:07.3955915Z best={'block_sizes': [256, 1, 32],
2026-02-21T09:15:07.3956287Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:15:07.3956652Z  'l2_groupings': [32],
2026-02-21T09:15:07.3956929Z  'load_eviction_policies': ['', ''],
2026-02-21T09:15:07.3957244Z  'loop_orders': [[1, 0]],
2026-02-21T09:15:07.3957525Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:15:07.3957792Z  'num_stages': 4,
2026-02-21T09:15:07.3958024Z  'num_warps': 8,
2026-02-21T09:15:07.3958260Z  'pid_type': 'xyz',
2026-02-21T09:15:07.3958513Z  'range_flattens': [None, False],
2026-02-21T09:15:07.3958843Z  'range_multi_buffers': [None, False],
2026-02-21T09:15:07.3959153Z  'range_num_stages': [0, 2],
2026-02-21T09:15:07.3959436Z  'range_unroll_factors': [0, 1],
2026-02-21T09:15:07.3959752Z  'range_warp_specializes': [],
2026-02-21T09:15:07.3960030Z  'waves_per_eu': 2}
2026-02-21T09:15:07.4401018Z [353s] Fitting surrogate: 664 points, 664 targets
2026-02-21T09:15:08.4259654Z [354s] Generation 9 starting: 30 neighbors, 2 active search path(s)
2026-02-21T09:15:42.6144290Z [388s] Timeout after 30s compiling Config(block_sizes=[1024, 1, 128], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:15:42.6163065Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0.2 configs/s
2026-02-21T09:15:44.4929624Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 16.9 configs/s
2026-02-21T09:15:45.5864837Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 852.1         
2026-02-21T09:15:45.5865516Z                                                                   configs/s     
2026-02-21T09:15:45.8590960Z [391s] Generation 9 complete: 
2026-02-21T09:15:45.8591242Z timeout=1
2026-02-21T09:15:45.8591358Z ok=32
2026-02-21T09:15:45.8591475Z min=0.0247
2026-02-21T09:15:45.8591591Z mid=0.0352
2026-02-21T09:15:45.8591711Z max=0.4406
2026-02-21T09:15:45.8591845Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:15:45.8592072Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:15:45.8592289Z  'l2_groupings': [2],
2026-02-21T09:15:45.8592455Z  'load_eviction_policies': ['', ''],
2026-02-21T09:15:45.8592673Z  'loop_orders': [[1, 0]],
2026-02-21T09:15:45.8592842Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:15:45.8593000Z  'num_stages': 4,
2026-02-21T09:15:45.8593142Z  'num_warps': 8,
2026-02-21T09:15:45.8593324Z  'pid_type': 'flat',
2026-02-21T09:15:45.8593474Z  'range_flattens': [None, False],
2026-02-21T09:15:45.8593677Z  'range_multi_buffers': [None, False],
2026-02-21T09:15:45.8593860Z  'range_num_stages': [0, 2],
2026-02-21T09:15:45.8594031Z  'range_unroll_factors': [0, 1],
2026-02-21T09:15:45.8594206Z  'range_warp_specializes': [],
2026-02-21T09:15:45.8594370Z  'waves_per_eu': 2}
2026-02-21T09:15:45.8882396Z [391s] Fitting surrogate: 697 points, 697 targets
2026-02-21T09:15:46.2022448Z [391s] Generation 10 starting: 24 neighbors, 2 active search path(s)
2026-02-21T09:16:05.6904651Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 0.7 configs/s
2026-02-21T09:16:07.3401289Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 25/25 16.6 configs/s
2026-02-21T09:16:08.1670678Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1111.4        
2026-02-21T09:16:08.1671301Z                                                                   configs/s     
2026-02-21T09:16:08.3998352Z [414s] Generation 10 complete: 
2026-02-21T09:16:08.3998686Z ok=26
2026-02-21T09:16:08.3998901Z min=0.0248
2026-02-21T09:16:08.4000189Z mid=0.0438
2026-02-21T09:16:08.4000395Z max=0.6669
2026-02-21T09:16:08.4000628Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:16:08.4000998Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:16:08.4001364Z  'l2_groupings': [2],
2026-02-21T09:16:08.4001637Z  'load_eviction_policies': ['', ''],
2026-02-21T09:16:08.4001944Z  'loop_orders': [[1, 0]],
2026-02-21T09:16:08.4002219Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:16:08.4002490Z  'num_stages': 4,
2026-02-21T09:16:08.4002805Z  'num_warps': 8,
2026-02-21T09:16:08.4003039Z  'pid_type': 'flat',
2026-02-21T09:16:08.4003300Z  'range_flattens': [None, False],
2026-02-21T09:16:08.4003632Z  'range_multi_buffers': [None, False],
2026-02-21T09:16:08.4003947Z  'range_num_stages': [0, 2],
2026-02-21T09:16:08.4004222Z  'range_unroll_factors': [0, 1],
2026-02-21T09:16:08.4004759Z  'range_warp_specializes': [],
2026-02-21T09:16:08.4004979Z  'waves_per_eu': 2}
2026-02-21T09:16:08.4173532Z [414s] Fitting surrogate: 723 points, 723 targets
2026-02-21T09:16:08.6530104Z [414s] Generation 11 starting: 14 neighbors, 1 active search path(s)
2026-02-21T09:16:17.9666449Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 1.0 configs/s
2026-02-21T09:16:18.9692723Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.5 configs/s
2026-02-21T09:16:19.1735455Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3920.2        
2026-02-21T09:16:19.1735689Z                                                                   configs/s     
2026-02-21T09:16:19.3750137Z [425s] Generation 11 complete: 
2026-02-21T09:16:19.3750395Z ok=16
2026-02-21T09:16:19.3750565Z min=0.0248
2026-02-21T09:16:19.3750732Z mid=0.0826
2026-02-21T09:16:19.3750881Z max=0.7023
2026-02-21T09:16:19.3751041Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:16:19.3751347Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:16:19.3751603Z  'l2_groupings': [2],
2026-02-21T09:16:19.3751807Z  'load_eviction_policies': ['', ''],
2026-02-21T09:16:19.3752063Z  'loop_orders': [[1, 0]],
2026-02-21T09:16:19.3752262Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:16:19.3752455Z  'num_stages': 4,
2026-02-21T09:16:19.3752626Z  'num_warps': 8,
2026-02-21T09:16:19.3752787Z  'pid_type': 'flat',
2026-02-21T09:16:19.3752977Z  'range_flattens': [None, False],
2026-02-21T09:16:19.3753200Z  'range_multi_buffers': [None, False],
2026-02-21T09:16:19.3753422Z  'range_num_stages': [0, 2],
2026-02-21T09:16:19.3753625Z  'range_unroll_factors': [0, 1],
2026-02-21T09:16:19.3753836Z  'range_warp_specializes': [],
2026-02-21T09:16:19.3754035Z  'waves_per_eu': 2}
2026-02-21T09:16:19.3821861Z [425s] Fitting surrogate: 739 points, 739 targets
2026-02-21T09:16:19.6135057Z [425s] Generation 12 starting: 15 neighbors, 1 active search path(s)
2026-02-21T09:16:31.6249252Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 0.6 configs/s
2026-02-21T09:16:32.6901517Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.4 configs/s
2026-02-21T09:16:32.8333508Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 6458.5        
2026-02-21T09:16:32.8333748Z                                                                   configs/s     
2026-02-21T09:16:33.0058626Z [438s] Generation 12 complete: 
2026-02-21T09:16:33.0059108Z ok=17
2026-02-21T09:16:33.0059322Z min=0.0246
2026-02-21T09:16:33.0059548Z mid=0.0963
2026-02-21T09:16:33.0059745Z max=0.3697
2026-02-21T09:16:33.0059977Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:16:33.0060388Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:16:33.0060762Z  'l2_groupings': [2],
2026-02-21T09:16:33.0061038Z  'load_eviction_policies': ['', ''],
2026-02-21T09:16:33.0061348Z  'loop_orders': [[1, 0]],
2026-02-21T09:16:33.0061626Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:16:33.0061904Z  'num_stages': 4,
2026-02-21T09:16:33.0062133Z  'num_warps': 8,
2026-02-21T09:16:33.0062427Z  'pid_type': 'flat',
2026-02-21T09:16:33.0077990Z  'range_flattens': [None, False],
2026-02-21T09:16:33.0078211Z  'range_multi_buffers': [None, False],
2026-02-21T09:16:33.0078393Z  'range_num_stages': [0, 2],
2026-02-21T09:16:33.0078521Z  'range_unroll_factors': [0, 1],
2026-02-21T09:16:33.0078634Z  'range_warp_specializes': [],
2026-02-21T09:16:33.0078738Z  'waves_per_eu': 2}
2026-02-21T09:16:33.0120985Z [438s] Fitting surrogate: 756 points, 756 targets
2026-02-21T09:16:33.6667972Z [439s] Generation 13 starting: 13 neighbors, 1 active search path(s)
2026-02-21T09:16:44.4435120Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 0.8 configs/s
2026-02-21T09:16:45.3885438Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.5 configs/s
2026-02-21T09:16:45.4630749Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━━ 1000/1000 - configs/s
2026-02-21T09:16:45.5549050Z [451s] Generation 13 complete: 
2026-02-21T09:16:45.5549300Z ok=15
2026-02-21T09:16:45.5550044Z min=0.0245
2026-02-21T09:16:45.5550126Z mid=0.0961
2026-02-21T09:16:45.5550200Z max=0.7028
2026-02-21T09:16:45.5550307Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:16:45.5550446Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:16:45.5550585Z  'l2_groupings': [2],
2026-02-21T09:16:45.5550693Z  'load_eviction_policies': ['', ''],
2026-02-21T09:16:45.5550809Z  'loop_orders': [[1, 0]],
2026-02-21T09:16:45.5550917Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:16:45.5551019Z  'num_stages': 4,
2026-02-21T09:16:45.5551105Z  'num_warps': 8,
2026-02-21T09:16:45.5551192Z  'pid_type': 'flat',
2026-02-21T09:16:45.5551292Z  'range_flattens': [None, False],
2026-02-21T09:16:45.5551406Z  'range_multi_buffers': [None, False],
2026-02-21T09:16:45.5551525Z  'range_num_stages': [0, 2],
2026-02-21T09:16:45.5551629Z  'range_unroll_factors': [0, 1],
2026-02-21T09:16:45.5551740Z  'range_warp_specializes': [],
2026-02-21T09:16:45.5551847Z  'waves_per_eu': 2}
2026-02-21T09:16:45.5606402Z [451s] Fitting surrogate: 771 points, 771 targets
2026-02-21T09:16:45.8413726Z [451s] Generation 14 starting: 16 neighbors, 1 active search path(s)
2026-02-21T09:16:51.9184662Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 2.7 configs/s
2026-02-21T09:16:53.0543201Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.2 configs/s
2026-02-21T09:16:53.4215618Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2337.7        
2026-02-21T09:16:53.4216973Z                                                                   configs/s     
2026-02-21T09:16:53.6426080Z [459s] Generation 14 complete: 
2026-02-21T09:16:53.6426434Z ok=18
2026-02-21T09:16:53.6426652Z min=0.0244
2026-02-21T09:16:53.6426864Z mid=0.1417
2026-02-21T09:16:53.6427075Z max=0.5240
2026-02-21T09:16:53.6427308Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:16:53.6427681Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:16:53.6428044Z  'l2_groupings': [2],
2026-02-21T09:16:53.6428372Z  'load_eviction_policies': ['', ''],
2026-02-21T09:16:53.6428686Z  'loop_orders': [[1, 0]],
2026-02-21T09:16:53.6428962Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:16:53.6429724Z  'num_stages': 4,
2026-02-21T09:16:53.6429954Z  'num_warps': 8,
2026-02-21T09:16:53.6430189Z  'pid_type': 'flat',
2026-02-21T09:16:53.6430455Z  'range_flattens': [None, False],
2026-02-21T09:16:53.6430769Z  'range_multi_buffers': [None, False],
2026-02-21T09:16:53.6431081Z  'range_num_stages': [0, 2],
2026-02-21T09:16:53.6431332Z  'range_unroll_factors': [0, 1],
2026-02-21T09:16:53.6431506Z  'range_warp_specializes': [],
2026-02-21T09:16:53.6431625Z  'waves_per_eu': 2}
2026-02-21T09:16:53.6527561Z [459s] Fitting surrogate: 789 points, 789 targets
2026-02-21T09:16:53.9038389Z [459s] Generation 15 starting: 15 neighbors, 1 active search path(s)
2026-02-21T09:17:07.3077413Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 0.7 configs/s
2026-02-21T09:17:08.3927487Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.0 configs/s
2026-02-21T09:17:08.6133927Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 4230.6        
2026-02-21T09:17:08.6134538Z                                                                   configs/s     
2026-02-21T09:17:08.8129443Z [474s] Generation 15 complete: 
2026-02-21T09:17:08.8129733Z ok=17
2026-02-21T09:17:08.8129912Z min=0.0245
2026-02-21T09:17:08.8130088Z mid=0.0855
2026-02-21T09:17:08.8130256Z max=0.8225
2026-02-21T09:17:08.8130477Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:17:08.8130787Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:17:08.8131096Z  'l2_groupings': [2],
2026-02-21T09:17:08.8131328Z  'load_eviction_policies': ['', ''],
2026-02-21T09:17:08.8131584Z  'loop_orders': [[1, 0]],
2026-02-21T09:17:08.8131811Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:17:08.8132039Z  'num_stages': 4,
2026-02-21T09:17:08.8132222Z  'num_warps': 8,
2026-02-21T09:17:08.8132416Z  'pid_type': 'flat',
2026-02-21T09:17:08.8132630Z  'range_flattens': [None, False],
2026-02-21T09:17:08.8133338Z  'range_multi_buffers': [None, False],
2026-02-21T09:17:08.8133596Z  'range_num_stages': [0, 2],
2026-02-21T09:17:08.8133844Z  'range_unroll_factors': [0, 1],
2026-02-21T09:17:08.8134095Z  'range_warp_specializes': [],
2026-02-21T09:17:08.8134318Z  'waves_per_eu': 2}
2026-02-21T09:17:08.8196949Z [474s] Fitting surrogate: 806 points, 806 targets
2026-02-21T09:17:09.0388991Z [474s] Generation 16 starting: 13 neighbors, 1 active search path(s)
2026-02-21T09:17:11.2750677Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 7.0 configs/s
2026-02-21T09:17:12.2687932Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 16.4 configs/s
2026-02-21T09:17:12.9808796Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1256.6        
2026-02-21T09:17:12.9809421Z                                                                   configs/s     
2026-02-21T09:17:13.1968732Z [478s] Generation 16 complete: 
2026-02-21T09:17:13.1969141Z ok=15
2026-02-21T09:17:13.1969412Z min=0.0249
2026-02-21T09:17:13.1969622Z mid=0.0344
2026-02-21T09:17:13.1969823Z max=0.0958
2026-02-21T09:17:13.1970084Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:17:13.1970454Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:17:13.1970810Z  'l2_groupings': [2],
2026-02-21T09:17:13.1971083Z  'load_eviction_policies': ['', ''],
2026-02-21T09:17:13.1971397Z  'loop_orders': [[1, 0]],
2026-02-21T09:17:13.1971672Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:17:13.1971943Z  'num_stages': 4,
2026-02-21T09:17:13.1972168Z  'num_warps': 8,
2026-02-21T09:17:13.1972398Z  'pid_type': 'flat',
2026-02-21T09:17:13.1972655Z  'range_flattens': [None, False],
2026-02-21T09:17:13.1972962Z  'range_multi_buffers': [None, False],
2026-02-21T09:17:13.1973268Z  'range_num_stages': [0, 2],
2026-02-21T09:17:13.1973546Z  'range_unroll_factors': [0, 1],
2026-02-21T09:17:13.1973842Z  'range_warp_specializes': [],
2026-02-21T09:17:13.1974115Z  'waves_per_eu': 2}
2026-02-21T09:17:13.2093224Z [478s] Fitting surrogate: 821 points, 821 targets
2026-02-21T09:17:13.4386165Z [479s] Generation 17 starting: 15 neighbors, 1 active search path(s)
2026-02-21T09:17:45.2432912Z [510s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 32], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:17:45.2454482Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 0.3 configs/s
2026-02-21T09:17:46.1656661Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 18.3 configs/s
2026-02-21T09:17:46.3825184Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 4026.2        
2026-02-21T09:17:46.3829479Z                                                                   configs/s     
2026-02-21T09:17:46.5799871Z [512s] Generation 17 complete: 
2026-02-21T09:17:46.5800120Z timeout=1
2026-02-21T09:17:46.5800223Z ok=16
2026-02-21T09:17:46.5800314Z min=0.0247
2026-02-21T09:17:46.5800395Z mid=0.0858
2026-02-21T09:17:46.5800487Z max=0.8225
2026-02-21T09:17:46.5800575Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:17:46.5800724Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:17:46.5800865Z  'l2_groupings': [2],
2026-02-21T09:17:46.5800975Z  'load_eviction_policies': ['', ''],
2026-02-21T09:17:46.5801096Z  'loop_orders': [[1, 0]],
2026-02-21T09:17:46.5801212Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:17:46.5801321Z  'num_stages': 4,
2026-02-21T09:17:46.5801411Z  'num_warps': 8,
2026-02-21T09:17:46.5801506Z  'pid_type': 'flat',
2026-02-21T09:17:46.5801608Z  'range_flattens': [None, False],
2026-02-21T09:17:46.5801732Z  'range_multi_buffers': [None, False],
2026-02-21T09:17:46.5801851Z  'range_num_stages': [0, 2],
2026-02-21T09:17:46.5801963Z  'range_unroll_factors': [0, 1],
2026-02-21T09:17:46.5802790Z  'range_warp_specializes': [],
2026-02-21T09:17:46.5802977Z  'waves_per_eu': 2}
2026-02-21T09:17:46.5864861Z [512s] Fitting surrogate: 838 points, 838 targets
2026-02-21T09:17:46.8124259Z [512s] Generation 18 starting: 12 neighbors, 1 active search path(s)
2026-02-21T09:17:48.5671757Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 12.0 configs/s
2026-02-21T09:17:49.4264881Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 13/13 15.9 configs/s
2026-02-21T09:17:49.8052736Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2211.7        
2026-02-21T09:17:49.8052967Z                                                                   configs/s     
2026-02-21T09:17:50.0034912Z [515s] Generation 18 complete: 
2026-02-21T09:17:50.0035137Z ok=14
2026-02-21T09:17:50.0035268Z min=0.0247
2026-02-21T09:17:50.0035372Z mid=0.0376
2026-02-21T09:17:50.0035466Z max=0.0976
2026-02-21T09:17:50.0035579Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:17:50.0035810Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:17:50.0035974Z  'l2_groupings': [2],
2026-02-21T09:17:50.0036680Z  'load_eviction_policies': ['', ''],
2026-02-21T09:17:50.0036821Z  'loop_orders': [[1, 0]],
2026-02-21T09:17:50.0036941Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:17:50.0037064Z  'num_stages': 4,
2026-02-21T09:17:50.0037173Z  'num_warps': 8,
2026-02-21T09:17:50.0037277Z  'pid_type': 'flat',
2026-02-21T09:17:50.0037402Z  'range_flattens': [None, False],
2026-02-21T09:17:50.0037542Z  'range_multi_buffers': [None, False],
2026-02-21T09:17:50.0037677Z  'range_num_stages': [0, 2],
2026-02-21T09:17:50.0037803Z  'range_unroll_factors': [0, 1],
2026-02-21T09:17:50.0037936Z  'range_warp_specializes': [],
2026-02-21T09:17:50.0038063Z  'waves_per_eu': 2}
2026-02-21T09:17:50.0135076Z [515s] Fitting surrogate: 852 points, 852 targets
2026-02-21T09:17:50.6759629Z [516s] Generation 19 starting: 13 neighbors, 1 active search path(s)
2026-02-21T09:17:55.3140195Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 2.0 configs/s
2026-02-21T09:17:56.2805116Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.0 configs/s
2026-02-21T09:17:56.5926298Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2674.6        
2026-02-21T09:17:56.5926529Z                                                                   configs/s     
2026-02-21T09:17:56.7790452Z [522s] Generation 19 complete: 
2026-02-21T09:17:56.7790674Z ok=15
2026-02-21T09:17:56.7790766Z min=0.0247
2026-02-21T09:17:56.7790854Z mid=0.0396
2026-02-21T09:17:56.7790939Z max=0.4143
2026-02-21T09:17:56.7791029Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:17:56.7791179Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:17:56.7791318Z  'l2_groupings': [2],
2026-02-21T09:17:56.7791435Z  'load_eviction_policies': ['', ''],
2026-02-21T09:17:56.7791556Z  'loop_orders': [[1, 0]],
2026-02-21T09:17:56.7791671Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:17:56.7791777Z  'num_stages': 4,
2026-02-21T09:17:56.7792473Z  'num_warps': 8,
2026-02-21T09:17:56.7792570Z  'pid_type': 'flat',
2026-02-21T09:17:56.7792691Z  'range_flattens': [None, False],
2026-02-21T09:17:56.7792816Z  'range_multi_buffers': [None, False],
2026-02-21T09:17:56.7792934Z  'range_num_stages': [0, 2],
2026-02-21T09:17:56.7793048Z  'range_unroll_factors': [0, 1],
2026-02-21T09:17:56.7793161Z  'range_warp_specializes': [],
2026-02-21T09:17:56.7793272Z  'waves_per_eu': 2}
2026-02-21T09:17:56.7864208Z [522s] Fitting surrogate: 867 points, 867 targets
2026-02-21T09:17:57.0347063Z [522s] Generation 20 starting: 15 neighbors, 1 active search path(s)
2026-02-21T09:18:01.6267658Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.2 configs/s
2026-02-21T09:18:02.7355850Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.5 configs/s
2026-02-21T09:18:03.4305113Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1288.7        
2026-02-21T09:18:03.4305695Z                                                                   configs/s     
2026-02-21T09:18:03.6656571Z [529s] Generation 20 complete: 
2026-02-21T09:18:03.6656953Z ok=17
2026-02-21T09:18:03.6657179Z min=0.0248
2026-02-21T09:18:03.6657397Z mid=0.0351
2026-02-21T09:18:03.6657600Z max=0.1104
2026-02-21T09:18:03.6657835Z best={'block_sizes': [512, 1, 32],
2026-02-21T09:18:03.6658212Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:18:03.6658583Z  'l2_groupings': [2],
2026-02-21T09:18:03.6658862Z  'load_eviction_policies': ['', ''],
2026-02-21T09:18:03.6659180Z  'loop_orders': [[1, 0]],
2026-02-21T09:18:03.6659472Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:18:03.6659747Z  'num_stages': 4,
2026-02-21T09:18:03.6659980Z  'num_warps': 8,
2026-02-21T09:18:03.6660213Z  'pid_type': 'flat',
2026-02-21T09:18:03.6660476Z  'range_flattens': [None, False],
2026-02-21T09:18:03.6660785Z  'range_multi_buffers': [None, False],
2026-02-21T09:18:03.6661098Z  'range_num_stages': [0, 2],
2026-02-21T09:18:03.6661385Z  'range_unroll_factors': [0, 1],
2026-02-21T09:18:03.6661704Z  'range_warp_specializes': [],
2026-02-21T09:18:03.6661988Z  'waves_per_eu': 2}
2026-02-21T09:18:03.6803947Z [529s] Fitting surrogate: 884 points, 884 targets
2026-02-21T09:18:03.8024768Z [529s] Autotuning complete in 529.5s after searching 832 configs.
2026-02-21T09:18:03.8025181Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:18:03.8026565Z     @helion.kernel(config=helion.Config(block_sizes=[512, 1, 32], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:18:03.8027832Z 
2026-02-21T09:18:03.8028176Z [529s] Code of selected kernel: /tmp/torchinductor_root/xy/cxyczichkyjbyzbbk57upsfj6bffbu55pjaku6spkqal2zheky63.py
2026-02-21T09:18:03.8202700Z from __future__ import annotations
2026-02-21T09:18:03.8202860Z 
2026-02-21T09:18:03.8202928Z import torch
2026-02-21T09:18:03.8203086Z import triton
2026-02-21T09:18:03.8203255Z import triton.language as tl
2026-02-21T09:18:03.8203520Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:18:03.8203737Z 
2026-02-21T09:18:03.8203825Z _BLOCK_SIZE_2 = tl.constexpr(32)
2026-02-21T09:18:03.8204032Z _BLOCK_SIZE_1 = tl.constexpr(1)
2026-02-21T09:18:03.8204234Z _BLOCK_SIZE_0 = tl.constexpr(512)
2026-02-21T09:18:03.8204365Z 
2026-02-21T09:18:03.8204427Z @triton.jit
2026-02-21T09:18:03.8204760Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr, _SHAPE_DIM_3: tl.constexpr):
2026-02-21T09:18:03.8205212Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:18:03.8205504Z     num_pid_m = tl.cdiv(8192, _BLOCK_SIZE_2)
2026-02-21T09:18:03.8205705Z     num_pid_n = 1
2026-02-21T09:18:03.8206215Z     inner_2d_pid = tl.program_id(0)
2026-02-21T09:18:03.8206426Z     num_pid_in_group = 2 * num_pid_n
2026-02-21T09:18:03.8206649Z     group_id = inner_2d_pid // num_pid_in_group
2026-02-21T09:18:03.8206876Z     first_pid_m = group_id * 2
2026-02-21T09:18:03.8207095Z     group_size_m = min(num_pid_m - first_pid_m, 2)
2026-02-21T09:18:03.8207393Z     pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T09:18:03.8207666Z     offset_2 = pid_0 * _BLOCK_SIZE_2
2026-02-21T09:18:03.8207923Z     indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32)
2026-02-21T09:18:03.8208274Z     # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:18:03.8208616Z     acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32)
2026-02-21T09:18:03.8208995Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:18:03.8209461Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:18:03.8209912Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:18:03.8210225Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:18:03.8210640Z     for offset_3 in tl.range(0, 1792, _BLOCK_SIZE_0, loop_unroll_factor=1, num_stages=2, disallow_acc_multi_buffer=True, flatten=False):
2026-02-21T09:18:03.8211124Z         indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T09:18:03.8211392Z         mask_0 = indices_3 < 1792
2026-02-21T09:18:03.8211580Z         acc_copy = acc
2026-02-21T09:18:03.8211753Z         acc_copy_0 = acc_copy
2026-02-21T09:18:03.8212000Z         # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2
2026-02-21T09:18:03.8212247Z         mul = 2 * offset_3
2026-02-21T09:18:03.8212534Z         # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:18:03.8212852Z         iota = mul + tl.arange(0, mul_1)
2026-02-21T09:18:03.8213174Z         load = tl.broadcast_to(tl.load(A + iota[None, :] * 1, None), [_BLOCK_SIZE_1, _SHAPE_DIM_2])
2026-02-21T09:18:03.8213686Z         # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:18:03.8213975Z         # src[int4_gemm.py:66]:     torch.float32
2026-02-21T09:18:03.8214181Z         # src[int4_gemm.py:67]: )  # [BLOCK_SIZE_M, BLOCK_SIZE_K]
2026-02-21T09:18:03.8214379Z         v_0 = tl.cast(load, tl.float32)
2026-02-21T09:18:03.8214625Z         # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n]  # [BLOCK_SIZE_K//2, BLOCK_SIZE_N]
2026-02-21T09:18:03.8214965Z         b_tile = tl.load(B + (indices_3[:, None] * 8192 + indices_2[None, :] * 1), mask_0[:, None], other=0)
2026-02-21T09:18:03.8215304Z         # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8)  # Sign-extend low 4 bits
2026-02-21T09:18:03.8215544Z         v_1 = tl.full([], 4, tl.int8)
2026-02-21T09:18:03.8215701Z         v_2 = b_tile << v_1
2026-02-21T09:18:03.8215850Z         v_3 = tl.full([], 4, tl.int8)
2026-02-21T09:18:03.8215994Z         v_4 = v_2 >> v_3
2026-02-21T09:18:03.8216208Z         # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8)  # Sign-extend high 4 bits
2026-02-21T09:18:03.8216438Z         v_5 = tl.full([], 4, tl.int8)
2026-02-21T09:18:03.8216583Z         v_6 = b_tile >> v_5
2026-02-21T09:18:03.8216769Z         # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1)
2026-02-21T09:18:03.8216981Z         stack_idx = tl.arange(0, 2)
2026-02-21T09:18:03.8217148Z         broadcast_idx = stack_idx[None, :, None]
2026-02-21T09:18:03.8217321Z         expanded_0 = tl.expand_dims(v_4, 1)
2026-02-21T09:18:03.8217496Z         expanded_1 = tl.expand_dims(v_6, 1)
2026-02-21T09:18:03.8217665Z         stacked_result = tl.zeros_like(expanded_0)
2026-02-21T09:18:03.8217844Z         mask_1 = broadcast_idx == 0
2026-02-21T09:18:03.8218044Z         stacked_result = tl.where(mask_1, expanded_0, stacked_result)
2026-02-21T09:18:03.8218302Z         mask_2 = broadcast_idx == 1
2026-02-21T09:18:03.8218497Z         stacked_result = tl.where(mask_2, expanded_1, stacked_result)
2026-02-21T09:18:03.8218740Z         # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:18:03.8218990Z         # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:18:03.8219216Z         # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:18:03.8219439Z         view = tl.reshape(stacked_result, [_SHAPE_DIM_3, _BLOCK_SIZE_2])
2026-02-21T09:18:03.8219653Z         v_7 = tl.cast(view, tl.float32)
2026-02-21T09:18:03.8219874Z         # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2)  # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1]
2026-02-21T09:18:03.8220121Z         a_tile_1 = v_0[:, :, None]
2026-02-21T09:18:03.8220309Z         # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0)
2026-02-21T09:18:03.8220514Z         b_unpacked_1 = v_7[None, :, :]
2026-02-21T09:18:03.8220771Z         # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1)  # [BLOCK_SIZE_M, BLOCK_SIZE_N]
2026-02-21T09:18:03.8221029Z         v_8 = a_tile_1 * b_unpacked_1
2026-02-21T09:18:03.8221200Z         sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32)
2026-02-21T09:18:03.8221371Z         acc = acc_copy_0 + sum_1
2026-02-21T09:18:03.8221601Z     # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16)
2026-02-21T09:18:03.8221805Z     v_10 = tl.cast(acc, tl.bfloat16)
2026-02-21T09:18:03.8222137Z     tl.store(tl.make_block_ptr(C, [1, 8192], [8192, 1], [0, offset_2], [1, _BLOCK_SIZE_2], [1, 0]), tl.reshape(v_10, [1, _BLOCK_SIZE_2]), boundary_check=[1])
2026-02-21T09:18:03.8222421Z 
2026-02-21T09:18:03.8222536Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher):
2026-02-21T09:18:03.8222749Z     """
2026-02-21T09:18:03.8222905Z     BFloat16 x INT4 General Matrix Multiplication (GEMM).
2026-02-21T09:18:03.8223043Z 
2026-02-21T09:18:03.8223132Z     This kernel performs matrix multiplication where:
2026-02-21T09:18:03.8223338Z     - A is a bfloat16 matrix of shape [M, K]
2026-02-21T09:18:03.8223552Z     - B is an int8 matrix of shape [K//2, N] containing packed int4 values
2026-02-21T09:18:03.8223827Z       (two 4-bit values packed into each int8)
2026-02-21T09:18:03.8223947Z 
2026-02-21T09:18:03.8223996Z     Args:
2026-02-21T09:18:03.8224172Z         A (Tensor): Input tensor of shape [M, K] in bfloat16 format.
2026-02-21T09:18:03.8224408Z         B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format.
2026-02-21T09:18:03.8224533Z 
2026-02-21T09:18:03.8224571Z     Returns:
2026-02-21T09:18:03.8224699Z         Tensor: Output tensor of shape [M, N] in bfloat16 format.
2026-02-21T09:18:03.8224846Z     """
2026-02-21T09:18:03.8224945Z     # src[int4_gemm.py:50]: M, K = A.shape
2026-02-21T09:18:03.8225069Z     M, K = A.shape
2026-02-21T09:18:03.8225172Z     # src[int4_gemm.py:51]: _, N = B.shape
2026-02-21T09:18:03.8225297Z     _, N = B.shape
2026-02-21T09:18:03.8225453Z     # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:18:03.8225675Z     C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:18:03.8225866Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:18:03.8226023Z     _BLOCK_SIZE_2 = 32
2026-02-21T09:18:03.8226199Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:18:03.8226483Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:18:03.8226756Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:18:03.8226944Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:18:03.8227064Z     _BLOCK_SIZE_0 = 512
2026-02-21T09:18:03.8227234Z     # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:18:03.8227426Z     _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0
2026-02-21T09:18:03.8227614Z     # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:18:03.8227808Z     # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:18:03.8227994Z     # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:18:03.8228125Z     _SHAPE_DIM_3 = 2 * _BLOCK_SIZE_0
2026-02-21T09:18:03.8228279Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:18:03.8228514Z     # src[int4_gemm.py:58]:     acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:18:03.8228696Z     # src[int4_gemm.py:57-91]: ...
2026-02-21T09:18:03.8228842Z     _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0)
2026-02-21T09:18:03.8229220Z     _launcher(_helion_matmul_bf16_int4, (triton.cdiv(8192, _BLOCK_SIZE_2) * 1,), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, _SHAPE_DIM_3, num_warps=8, num_stages=4, waves_per_eu=2, matrix_instr_nonkdim=32)
2026-02-21T09:18:03.8229566Z     # src[int4_gemm.py:93]: return C
2026-02-21T09:18:03.8229681Z     return C
2026-02-21T09:18:04.6690181Z WARNING:tritonbench.utils.triton_op:Completed input ID 3:
2026-02-21T09:18:04.6690459Z x_val
2026-02-21T09:18:04.6690572Z ------------------
2026-02-21T09:18:04.6690716Z (1, 1, 8192, 3584)
2026-02-21T09:18:04.6690782Z 
2026-02-21T09:18:04.6698751Z  20%|██        | 2/10 [15:25<1:03:23, 475.49s/it]WARNING:tritonbench.utils.triton_op:Running input ID 7:
2026-02-21T09:18:04.6699025Z x_val
2026-02-21T09:18:04.6699135Z ------------------
2026-02-21T09:18:04.6699255Z (4, 1, 8192, 3584)
2026-02-21T09:18:04.6704057Z INFO:tritonbench.utils.triton_op:Took 0.43ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T09:18:05.8439032Z INFO:tritonbench.utils.triton_op:Took 4.81ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T09:18:07.7728431Z INFO:tritonbench.utils.triton_op:Took 0.18ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T09:18:07.7847587Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:18:07.7847905Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:18:07.7848123Z               'dtype': 'torch.bfloat16',
2026-02-21T09:18:07.7848316Z               'shape': (4, 1, 3584),
2026-02-21T09:18:07.7849047Z               'stride': (3584, 3584, 1)},
2026-02-21T09:18:07.7849172Z             { 'device': 'cuda:0',
2026-02-21T09:18:07.7849290Z               'dtype': 'torch.int32',
2026-02-21T09:18:07.7849410Z               'shape': (3584, 8192),
2026-02-21T09:18:07.7849520Z               'stride': (8192, 1)}),
2026-02-21T09:18:07.7849666Z   'kwargs': {}}
2026-02-21T09:18:07.7882151Z INFO:tritonbench.utils.triton_op:Took 3.72ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T09:18:08.0113463Z [0s] Autotune random seed: 2138032649
2026-02-21T09:18:08.0385374Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:18:41.5978666Z [33s] Timeout after 30s compiling Config(block_sizes=[512, 1, 16], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:18:42.5755890Z [34s] Timeout after 30s compiling Config(block_sizes=[512, 1, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T09:18:42.9950703Z [34s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 64], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[3, 3], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T09:18:43.2622888Z [35s] Timeout after 30s compiling Config(block_sizes=[2048, 1, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[4, 1], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:18:43.7770059Z [35s] Timeout after 30s compiling Config(block_sizes=[8, 1, 2048], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:18:49.5235774Z [41s] Timeout after 30s compiling Config(block_sizes=[128, 1, 256], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T09:18:50.1277516Z [42s] Timeout after 30s compiling Config(block_sizes=[512, 1, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[4, 1], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:18:50.1290358Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.5 configs/s
2026-02-21T09:18:56.0853413Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.9 configs/s
2026-02-21T09:18:56.0863189Z [48s] Adaptive compile timeout: 30s (90% percentile=10.9s, bounds=[30.0s, 30s])
2026-02-21T09:18:56.2517242Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 5458.9 configs/s
2026-02-21T09:18:56.5238593Z [48s] Initial random population of 100, 5 starting points: 
2026-02-21T09:18:56.5239109Z timeout=7
2026-02-21T09:18:56.5239328Z ok=93
2026-02-21T09:18:56.5239539Z min=0.0455
2026-02-21T09:18:56.5239754Z mid=0.2931
2026-02-21T09:18:56.5239965Z max=21.1505
2026-02-21T09:18:56.5240213Z best={'block_sizes': [128, 2, 16],
2026-02-21T09:18:56.5240577Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T09:18:56.5240949Z  'l2_groupings': [2],
2026-02-21T09:18:56.5241300Z  'load_eviction_policies': ['', ''],
2026-02-21T09:18:56.5241620Z  'loop_orders': [[0, 1]],
2026-02-21T09:18:56.5241897Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:18:56.5242201Z  'num_stages': 2,
2026-02-21T09:18:56.5242442Z  'num_warps': 1,
2026-02-21T09:18:56.5242779Z  'pid_type': 'xyz',
2026-02-21T09:18:56.5243044Z  'range_flattens': [None, None],
2026-02-21T09:18:56.5243362Z  'range_multi_buffers': [None, False],
2026-02-21T09:18:56.5243682Z  'range_num_stages': [0, 3],
2026-02-21T09:18:56.5243972Z  'range_unroll_factors': [0, 1],
2026-02-21T09:18:56.5244277Z  'range_warp_specializes': [],
2026-02-21T09:18:56.5244561Z  'waves_per_eu': 1}
2026-02-21T09:18:56.5277234Z [48s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:18:57.8168880Z [49s] Generation 1 starting: 90 neighbors, 5 active search path(s)
2026-02-21T09:19:29.4778370Z [81s] Timeout after 30s compiling Config(block_sizes=[512, 2, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[4, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:19:30.3199521Z [82s] Timeout after 30s compiling Config(block_sizes=[512, 2, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='xyz', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:19:30.3221152Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 0.7 configs/s
2026-02-21T09:19:36.0033249Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 16.7 configs/s
2026-02-21T09:19:37.9571444Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 479.9         
2026-02-21T09:19:37.9571736Z                                                                   configs/s     
2026-02-21T09:19:38.3038593Z [90s] Generation 1 complete: 
2026-02-21T09:19:38.3038963Z timeout=2
2026-02-21T09:19:38.3039172Z ok=94
2026-02-21T09:19:38.3039388Z min=0.0399
2026-02-21T09:19:38.3039597Z mid=0.0730
2026-02-21T09:19:38.3039802Z max=0.5802
2026-02-21T09:19:38.3040030Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:19:38.3040412Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:19:38.3040782Z  'l2_groupings': [32],
2026-02-21T09:19:38.3041065Z  'load_eviction_policies': ['', ''],
2026-02-21T09:19:38.3041380Z  'loop_orders': [[1, 0]],
2026-02-21T09:19:38.3041662Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:19:38.3041935Z  'num_stages': 3,
2026-02-21T09:19:38.3042171Z  'num_warps': 1,
2026-02-21T09:19:38.3042410Z  'pid_type': 'xyz',
2026-02-21T09:19:38.3042741Z  'range_flattens': [None, None],
2026-02-21T09:19:38.3043096Z  'range_multi_buffers': [None, None],
2026-02-21T09:19:38.3043409Z  'range_num_stages': [0, 2],
2026-02-21T09:19:38.3043520Z  'range_unroll_factors': [0, 1],
2026-02-21T09:19:38.3044015Z  'range_warp_specializes': [],
2026-02-21T09:19:38.3044129Z  'waves_per_eu': 1}
2026-02-21T09:19:38.3379682Z [90s] Fitting surrogate: 196 points, 196 targets
2026-02-21T09:19:39.2627573Z [91s] Generation 2 starting: 93 neighbors, 5 active search path(s)
2026-02-21T09:20:12.8403599Z [124s] Timeout after 30s compiling Config(block_sizes=[128, 2, 32], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:20:13.0041845Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 0.6 configs/s
2026-02-21T09:20:18.8929806Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 16.7 configs/s
2026-02-21T09:20:24.0346967Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 190.8         
2026-02-21T09:20:24.0347417Z                                                                   configs/s     
2026-02-21T09:20:24.5604149Z [136s] Generation 2 complete: 
2026-02-21T09:20:24.5604580Z timeout=1
2026-02-21T09:20:24.5604791Z ok=98
2026-02-21T09:20:24.5605006Z min=0.0390
2026-02-21T09:20:24.5605215Z mid=0.0567
2026-02-21T09:20:24.5605425Z max=0.7450
2026-02-21T09:20:24.5605659Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:20:24.5606065Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:20:24.5606463Z  'l2_groupings': [32],
2026-02-21T09:20:24.5606747Z  'load_eviction_policies': ['', ''],
2026-02-21T09:20:24.5607068Z  'loop_orders': [[1, 0]],
2026-02-21T09:20:24.5607352Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:20:24.5607631Z  'num_stages': 3,
2026-02-21T09:20:24.5608607Z  'num_warps': 2,
2026-02-21T09:20:24.5608848Z  'pid_type': 'flat',
2026-02-21T09:20:24.5609089Z  'range_flattens': [None, None],
2026-02-21T09:20:24.5609360Z  'range_multi_buffers': [None, None],
2026-02-21T09:20:24.5609609Z  'range_num_stages': [0, 2],
2026-02-21T09:20:24.5609836Z  'range_unroll_factors': [0, 1],
2026-02-21T09:20:24.5610073Z  'range_warp_specializes': [],
2026-02-21T09:20:24.5610306Z  'waves_per_eu': 1}
2026-02-21T09:20:24.6803951Z [136s] Fitting surrogate: 295 points, 295 targets
2026-02-21T09:20:25.6412062Z [137s] Generation 3 starting: 97 neighbors, 5 active search path(s)
2026-02-21T09:20:54.1576317Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.6 configs/s
2026-02-21T09:21:00.1958119Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 16.5 configs/s
2026-02-21T09:21:05.3419639Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 190.8         
2026-02-21T09:21:05.3420270Z                                                                   configs/s     
2026-02-21T09:21:05.8574618Z [177s] Generation 3 complete: 
2026-02-21T09:21:05.8574975Z ok=102
2026-02-21T09:21:05.8575227Z min=0.0388
2026-02-21T09:21:05.8575400Z mid=0.0545
2026-02-21T09:21:05.8575601Z max=1.0205
2026-02-21T09:21:05.8575789Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:21:05.8576081Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:21:05.8576371Z  'l2_groupings': [32],
2026-02-21T09:21:05.8576593Z  'load_eviction_policies': ['', ''],
2026-02-21T09:21:05.8576838Z  'loop_orders': [[1, 0]],
2026-02-21T09:21:05.8577074Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:21:05.8577289Z  'num_stages': 3,
2026-02-21T09:21:05.8577477Z  'num_warps': 2,
2026-02-21T09:21:05.8577663Z  'pid_type': 'flat',
2026-02-21T09:21:05.8577877Z  'range_flattens': [None, None],
2026-02-21T09:21:05.8578115Z  'range_multi_buffers': [None, None],
2026-02-21T09:21:05.8578360Z  'range_num_stages': [0, 2],
2026-02-21T09:21:05.8578580Z  'range_unroll_factors': [0, 1],
2026-02-21T09:21:05.8578844Z  'range_warp_specializes': [],
2026-02-21T09:21:05.8579066Z  'waves_per_eu': 1}
2026-02-21T09:21:05.9751681Z [177s] Fitting surrogate: 397 points, 397 targets
2026-02-21T09:21:07.1919024Z [179s] Generation 4 starting: 79 neighbors, 5 active search path(s)
2026-02-21T09:21:43.3819417Z [215s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 8], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[1, 2], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:21:46.1295606Z [218s] Timeout after 30s compiling Config(block_sizes=[1024, 4, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:21:46.1312852Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.4 configs/s
2026-02-21T09:21:48.5359476Z [220s] Skipping config with accuracy mismatch: helion.Config(block_sizes=[512, 4, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:21:48.5361417Z Tensor-likes are not close!
2026-02-21T09:21:48.5361625Z 
2026-02-21T09:21:48.5361768Z Mismatched elements: 8006 / 32768 (24.4%)
2026-02-21T09:21:48.5363015Z Greatest absolute difference: 342.0 at index (0, 6175) (up to 0.01 allowed)
2026-02-21T09:21:48.5363599Z Greatest relative difference: 4608.0 at index (0, 549) (up to 0.01 allowed)
2026-02-21T09:21:48.5364145Z Use HELION_AUTOTUNE_ACCURACY_CHECK=0 to disable this check.
2026-02-21T09:21:48.5364444Z 
2026-02-21T09:21:50.9293569Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 16.6 configs/s
2026-02-21T09:21:55.3020185Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 224.2         
2026-02-21T09:21:55.3020457Z                                                                   configs/s     
2026-02-21T09:21:55.7656657Z [227s] Generation 4 complete: 
2026-02-21T09:21:55.7656866Z error=1
2026-02-21T09:21:55.7656956Z timeout=2
2026-02-21T09:21:55.7657036Z ok=81
2026-02-21T09:21:55.7657121Z min=0.0388
2026-02-21T09:21:55.7657201Z mid=0.0466
2026-02-21T09:21:55.7657282Z max=1.3751
2026-02-21T09:21:55.7657370Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:21:55.7657514Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:21:55.7657717Z  'l2_groupings': [32],
2026-02-21T09:21:55.7658128Z  'load_eviction_policies': ['', ''],
2026-02-21T09:21:55.7658267Z  'loop_orders': [[1, 0]],
2026-02-21T09:21:55.7658373Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:21:55.7658482Z  'num_stages': 3,
2026-02-21T09:21:55.7658572Z  'num_warps': 2,
2026-02-21T09:21:55.7658666Z  'pid_type': 'flat',
2026-02-21T09:21:55.7658765Z  'range_flattens': [None, None],
2026-02-21T09:21:55.7662395Z  'range_multi_buffers': [None, None],
2026-02-21T09:21:55.7662510Z  'range_num_stages': [0, 2],
2026-02-21T09:21:55.7662620Z  'range_unroll_factors': [0, 1],
2026-02-21T09:21:55.7662731Z  'range_warp_specializes': [],
2026-02-21T09:21:55.7662843Z  'waves_per_eu': 1}
2026-02-21T09:21:55.8644654Z [227s] Fitting surrogate: 481 points, 481 targets
2026-02-21T09:21:56.6291063Z [228s] Generation 5 starting: 71 neighbors, 4 active search path(s)
2026-02-21T09:22:31.0463921Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.3 configs/s
2026-02-21T09:22:35.5289564Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 16.4 configs/s
2026-02-21T09:22:39.5962487Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 240.7         
2026-02-21T09:22:39.5963239Z                                                                   configs/s     
2026-02-21T09:22:40.0664947Z [272s] Generation 5 complete: 
2026-02-21T09:22:40.0665287Z ok=76
2026-02-21T09:22:40.0665480Z min=0.0388
2026-02-21T09:22:40.0665678Z mid=0.0464
2026-02-21T09:22:40.0665861Z max=1.5840
2026-02-21T09:22:40.0666072Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:22:40.0666407Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:22:40.0666745Z  'l2_groupings': [32],
2026-02-21T09:22:40.0667008Z  'load_eviction_policies': ['', ''],
2026-02-21T09:22:40.0667288Z  'loop_orders': [[1, 0]],
2026-02-21T09:22:40.0667545Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:22:40.0667786Z  'num_stages': 3,
2026-02-21T09:22:40.0668002Z  'num_warps': 2,
2026-02-21T09:22:40.0668244Z  'pid_type': 'flat',
2026-02-21T09:22:40.0668598Z  'range_flattens': [None, None],
2026-02-21T09:22:40.0668886Z  'range_multi_buffers': [None, None],
2026-02-21T09:22:40.0669198Z  'range_num_stages': [0, 2],
2026-02-21T09:22:40.0669457Z  'range_unroll_factors': [0, 1],
2026-02-21T09:22:40.0669721Z  'range_warp_specializes': [],
2026-02-21T09:22:40.0669981Z  'waves_per_eu': 1}
2026-02-21T09:22:40.1820699Z [272s] Fitting surrogate: 557 points, 557 targets
2026-02-21T09:22:40.7469271Z [272s] Generation 6 starting: 52 neighbors, 3 active search path(s)
2026-02-21T09:22:48.8849724Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 3.0 configs/s
2026-02-21T09:22:52.3112198Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 15.8 configs/s
2026-02-21T09:22:56.1384759Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 305.6         
2026-02-21T09:22:56.1385148Z                                                                   configs/s     
2026-02-21T09:22:56.5418584Z [288s] Generation 6 complete: 
2026-02-21T09:22:56.5419033Z ok=56
2026-02-21T09:22:56.5419245Z min=0.0389
2026-02-21T09:22:56.5419484Z mid=0.0446
2026-02-21T09:22:56.5419689Z max=0.7937
2026-02-21T09:22:56.5419925Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:22:56.5420300Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:22:56.5420669Z  'l2_groupings': [32],
2026-02-21T09:22:56.5420945Z  'load_eviction_policies': ['', ''],
2026-02-21T09:22:56.5421259Z  'loop_orders': [[1, 0]],
2026-02-21T09:22:56.5421538Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:22:56.5421822Z  'num_stages': 3,
2026-02-21T09:22:56.5422057Z  'num_warps': 2,
2026-02-21T09:22:56.5422244Z  'pid_type': 'flat',
2026-02-21T09:22:56.5422412Z  'range_flattens': [None, None],
2026-02-21T09:22:56.5422600Z  'range_multi_buffers': [None, None],
2026-02-21T09:22:56.5422783Z  'range_num_stages': [0, 2],
2026-02-21T09:22:56.5422955Z  'range_unroll_factors': [0, 1],
2026-02-21T09:22:56.5423143Z  'range_warp_specializes': [],
2026-02-21T09:22:56.5423322Z  'waves_per_eu': 1}
2026-02-21T09:22:56.6176628Z [288s] Fitting surrogate: 613 points, 613 targets
2026-02-21T09:22:57.2349523Z [289s] Generation 7 starting: 48 neighbors, 3 active search path(s)
2026-02-21T09:23:04.8150373Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 6.4 configs/s
2026-02-21T09:23:07.8628791Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 16.5 configs/s
2026-02-21T09:23:10.7200929Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 337.6         
2026-02-21T09:23:10.7201547Z                                                                   configs/s     
2026-02-21T09:23:11.0884376Z [303s] Generation 7 complete: 
2026-02-21T09:23:11.0884683Z ok=52
2026-02-21T09:23:11.0884856Z min=0.0387
2026-02-21T09:23:11.0885029Z mid=0.0460
2026-02-21T09:23:11.0885190Z max=0.2855
2026-02-21T09:23:11.0885371Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:23:11.0885673Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:23:11.0886019Z  'l2_groupings': [32],
2026-02-21T09:23:11.0886240Z  'load_eviction_policies': ['', ''],
2026-02-21T09:23:11.0886520Z  'loop_orders': [[1, 0]],
2026-02-21T09:23:11.0886742Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:23:11.0886995Z  'num_stages': 3,
2026-02-21T09:23:11.0887182Z  'num_warps': 2,
2026-02-21T09:23:11.0887377Z  'pid_type': 'flat',
2026-02-21T09:23:11.0887591Z  'range_flattens': [None, None],
2026-02-21T09:23:11.0887835Z  'range_multi_buffers': [None, None],
2026-02-21T09:23:11.0888085Z  'range_num_stages': [0, 2],
2026-02-21T09:23:11.0888308Z  'range_unroll_factors': [0, 1],
2026-02-21T09:23:11.0888563Z  'range_warp_specializes': [],
2026-02-21T09:23:11.0888803Z  'waves_per_eu': 1}
2026-02-21T09:23:11.1468505Z [303s] Fitting surrogate: 665 points, 665 targets
2026-02-21T09:23:11.5804276Z [303s] Generation 8 starting: 37 neighbors, 2 active search path(s)
2026-02-21T09:23:17.9289164Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 3.6 configs/s
2026-02-21T09:23:20.3179935Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 16.5 configs/s
2026-02-21T09:23:22.4407838Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 448.1         
2026-02-21T09:23:22.4408465Z                                                                   configs/s     
2026-02-21T09:23:22.7455284Z [314s] Generation 8 complete: 
2026-02-21T09:23:22.7455649Z ok=40
2026-02-21T09:23:22.7455952Z min=0.0384
2026-02-21T09:23:22.7456212Z mid=0.0445
2026-02-21T09:23:22.7456419Z max=0.3020
2026-02-21T09:23:22.7456651Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:23:22.7457020Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:23:22.7457403Z  'l2_groupings': [1],
2026-02-21T09:23:22.7457679Z  'load_eviction_policies': ['', ''],
2026-02-21T09:23:22.7458002Z  'loop_orders': [[0, 1]],
2026-02-21T09:23:22.7458284Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:23:22.7458552Z  'num_stages': 1,
2026-02-21T09:23:22.7458785Z  'num_warps': 4,
2026-02-21T09:23:22.7459057Z  'pid_type': 'flat',
2026-02-21T09:23:22.7459318Z  'range_flattens': [None, None],
2026-02-21T09:23:22.7459575Z  'range_multi_buffers': [None, False],
2026-02-21T09:23:22.7459859Z  'range_num_stages': [0, 2],
2026-02-21T09:23:22.7460097Z  'range_unroll_factors': [0, 0],
2026-02-21T09:23:22.7460350Z  'range_warp_specializes': [],
2026-02-21T09:23:22.7460582Z  'waves_per_eu': 3}
2026-02-21T09:23:22.7830875Z [314s] Fitting surrogate: 705 points, 705 targets
2026-02-21T09:23:23.2116073Z [315s] Generation 9 starting: 36 neighbors, 2 active search path(s)
2026-02-21T09:23:28.7091777Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 4.5 configs/s
2026-02-21T09:23:31.0566854Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 16.3 configs/s
2026-02-21T09:23:32.7000023Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 564.2         
2026-02-21T09:23:32.7000391Z                                                                   configs/s     
2026-02-21T09:23:32.9872466Z [324s] Generation 9 complete: 
2026-02-21T09:23:32.9872865Z ok=38
2026-02-21T09:23:32.9873058Z min=0.0385
2026-02-21T09:23:32.9873867Z mid=0.0523
2026-02-21T09:23:32.9874001Z max=0.3922
2026-02-21T09:23:32.9874139Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:23:32.9874359Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:23:32.9874575Z  'l2_groupings': [1],
2026-02-21T09:23:32.9874730Z  'load_eviction_policies': ['', ''],
2026-02-21T09:23:32.9874912Z  'loop_orders': [[0, 1]],
2026-02-21T09:23:32.9875080Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:23:32.9875243Z  'num_stages': 1,
2026-02-21T09:23:32.9875384Z  'num_warps': 4,
2026-02-21T09:23:32.9875521Z  'pid_type': 'flat',
2026-02-21T09:23:32.9875668Z  'range_flattens': [None, None],
2026-02-21T09:23:32.9875850Z  'range_multi_buffers': [None, False],
2026-02-21T09:23:32.9876028Z  'range_num_stages': [0, 2],
2026-02-21T09:23:32.9876195Z  'range_unroll_factors': [0, 0],
2026-02-21T09:23:32.9876372Z  'range_warp_specializes': [],
2026-02-21T09:23:32.9876545Z  'waves_per_eu': 3}
2026-02-21T09:23:33.0146879Z [324s] Fitting surrogate: 743 points, 743 targets
2026-02-21T09:23:33.3910158Z [325s] Generation 10 starting: 31 neighbors, 2 active search path(s)
2026-02-21T09:23:37.8973572Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 3.1 configs/s
2026-02-21T09:23:39.9229548Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 16.5 configs/s
2026-02-21T09:23:42.3415334Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 496.4         
2026-02-21T09:23:42.3415976Z                                                                   configs/s     
2026-02-21T09:23:42.6715929Z [334s] Generation 10 complete: 
2026-02-21T09:23:42.6716448Z ok=33
2026-02-21T09:23:42.6716812Z min=0.0383
2026-02-21T09:23:42.6717057Z mid=0.0433
2026-02-21T09:23:42.6717261Z max=0.2369
2026-02-21T09:23:42.6717498Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:23:42.6717978Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:23:42.6719001Z  'l2_groupings': [1],
2026-02-21T09:23:42.6719356Z  'load_eviction_policies': ['', ''],
2026-02-21T09:23:42.6719704Z  'loop_orders': [[0, 1]],
2026-02-21T09:23:42.6719981Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:23:42.6720258Z  'num_stages': 1,
2026-02-21T09:23:42.6720482Z  'num_warps': 4,
2026-02-21T09:23:42.6720718Z  'pid_type': 'flat',
2026-02-21T09:23:42.6720988Z  'range_flattens': [None, None],
2026-02-21T09:23:42.6721294Z  'range_multi_buffers': [None, False],
2026-02-21T09:23:42.6721611Z  'range_num_stages': [0, 2],
2026-02-21T09:23:42.6721897Z  'range_unroll_factors': [0, 0],
2026-02-21T09:23:42.6722199Z  'range_warp_specializes': [],
2026-02-21T09:23:42.6722474Z  'waves_per_eu': 3}
2026-02-21T09:23:42.7080351Z [334s] Fitting surrogate: 776 points, 776 targets
2026-02-21T09:23:42.9664962Z [334s] Generation 11 starting: 17 neighbors, 1 active search path(s)
2026-02-21T09:23:59.3342554Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.4 configs/s
2026-02-21T09:24:00.4723009Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.1 configs/s
2026-02-21T09:24:01.2589896Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1073.0        
2026-02-21T09:24:01.2590163Z                                                                   configs/s     
2026-02-21T09:24:01.4972339Z [353s] Generation 11 complete: 
2026-02-21T09:24:01.4972514Z ok=19
2026-02-21T09:24:01.4972622Z min=0.0388
2026-02-21T09:24:01.4972710Z mid=0.0528
2026-02-21T09:24:01.4972786Z max=0.4943
2026-02-21T09:24:01.4972881Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:24:01.4973025Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:24:01.4973159Z  'l2_groupings': [1],
2026-02-21T09:24:01.4973266Z  'load_eviction_policies': ['', ''],
2026-02-21T09:24:01.4973381Z  'loop_orders': [[0, 1]],
2026-02-21T09:24:01.4973493Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:24:01.4973599Z  'num_stages': 1,
2026-02-21T09:24:01.4973691Z  'num_warps': 4,
2026-02-21T09:24:01.4973830Z  'pid_type': 'flat',
2026-02-21T09:24:01.4973934Z  'range_flattens': [None, None],
2026-02-21T09:24:01.4974054Z  'range_multi_buffers': [None, False],
2026-02-21T09:24:01.4974513Z  'range_num_stages': [0, 2],
2026-02-21T09:24:01.4974626Z  'range_unroll_factors': [0, 0],
2026-02-21T09:24:01.4974738Z  'range_warp_specializes': [],
2026-02-21T09:24:01.4974849Z  'waves_per_eu': 3}
2026-02-21T09:24:01.5087115Z [353s] Fitting surrogate: 795 points, 795 targets
2026-02-21T09:24:01.7514106Z [353s] Generation 12 starting: 17 neighbors, 1 active search path(s)
2026-02-21T09:24:04.2018753Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 15.0 configs/s
2026-02-21T09:24:05.3051971Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 16.7 configs/s
2026-02-21T09:24:06.0772578Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1122.0        
2026-02-21T09:24:06.0772865Z                                                                   configs/s     
2026-02-21T09:24:06.3117267Z [358s] Generation 12 complete: 
2026-02-21T09:24:06.3117487Z ok=19
2026-02-21T09:24:06.3117582Z min=0.0384
2026-02-21T09:24:06.3117683Z mid=0.0442
2026-02-21T09:24:06.3117780Z max=0.1085
2026-02-21T09:24:06.3117874Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:24:06.3118016Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:24:06.3118163Z  'l2_groupings': [1],
2026-02-21T09:24:06.3118272Z  'load_eviction_policies': ['', ''],
2026-02-21T09:24:06.3118396Z  'loop_orders': [[0, 1]],
2026-02-21T09:24:06.3118504Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:24:06.3118610Z  'num_stages': 1,
2026-02-21T09:24:06.3118700Z  'num_warps': 4,
2026-02-21T09:24:06.3118796Z  'pid_type': 'flat',
2026-02-21T09:24:06.3118897Z  'range_flattens': [None, None],
2026-02-21T09:24:06.3119019Z  'range_multi_buffers': [None, False],
2026-02-21T09:24:06.3119143Z  'range_num_stages': [0, 2],
2026-02-21T09:24:06.3119251Z  'range_unroll_factors': [0, 0],
2026-02-21T09:24:06.3119367Z  'range_warp_specializes': [],
2026-02-21T09:24:06.3120218Z  'waves_per_eu': 3}
2026-02-21T09:24:06.3299411Z [358s] Fitting surrogate: 814 points, 814 targets
2026-02-21T09:24:06.5960270Z [358s] Generation 13 starting: 19 neighbors, 1 active search path(s)
2026-02-21T09:24:10.5732590Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 3.7 configs/s
2026-02-21T09:24:11.9268263Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 15.6 configs/s
2026-02-21T09:24:13.2329529Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 724.0         
2026-02-21T09:24:13.2330057Z                                                                   configs/s     
2026-02-21T09:24:13.5303897Z [365s] Generation 13 complete: 
2026-02-21T09:24:13.5304041Z ok=21
2026-02-21T09:24:13.5305196Z min=0.0385
2026-02-21T09:24:13.5305370Z mid=0.0431
2026-02-21T09:24:13.5305455Z max=0.0685
2026-02-21T09:24:13.5305556Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:24:13.5305739Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:24:13.5305929Z  'l2_groupings': [1],
2026-02-21T09:24:13.5306035Z  'load_eviction_policies': ['', ''],
2026-02-21T09:24:13.5306203Z  'loop_orders': [[0, 1]],
2026-02-21T09:24:13.5306308Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:24:13.5306414Z  'num_stages': 1,
2026-02-21T09:24:13.5306505Z  'num_warps': 4,
2026-02-21T09:24:13.5306592Z  'pid_type': 'flat',
2026-02-21T09:24:13.5306693Z  'range_flattens': [None, None],
2026-02-21T09:24:13.5306808Z  'range_multi_buffers': [None, False],
2026-02-21T09:24:13.5306925Z  'range_num_stages': [0, 2],
2026-02-21T09:24:13.5307031Z  'range_unroll_factors': [0, 0],
2026-02-21T09:24:13.5307142Z  'range_warp_specializes': [],
2026-02-21T09:24:13.5307245Z  'waves_per_eu': 3}
2026-02-21T09:24:13.5589558Z [365s] Fitting surrogate: 835 points, 835 targets
2026-02-21T09:24:13.7924104Z [365s] Generation 14 starting: 18 neighbors, 1 active search path(s)
2026-02-21T09:24:16.2966609Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 9.0 configs/s
2026-02-21T09:24:17.5376808Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 16.4 configs/s
2026-02-21T09:24:18.9314304Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 685.8         
2026-02-21T09:24:18.9315509Z                                                                   configs/s     
2026-02-21T09:24:19.2361778Z [371s] Generation 14 complete: 
2026-02-21T09:24:19.2363645Z ok=20
2026-02-21T09:24:19.2363809Z min=0.0386
2026-02-21T09:24:19.2363964Z mid=0.0395
2026-02-21T09:24:19.2364204Z max=0.0684
2026-02-21T09:24:19.2364378Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:24:19.2364657Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:24:19.2364933Z  'l2_groupings': [4],
2026-02-21T09:24:19.2365139Z  'load_eviction_policies': ['', ''],
2026-02-21T09:24:19.2365372Z  'loop_orders': [[1, 0]],
2026-02-21T09:24:19.2365583Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:24:19.2365790Z  'num_stages': 4,
2026-02-21T09:24:19.2365958Z  'num_warps': 4,
2026-02-21T09:24:19.2366177Z  'pid_type': 'flat',
2026-02-21T09:24:19.2366371Z  'range_flattens': [None, False],
2026-02-21T09:24:19.2366607Z  'range_multi_buffers': [None, None],
2026-02-21T09:24:19.2366868Z  'range_num_stages': [0, 1],
2026-02-21T09:24:19.2367081Z  'range_unroll_factors': [0, 2],
2026-02-21T09:24:19.2367303Z  'range_warp_specializes': [],
2026-02-21T09:24:19.2367515Z  'waves_per_eu': 1}
2026-02-21T09:24:19.2687924Z [371s] Fitting surrogate: 855 points, 855 targets
2026-02-21T09:24:19.9688387Z [371s] Generation 15 starting: 12 neighbors, 1 active search path(s)
2026-02-21T09:24:22.3247371Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 4.0 configs/s
2026-02-21T09:24:23.1972909Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 12/12 16.4 configs/s
2026-02-21T09:24:23.7810357Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1483.9        
2026-02-21T09:24:23.7810880Z                                                                   configs/s     
2026-02-21T09:24:24.0306722Z [375s] Generation 15 complete: 
2026-02-21T09:24:24.0307181Z ok=13
2026-02-21T09:24:24.0307397Z min=0.0389
2026-02-21T09:24:24.0307644Z mid=0.0416
2026-02-21T09:24:24.0307850Z max=0.1087
2026-02-21T09:24:24.0308072Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:24:24.0308444Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:24:24.0308806Z  'l2_groupings': [8],
2026-02-21T09:24:24.0309085Z  'load_eviction_policies': ['', ''],
2026-02-21T09:24:24.0309396Z  'loop_orders': [[1, 0]],
2026-02-21T09:24:24.0309677Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:24:24.0309946Z  'num_stages': 4,
2026-02-21T09:24:24.0310176Z  'num_warps': 4,
2026-02-21T09:24:24.0310410Z  'pid_type': 'flat',
2026-02-21T09:24:24.0310678Z  'range_flattens': [None, False],
2026-02-21T09:24:24.0310988Z  'range_multi_buffers': [None, None],
2026-02-21T09:24:24.0311298Z  'range_num_stages': [0, 1],
2026-02-21T09:24:24.0311586Z  'range_unroll_factors': [0, 2],
2026-02-21T09:24:24.0311883Z  'range_warp_specializes': [],
2026-02-21T09:24:24.0312172Z  'waves_per_eu': 1}
2026-02-21T09:24:24.0419949Z [376s] Fitting surrogate: 868 points, 868 targets
2026-02-21T09:24:24.3746806Z [376s] Generation 16 starting: 17 neighbors, 1 active search path(s)
2026-02-21T09:24:27.1652840Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 21.9 configs/s
2026-02-21T09:24:28.3887807Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 15.7 configs/s
2026-02-21T09:24:29.7249630Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 695.9         
2026-02-21T09:24:29.7249991Z                                                                   configs/s     
2026-02-21T09:24:30.0443518Z [382s] Generation 16 complete: 
2026-02-21T09:24:30.0443720Z ok=18
2026-02-21T09:24:30.0443812Z min=0.0389
2026-02-21T09:24:30.0443894Z mid=0.0413
2026-02-21T09:24:30.0443979Z max=0.0633
2026-02-21T09:24:30.0444076Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:24:30.0444225Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:24:30.0444431Z  'l2_groupings': [8],
2026-02-21T09:24:30.0444564Z  'load_eviction_policies': ['', ''],
2026-02-21T09:24:30.0444705Z  'loop_orders': [[1, 0]],
2026-02-21T09:24:30.0444812Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:24:30.0444923Z  'num_stages': 4,
2026-02-21T09:24:30.0445010Z  'num_warps': 4,
2026-02-21T09:24:30.0445104Z  'pid_type': 'flat',
2026-02-21T09:24:30.0445206Z  'range_flattens': [None, False],
2026-02-21T09:24:30.0445325Z  'range_multi_buffers': [None, True],
2026-02-21T09:24:30.0445447Z  'range_num_stages': [0, 1],
2026-02-21T09:24:30.0445553Z  'range_unroll_factors': [0, 2],
2026-02-21T09:24:30.0445669Z  'range_warp_specializes': [],
2026-02-21T09:24:30.0445777Z  'waves_per_eu': 1}
2026-02-21T09:24:30.0758198Z [382s] Fitting surrogate: 886 points, 886 targets
2026-02-21T09:24:30.3659035Z [382s] Generation 17 starting: 16 neighbors, 1 active search path(s)
2026-02-21T09:24:33.0543170Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 13.4 configs/s
2026-02-21T09:24:34.1866883Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.1 configs/s
2026-02-21T09:24:35.3241749Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 803.0         
2026-02-21T09:24:35.3242366Z                                                                   configs/s     
2026-02-21T09:24:35.6022112Z [387s] Generation 17 complete: 
2026-02-21T09:24:35.6022482Z ok=17
2026-02-21T09:24:35.6022702Z min=0.0390
2026-02-21T09:24:35.6022923Z mid=0.0411
2026-02-21T09:24:35.6023123Z max=0.0684
2026-02-21T09:24:35.6023355Z best={'block_sizes': [256, 4, 8],
2026-02-21T09:24:35.6023732Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:24:35.6024097Z  'l2_groupings': [8],
2026-02-21T09:24:35.6024384Z  'load_eviction_policies': ['', ''],
2026-02-21T09:24:35.6024693Z  'loop_orders': [[1, 0]],
2026-02-21T09:24:35.6024979Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:24:35.6025255Z  'num_stages': 4,
2026-02-21T09:24:35.6025492Z  'num_warps': 4,
2026-02-21T09:24:35.6025773Z  'pid_type': 'flat',
2026-02-21T09:24:35.6026050Z  'range_flattens': [None, False],
2026-02-21T09:24:35.6026359Z  'range_multi_buffers': [None, True],
2026-02-21T09:24:35.6026687Z  'range_num_stages': [0, 1],
2026-02-21T09:24:35.6026976Z  'range_unroll_factors': [0, 2],
2026-02-21T09:24:35.6027280Z  'range_warp_specializes': [],
2026-02-21T09:24:35.6027564Z  'waves_per_eu': 1}
2026-02-21T09:24:35.6280588Z [387s] Fitting surrogate: 903 points, 903 targets
2026-02-21T09:24:35.7598803Z [387s] Autotuning complete in 387.7s after searching 845 configs.
2026-02-21T09:24:35.7599355Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:24:35.7601253Z     @helion.kernel(config=helion.Config(block_sizes=[256, 4, 8], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:24:35.7603116Z 
2026-02-21T09:24:35.7603977Z [387s] Code of selected kernel: /tmp/torchinductor_root/nu/cnukfdkpviyw2zwyyyx532dwweldqjajvionaw2sprqyzlaiy3s3.py
2026-02-21T09:24:35.7774652Z from __future__ import annotations
2026-02-21T09:24:35.7774985Z 
2026-02-21T09:24:35.7775087Z import torch
2026-02-21T09:24:35.7775319Z import triton
2026-02-21T09:24:35.7775576Z import triton.language as tl
2026-02-21T09:24:35.7775983Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:24:35.7776312Z 
2026-02-21T09:24:35.7776435Z _BLOCK_SIZE_2 = tl.constexpr(8)
2026-02-21T09:24:35.7776730Z _BLOCK_SIZE_1 = tl.constexpr(4)
2026-02-21T09:24:35.7777023Z _BLOCK_SIZE_0 = tl.constexpr(256)
2026-02-21T09:24:35.7777226Z 
2026-02-21T09:24:35.7777312Z @triton.jit
2026-02-21T09:24:35.7777723Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr):
2026-02-21T09:24:35.7778336Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:24:35.7778757Z     num_pid_m = tl.cdiv(8192, _BLOCK_SIZE_2)
2026-02-21T09:24:35.7779103Z     num_pid_n = tl.cdiv(4, _BLOCK_SIZE_1)
2026-02-21T09:24:35.7779430Z     inner_2d_pid = tl.program_id(0)
2026-02-21T09:24:35.7779748Z     num_pid_in_group = 8 * num_pid_n
2026-02-21T09:24:35.7780086Z     group_id = inner_2d_pid // num_pid_in_group
2026-02-21T09:24:35.7780440Z     first_pid_m = group_id * 8
2026-02-21T09:24:35.7780777Z     group_size_m = min(num_pid_m - first_pid_m, 8)
2026-02-21T09:24:35.7781215Z     pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T09:24:35.7781482Z     pid_1 = inner_2d_pid % num_pid_in_group // group_size_m
2026-02-21T09:24:35.7781674Z     offset_2 = pid_0 * _BLOCK_SIZE_2
2026-02-21T09:24:35.7781878Z     indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32)
2026-02-21T09:24:35.7782084Z     offset_1 = pid_1 * _BLOCK_SIZE_1
2026-02-21T09:24:35.7782470Z     indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32)
2026-02-21T09:24:35.7782739Z     # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:24:35.7783010Z     acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32)
2026-02-21T09:24:35.7783304Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:24:35.7783664Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:24:35.7784018Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:24:35.7784264Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:24:35.7784590Z     for offset_3 in tl.range(0, 1792, _BLOCK_SIZE_0, loop_unroll_factor=2, num_stages=1, disallow_acc_multi_buffer=False, flatten=False):
2026-02-21T09:24:35.7784982Z         indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T09:24:35.7785183Z         acc_copy = acc
2026-02-21T09:24:35.7785317Z         acc_copy_0 = acc_copy
2026-02-21T09:24:35.7785510Z         # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2
2026-02-21T09:24:35.7785713Z         mul = 2 * offset_3
2026-02-21T09:24:35.7785929Z         # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:24:35.7786186Z         iota = mul + tl.arange(0, mul_1)
2026-02-21T09:24:35.7786401Z         load = tl.load(A + (indices_1[:, None] * 3584 + iota[None, :] * 1), None)
2026-02-21T09:24:35.7786693Z         # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:24:35.7786953Z         # src[int4_gemm.py:66]:     torch.float32
2026-02-21T09:24:35.7787155Z         # src[int4_gemm.py:67]: )  # [BLOCK_SIZE_M, BLOCK_SIZE_K]
2026-02-21T09:24:35.7787354Z         v_0 = tl.cast(load, tl.float32)
2026-02-21T09:24:35.7787597Z         # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n]  # [BLOCK_SIZE_K//2, BLOCK_SIZE_N]
2026-02-21T09:24:35.7787913Z         b_tile = tl.load(B + (indices_3[:, None] * 8192 + indices_2[None, :] * 1), None)
2026-02-21T09:24:35.7788273Z         # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8)  # Sign-extend low 4 bits
2026-02-21T09:24:35.7788518Z         v_1 = tl.full([], 4, tl.int8)
2026-02-21T09:24:35.7788669Z         v_2 = b_tile << v_1
2026-02-21T09:24:35.7788802Z         v_3 = tl.full([], 4, tl.int8)
2026-02-21T09:24:35.7788952Z         v_4 = v_2 >> v_3
2026-02-21T09:24:35.7789165Z         # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8)  # Sign-extend high 4 bits
2026-02-21T09:24:35.7789405Z         v_5 = tl.full([], 4, tl.int8)
2026-02-21T09:24:35.7789551Z         v_6 = b_tile >> v_5
2026-02-21T09:24:35.7789736Z         # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1)
2026-02-21T09:24:35.7789957Z         stack_idx = tl.arange(0, 2)
2026-02-21T09:24:35.7790117Z         broadcast_idx = stack_idx[None, :, None]
2026-02-21T09:24:35.7790297Z         expanded_0 = tl.expand_dims(v_4, 1)
2026-02-21T09:24:35.7790469Z         expanded_1 = tl.expand_dims(v_6, 1)
2026-02-21T09:24:35.7790650Z         stacked_result = tl.zeros_like(expanded_0)
2026-02-21T09:24:35.7790831Z         mask_0 = broadcast_idx == 0
2026-02-21T09:24:35.7791032Z         stacked_result = tl.where(mask_0, expanded_0, stacked_result)
2026-02-21T09:24:35.7791259Z         mask_1 = broadcast_idx == 1
2026-02-21T09:24:35.7791443Z         stacked_result = tl.where(mask_1, expanded_1, stacked_result)
2026-02-21T09:24:35.7791630Z         # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:24:35.7791828Z         # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:24:35.7792013Z         # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:24:35.7792184Z         view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2])
2026-02-21T09:24:35.7792347Z         v_7 = tl.cast(view, tl.float32)
2026-02-21T09:24:35.7792570Z         # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2)  # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1]
2026-02-21T09:24:35.7792755Z         a_tile_1 = v_0[:, :, None]
2026-02-21T09:24:35.7792910Z         # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0)
2026-02-21T09:24:35.7793070Z         b_unpacked_1 = v_7[None, :, :]
2026-02-21T09:24:35.7793268Z         # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1)  # [BLOCK_SIZE_M, BLOCK_SIZE_N]
2026-02-21T09:24:35.7793471Z         v_8 = a_tile_1 * b_unpacked_1
2026-02-21T09:24:35.7793605Z         sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32)
2026-02-21T09:24:35.7793743Z         acc = acc_copy_0 + sum_1
2026-02-21T09:24:35.7793893Z     # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16)
2026-02-21T09:24:35.7794055Z     v_10 = tl.cast(acc, tl.bfloat16)
2026-02-21T09:24:35.7794217Z     tl.store(C + (indices_1[:, None] * 8192 + indices_2[None, :] * 1), v_10, None)
2026-02-21T09:24:35.7794354Z 
2026-02-21T09:24:35.7794445Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher):
2026-02-21T09:24:35.7794612Z     """
2026-02-21T09:24:35.7794729Z     BFloat16 x INT4 General Matrix Multiplication (GEMM).
2026-02-21T09:24:35.7794839Z 
2026-02-21T09:24:35.7794905Z     This kernel performs matrix multiplication where:
2026-02-21T09:24:35.7795059Z     - A is a bfloat16 matrix of shape [M, K]
2026-02-21T09:24:35.7795255Z     - B is an int8 matrix of shape [K//2, N] containing packed int4 values
2026-02-21T09:24:35.7795434Z       (two 4-bit values packed into each int8)
2026-02-21T09:24:35.7795523Z 
2026-02-21T09:24:35.7795560Z     Args:
2026-02-21T09:24:35.7795688Z         A (Tensor): Input tensor of shape [M, K] in bfloat16 format.
2026-02-21T09:24:35.7795877Z         B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format.
2026-02-21T09:24:35.7796002Z 
2026-02-21T09:24:35.7796041Z     Returns:
2026-02-21T09:24:35.7796165Z         Tensor: Output tensor of shape [M, N] in bfloat16 format.
2026-02-21T09:24:35.7796314Z     """
2026-02-21T09:24:35.7796411Z     # src[int4_gemm.py:50]: M, K = A.shape
2026-02-21T09:24:35.7796532Z     M, K = A.shape
2026-02-21T09:24:35.7796671Z     # src[int4_gemm.py:51]: _, N = B.shape
2026-02-21T09:24:35.7796785Z     _, N = B.shape
2026-02-21T09:24:35.7796942Z     # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:24:35.7797156Z     C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:24:35.7797345Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:24:35.7797499Z     _BLOCK_SIZE_2 = 8
2026-02-21T09:24:35.7797600Z     _BLOCK_SIZE_1 = 4
2026-02-21T09:24:35.7797772Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:24:35.7798059Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:24:35.7798332Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:24:35.7798526Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:24:35.7798641Z     _BLOCK_SIZE_0 = 256
2026-02-21T09:24:35.7798775Z     # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:24:35.7798968Z     # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:24:35.7799142Z     # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:24:35.7799276Z     _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0
2026-02-21T09:24:35.7799427Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:24:35.7799621Z     # src[int4_gemm.py:58]:     acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:24:35.7799801Z     # src[int4_gemm.py:57-91]: ...
2026-02-21T09:24:35.7799944Z     _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0)
2026-02-21T09:24:35.7800375Z     _launcher(_helion_matmul_bf16_int4, (triton.cdiv(8192, _BLOCK_SIZE_2) * triton.cdiv(4, _BLOCK_SIZE_1),), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=4, num_stages=4, waves_per_eu=1, matrix_instr_nonkdim=0)
2026-02-21T09:24:35.7800742Z     # src[int4_gemm.py:93]: return C
2026-02-21T09:24:35.7800865Z     return C
2026-02-21T09:24:36.6153979Z WARNING:tritonbench.utils.triton_op:Completed input ID 7:
2026-02-21T09:24:36.6154429Z x_val
2026-02-21T09:24:36.6154639Z ------------------
2026-02-21T09:24:36.6154869Z (4, 1, 8192, 3584)
2026-02-21T09:24:36.6155002Z 
2026-02-21T09:24:36.6186032Z  30%|███       | 3/10 [21:57<51:01, 437.34s/it]  WARNING:tritonbench.utils.triton_op:Running input ID 10:
2026-02-21T09:24:36.6186470Z x_val
2026-02-21T09:24:36.6186650Z -------------------
2026-02-21T09:24:36.6186847Z (16, 1, 7168, 8192)
2026-02-21T09:24:36.6189760Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T09:24:37.6151837Z INFO:tritonbench.utils.triton_op:Took 5.01ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T09:24:39.0638347Z INFO:tritonbench.utils.triton_op:Took 0.22ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T09:24:39.0702758Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:24:39.0703124Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:24:39.0703424Z               'dtype': 'torch.bfloat16',
2026-02-21T09:24:39.0703712Z               'shape': (16, 1, 8192),
2026-02-21T09:24:39.0703995Z               'stride': (8192, 8192, 1)},
2026-02-21T09:24:39.0704269Z             { 'device': 'cuda:0',
2026-02-21T09:24:39.0704531Z               'dtype': 'torch.int32',
2026-02-21T09:24:39.0704800Z               'shape': (8192, 7168),
2026-02-21T09:24:39.0705054Z               'stride': (7168, 1)}),
2026-02-21T09:24:39.0705301Z   'kwargs': {}}
2026-02-21T09:24:39.0743181Z INFO:tritonbench.utils.triton_op:Took 4.25ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T09:24:39.2682857Z [0s] Autotune random seed: 2138032649
2026-02-21T09:24:39.2942162Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:25:22.1995909Z [42s] Timeout after 30s compiling Config(block_sizes=[8, 2, 4096], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[2, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:25:22.6261084Z [43s] Timeout after 30s compiling Config(block_sizes=[128, 1, 512], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[0, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:25:22.6284519Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s
2026-02-21T09:25:29.9183542Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.8 configs/s
2026-02-21T09:25:29.9193949Z [50s] Adaptive compile timeout: 30s (90% percentile=12.0s, bounds=[30.0s, 30s])
2026-02-21T09:25:29.9195523Z [50s] Initial random population of 100, 5 starting points: 
2026-02-21T09:25:29.9195663Z error=2
2026-02-21T09:25:29.9195754Z timeout=2
2026-02-21T09:25:29.9195829Z ok=96
2026-02-21T09:25:29.9195920Z min=0.1809
2026-02-21T09:25:29.9195995Z mid=1.2689
2026-02-21T09:25:29.9196072Z max=36.9792
2026-02-21T09:25:29.9196159Z best={'block_sizes': [16, 16, 16],
2026-02-21T09:25:29.9196287Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:25:29.9196412Z  'l2_groupings': [1],
2026-02-21T09:25:29.9196515Z  'load_eviction_policies': ['', ''],
2026-02-21T09:25:29.9196627Z  'loop_orders': [[0, 1]],
2026-02-21T09:25:29.9196728Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:25:29.9197238Z  'num_stages': 1,
2026-02-21T09:25:29.9197323Z  'num_warps': 4,
2026-02-21T09:25:29.9197411Z  'pid_type': 'flat',
2026-02-21T09:25:29.9197515Z  'range_flattens': [None, None],
2026-02-21T09:25:29.9197628Z  'range_multi_buffers': [None, None],
2026-02-21T09:25:29.9197742Z  'range_num_stages': [0, 0],
2026-02-21T09:25:29.9197846Z  'range_unroll_factors': [0, 0],
2026-02-21T09:25:29.9197953Z  'range_warp_specializes': [],
2026-02-21T09:25:29.9198059Z  'waves_per_eu': 1}
2026-02-21T09:25:29.9212052Z [50s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:25:31.6211877Z [52s] Generation 1 starting: 97 neighbors, 5 active search path(s)
2026-02-21T09:26:13.3841914Z [94s] Timeout after 30s compiling Config(block_sizes=[256, 2, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:26:13.3862589Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s
2026-02-21T09:26:14.2341451Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:26:14.2345968Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:26:14.2346874Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:26:14.2347671Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:26:14.2348467Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:26:14.2349121Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:26:14.2350397Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:26:14.2350804Z #smem = #ttg.shared_memory
2026-02-21T09:26:14.2351265Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:26:14.2352184Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:26:14.2352950Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x16xf32, #mma>
2026-02-21T09:26:14.2353269Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:26:14.2353495Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T09:26:14.2353722Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:26:14.2354011Z     %cst_0 = arith.constant dense<0> : tensor<4x2x16xi8, #blocked>
2026-02-21T09:26:14.2354301Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:26:14.2354531Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:26:14.2354902Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:26:14.2355113Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:26:14.2355383Z     %cst_1 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:26:14.2355810Z     %cst_2 = arith.constant dense<7168> : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2356314Z     %cst_3 = arith.constant dense<4> : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2356726Z     %cst_4 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:26:14.2357055Z     %cst_5 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:26:14.2357377Z     %cst_6 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:26:14.2357865Z     %0 = tt.get_program_id x : i32
2026-02-21T09:26:14.2358083Z     %1 = arith.muli %0, %c2_i32 : i32
2026-02-21T09:26:14.2358306Z     %2 = arith.addi %1, %c2_i32 : i32
2026-02-21T09:26:14.2358526Z     %3 = arith.minsi %2, %c448_i32 : i32
2026-02-21T09:26:14.2358984Z     %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:26:14.2359583Z     %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:26:14.2360165Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:26:14.2360756Z     %7 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:26:14.2361133Z     %8 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:26:14.2361575Z     %9 = tt.expand_dims %8 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:26:14.2361931Z     %10 = arith.muli %9, %cst_1 : tensor<16x1xi32, #blocked1>
2026-02-21T09:26:14.2362197Z     %11 = tt.broadcast %10 : tensor<16x1xi32, #blocked1> -> tensor<16x8xi32, #blocked1>
2026-02-21T09:26:14.2362503Z     %12 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:26:14.2362959Z     %13 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2363381Z     %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:26:14.2364077Z     %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:26:14.2364638Z     %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:26:14.2365056Z     %17 = arith.cmpi eq, %16, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:26:14.2365328Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked>
2026-02-21T09:26:14.2365605Z     %19 = arith.cmpi eq, %16, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:26:14.2365873Z     %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked>
2026-02-21T09:26:14.2366202Z     %21 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:26:14.2366613Z     %22 = tt.expand_dims %21 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:26:14.2366949Z     %23 = arith.muli %22, %cst_6 : tensor<16x1xi32, #mma>
2026-02-21T09:26:14.2367196Z     %24 = tt.broadcast %23 : tensor<16x1xi32, #mma> -> tensor<16x16xi32, #mma>
2026-02-21T09:26:14.2367508Z     %25 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:26:14.2368093Z     scf.for %arg3 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T09:26:14.2368287Z       %26 = arith.muli %arg3, %c16_i32 : i32
2026-02-21T09:26:14.2368584Z       %27 = tt.splat %26 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:26:14.2368922Z       %28 = tt.splat %26 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:26:14.2369271Z       %29 = arith.addi %27, %4 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:26:14.2369627Z       %30 = arith.addi %28, %5 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:26:14.2370173Z       %31 = tt.expand_dims %29 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2370745Z       %32 = tt.broadcast %31 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2371124Z       %33 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst) -> (tensor<16x16xf32, #mma>)  : i32 {
2026-02-21T09:26:14.2371453Z         %39 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:26:14.2371782Z         %40 = arith.addi %39, %6 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:26:14.2372016Z         %41 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:26:14.2372206Z         %42 = tt.splat %41 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:26:14.2372526Z         %43 = arith.addi %42, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:26:14.2372830Z         %44 = tt.expand_dims %43 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:26:14.2373141Z         %45 = tt.broadcast %44 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1>
2026-02-21T09:26:14.2373354Z         %46 = arith.addi %11, %45 : tensor<16x8xi32, #blocked1>
2026-02-21T09:26:14.2373577Z         %47 = tt.addptr %12, %46 : tensor<16x8x!tt.ptr<bf16>, #blocked1>, tensor<16x8xi32, #blocked1>
2026-02-21T09:26:14.2373797Z         %48 = tt.load %47 : tensor<16x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:26:14.2374042Z         %49 = ttg.local_alloc %48 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:26:14.2374405Z         %50 = ttg.local_load %49 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:26:14.2374858Z         %51 = arith.extf %50 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:26:14.2375392Z         %52 = tt.expand_dims %40 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2375786Z         %53 = arith.muli %52, %cst_2 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2376123Z         %54 = tt.broadcast %53 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2376452Z         %55 = arith.addi %54, %32 : tensor<4x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2376784Z         %56 = tt.addptr %13, %55 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2377124Z         %57 = tt.load %56 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2377378Z         %58 = arith.shli %57, %cst_3 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2377634Z         %59 = arith.shrsi %58, %cst_3 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2377893Z         %60 = arith.shrsi %57, %cst_3 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:26:14.2378210Z         %61 = tt.expand_dims %59 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:26:14.2378574Z         %62 = tt.expand_dims %60 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:26:14.2378879Z         %63 = tt.broadcast %61 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:26:14.2379133Z         %64 = arith.select %18, %63, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:26:14.2379426Z         %65 = tt.broadcast %62 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:26:14.2379679Z         %66 = arith.select %20, %65, %64 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:26:14.2379922Z         %67 = tt.reshape %66 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2>
2026-02-21T09:26:14.2380165Z         %68 = arith.sitofp %67 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2>
2026-02-21T09:26:14.2380433Z         %69 = ttg.local_alloc %68 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem>
2026-02-21T09:26:14.2380766Z         %70 = ttg.local_load %69 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:26:14.2381239Z         %71 = tt.dot %51, %70, %arg5, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x16xf32, #mma>
2026-02-21T09:26:14.2381588Z         scf.yield %71 : tensor<16x16xf32, #mma>
2026-02-21T09:26:14.2381714Z       }
2026-02-21T09:26:14.2381845Z       %34 = arith.truncf %33 : tensor<16x16xf32, #mma> to tensor<16x16xbf16, #mma>
2026-02-21T09:26:14.2382109Z       %35 = tt.expand_dims %30 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma>
2026-02-21T09:26:14.2382369Z       %36 = tt.broadcast %35 : tensor<1x16xi32, #mma> -> tensor<16x16xi32, #mma>
2026-02-21T09:26:14.2382547Z       %37 = arith.addi %24, %36 : tensor<16x16xi32, #mma>
2026-02-21T09:26:14.2382736Z       %38 = tt.addptr %25, %37 : tensor<16x16x!tt.ptr<bf16>, #mma>, tensor<16x16xi32, #mma>
2026-02-21T09:26:14.2382929Z       tt.store %38, %34 : tensor<16x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:26:14.2383062Z     }
2026-02-21T09:26:14.2383144Z     tt.return
2026-02-21T09:26:14.2383232Z   }
2026-02-21T09:26:14.2383313Z }
2026-02-21T09:26:14.2383362Z 
2026-02-21T09:26:14.2383401Z {-#
2026-02-21T09:26:14.2383489Z   external_resources: {
2026-02-21T09:26:14.2383591Z     mlir_reproducer: {
2026-02-21T09:26:14.2384648Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:26:14.2385833Z       disable_threading: false,
2026-02-21T09:26:14.2385943Z       verify_each: true
2026-02-21T09:26:14.2386040Z     }
2026-02-21T09:26:14.2386119Z   }
2026-02-21T09:26:14.2386193Z #-}
2026-02-21T09:26:14.2386486Z /tmp/torchinductor_root/sr/csrcdavv3qtyuok5hacwwd33omqwh7txpyktxxydqp7domrkedaf.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:26:14.2387178Z /tmp/torchinductor_root/sr/csrcdavv3qtyuok5hacwwd33omqwh7txpyktxxydqp7domrkedaf.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:26:14.2387733Z [94s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:26:14.2388543Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 16], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:26:14.2389243Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:26:14.2389424Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:26:19.4430413Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 16.6 configs/s
2026-02-21T09:26:20.6903287Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 609.0         
2026-02-21T09:26:20.6903887Z                                                                   configs/s     
2026-02-21T09:26:21.2661187Z [101s] Generation 1 complete: 
2026-02-21T09:26:21.2661401Z error=1
2026-02-21T09:26:21.2661556Z timeout=1
2026-02-21T09:26:21.2661692Z ok=101
2026-02-21T09:26:21.2661846Z min=0.1282
2026-02-21T09:26:21.2661973Z mid=0.3725
2026-02-21T09:26:21.2662092Z max=4.6295
2026-02-21T09:26:21.2662229Z best={'block_sizes': [32, 16, 16],
2026-02-21T09:26:21.2662457Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:26:21.2662694Z  'l2_groupings': [1],
2026-02-21T09:26:21.2662856Z  'load_eviction_policies': ['', ''],
2026-02-21T09:26:21.2663023Z  'loop_orders': [[0, 1]],
2026-02-21T09:26:21.2663203Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:26:21.2663363Z  'num_stages': 1,
2026-02-21T09:26:21.2663502Z  'num_warps': 4,
2026-02-21T09:26:21.2663632Z  'pid_type': 'flat',
2026-02-21T09:26:21.2663787Z  'range_flattens': [None, True],
2026-02-21T09:26:21.2663965Z  'range_multi_buffers': [None, None],
2026-02-21T09:26:21.2664138Z  'range_num_stages': [0, 0],
2026-02-21T09:26:21.2664308Z  'range_unroll_factors': [0, 0],
2026-02-21T09:26:21.2664486Z  'range_warp_specializes': [],
2026-02-21T09:26:21.2664652Z  'waves_per_eu': 3}
2026-02-21T09:26:21.2748952Z [101s] Fitting surrogate: 203 points, 203 targets
2026-02-21T09:26:22.2445187Z [102s] Generation 2 starting: 99 neighbors, 5 active search path(s)
2026-02-21T09:27:00.5068496Z [141s] Timeout after 30s compiling Config(block_sizes=[256, 1, 64], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:27:01.4158841Z [142s] Timeout after 30s compiling Config(block_sizes=[256, 1, 64], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[4, 1], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:27:01.8154235Z [142s] Timeout after 30s compiling Config(block_sizes=[256, 1, 64], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[4, 2], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:27:05.7736181Z [146s] Timeout after 30s compiling Config(block_sizes=[256, 2, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[4, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:27:05.7755194Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 102/102 0.8 configs/s
2026-02-21T09:27:11.9985837Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 102/102 16.5 configs/s
2026-02-21T09:27:13.4022692Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 595.6         
2026-02-21T09:27:13.4023093Z                                                                   configs/s     
2026-02-21T09:27:13.8739128Z [154s] Generation 2 complete: 
2026-02-21T09:27:13.8739291Z timeout=4
2026-02-21T09:27:13.8739368Z ok=101
2026-02-21T09:27:13.8739449Z min=0.0941
2026-02-21T09:27:13.8739525Z mid=0.3238
2026-02-21T09:27:13.8739603Z max=5.1746
2026-02-21T09:27:13.8739689Z best={'block_sizes': [32, 16, 16],
2026-02-21T09:27:13.8739829Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:27:13.8739961Z  'l2_groupings': [1],
2026-02-21T09:27:13.8740068Z  'load_eviction_policies': ['', ''],
2026-02-21T09:27:13.8740184Z  'loop_orders': [[0, 1]],
2026-02-21T09:27:13.8740286Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:27:13.8740387Z  'num_stages': 2,
2026-02-21T09:27:13.8740472Z  'num_warps': 2,
2026-02-21T09:27:13.8740560Z  'pid_type': 'flat',
2026-02-21T09:27:13.8740657Z  'range_flattens': [None, True],
2026-02-21T09:27:13.8740806Z  'range_multi_buffers': [None, False],
2026-02-21T09:27:13.8740924Z  'range_num_stages': [0, 0],
2026-02-21T09:27:13.8741043Z  'range_unroll_factors': [0, 0],
2026-02-21T09:27:13.8741150Z  'range_warp_specializes': [],
2026-02-21T09:27:13.8741254Z  'waves_per_eu': 3}
2026-02-21T09:27:13.8855383Z [154s] Fitting surrogate: 308 points, 308 targets
2026-02-21T09:27:14.9986786Z [155s] Generation 3 starting: 105 neighbors, 5 active search path(s)
2026-02-21T09:27:57.7780549Z [198s] Timeout after 30s compiling Config(block_sizes=[512, 2, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[0, 2], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T09:27:58.7149374Z [199s] Timeout after 30s compiling Config(block_sizes=[128, 2, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:27:59.5394419Z [200s] Timeout after 30s compiling Config(block_sizes=[256, 2, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:27:59.5421049Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 108/108 0.7 configs/s
2026-02-21T09:28:05.9706888Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 108/108 16.9 configs/s
2026-02-21T09:28:09.4005520Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 268.2         
2026-02-21T09:28:09.4005774Z                                                                   configs/s     
2026-02-21T09:28:09.8516565Z [210s] Generation 3 complete: 
2026-02-21T09:28:09.8516865Z error=1
2026-02-21T09:28:09.8517031Z timeout=3
2026-02-21T09:28:09.8517201Z ok=106
2026-02-21T09:28:09.8517354Z min=0.0908
2026-02-21T09:28:09.8517513Z mid=0.3135
2026-02-21T09:28:09.8517665Z max=6.8092
2026-02-21T09:28:09.8517843Z best={'block_sizes': [64, 16, 16],
2026-02-21T09:28:09.8518178Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:28:09.8518479Z  'l2_groupings': [1],
2026-02-21T09:28:09.8518697Z  'load_eviction_policies': ['', ''],
2026-02-21T09:28:09.8518944Z  'loop_orders': [[0, 1]],
2026-02-21T09:28:09.8519167Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:28:09.8519378Z  'num_stages': 2,
2026-02-21T09:28:09.8520167Z  'num_warps': 4,
2026-02-21T09:28:09.8520352Z  'pid_type': 'flat',
2026-02-21T09:28:09.8520583Z  'range_flattens': [None, True],
2026-02-21T09:28:09.8520826Z  'range_multi_buffers': [None, False],
2026-02-21T09:28:09.8521074Z  'range_num_stages': [0, 0],
2026-02-21T09:28:09.8521297Z  'range_unroll_factors': [0, 0],
2026-02-21T09:28:09.8521530Z  'range_warp_specializes': [],
2026-02-21T09:28:09.8521751Z  'waves_per_eu': 1}
2026-02-21T09:28:09.8900377Z [210s] Fitting surrogate: 418 points, 418 targets
2026-02-21T09:28:11.5540623Z [212s] Generation 4 starting: 90 neighbors, 5 active search path(s)
2026-02-21T09:28:30.9855630Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 0.9 configs/s
2026-02-21T09:28:36.3444242Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 93/93 17.8 configs/s
2026-02-21T09:28:41.3357802Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 189.6         
2026-02-21T09:28:41.3358339Z                                                                   configs/s     
2026-02-21T09:28:41.8844030Z [242s] Generation 4 complete: 
2026-02-21T09:28:41.8844334Z error=7
2026-02-21T09:28:41.8844497Z ok=89
2026-02-21T09:28:41.8844653Z min=0.0894
2026-02-21T09:28:41.8844807Z mid=0.2292
2026-02-21T09:28:41.8844976Z max=6.3389
2026-02-21T09:28:41.8845143Z best={'block_sizes': [128, 16, 16],
2026-02-21T09:28:41.8845424Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:28:41.8845693Z  'l2_groupings': [4],
2026-02-21T09:28:41.8845897Z  'load_eviction_policies': ['', ''],
2026-02-21T09:28:41.8846134Z  'loop_orders': [[0, 1]],
2026-02-21T09:28:41.8846340Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:28:41.8846544Z  'num_stages': 4,
2026-02-21T09:28:41.8846713Z  'num_warps': 2,
2026-02-21T09:28:41.8846889Z  'pid_type': 'flat',
2026-02-21T09:28:41.8847082Z  'range_flattens': [None, False],
2026-02-21T09:28:41.8847312Z  'range_multi_buffers': [None, None],
2026-02-21T09:28:41.8847539Z  'range_num_stages': [0, 1],
2026-02-21T09:28:41.8847755Z  'range_unroll_factors': [0, 1],
2026-02-21T09:28:41.8847996Z  'range_warp_specializes': [],
2026-02-21T09:28:41.8848574Z  'waves_per_eu': 2}
2026-02-21T09:28:41.9480663Z [242s] Fitting surrogate: 514 points, 514 targets
2026-02-21T09:28:42.9043826Z [243s] Generation 5 starting: 92 neighbors, 5 active search path(s)
2026-02-21T09:29:25.4196967Z [286s] Timeout after 30s compiling Config(block_sizes=[256, 2, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T09:29:25.4214669Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 0.3 configs/s
2026-02-21T09:29:31.1491564Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 16.5 configs/s
2026-02-21T09:29:37.2628781Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 156.9         
2026-02-21T09:29:37.2629379Z                                                                   configs/s     
2026-02-21T09:29:37.8447399Z [298s] Generation 5 complete: 
2026-02-21T09:29:37.8447661Z error=2
2026-02-21T09:29:37.8447801Z timeout=1
2026-02-21T09:29:37.8447932Z ok=94
2026-02-21T09:29:37.8448066Z min=0.0881
2026-02-21T09:29:37.8448197Z mid=0.1505
2026-02-21T09:29:37.8448333Z max=6.4633
2026-02-21T09:29:37.8448476Z best={'block_sizes': [128, 16, 16],
2026-02-21T09:29:37.8448719Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:29:37.8448946Z  'l2_groupings': [16],
2026-02-21T09:29:37.8449132Z  'load_eviction_policies': ['', ''],
2026-02-21T09:29:37.8449334Z  'loop_orders': [[1, 0]],
2026-02-21T09:29:37.8449520Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:29:37.8449704Z  'num_sm_multiplier': 64,
2026-02-21T09:29:37.8449869Z  'num_stages': 2,
2026-02-21T09:29:37.8450459Z  'num_warps': 2,
2026-02-21T09:29:37.8450634Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:29:37.8450845Z  'range_flattens': [False, False],
2026-02-21T09:29:37.8451050Z  'range_multi_buffers': [False, True],
2026-02-21T09:29:37.8451249Z  'range_num_stages': [4, 1],
2026-02-21T09:29:37.8451430Z  'range_unroll_factors': [3, 0],
2026-02-21T09:29:37.8451623Z  'range_warp_specializes': [],
2026-02-21T09:29:37.8451799Z  'waves_per_eu': 2}
2026-02-21T09:29:37.9464489Z [298s] Fitting surrogate: 611 points, 611 targets
2026-02-21T09:29:38.9092982Z [299s] Generation 6 starting: 88 neighbors, 5 active search path(s)
2026-02-21T09:30:22.5002950Z [343s] Timeout after 30s compiling Config(block_sizes=[128, 2, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:30:23.3320766Z [344s] Timeout after 30s compiling Config(block_sizes=[128, 2, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:30:23.3338406Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.2 configs/s
2026-02-21T09:30:28.6129440Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 17.2 configs/s
2026-02-21T09:30:35.4679027Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 155.7         
2026-02-21T09:30:35.4679658Z                                                                   configs/s     
2026-02-21T09:30:35.9966714Z [356s] Generation 6 complete: 
2026-02-21T09:30:35.9967064Z error=1
2026-02-21T09:30:35.9967278Z timeout=2
2026-02-21T09:30:35.9967509Z ok=90
2026-02-21T09:30:35.9967711Z min=0.0874
2026-02-21T09:30:35.9967919Z mid=0.1325
2026-02-21T09:30:35.9968114Z max=2.7252
2026-02-21T09:30:35.9968350Z best={'block_sizes': [128, 16, 16],
2026-02-21T09:30:35.9968727Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:30:35.9969093Z  'l2_groupings': [16],
2026-02-21T09:30:35.9969385Z  'load_eviction_policies': ['', ''],
2026-02-21T09:30:35.9969697Z  'loop_orders': [[1, 0]],
2026-02-21T09:30:35.9969982Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:30:35.9970264Z  'num_sm_multiplier': 64,
2026-02-21T09:30:35.9970532Z  'num_stages': 3,
2026-02-21T09:30:35.9970770Z  'num_warps': 2,
2026-02-21T09:30:35.9971040Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:30:35.9971367Z  'range_flattens': [False, False],
2026-02-21T09:30:35.9971687Z  'range_multi_buffers': [False, True],
2026-02-21T09:30:35.9972424Z  'range_num_stages': [4, 1],
2026-02-21T09:30:35.9972718Z  'range_unroll_factors': [3, 0],
2026-02-21T09:30:35.9973024Z  'range_warp_specializes': [],
2026-02-21T09:30:35.9973289Z  'waves_per_eu': 2}
2026-02-21T09:30:36.0802675Z [356s] Fitting surrogate: 704 points, 704 targets
2026-02-21T09:30:36.8882849Z [357s] Generation 7 starting: 72 neighbors, 4 active search path(s)
2026-02-21T09:31:16.3117502Z [397s] Timeout after 30s compiling Config(block_sizes=[128, 2, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:31:16.3141321Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 0.2 configs/s
2026-02-21T09:31:20.7460155Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.9 configs/s
2026-02-21T09:31:24.1190985Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 274.4         
2026-02-21T09:31:24.1191667Z                                                                   configs/s     
2026-02-21T09:31:24.5464352Z [405s] Generation 7 complete: 
2026-02-21T09:31:24.5464608Z timeout=1
2026-02-21T09:31:24.5464759Z ok=75
2026-02-21T09:31:24.5464917Z min=0.0846
2026-02-21T09:31:24.5465068Z mid=0.1910
2026-02-21T09:31:24.5465225Z max=5.0350
2026-02-21T09:31:24.5465413Z best={'block_sizes': [256, 16, 16],
2026-02-21T09:31:24.5465688Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:31:24.5465953Z  'l2_groupings': [4],
2026-02-21T09:31:24.5466164Z  'load_eviction_policies': ['', ''],
2026-02-21T09:31:24.5466398Z  'loop_orders': [[0, 1]],
2026-02-21T09:31:24.5466607Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:31:24.5466810Z  'num_stages': 4,
2026-02-21T09:31:24.5467028Z  'num_warps': 2,
2026-02-21T09:31:24.5467209Z  'pid_type': 'flat',
2026-02-21T09:31:24.5467434Z  'range_flattens': [None, False],
2026-02-21T09:31:24.5467674Z  'range_multi_buffers': [None, True],
2026-02-21T09:31:24.5468216Z  'range_num_stages': [0, 1],
2026-02-21T09:31:24.5468428Z  'range_unroll_factors': [0, 1],
2026-02-21T09:31:24.5468645Z  'range_warp_specializes': [],
2026-02-21T09:31:24.5468855Z  'waves_per_eu': 1}
2026-02-21T09:31:24.5893720Z [405s] Fitting surrogate: 780 points, 780 targets
2026-02-21T09:31:25.3269593Z [406s] Generation 8 starting: 68 neighbors, 4 active search path(s)
2026-02-21T09:31:37.2207081Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 12.8 configs/s
2026-02-21T09:31:41.5274647Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.6 configs/s
2026-02-21T09:31:44.9429834Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 274.4         
2026-02-21T09:31:44.9432404Z                                                                   configs/s     
2026-02-21T09:31:45.4239044Z [426s] Generation 8 complete: 
2026-02-21T09:31:45.4239450Z ok=72
2026-02-21T09:31:45.4239700Z min=0.0777
2026-02-21T09:31:45.4239919Z mid=0.1677
2026-02-21T09:31:45.4240149Z max=3.8257
2026-02-21T09:31:45.4240379Z best={'block_sizes': [256, 16, 16],
2026-02-21T09:31:45.4240772Z  'indexing': ['pointer', 'block_ptr', 'block_ptr'],
2026-02-21T09:31:45.4241144Z  'l2_groupings': [4],
2026-02-21T09:31:45.4241427Z  'load_eviction_policies': ['', ''],
2026-02-21T09:31:45.4241751Z  'loop_orders': [[0, 1]],
2026-02-21T09:31:45.4242038Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:31:45.4242327Z  'num_sm_multiplier': 64,
2026-02-21T09:31:45.4242668Z  'num_stages': 4,
2026-02-21T09:31:45.4242910Z  'num_warps': 2,
2026-02-21T09:31:45.4243880Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:31:45.4244325Z  'range_flattens': [None, False],
2026-02-21T09:31:45.4244635Z  'range_multi_buffers': [False, True],
2026-02-21T09:31:45.4244964Z  'range_num_stages': [2, 1],
2026-02-21T09:31:45.4245256Z  'range_unroll_factors': [3, 1],
2026-02-21T09:31:45.4245971Z  'range_warp_specializes': [],
2026-02-21T09:31:45.4246209Z  'waves_per_eu': 1}
2026-02-21T09:31:45.4667001Z [426s] Fitting surrogate: 852 points, 852 targets
2026-02-21T09:31:46.1222395Z [426s] Generation 9 starting: 57 neighbors, 3 active search path(s)
2026-02-21T09:32:02.1595389Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 0.5 configs/s
2026-02-21T09:32:05.8325064Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 16.7 configs/s
2026-02-21T09:32:09.3163146Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 268.5         
2026-02-21T09:32:09.3163750Z                                                                   configs/s     
2026-02-21T09:32:09.8358648Z [450s] Generation 9 complete: 
2026-02-21T09:32:09.8358997Z ok=60
2026-02-21T09:32:09.8359198Z min=0.0778
2026-02-21T09:32:09.8359400Z mid=0.1385
2026-02-21T09:32:09.8359588Z max=1.6785
2026-02-21T09:32:09.8359796Z best={'block_sizes': [256, 16, 16],
2026-02-21T09:32:09.8360171Z  'indexing': ['pointer', 'block_ptr', 'block_ptr'],
2026-02-21T09:32:09.8360511Z  'l2_groupings': [4],
2026-02-21T09:32:09.8360754Z  'load_eviction_policies': ['', ''],
2026-02-21T09:32:09.8361061Z  'loop_orders': [[0, 1]],
2026-02-21T09:32:09.8361336Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:32:09.8361596Z  'num_sm_multiplier': 64,
2026-02-21T09:32:09.8361822Z  'num_stages': 4,
2026-02-21T09:32:09.8362035Z  'num_warps': 2,
2026-02-21T09:32:09.8362277Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:32:09.8362672Z  'range_flattens': [None, False],
2026-02-21T09:32:09.8362965Z  'range_multi_buffers': [False, True],
2026-02-21T09:32:09.8363234Z  'range_num_stages': [2, 1],
2026-02-21T09:32:09.8363479Z  'range_unroll_factors': [3, 1],
2026-02-21T09:32:09.8363739Z  'range_warp_specializes': [],
2026-02-21T09:32:09.8363979Z  'waves_per_eu': 1}
2026-02-21T09:32:09.8805534Z [450s] Fitting surrogate: 912 points, 912 targets
2026-02-21T09:32:10.4475242Z [451s] Generation 10 starting: 50 neighbors, 3 active search path(s)
2026-02-21T09:32:20.2747787Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 1.7 configs/s
2026-02-21T09:32:23.4046138Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 51/51 17.1 configs/s
2026-02-21T09:32:27.2943633Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 242.9         
2026-02-21T09:32:27.2944236Z                                                                   configs/s     
2026-02-21T09:32:27.8003050Z [468s] Generation 10 complete: 
2026-02-21T09:32:27.8003386Z error=1
2026-02-21T09:32:27.8003598Z ok=52
2026-02-21T09:32:27.8003810Z min=0.0759
2026-02-21T09:32:27.8004020Z mid=0.0938
2026-02-21T09:32:27.8004223Z max=0.7182
2026-02-21T09:32:27.8004450Z best={'block_sizes': [256, 16, 16],
2026-02-21T09:32:27.8004832Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:32:27.8005200Z  'l2_groupings': [4],
2026-02-21T09:32:27.8005480Z  'load_eviction_policies': ['', ''],
2026-02-21T09:32:27.8005792Z  'loop_orders': [[0, 1]],
2026-02-21T09:32:27.8006119Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:32:27.8006404Z  'num_sm_multiplier': 32,
2026-02-21T09:32:27.8006671Z  'num_stages': 4,
2026-02-21T09:32:27.8006933Z  'num_warps': 2,
2026-02-21T09:32:27.8007197Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:32:27.8007528Z  'range_flattens': [None, False],
2026-02-21T09:32:27.8007838Z  'range_multi_buffers': [False, True],
2026-02-21T09:32:27.8008155Z  'range_num_stages': [2, 1],
2026-02-21T09:32:27.8008441Z  'range_unroll_factors': [3, 1],
2026-02-21T09:32:27.8008746Z  'range_warp_specializes': [],
2026-02-21T09:32:27.8009027Z  'waves_per_eu': 1}
2026-02-21T09:32:27.8610752Z [468s] Fitting surrogate: 965 points, 965 targets
2026-02-21T09:32:28.4748967Z [469s] Generation 11 starting: 55 neighbors, 3 active search path(s)
2026-02-21T09:32:41.8937392Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55/55 1.5 configs/s
2026-02-21T09:32:45.2110714Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 55/55 17.3 configs/s
2026-02-21T09:32:49.8527690Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 205.7         
2026-02-21T09:32:49.8528291Z                                                                   configs/s     
2026-02-21T09:32:50.3904708Z [491s] Generation 11 complete: 
2026-02-21T09:32:50.3905032Z error=2
2026-02-21T09:32:50.3905228Z ok=56
2026-02-21T09:32:50.3905425Z min=0.0755
2026-02-21T09:32:50.3905607Z mid=0.0915
2026-02-21T09:32:50.3905782Z max=0.6898
2026-02-21T09:32:50.3905980Z best={'block_sizes': [256, 16, 16],
2026-02-21T09:32:50.3909251Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:32:50.3909638Z  'l2_groupings': [4],
2026-02-21T09:32:50.3909857Z  'load_eviction_policies': ['', ''],
2026-02-21T09:32:50.3910088Z  'loop_orders': [[0, 1]],
2026-02-21T09:32:50.3910292Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:32:50.3910518Z  'num_sm_multiplier': 32,
2026-02-21T09:32:50.3910708Z  'num_stages': 4,
2026-02-21T09:32:50.3910884Z  'num_warps': 2,
2026-02-21T09:32:50.3911146Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:32:50.3911394Z  'range_flattens': [None, False],
2026-02-21T09:32:50.3911614Z  'range_multi_buffers': [False, True],
2026-02-21T09:32:50.3911889Z  'range_num_stages': [1, 1],
2026-02-21T09:32:50.3912104Z  'range_unroll_factors': [3, 1],
2026-02-21T09:32:50.3912320Z  'range_warp_specializes': [],
2026-02-21T09:32:50.3912526Z  'waves_per_eu': 1}
2026-02-21T09:32:50.4675040Z [491s] Fitting surrogate: 1023 points, 1023 targets
2026-02-21T09:32:51.0720248Z [491s] Generation 12 starting: 54 neighbors, 3 active search path(s)
2026-02-21T09:33:24.9389440Z [525s] Timeout after 30s compiling Config(block_sizes=[256, 16, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[1, 1], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:33:28.3527471Z [529s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:33:28.5480732Z [529s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:33:28.8149004Z [529s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:33:29.0894460Z [529s] Timeout after 30s compiling Config(block_sizes=[256, 16, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:33:29.6155709Z [530s] Timeout after 30s compiling Config(block_sizes=[256, 16, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:33:29.7676109Z [530s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:33:29.7699059Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 0.4 configs/s
2026-02-21T09:33:32.4143331Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 54/54 20.8 configs/s
2026-02-21T09:33:37.0233871Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 207.0         
2026-02-21T09:33:37.0236174Z                                                                   configs/s     
2026-02-21T09:33:37.5535534Z [538s] Generation 12 complete: 
2026-02-21T09:33:37.5535911Z error=3
2026-02-21T09:33:37.5536141Z timeout=7
2026-02-21T09:33:37.5536346Z ok=47
2026-02-21T09:33:37.5536553Z min=0.0758
2026-02-21T09:33:37.5536756Z mid=0.0864
2026-02-21T09:33:37.5536956Z max=0.2752
2026-02-21T09:33:37.5537173Z best={'block_sizes': [256, 16, 16],
2026-02-21T09:33:37.5537557Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:33:37.5537926Z  'l2_groupings': [4],
2026-02-21T09:33:37.5538201Z  'load_eviction_policies': ['', ''],
2026-02-21T09:33:37.5538516Z  'loop_orders': [[0, 1]],
2026-02-21T09:33:37.5538802Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:33:37.5539088Z  'num_sm_multiplier': 32,
2026-02-21T09:33:37.5539352Z  'num_stages': 4,
2026-02-21T09:33:37.5539583Z  'num_warps': 2,
2026-02-21T09:33:37.5539845Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:33:37.5540205Z  'range_flattens': [None, False],
2026-02-21T09:33:37.5540510Z  'range_multi_buffers': [False, True],
2026-02-21T09:33:37.5540824Z  'range_num_stages': [1, 1],
2026-02-21T09:33:37.5541507Z  'range_unroll_factors': [3, 1],
2026-02-21T09:33:37.5541813Z  'range_warp_specializes': [],
2026-02-21T09:33:37.5542087Z  'waves_per_eu': 1}
2026-02-21T09:33:37.6279722Z [538s] Fitting surrogate: 1080 points, 1080 targets
2026-02-21T09:33:37.9924709Z [538s] Generation 13 starting: 23 neighbors, 2 active search path(s)
2026-02-21T09:34:11.1392121Z [571s] Timeout after 30s compiling Config(block_sizes=[256, 16, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:34:11.1410447Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 0.4 configs/s
2026-02-21T09:34:12.2913014Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 23/23 21.0 configs/s
2026-02-21T09:34:13.4451232Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 717.5         
2026-02-21T09:34:13.4451781Z                                                                   configs/s     
2026-02-21T09:34:13.7982102Z [574s] Generation 13 complete: 
2026-02-21T09:34:13.7982451Z error=3
2026-02-21T09:34:13.7982660Z timeout=1
2026-02-21T09:34:13.7982893Z ok=22
2026-02-21T09:34:13.7983089Z min=0.0759
2026-02-21T09:34:13.7983297Z mid=0.1191
2026-02-21T09:34:13.7983493Z max=0.9180
2026-02-21T09:34:13.7983721Z best={'block_sizes': [256, 16, 16],
2026-02-21T09:34:13.7984089Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:34:13.7984451Z  'l2_groupings': [4],
2026-02-21T09:34:13.7984731Z  'load_eviction_policies': ['', ''],
2026-02-21T09:34:13.7985044Z  'loop_orders': [[0, 1]],
2026-02-21T09:34:13.7985358Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:34:13.7986115Z  'num_sm_multiplier': 32,
2026-02-21T09:34:13.7986389Z  'num_stages': 4,
2026-02-21T09:34:13.7986645Z  'num_warps': 2,
2026-02-21T09:34:13.7986914Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:34:13.7987246Z  'range_flattens': [None, False],
2026-02-21T09:34:13.7998200Z  'range_multi_buffers': [False, True],
2026-02-21T09:34:13.7998416Z  'range_num_stages': [1, 1],
2026-02-21T09:34:13.7998583Z  'range_unroll_factors': [3, 1],
2026-02-21T09:34:13.7998756Z  'range_warp_specializes': [],
2026-02-21T09:34:13.7998918Z  'waves_per_eu': 1}
2026-02-21T09:34:13.8129179Z [574s] Fitting surrogate: 1106 points, 1106 targets
2026-02-21T09:34:14.0857200Z [574s] Generation 14 starting: 17 neighbors, 1 active search path(s)
2026-02-21T09:34:45.2173984Z [605s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:34:45.2195481Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.4 configs/s
2026-02-21T09:34:46.1777167Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 18.7 configs/s
2026-02-21T09:34:47.7082478Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 566.2         
2026-02-21T09:34:47.7083140Z                                                                   configs/s     
2026-02-21T09:34:48.0492172Z [608s] Generation 14 complete: 
2026-02-21T09:34:48.0492304Z timeout=1
2026-02-21T09:34:48.0492379Z ok=18
2026-02-21T09:34:48.0492461Z min=0.0758
2026-02-21T09:34:48.0492538Z mid=0.0792
2026-02-21T09:34:48.0492616Z max=0.2255
2026-02-21T09:34:48.0492707Z best={'block_sizes': [256, 16, 16],
2026-02-21T09:34:48.0492852Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:34:48.0493012Z  'l2_groupings': [4],
2026-02-21T09:34:48.0493126Z  'load_eviction_policies': ['', ''],
2026-02-21T09:34:48.0493557Z  'loop_orders': [[0, 1]],
2026-02-21T09:34:48.0493672Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:34:48.0493778Z  'num_sm_multiplier': 32,
2026-02-21T09:34:48.0493874Z  'num_stages': 4,
2026-02-21T09:34:48.0493962Z  'num_warps': 2,
2026-02-21T09:34:48.0494664Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:34:48.0494789Z  'range_flattens': [None, False],
2026-02-21T09:34:48.0494903Z  'range_multi_buffers': [False, True],
2026-02-21T09:34:48.0495018Z  'range_num_stages': [1, 1],
2026-02-21T09:34:48.0495121Z  'range_unroll_factors': [3, 1],
2026-02-21T09:34:48.0495231Z  'range_warp_specializes': [],
2026-02-21T09:34:48.0495336Z  'waves_per_eu': 1}
2026-02-21T09:34:48.0711041Z [608s] Fitting surrogate: 1125 points, 1125 targets
2026-02-21T09:34:48.3637932Z [609s] Generation 15 starting: 19 neighbors, 1 active search path(s)
2026-02-21T09:35:19.9193208Z [640s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:35:20.0583118Z [640s] Timeout after 30s compiling Config(block_sizes=[256, 16, 32], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:35:20.0595605Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 0.5 configs/s
2026-02-21T09:35:21.0912238Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 19.4 configs/s
2026-02-21T09:35:22.7660122Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 521.7         
2026-02-21T09:35:22.7660724Z                                                                   configs/s     
2026-02-21T09:35:23.1605332Z [643s] Generation 15 complete: 
2026-02-21T09:35:23.1605562Z timeout=2
2026-02-21T09:35:23.1605692Z ok=19
2026-02-21T09:35:23.1605826Z min=0.0759
2026-02-21T09:35:23.1605955Z mid=0.0876
2026-02-21T09:35:23.1606090Z max=0.2290
2026-02-21T09:35:23.1606234Z best={'block_sizes': [256, 16, 16],
2026-02-21T09:35:23.1606479Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:35:23.1606710Z  'l2_groupings': [4],
2026-02-21T09:35:23.1606891Z  'load_eviction_policies': ['', ''],
2026-02-21T09:35:23.1607093Z  'loop_orders': [[0, 1]],
2026-02-21T09:35:23.1607271Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:35:23.1607451Z  'num_sm_multiplier': 32,
2026-02-21T09:35:23.1607656Z  'num_stages': 4,
2026-02-21T09:35:23.1607805Z  'num_warps': 2,
2026-02-21T09:35:23.1607976Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:35:23.1608485Z  'range_flattens': [None, False],
2026-02-21T09:35:23.1608678Z  'range_multi_buffers': [False, True],
2026-02-21T09:35:23.1608882Z  'range_num_stages': [1, 1],
2026-02-21T09:35:23.1609067Z  'range_unroll_factors': [3, 1],
2026-02-21T09:35:23.1609270Z  'range_warp_specializes': [],
2026-02-21T09:35:23.1609441Z  'waves_per_eu': 1}
2026-02-21T09:35:23.1863949Z [643s] Fitting surrogate: 1146 points, 1146 targets
2026-02-21T09:35:23.4821953Z [644s] Generation 16 starting: 19 neighbors, 1 active search path(s)
2026-02-21T09:35:55.5368613Z [676s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:35:55.6533302Z [676s] Timeout after 30s compiling Config(block_sizes=[256, 8, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:35:55.8984805Z [676s] Timeout after 30s compiling Config(block_sizes=[1024, 16, 16], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:35:55.9006618Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 0.6 configs/s
2026-02-21T09:35:56.8190738Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 22.1 configs/s
2026-02-21T09:35:58.7583021Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 668.7         
2026-02-21T09:35:58.7583602Z                                                                   configs/s     
2026-02-21T09:35:59.1407684Z [679s] Generation 16 complete: 
2026-02-21T09:35:59.1408030Z error=1
2026-02-21T09:35:59.1408244Z timeout=3
2026-02-21T09:35:59.1408447Z ok=17
2026-02-21T09:35:59.1408648Z min=0.0759
2026-02-21T09:35:59.1408880Z mid=0.0876
2026-02-21T09:35:59.1409083Z max=0.3048
2026-02-21T09:35:59.1409322Z best={'block_sizes': [256, 16, 16],
2026-02-21T09:35:59.1409701Z  'indexing': ['pointer', 'pointer', 'block_ptr'],
2026-02-21T09:35:59.1410113Z  'l2_groupings': [4],
2026-02-21T09:35:59.1410394Z  'load_eviction_policies': ['', ''],
2026-02-21T09:35:59.1410708Z  'loop_orders': [[0, 1]],
2026-02-21T09:35:59.1411029Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:35:59.1411308Z  'num_sm_multiplier': 32,
2026-02-21T09:35:59.1411574Z  'num_stages': 4,
2026-02-21T09:35:59.1411799Z  'num_warps': 2,
2026-02-21T09:35:59.1412061Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:35:59.1412388Z  'range_flattens': [None, False],
2026-02-21T09:35:59.1412693Z  'range_multi_buffers': [False, True],
2026-02-21T09:35:59.1413001Z  'range_num_stages': [1, 1],
2026-02-21T09:35:59.1413286Z  'range_unroll_factors': [3, 1],
2026-02-21T09:35:59.1413583Z  'range_warp_specializes': [],
2026-02-21T09:35:59.1413851Z  'waves_per_eu': 1}
2026-02-21T09:35:59.1576237Z [679s] Fitting surrogate: 1167 points, 1167 targets
2026-02-21T09:35:59.2870808Z [679s] Autotuning complete in 680.0s after searching 1100 configs.
2026-02-21T09:35:59.2871353Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:35:59.2873413Z     @helion.kernel(config=helion.Config(block_sizes=[256, 16, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:35:59.2875673Z 
2026-02-21T09:35:59.2876131Z [679s] Code of selected kernel: /tmp/torchinductor_root/ll/cll53dqlm5eq55uqbwntdste57uy2gcyxcujca5g2h2jgsbmdcpl.py
2026-02-21T09:35:59.3034585Z from __future__ import annotations
2026-02-21T09:35:59.3034734Z 
2026-02-21T09:35:59.3034790Z import torch
2026-02-21T09:35:59.3034919Z import helion
2026-02-21T09:35:59.3035044Z import triton
2026-02-21T09:35:59.3035183Z import triton.language as tl
2026-02-21T09:35:59.3035430Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:35:59.3035620Z 
2026-02-21T09:35:59.3035691Z _BLOCK_SIZE_1 = tl.constexpr(16)
2026-02-21T09:35:59.3035868Z _BLOCK_SIZE_2 = tl.constexpr(16)
2026-02-21T09:35:59.3036038Z _BLOCK_SIZE_0 = tl.constexpr(256)
2026-02-21T09:35:59.3036152Z 
2026-02-21T09:35:59.3036212Z @triton.jit
2026-02-21T09:35:59.3036500Z def _helion_matmul_bf16_int4(A, B, C, _NUM_SM: tl.constexpr, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr):
2026-02-21T09:35:59.3036900Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:35:59.3037227Z     # src[int4_gemm.py:58]:     acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:35:59.3037494Z     # src[int4_gemm.py:57-91]: ...
2026-02-21T09:35:59.3037727Z     total_pids = tl.cdiv(16, _BLOCK_SIZE_1) * tl.cdiv(7168, _BLOCK_SIZE_2)
2026-02-21T09:35:59.3038195Z     for virtual_pid in tl.range(tl.program_id(0), total_pids, _NUM_SM * 32, loop_unroll_factor=3, num_stages=1, disallow_acc_multi_buffer=True):
2026-02-21T09:35:59.3038841Z         # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:35:59.3039093Z         num_pid_m = tl.cdiv(16, _BLOCK_SIZE_1)
2026-02-21T09:35:59.3039293Z         num_pid_n = tl.cdiv(7168, _BLOCK_SIZE_2)
2026-02-21T09:35:59.3039485Z         inner_2d_pid = virtual_pid
2026-02-21T09:35:59.3039678Z         num_pid_in_group = 4 * num_pid_n
2026-02-21T09:35:59.3039879Z         group_id = inner_2d_pid // num_pid_in_group
2026-02-21T09:35:59.3040079Z         first_pid_m = group_id * 4
2026-02-21T09:35:59.3040280Z         group_size_m = min(num_pid_m - first_pid_m, 4)
2026-02-21T09:35:59.3040540Z         pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T09:35:59.3040826Z         pid_1 = inner_2d_pid % num_pid_in_group // group_size_m
2026-02-21T09:35:59.3041050Z         offset_1 = pid_0 * _BLOCK_SIZE_1
2026-02-21T09:35:59.3041287Z         indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32)
2026-02-21T09:35:59.3041533Z         offset_2 = pid_1 * _BLOCK_SIZE_2
2026-02-21T09:35:59.3041759Z         indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32)
2026-02-21T09:35:59.3042068Z         # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:35:59.3042370Z         acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32)
2026-02-21T09:35:59.3042761Z         # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:35:59.3043177Z         # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:35:59.3043576Z         # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:35:59.3043853Z         # src[int4_gemm.py:60-89]: ...
2026-02-21T09:35:59.3044256Z         for offset_3 in tl.range(0, 4096, _BLOCK_SIZE_0, loop_unroll_factor=1, num_stages=1, disallow_acc_multi_buffer=False, flatten=False):
2026-02-21T09:35:59.3044698Z             indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T09:35:59.3044936Z             acc_copy = acc
2026-02-21T09:35:59.3045162Z             acc_copy_0 = acc_copy
2026-02-21T09:35:59.3045379Z             # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2
2026-02-21T09:35:59.3045611Z             mul = 2 * offset_3
2026-02-21T09:35:59.3045870Z             # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:35:59.3046158Z             iota = mul + tl.arange(0, mul_1)
2026-02-21T09:35:59.3046405Z             load = tl.load(A + (indices_1[:, None] * 8192 + iota[None, :] * 1), None)
2026-02-21T09:35:59.3046739Z             # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:35:59.3047036Z             # src[int4_gemm.py:66]:     torch.float32
2026-02-21T09:35:59.3047274Z             # src[int4_gemm.py:67]: )  # [BLOCK_SIZE_M, BLOCK_SIZE_K]
2026-02-21T09:35:59.3047460Z             v_0 = tl.cast(load, tl.float32)
2026-02-21T09:35:59.3047686Z             # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n]  # [BLOCK_SIZE_K//2, BLOCK_SIZE_N]
2026-02-21T09:35:59.3047978Z             b_tile = tl.load(B + (indices_3[:, None] * 7168 + indices_2[None, :] * 1), None)
2026-02-21T09:35:59.3048267Z             # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8)  # Sign-extend low 4 bits
2026-02-21T09:35:59.3048487Z             v_1 = tl.full([], 4, tl.int8)
2026-02-21T09:35:59.3048633Z             v_2 = b_tile << v_1
2026-02-21T09:35:59.3048766Z             v_3 = tl.full([], 4, tl.int8)
2026-02-21T09:35:59.3048903Z             v_4 = v_2 >> v_3
2026-02-21T09:35:59.3049097Z             # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8)  # Sign-extend high 4 bits
2026-02-21T09:35:59.3049313Z             v_5 = tl.full([], 4, tl.int8)
2026-02-21T09:35:59.3049455Z             v_6 = b_tile >> v_5
2026-02-21T09:35:59.3049675Z             # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1)
2026-02-21T09:35:59.3049875Z             stack_idx = tl.arange(0, 2)
2026-02-21T09:35:59.3050042Z             broadcast_idx = stack_idx[None, :, None]
2026-02-21T09:35:59.3050216Z             expanded_0 = tl.expand_dims(v_4, 1)
2026-02-21T09:35:59.3050372Z             expanded_1 = tl.expand_dims(v_6, 1)
2026-02-21T09:35:59.3050538Z             stacked_result = tl.zeros_like(expanded_0)
2026-02-21T09:35:59.3050699Z             mask_0 = broadcast_idx == 0
2026-02-21T09:35:59.3050886Z             stacked_result = tl.where(mask_0, expanded_0, stacked_result)
2026-02-21T09:35:59.3051081Z             mask_1 = broadcast_idx == 1
2026-02-21T09:35:59.3051257Z             stacked_result = tl.where(mask_1, expanded_1, stacked_result)
2026-02-21T09:35:59.3051489Z             # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:35:59.3051716Z             # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:35:59.3051954Z             # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:35:59.3052157Z             view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2])
2026-02-21T09:35:59.3052355Z             v_7 = tl.cast(view, tl.float32)
2026-02-21T09:35:59.3052577Z             # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2)  # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1]
2026-02-21T09:35:59.3052798Z             a_tile_1 = v_0[:, :, None]
2026-02-21T09:35:59.3052982Z             # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0)
2026-02-21T09:35:59.3053172Z             b_unpacked_1 = v_7[None, :, :]
2026-02-21T09:35:59.3053407Z             # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1)  # [BLOCK_SIZE_M, BLOCK_SIZE_N]
2026-02-21T09:35:59.3053653Z             v_8 = a_tile_1 * b_unpacked_1
2026-02-21T09:35:59.3053810Z             sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32)
2026-02-21T09:35:59.3053969Z             acc = acc_copy_0 + sum_1
2026-02-21T09:35:59.3054157Z         # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16)
2026-02-21T09:35:59.3054348Z         v_10 = tl.cast(acc, tl.bfloat16)
2026-02-21T09:35:59.3054699Z         tl.store(tl.make_block_ptr(C, [16, 7168], [7168, 1], [offset_1, offset_2], [_BLOCK_SIZE_1, _BLOCK_SIZE_2], [1, 0]), v_10, boundary_check=[0, 1])
2026-02-21T09:35:59.3054969Z 
2026-02-21T09:35:59.3055075Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher):
2026-02-21T09:35:59.3055267Z     """
2026-02-21T09:35:59.3055399Z     BFloat16 x INT4 General Matrix Multiplication (GEMM).
2026-02-21T09:35:59.3055525Z 
2026-02-21T09:35:59.3055602Z     This kernel performs matrix multiplication where:
2026-02-21T09:35:59.3055798Z     - A is a bfloat16 matrix of shape [M, K]
2026-02-21T09:35:59.3055964Z     - B is an int8 matrix of shape [K//2, N] containing packed int4 values
2026-02-21T09:35:59.3056132Z       (two 4-bit values packed into each int8)
2026-02-21T09:35:59.3056220Z 
2026-02-21T09:35:59.3056253Z     Args:
2026-02-21T09:35:59.3056371Z         A (Tensor): Input tensor of shape [M, K] in bfloat16 format.
2026-02-21T09:35:59.3056549Z         B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format.
2026-02-21T09:35:59.3056670Z 
2026-02-21T09:35:59.3056702Z     Returns:
2026-02-21T09:35:59.3056817Z         Tensor: Output tensor of shape [M, N] in bfloat16 format.
2026-02-21T09:35:59.3056953Z     """
2026-02-21T09:35:59.3057037Z     # src[int4_gemm.py:50]: M, K = A.shape
2026-02-21T09:35:59.3057150Z     M, K = A.shape
2026-02-21T09:35:59.3057247Z     # src[int4_gemm.py:51]: _, N = B.shape
2026-02-21T09:35:59.3057355Z     _, N = B.shape
2026-02-21T09:35:59.3057501Z     # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:35:59.3057702Z     C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:35:59.3057880Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:35:59.3058040Z     _NUM_SM = helion.runtime.get_num_sm(A.device)
2026-02-21T09:35:59.3058292Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:35:59.3058561Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:35:59.3058818Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:35:59.3058996Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:35:59.3059104Z     _BLOCK_SIZE_0 = 256
2026-02-21T09:35:59.3059230Z     # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:35:59.3059410Z     # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:35:59.3059579Z     # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:35:59.3059705Z     _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0
2026-02-21T09:35:59.3059846Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:35:59.3060039Z     # src[int4_gemm.py:58]:     acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:35:59.3060208Z     # src[int4_gemm.py:57-91]: ...
2026-02-21T09:35:59.3060341Z     _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0)
2026-02-21T09:35:59.3060649Z     _launcher(_helion_matmul_bf16_int4, (_NUM_SM * 32,), A, B, C, _NUM_SM, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=2, num_stages=4, waves_per_eu=1, matrix_instr_nonkdim=16)
2026-02-21T09:35:59.3060935Z     # src[int4_gemm.py:93]: return C
2026-02-21T09:35:59.3061044Z     return C
2026-02-21T09:36:00.2613509Z WARNING:tritonbench.utils.triton_op:Completed input ID 10:
2026-02-21T09:36:00.2613958Z x_val
2026-02-21T09:36:00.2614176Z -------------------
2026-02-21T09:36:00.2614420Z (16, 1, 7168, 8192)
2026-02-21T09:36:00.2614562Z 
2026-02-21T09:36:00.2635038Z  40%|████      | 4/10 [33:21<53:27, 534.58s/it]WARNING:tritonbench.utils.triton_op:Running input ID 14:
2026-02-21T09:36:00.2635470Z x_val
2026-02-21T09:36:00.2635644Z -------------------
2026-02-21T09:36:00.2635838Z (64, 1, 7168, 8192)
2026-02-21T09:36:00.2637341Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T09:36:01.3270280Z INFO:tritonbench.utils.triton_op:Took 4.45ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T09:36:03.7131543Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T09:36:03.7149944Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:36:03.7150081Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:36:03.7150207Z               'dtype': 'torch.bfloat16',
2026-02-21T09:36:03.7150327Z               'shape': (64, 1, 8192),
2026-02-21T09:36:03.7150439Z               'stride': (8192, 8192, 1)},
2026-02-21T09:36:03.7150556Z             { 'device': 'cuda:0',
2026-02-21T09:36:03.7150668Z               'dtype': 'torch.int32',
2026-02-21T09:36:03.7150777Z               'shape': (8192, 7168),
2026-02-21T09:36:03.7150889Z               'stride': (7168, 1)}),
2026-02-21T09:36:03.7150993Z   'kwargs': {}}
2026-02-21T09:36:03.7186350Z INFO:tritonbench.utils.triton_op:Took 3.65ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T09:36:03.8956365Z [0s] Autotune random seed: 2138032649
2026-02-21T09:36:03.9162935Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:36:40.2709481Z [36s] Timeout after 30s compiling Config(block_sizes=[512, 2, 64], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:36:41.3564197Z [37s] Timeout after 30s compiling Config(block_sizes=[2048, 16, 8], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[0, 2], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:36:43.5023387Z [39s] Timeout after 30s compiling Config(block_sizes=[4096, 32, 4], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:36:44.4451176Z [40s] Timeout after 30s compiling Config(block_sizes=[1024, 2, 32], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[4, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:36:48.2928875Z [44s] Timeout after 30s compiling Config(block_sizes=[128, 1, 512], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[0, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:36:48.2948935Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s
2026-02-21T09:36:59.0164445Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 9.3 configs/s
2026-02-21T09:36:59.0173456Z [55s] Adaptive compile timeout: 30s (90% percentile=13.4s, bounds=[30.0s, 30s])
2026-02-21T09:36:59.1040960Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 754/754 - configs/s
2026-02-21T09:36:59.8466852Z [55s] Initial random population of 100, 5 starting points: 
2026-02-21T09:36:59.8469666Z error=2
2026-02-21T09:36:59.8469941Z timeout=5
2026-02-21T09:36:59.8470160Z ok=93
2026-02-21T09:36:59.8470362Z min=0.2463
2026-02-21T09:36:59.8473466Z mid=3.2506
2026-02-21T09:36:59.8473567Z max=153.5757
2026-02-21T09:36:59.8473671Z best={'block_sizes': [16, 16, 16],
2026-02-21T09:36:59.8473814Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:36:59.8473944Z  'l2_groupings': [1],
2026-02-21T09:36:59.8474048Z  'load_eviction_policies': ['', ''],
2026-02-21T09:36:59.8474164Z  'loop_orders': [[0, 1]],
2026-02-21T09:36:59.8474268Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:36:59.8474368Z  'num_stages': 1,
2026-02-21T09:36:59.8474455Z  'num_warps': 4,
2026-02-21T09:36:59.8474540Z  'pid_type': 'flat',
2026-02-21T09:36:59.8474648Z  'range_flattens': [None, None],
2026-02-21T09:36:59.8474770Z  'range_multi_buffers': [None, None],
2026-02-21T09:36:59.8474883Z  'range_num_stages': [0, 0],
2026-02-21T09:36:59.8474995Z  'range_unroll_factors': [0, 0],
2026-02-21T09:36:59.8475103Z  'range_warp_specializes': [],
2026-02-21T09:36:59.8475209Z  'waves_per_eu': 1}
2026-02-21T09:36:59.8489477Z [55s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:37:00.8420503Z [56s] Generation 1 starting: 100 neighbors, 5 active search path(s)
2026-02-21T09:37:18.3895832Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 104/104 3.4 configs/s
2026-02-21T09:37:21.5539237Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:21.5541766Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:37:21.5543258Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:37:21.5544118Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:37:21.5544901Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [16, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:37:21.5545599Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:37:21.5546240Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:21.5546457Z #smem = #ttg.shared_memory
2026-02-21T09:37:21.5546712Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:21.5547189Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:21.5547593Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x32xf32, #mma>
2026-02-21T09:37:21.5547760Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:21.5547901Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:37:21.5548015Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:21.5548130Z     %c224_i32 = arith.constant 224 : i32
2026-02-21T09:37:21.5548247Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:21.5548356Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:21.5548498Z     %cst_0 = arith.constant dense<0> : tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5548646Z     %c4092_i32 = arith.constant 4092 : i32
2026-02-21T09:37:21.5548764Z     %c12_i32 = arith.constant 12 : i32
2026-02-21T09:37:21.5548872Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:37:21.5549056Z     %cst_1 = arith.constant dense<8184> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.5549357Z     %cst_2 = arith.constant dense<4092> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5549762Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:21.5549979Z     %cst_4 = arith.constant dense<7168> : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5550232Z     %cst_5 = arith.constant dense<4> : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5550446Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:21.5550618Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:21.5550783Z     %cst_8 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:21.5550928Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:21.5551040Z     %1 = arith.divsi %0, %c16_i32 : i32
2026-02-21T09:37:21.5551157Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T09:37:21.5551271Z     %3 = arith.subi %c224_i32, %2 : i32
2026-02-21T09:37:21.5551385Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T09:37:21.5551498Z     %5 = arith.remsi %0, %c16_i32 : i32
2026-02-21T09:37:21.5551608Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:37:21.5551718Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:37:21.5551820Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:37:21.5551929Z     %9 = arith.muli %7, %c32_i32 : i32
2026-02-21T09:37:21.5552162Z     %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5552471Z     %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:21.5552758Z     %12 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5553004Z     %13 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:21.5553289Z     %14 = arith.addi %12, %10 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5553537Z     %15 = arith.addi %13, %11 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:21.5553697Z     %16 = arith.muli %8, %c16_i32 : i32
2026-02-21T09:37:21.5553889Z     %17 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:21.5554156Z     %18 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:21.5554397Z     %19 = tt.splat %16 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:21.5554605Z     %20 = tt.splat %16 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:21.5554811Z     %21 = arith.addi %19, %17 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:21.5555019Z     %22 = arith.addi %20, %18 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:21.5555292Z     %23 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5555601Z     %24 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.5555899Z     %25 = tt.expand_dims %21 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:21.5556146Z     %26 = arith.muli %25, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:21.5556334Z     %27 = tt.broadcast %26 : tensor<16x1xi32, #blocked1> -> tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5556546Z     %28 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:21.5556890Z     %29 = tt.expand_dims %14 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5557313Z     %30 = tt.broadcast %29 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5557673Z     %31 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5557976Z     %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:21.5558378Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:21.5558772Z     %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:21.5559025Z     %35 = arith.cmpi eq, %34, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:21.5559220Z     %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked>
2026-02-21T09:37:21.5559417Z     %37 = arith.cmpi eq, %34, %cst_7 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:21.5559602Z     %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked>
2026-02-21T09:37:21.5559866Z     %39 = scf.for %arg3 = %c0_i32 to %c4092_i32 step %c12_i32 iter_args(%arg4 = %cst) -> (tensor<16x32xf32, #mma>)  : i32 {
2026-02-21T09:37:21.5560172Z       %79 = tt.splat %arg3 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5560475Z       %80 = arith.addi %79, %23 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5560687Z       %81 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:37:21.5560857Z       %82 = tt.splat %81 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.5561106Z       %83 = arith.addi %82, %24 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.5561373Z       %84 = tt.expand_dims %83 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:37:21.5561646Z       %85 = tt.broadcast %84 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5561836Z       %86 = arith.addi %27, %85 : tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5562029Z       %87 = tt.addptr %28, %86 : tensor<16x8x!tt.ptr<bf16>, #blocked1>, tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5562232Z       %88 = tt.load %87 : tensor<16x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:21.5562444Z       %89 = ttg.local_alloc %88 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:37:21.5562837Z       %90 = ttg.local_load %89 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5563240Z       %91 = arith.extf %90 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5563691Z       %92 = tt.expand_dims %80 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5564040Z       %93 = arith.muli %92, %cst_4 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5564342Z       %94 = tt.broadcast %93 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5564636Z       %95 = arith.addi %94, %30 : tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5564938Z       %96 = tt.addptr %31, %95 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5565242Z       %97 = tt.load %96 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5565512Z       %98 = arith.shli %97, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5565740Z       %99 = arith.shrsi %98, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5565971Z       %100 = arith.shrsi %97, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5566255Z       %101 = tt.expand_dims %99 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:37:21.5566587Z       %102 = tt.expand_dims %100 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:37:21.5566868Z       %103 = tt.broadcast %101 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5567108Z       %104 = arith.select %36, %103, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5567343Z       %105 = tt.broadcast %102 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5567583Z       %106 = arith.select %38, %105, %104 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5567807Z       %107 = tt.reshape %106 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:37:21.5568025Z       %108 = arith.sitofp %107 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:37:21.5568275Z       %109 = ttg.local_alloc %108 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:37:21.5568595Z       %110 = ttg.local_load %109 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5569101Z       %111 = tt.dot %91, %110, %arg4, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma>
2026-02-21T09:37:21.5569454Z       %112 = arith.addi %arg3, %c4_i32 : i32
2026-02-21T09:37:21.5569665Z       %113 = tt.splat %112 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5569969Z       %114 = arith.addi %113, %23 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5570181Z       %115 = arith.muli %112, %c2_i32 : i32
2026-02-21T09:37:21.5570351Z       %116 = tt.splat %115 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.5570577Z       %117 = arith.addi %116, %24 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.5570856Z       %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:37:21.5571135Z       %119 = tt.broadcast %118 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5571328Z       %120 = arith.addi %27, %119 : tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5571531Z       %121 = tt.addptr %28, %120 : tensor<16x8x!tt.ptr<bf16>, #blocked1>, tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5571737Z       %122 = tt.load %121 : tensor<16x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:21.5571961Z       %123 = ttg.local_alloc %122 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:37:21.5572288Z       %124 = ttg.local_load %123 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5572690Z       %125 = arith.extf %124 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5573150Z       %126 = tt.expand_dims %114 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5573620Z       %127 = arith.muli %126, %cst_4 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5573931Z       %128 = tt.broadcast %127 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5574233Z       %129 = arith.addi %128, %30 : tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5574538Z       %130 = tt.addptr %31, %129 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5574846Z       %131 = tt.load %130 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5575076Z       %132 = arith.shli %131, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5575310Z       %133 = arith.shrsi %132, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5575546Z       %134 = arith.shrsi %131, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5575838Z       %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:37:21.5576164Z       %136 = tt.expand_dims %134 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:37:21.5576444Z       %137 = tt.broadcast %135 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5576678Z       %138 = arith.select %36, %137, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5576912Z       %139 = tt.broadcast %136 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5577183Z       %140 = arith.select %38, %139, %138 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5577409Z       %141 = tt.reshape %140 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:37:21.5577631Z       %142 = arith.sitofp %141 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:37:21.5577876Z       %143 = ttg.local_alloc %142 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:37:21.5578201Z       %144 = ttg.local_load %143 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5578667Z       %145 = tt.dot %125, %144, %111, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma>
2026-02-21T09:37:21.5579011Z       %146 = arith.addi %arg3, %c8_i32 : i32
2026-02-21T09:37:21.5579225Z       %147 = tt.splat %146 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5579520Z       %148 = arith.addi %147, %23 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5579734Z       %149 = arith.muli %146, %c2_i32 : i32
2026-02-21T09:37:21.5579905Z       %150 = tt.splat %149 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.5580126Z       %151 = arith.addi %150, %24 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.5580401Z       %152 = tt.expand_dims %151 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:37:21.5580674Z       %153 = tt.broadcast %152 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5580867Z       %154 = arith.addi %27, %153 : tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5581071Z       %155 = tt.addptr %28, %154 : tensor<16x8x!tt.ptr<bf16>, #blocked1>, tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5581277Z       %156 = tt.load %155 : tensor<16x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:21.5581533Z       %157 = ttg.local_alloc %156 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:37:21.5581857Z       %158 = ttg.local_load %157 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5582257Z       %159 = arith.extf %158 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5582711Z       %160 = tt.expand_dims %148 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5583062Z       %161 = arith.muli %160, %cst_4 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5583369Z       %162 = tt.broadcast %161 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5583673Z       %163 = arith.addi %162, %30 : tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5583979Z       %164 = tt.addptr %31, %163 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5584287Z       %165 = tt.load %164 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5584513Z       %166 = arith.shli %165, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5584746Z       %167 = arith.shrsi %166, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5584979Z       %168 = arith.shrsi %165, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5585302Z       %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:37:21.5585638Z       %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:37:21.5585918Z       %171 = tt.broadcast %169 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5586156Z       %172 = arith.select %36, %171, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5586391Z       %173 = tt.broadcast %170 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5586629Z       %174 = arith.select %38, %173, %172 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5586853Z       %175 = tt.reshape %174 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:37:21.5587072Z       %176 = arith.sitofp %175 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:37:21.5587320Z       %177 = ttg.local_alloc %176 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:37:21.5587644Z       %178 = ttg.local_load %177 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5588107Z       %179 = tt.dot %159, %178, %145, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma>
2026-02-21T09:37:21.5588447Z       scf.yield %179 : tensor<16x32xf32, #mma>
2026-02-21T09:37:21.5588569Z     } {tt.flatten}
2026-02-21T09:37:21.5588753Z     %40 = arith.addi %23, %cst_2 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.5589016Z     %41 = arith.addi %24, %cst_1 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.5589294Z     %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:37:21.5589597Z     %43 = tt.broadcast %42 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5589787Z     %44 = arith.addi %27, %43 : tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5589980Z     %45 = tt.addptr %28, %44 : tensor<16x8x!tt.ptr<bf16>, #blocked1>, tensor<16x8xi32, #blocked1>
2026-02-21T09:37:21.5590180Z     %46 = tt.load %45 : tensor<16x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:21.5590393Z     %47 = ttg.local_alloc %46 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:37:21.5590708Z     %48 = ttg.local_load %47 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5591104Z     %49 = arith.extf %48 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5591554Z     %50 = tt.expand_dims %40 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5591902Z     %51 = arith.muli %50, %cst_4 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5592199Z     %52 = tt.broadcast %51 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5592493Z     %53 = arith.addi %52, %30 : tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5592795Z     %54 = tt.addptr %31, %53 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5593094Z     %55 = tt.load %54 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5593345Z     %56 = arith.shli %55, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5593571Z     %57 = arith.shrsi %56, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5593798Z     %58 = arith.shrsi %55, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.5594076Z     %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:37:21.5594402Z     %60 = tt.expand_dims %58 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:37:21.5594674Z     %61 = tt.broadcast %59 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5594906Z     %62 = arith.select %36, %61, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5595131Z     %63 = tt.broadcast %60 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5595356Z     %64 = arith.select %38, %63, %62 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:37:21.5595576Z     %65 = tt.reshape %64 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:37:21.5595785Z     %66 = arith.sitofp %65 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:37:21.5596024Z     %67 = ttg.local_alloc %66 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:37:21.5596337Z     %68 = ttg.local_load %67 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.5596790Z     %69 = tt.dot %49, %68, %39, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma>
2026-02-21T09:37:21.5597166Z     %70 = arith.truncf %69 : tensor<16x32xf32, #mma> to tensor<16x32xbf16, #mma>
2026-02-21T09:37:21.5597420Z     %71 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:21.5606127Z     %72 = arith.muli %71, %cst_8 : tensor<16x1xi32, #mma>
2026-02-21T09:37:21.5606349Z     %73 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma>
2026-02-21T09:37:21.5606598Z     %74 = tt.broadcast %72 : tensor<16x1xi32, #mma> -> tensor<16x32xi32, #mma>
2026-02-21T09:37:21.5606792Z     %75 = tt.broadcast %73 : tensor<1x32xi32, #mma> -> tensor<16x32xi32, #mma>
2026-02-21T09:37:21.5606960Z     %76 = arith.addi %74, %75 : tensor<16x32xi32, #mma>
2026-02-21T09:37:21.5607129Z     %77 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:21.5607331Z     %78 = tt.addptr %77, %76 : tensor<16x32x!tt.ptr<bf16>, #mma>, tensor<16x32xi32, #mma>
2026-02-21T09:37:21.5607516Z     tt.store %78, %70 : tensor<16x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:21.5607644Z     tt.return
2026-02-21T09:37:21.5607729Z   }
2026-02-21T09:37:21.5607807Z }
2026-02-21T09:37:21.5607851Z 
2026-02-21T09:37:21.5607891Z {-#
2026-02-21T09:37:21.5607972Z   external_resources: {
2026-02-21T09:37:21.5608071Z     mlir_reproducer: {
2026-02-21T09:37:21.5609064Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:21.5610065Z       disable_threading: false,
2026-02-21T09:37:21.5610173Z       verify_each: true
2026-02-21T09:37:21.5610310Z     }
2026-02-21T09:37:21.5610385Z   }
2026-02-21T09:37:21.5610454Z #-}
2026-02-21T09:37:21.5610733Z /tmp/torchinductor_root/td/ctd7r2a7zhsond3v5i5kxjbytnvurxpyiaxknhm5ys67dxidw5k7.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:21.5611409Z /tmp/torchinductor_root/td/ctd7r2a7zhsond3v5i5kxjbytnvurxpyiaxknhm5ys67dxidw5k7.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:21.5611957Z [77s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:21.5612679Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 32], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:21.5613325Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:21.5613493Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:21.9189174Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:21.9192562Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:37:21.9193420Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:37:21.9194156Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:37:21.9194864Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:21.9195753Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:21.9196198Z #smem = #ttg.shared_memory
2026-02-21T09:37:21.9196762Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:21.9197914Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:21.9198794Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x128xf32, #mma>
2026-02-21T09:37:21.9199129Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:21.9199375Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:21.9199623Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:21.9199855Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:21.9200147Z     %cst_0 = arith.constant dense<0> : tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9200461Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:21.9200706Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:37:21.9200954Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:21.9201176Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:21.9201414Z     %c224_i32 = arith.constant 224 : i32
2026-02-21T09:37:21.9201643Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:21.9201881Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:21.9202279Z     %cst_1 = arith.constant dense<29352960> : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9202930Z     %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.9203382Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:21.9203916Z     %cst_4 = arith.constant dense<4> : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9204359Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:21.9204712Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:21.9205057Z     %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:21.9205354Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:21.9205745Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:21.9206302Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:21.9206936Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.9207530Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:21.9208025Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.9208381Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:21.9208725Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9209166Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:21.9209762Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:21.9210342Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:21.9210705Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:21.9210998Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked>
2026-02-21T09:37:21.9211332Z     %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:21.9211609Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked>
2026-02-21T09:37:21.9211912Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:21.9212214Z     %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.9212612Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:21.9213007Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9213285Z     scf.for %arg3 = %0 to %c224_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:21.9213503Z       %19 = arith.divsi %arg3, %c224_i32 : i32
2026-02-21T09:37:21.9213677Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:21.9213857Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:21.9214024Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:21.9214200Z       %23 = arith.remsi %arg3, %c224_i32 : i32
2026-02-21T09:37:21.9214380Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:21.9214543Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:21.9214705Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:21.9214867Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:21.9215116Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:21.9215420Z       %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:21.9215729Z       %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:21.9216080Z       %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:21.9216319Z       %32 = arith.muli %26, %c128_i32 : i32
2026-02-21T09:37:21.9216625Z       %33 = tt.splat %32 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.9217014Z       %34 = tt.splat %32 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:21.9217368Z       %35 = arith.addi %33, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:21.9217700Z       %36 = arith.addi %34, %4 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:21.9218003Z       %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:21.9218283Z       %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:21.9218507Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9218905Z       %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9219352Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x128xf32, #mma>)  : i32 {
2026-02-21T09:37:21.9219601Z         %72 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:21.9219791Z         %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.9220040Z         %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.9220344Z         %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:21.9220656Z         %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9220879Z         %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9221103Z         %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9221375Z         %79 = tt.load %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:21.9221672Z         %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9222129Z         %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9222454Z         %82 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:21.9222656Z         %83 = tt.splat %82 : i32 -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9222918Z         %84 = arith.addi %83, %40 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9223269Z         %85 = tt.addptr %7, %84 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9223622Z         %86 = tt.load %85 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9223885Z         %87 = arith.shli %86, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9224151Z         %88 = arith.shrsi %87, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9224417Z         %89 = arith.shrsi %86, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9224745Z         %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:21.9225160Z         %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:21.9225484Z         %92 = tt.broadcast %90 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9225756Z         %93 = arith.select %12, %92, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9226030Z         %94 = tt.broadcast %91 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9226295Z         %95 = arith.select %14, %94, %93 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9226550Z         %96 = tt.reshape %95 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2>
2026-02-21T09:37:21.9226806Z         %97 = arith.sitofp %96 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2>
2026-02-21T09:37:21.9227090Z         %98 = ttg.local_alloc %97 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared, #smem>
2026-02-21T09:37:21.9227448Z         %99 = ttg.local_load %98 : !ttg.memdesc<2x128xf32, #shared, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9227935Z         %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:37:21.9228285Z         %101 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:21.9228416Z         %102 = arith.muli %101, %c2_i32 : i32
2026-02-21T09:37:21.9228589Z         %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.9228818Z         %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.9229100Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:21.9229378Z         %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9229579Z         %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9229782Z         %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9230045Z         %109 = tt.load %108 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:21.9230315Z         %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9230715Z         %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9231001Z         %112 = arith.muli %101, %c7168_i32 : i32
2026-02-21T09:37:21.9231182Z         %113 = tt.splat %112 : i32 -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9231422Z         %114 = arith.addi %113, %40 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9231746Z         %115 = tt.addptr %7, %114 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9232069Z         %116 = tt.load %115 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9232311Z         %117 = arith.shli %116, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9232552Z         %118 = arith.shrsi %117, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9232799Z         %119 = arith.shrsi %116, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9233098Z         %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:21.9233473Z         %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:21.9233766Z         %122 = tt.broadcast %120 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9234021Z         %123 = arith.select %12, %122, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9234265Z         %124 = tt.broadcast %121 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9234512Z         %125 = arith.select %14, %124, %123 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9234749Z         %126 = tt.reshape %125 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2>
2026-02-21T09:37:21.9234986Z         %127 = arith.sitofp %126 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2>
2026-02-21T09:37:21.9235251Z         %128 = ttg.local_alloc %127 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared, #smem>
2026-02-21T09:37:21.9235582Z         %129 = ttg.local_load %128 : !ttg.memdesc<2x128xf32, #shared, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9236063Z         %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:37:21.9236417Z         %131 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:21.9236547Z         %132 = arith.muli %131, %c2_i32 : i32
2026-02-21T09:37:21.9236725Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.9236950Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:21.9237233Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:21.9237511Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9237717Z         %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9237956Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9238164Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:21.9238433Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9238834Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9239120Z         %142 = arith.muli %131, %c7168_i32 : i32
2026-02-21T09:37:21.9239306Z         %143 = tt.splat %142 : i32 -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9239540Z         %144 = arith.addi %143, %40 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9239859Z         %145 = tt.addptr %7, %144 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9240175Z         %146 = tt.load %145 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9240413Z         %147 = arith.shli %146, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9240656Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9240893Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9241189Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:21.9241556Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:21.9241846Z         %152 = tt.broadcast %150 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9242097Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9242339Z         %154 = tt.broadcast %151 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9242615Z         %155 = arith.select %14, %154, %153 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9242854Z         %156 = tt.reshape %155 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2>
2026-02-21T09:37:21.9243086Z         %157 = arith.sitofp %156 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2>
2026-02-21T09:37:21.9243340Z         %158 = ttg.local_alloc %157 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared, #smem>
2026-02-21T09:37:21.9243674Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x128xf32, #shared, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9244144Z         %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:37:21.9251639Z         scf.yield %160 : tensor<16x128xf32, #mma>
2026-02-21T09:37:21.9251772Z       } {tt.flatten}
2026-02-21T09:37:21.9251900Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9252105Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:21.9252307Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:21.9252572Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9252982Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9253383Z       %47 = arith.addi %40, %cst_1 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9253694Z       %48 = tt.addptr %7, %47 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9254002Z       %49 = tt.load %48 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9254232Z       %50 = arith.shli %49, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9254460Z       %51 = arith.shrsi %50, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9254689Z       %52 = arith.shrsi %49, %cst_4 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:21.9254974Z       %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:21.9255310Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:21.9255591Z       %55 = tt.broadcast %53 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9255824Z       %56 = arith.select %12, %55, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9256058Z       %57 = tt.broadcast %54 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9256288Z       %58 = arith.select %14, %57, %56 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:21.9256513Z       %59 = tt.reshape %58 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2>
2026-02-21T09:37:21.9256785Z       %60 = arith.sitofp %59 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2>
2026-02-21T09:37:21.9257031Z       %61 = ttg.local_alloc %60 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared, #smem>
2026-02-21T09:37:21.9257357Z       %62 = ttg.local_load %61 : !ttg.memdesc<2x128xf32, #shared, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:21.9257820Z       %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:37:21.9258201Z       %64 = arith.truncf %63 : tensor<16x128xf32, #mma> to tensor<16x128xbf16, #mma>
2026-02-21T09:37:21.9258462Z       %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:21.9258695Z       %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:21.9258931Z       %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:37:21.9259194Z       %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x128xi32, #mma>
2026-02-21T09:37:21.9259391Z       %69 = tt.broadcast %67 : tensor<1x128xi32, #mma> -> tensor<16x128xi32, #mma>
2026-02-21T09:37:21.9259567Z       %70 = arith.addi %68, %69 : tensor<16x128xi32, #mma>
2026-02-21T09:37:21.9259751Z       %71 = tt.addptr %15, %70 : tensor<16x128x!tt.ptr<bf16>, #mma>, tensor<16x128xi32, #mma>
2026-02-21T09:37:21.9259945Z       tt.store %71, %64 : tensor<16x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:21.9260085Z     } {tt.num_stages = 2 : i32}
2026-02-21T09:37:21.9260185Z     tt.return
2026-02-21T09:37:21.9260269Z   }
2026-02-21T09:37:21.9260341Z }
2026-02-21T09:37:21.9260385Z 
2026-02-21T09:37:21.9260418Z {-#
2026-02-21T09:37:21.9260498Z   external_resources: {
2026-02-21T09:37:21.9260600Z     mlir_reproducer: {
2026-02-21T09:37:21.9261599Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:21.9262634Z       disable_threading: false,
2026-02-21T09:37:21.9262739Z       verify_each: true
2026-02-21T09:37:21.9262831Z     }
2026-02-21T09:37:21.9262901Z   }
2026-02-21T09:37:21.9262972Z #-}
2026-02-21T09:37:21.9263246Z /tmp/torchinductor_root/zz/czz3su6t24flt4g6nymr7ggtobvnfesbvw7h27sy3zis4is7fcfx.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:21.9263926Z /tmp/torchinductor_root/zz/czz3su6t24flt4g6nymr7ggtobvnfesbvw7h27sy3zis4is7fcfx.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:21.9264481Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:21.9265265Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:37:21.9265981Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:21.9266188Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:22.1683307Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:22.1686955Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}>
2026-02-21T09:37:22.1687470Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:37:22.1687887Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}>
2026-02-21T09:37:22.1688266Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:22.1688596Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:22.1688837Z #smem = #ttg.shared_memory
2026-02-21T09:37:22.1689136Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:22.1689755Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:22.1690278Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.1690491Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:22.1690645Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:22.1690798Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:22.1690942Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:22.1691130Z     %cst_0 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1691326Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:22.1691483Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:37:22.1691632Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:22.1691928Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:22.1692074Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T09:37:22.1692223Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:22.1692375Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:22.1692630Z     %cst_1 = arith.constant dense<29352960> : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1692987Z     %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.1693277Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.1693558Z     %cst_4 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1693841Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.1694064Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.1694285Z     %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.1694477Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:22.1694733Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.1695088Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.1695525Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:22.1695932Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.1696276Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.1696656Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.1696969Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1697357Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:22.1697808Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:22.1698243Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.1698519Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.1698752Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:22.1698972Z     %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.1699175Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:22.1699407Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.1699639Z     %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.1699942Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.1700238Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1700447Z     scf.for %arg3 = %0 to %c112_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:22.1700611Z       %19 = arith.divsi %arg3, %c112_i32 : i32
2026-02-21T09:37:22.1700745Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:22.1700874Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:22.1701002Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:22.1701138Z       %23 = arith.remsi %arg3, %c112_i32 : i32
2026-02-21T09:37:22.1701270Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:22.1701445Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:22.1701569Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:22.1701690Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:22.1701874Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.1702106Z       %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.1702335Z       %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.1702564Z       %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.1702818Z       %32 = arith.muli %26, %c256_i32 : i32
2026-02-21T09:37:22.1703047Z       %33 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:22.1703324Z       %34 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.1703611Z       %35 = arith.addi %33, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:22.1703887Z       %36 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.1704178Z       %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.1704450Z       %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.1704657Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1705048Z       %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1705544Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T09:37:22.1705784Z         %72 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:22.1705972Z         %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.1706215Z         %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.1706513Z         %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.1706816Z         %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1707024Z         %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1707241Z         %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1707447Z         %79 = tt.load %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.1707714Z         %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1708118Z         %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1708402Z         %82 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:22.1708579Z         %83 = tt.splat %82 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1708807Z         %84 = arith.addi %83, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1709115Z         %85 = tt.addptr %7, %84 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1709430Z         %86 = tt.load %85 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1709655Z         %87 = arith.shli %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1709927Z         %88 = arith.shrsi %87, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1710158Z         %89 = arith.shrsi %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1710438Z         %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.1710774Z         %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.1711052Z         %92 = tt.broadcast %90 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1711291Z         %93 = arith.select %12, %92, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1711525Z         %94 = tt.broadcast %91 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1711755Z         %95 = arith.select %14, %94, %93 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1711981Z         %96 = tt.reshape %95 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.1712197Z         %97 = arith.sitofp %96 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.1712447Z         %98 = ttg.local_alloc %97 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.1712771Z         %99 = ttg.local_load %98 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1713273Z         %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.1713622Z         %101 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:22.1713748Z         %102 = arith.muli %101, %c2_i32 : i32
2026-02-21T09:37:22.1713919Z         %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.1714142Z         %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.1714416Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.1714692Z         %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1714883Z         %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1715083Z         %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1715293Z         %109 = tt.load %108 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.1715562Z         %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1715969Z         %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1716252Z         %112 = arith.muli %101, %c7168_i32 : i32
2026-02-21T09:37:22.1716432Z         %113 = tt.splat %112 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1716666Z         %114 = arith.addi %113, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1716985Z         %115 = tt.addptr %7, %114 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1717302Z         %116 = tt.load %115 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1717539Z         %117 = arith.shli %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1717810Z         %118 = arith.shrsi %117, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1718047Z         %119 = arith.shrsi %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1718411Z         %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.1718752Z         %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.1719044Z         %122 = tt.broadcast %120 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1719288Z         %123 = arith.select %12, %122, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1719529Z         %124 = tt.broadcast %121 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1719767Z         %125 = arith.select %14, %124, %123 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1719998Z         %126 = tt.reshape %125 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.1720223Z         %127 = arith.sitofp %126 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.1720473Z         %128 = ttg.local_alloc %127 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.1720796Z         %129 = ttg.local_load %128 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1721301Z         %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.1721645Z         %131 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:22.1721769Z         %132 = arith.muli %131, %c2_i32 : i32
2026-02-21T09:37:22.1721939Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.1722163Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.1722439Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.1722752Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1722948Z         %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1723144Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1723351Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.1723611Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1724011Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1724288Z         %142 = arith.muli %131, %c7168_i32 : i32
2026-02-21T09:37:22.1724461Z         %143 = tt.splat %142 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1724691Z         %144 = arith.addi %143, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1725000Z         %145 = tt.addptr %7, %144 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1725307Z         %146 = tt.load %145 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1725582Z         %147 = arith.shli %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1725814Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1726049Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1726340Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.1726673Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.1726961Z         %152 = tt.broadcast %150 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1727203Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1727442Z         %154 = tt.broadcast %151 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1727682Z         %155 = arith.select %14, %154, %153 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1727910Z         %156 = tt.reshape %155 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.1728135Z         %157 = arith.sitofp %156 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.1728381Z         %158 = ttg.local_alloc %157 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.1728702Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1729205Z         %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.1729552Z         scf.yield %160 : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.1729676Z       } {tt.flatten}
2026-02-21T09:37:22.1729794Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1729989Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.1730191Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.1730448Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1730842Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1731167Z       %47 = arith.addi %40, %cst_1 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1731478Z       %48 = tt.addptr %7, %47 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1731782Z       %49 = tt.load %48 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1732007Z       %50 = arith.shli %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1732235Z       %51 = arith.shrsi %50, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1732463Z       %52 = arith.shrsi %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.1732746Z       %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.1733079Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.1733356Z       %55 = tt.broadcast %53 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1733635Z       %56 = arith.select %12, %55, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1733937Z       %57 = tt.broadcast %54 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1734164Z       %58 = arith.select %14, %57, %56 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.1734389Z       %59 = tt.reshape %58 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.1734605Z       %60 = arith.sitofp %59 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.1734855Z       %61 = ttg.local_alloc %60 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.1735173Z       %62 = ttg.local_load %61 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.1735627Z       %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.1736009Z       %64 = arith.truncf %63 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T09:37:22.1736268Z       %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:22.1736510Z       %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.1736743Z       %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:37:22.1736999Z       %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:22.1737239Z       %69 = tt.broadcast %67 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:22.1737415Z       %70 = arith.addi %68, %69 : tensor<16x256xi32, #mma>
2026-02-21T09:37:22.1737600Z       %71 = tt.addptr %15, %70 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T09:37:22.1737793Z       tt.store %71, %64 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.1737929Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:37:22.1738035Z     tt.return
2026-02-21T09:37:22.1738115Z   }
2026-02-21T09:37:22.1738193Z }
2026-02-21T09:37:22.1738237Z 
2026-02-21T09:37:22.1738269Z {-#
2026-02-21T09:37:22.1738350Z   external_resources: {
2026-02-21T09:37:22.1738449Z     mlir_reproducer: {
2026-02-21T09:37:22.1739452Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:22.1740437Z       disable_threading: false,
2026-02-21T09:37:22.1740546Z       verify_each: true
2026-02-21T09:37:22.1740635Z     }
2026-02-21T09:37:22.1740709Z   }
2026-02-21T09:37:22.1740780Z #-}
2026-02-21T09:37:22.1741055Z /tmp/torchinductor_root/5l/c5lkmmdndf4o3wrzqyrbnjortfj7bf4m24krxczkgzkiw6fnwv6l.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:22.1741730Z /tmp/torchinductor_root/5l/c5lkmmdndf4o3wrzqyrbnjortfj7bf4m24krxczkgzkiw6fnwv6l.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:22.1742281Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:22.1743102Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[1, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:37:22.1743813Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:22.1743980Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:22.3142701Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:22.3146565Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}>
2026-02-21T09:37:22.3147449Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:37:22.3147927Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:37:22.3148296Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}>
2026-02-21T09:37:22.3148646Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:22.3148968Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:22.3149195Z #smem = #ttg.shared_memory
2026-02-21T09:37:22.3149577Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:22.3150183Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:22.3150663Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.3150888Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.3151107Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.3151337Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.3151573Z     %cst_3 = arith.constant dense<7168> : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:22.3151796Z     %cst_4 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:22.3151987Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:22.3152138Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:37:22.3152292Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:22.3152437Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:22.3152580Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T09:37:22.3152732Z     %c4092_i32 = arith.constant 4092 : i32
2026-02-21T09:37:22.3152878Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:37:22.3153058Z     %cst_5 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3153245Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:37:22.3153391Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:22.3153530Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:22.3153764Z     %cst_6 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3154003Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:22.3154250Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:22.3154595Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.3154989Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.3155331Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.3155668Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.3155997Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.3156295Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:22.3156544Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:22.3156892Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:22.3157387Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:22.3157810Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.3158076Z     %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.3158279Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:37:22.3158486Z     %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.3158683Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:37:22.3158897Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.3164425Z     scf.for %arg3 = %0 to %c112_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:22.3164595Z       %17 = arith.divsi %arg3, %c112_i32 : i32
2026-02-21T09:37:22.3164732Z       %18 = arith.muli %17, %c4_i32 : i32
2026-02-21T09:37:22.3164856Z       %19 = arith.subi %c4_i32, %18 : i32
2026-02-21T09:37:22.3164982Z       %20 = arith.minsi %19, %c4_i32 : i32
2026-02-21T09:37:22.3165113Z       %21 = arith.remsi %arg3, %c112_i32 : i32
2026-02-21T09:37:22.3165237Z       %22 = arith.remsi %21, %20 : i32
2026-02-21T09:37:22.3165358Z       %23 = arith.addi %18, %22 : i32
2026-02-21T09:37:22.3165475Z       %24 = arith.divsi %21, %20 : i32
2026-02-21T09:37:22.3165595Z       %25 = arith.muli %23, %c16_i32 : i32
2026-02-21T09:37:22.3165773Z       %26 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:22.3166001Z       %27 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.3166230Z       %28 = arith.addi %26, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:22.3166449Z       %29 = arith.addi %27, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.3166625Z       %30 = arith.muli %24, %c256_i32 : i32
2026-02-21T09:37:22.3166799Z       %31 = tt.splat %30 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.3167023Z       %32 = tt.splat %30 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.3167242Z       %33 = arith.addi %31, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.3167478Z       %34 = arith.addi %32, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.3167744Z       %35 = tt.expand_dims %28 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:37:22.3167997Z       %36 = arith.muli %35, %cst_4 : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:22.3168188Z       %37 = tt.broadcast %36 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3168474Z       %38 = tt.expand_dims %33 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:37:22.3168801Z       %39 = tt.broadcast %38 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3169072Z       %40 = scf.for %arg4 = %c0_i32 to %c4092_i32 step %c6_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T09:37:22.3169346Z         %50 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.3169568Z         %51 = arith.addi %50, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.3169744Z         %52 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:22.3169914Z         %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.3170130Z         %54 = arith.addi %53, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.3170403Z         %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:22.3170680Z         %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3170878Z         %57 = arith.addi %37, %56 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3171079Z         %58 = tt.addptr %7, %57 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3171280Z         %59 = tt.load %58 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:22.3171546Z         %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3171980Z         %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3172360Z         %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:22.3172608Z         %63 = arith.muli %62, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:22.3172803Z         %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3172998Z         %65 = arith.addi %64, %39 : tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3173193Z         %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3173392Z         %67 = tt.load %66 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:22.3173634Z         %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3173913Z         %69 = arith.shli %68, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3174151Z         %70 = arith.shrsi %69, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3174384Z         %71 = arith.shrsi %68, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3174674Z         %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:37:22.3175012Z         %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:37:22.3175294Z         %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3175536Z         %75 = arith.select %13, %74, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3175772Z         %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3176005Z         %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3176237Z         %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:37:22.3176501Z         %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:37:22.3176755Z         %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem>
2026-02-21T09:37:22.3177076Z         %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3177545Z         %82 = tt.dot %61, %81, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.3177889Z         %83 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:22.3178058Z         %84 = tt.splat %83 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.3178276Z         %85 = arith.addi %84, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.3178449Z         %86 = arith.muli %83, %c2_i32 : i32
2026-02-21T09:37:22.3178611Z         %87 = tt.splat %86 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.3178821Z         %88 = arith.addi %87, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.3179089Z         %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:22.3179358Z         %90 = tt.broadcast %89 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3179547Z         %91 = arith.addi %37, %90 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3179744Z         %92 = tt.addptr %7, %91 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3179975Z         %93 = tt.load %92 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:22.3180233Z         %94 = ttg.convert_layout %93 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3180625Z         %95 = arith.extf %94 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3180998Z         %96 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:22.3181238Z         %97 = arith.muli %96, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:22.3181427Z         %98 = tt.broadcast %97 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3181614Z         %99 = arith.addi %98, %39 : tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3181814Z         %100 = tt.addptr %8, %99 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3182020Z         %101 = tt.load %100 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:22.3182270Z         %102 = ttg.convert_layout %101 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3182553Z         %103 = arith.shli %102, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3182790Z         %104 = arith.shrsi %103, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3183027Z         %105 = arith.shrsi %102, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3183315Z         %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:37:22.3183658Z         %107 = tt.expand_dims %105 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:37:22.3183951Z         %108 = tt.broadcast %106 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3184228Z         %109 = arith.select %13, %108, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3184467Z         %110 = tt.broadcast %107 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3184702Z         %111 = arith.select %15, %110, %109 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3184937Z         %112 = tt.reshape %111 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:37:22.3185169Z         %113 = arith.sitofp %112 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:37:22.3185426Z         %114 = ttg.local_alloc %113 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem>
2026-02-21T09:37:22.3185755Z         %115 = ttg.local_load %114 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3186223Z         %116 = tt.dot %95, %115, %82, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.3186568Z         %117 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:37:22.3186744Z         %118 = tt.splat %117 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.3186966Z         %119 = arith.addi %118, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.3187144Z         %120 = arith.muli %117, %c2_i32 : i32
2026-02-21T09:37:22.3187313Z         %121 = tt.splat %120 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.3187532Z         %122 = arith.addi %121, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.3187848Z         %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:22.3188124Z         %124 = tt.broadcast %123 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3188319Z         %125 = arith.addi %37, %124 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3188520Z         %126 = tt.addptr %7, %125 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3188726Z         %127 = tt.load %126 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:22.3188992Z         %128 = ttg.convert_layout %127 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3189384Z         %129 = arith.extf %128 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3189769Z         %130 = tt.expand_dims %119 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:22.3190020Z         %131 = arith.muli %130, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:22.3190213Z         %132 = tt.broadcast %131 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3190408Z         %133 = arith.addi %132, %39 : tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3190605Z         %134 = tt.addptr %8, %133 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3190811Z         %135 = tt.load %134 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:22.3191056Z         %136 = ttg.convert_layout %135 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3191339Z         %137 = arith.shli %136, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3191579Z         %138 = arith.shrsi %137, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3191812Z         %139 = arith.shrsi %136, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3192138Z         %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:37:22.3192474Z         %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:37:22.3192759Z         %142 = tt.broadcast %140 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3193004Z         %143 = arith.select %13, %142, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3193242Z         %144 = tt.broadcast %141 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3193477Z         %145 = arith.select %15, %144, %143 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3193711Z         %146 = tt.reshape %145 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:37:22.3193939Z         %147 = arith.sitofp %146 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:37:22.3194192Z         %148 = ttg.local_alloc %147 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem>
2026-02-21T09:37:22.3194515Z         %149 = ttg.local_load %148 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3194985Z         %150 = tt.dot %129, %149, %116, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.3195332Z         scf.yield %150 : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.3195450Z       }
2026-02-21T09:37:22.3195660Z       %41 = scf.for %arg4 = %c4092_i32 to %c4096_i32 step %c2_i32 iter_args(%arg5 = %40) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T09:37:22.3195926Z         %50 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.3196154Z         %51 = arith.addi %50, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.3196329Z         %52 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:22.3196493Z         %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.3196709Z         %54 = arith.addi %53, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.3196979Z         %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:22.3197253Z         %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3197446Z         %57 = arith.addi %37, %56 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3197642Z         %58 = tt.addptr %7, %57 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:22.3197844Z         %59 = tt.load %58 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:22.3198107Z         %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3198508Z         %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3198890Z         %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:22.3199132Z         %63 = arith.muli %62, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:22.3199322Z         %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3199512Z         %65 = arith.addi %64, %39 : tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3199705Z         %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:37:22.3199940Z         %67 = tt.load %66 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:22.3200179Z         %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3200455Z         %69 = arith.shli %68, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3200683Z         %70 = arith.shrsi %69, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3200915Z         %71 = arith.shrsi %68, %cst_6 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.3201200Z         %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:37:22.3201533Z         %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:37:22.3201823Z         %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3202057Z         %75 = arith.select %13, %74, %cst_5 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3202292Z         %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3202521Z         %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:37:22.3202769Z         %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:37:22.3202990Z         %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:37:22.3203402Z         %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared, #smem>
2026-02-21T09:37:22.3203724Z         %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.3204192Z         %82 = tt.dot %61, %81, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.3204533Z         scf.yield %82 : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.3204660Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:37:22.3204816Z       %42 = arith.truncf %41 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T09:37:22.3205073Z       %43 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:22.3205302Z       %44 = arith.muli %43, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.3205533Z       %45 = tt.expand_dims %34 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:37:22.3205789Z       %46 = tt.broadcast %44 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:22.3205988Z       %47 = tt.broadcast %45 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:22.3206163Z       %48 = arith.addi %46, %47 : tensor<16x256xi32, #mma>
2026-02-21T09:37:22.3206347Z       %49 = tt.addptr %16, %48 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T09:37:22.3206534Z       tt.store %49, %42 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.3206671Z     } {tt.num_stages = 2 : i32}
2026-02-21T09:37:22.3206775Z     tt.return
2026-02-21T09:37:22.3206856Z   }
2026-02-21T09:37:22.3206927Z }
2026-02-21T09:37:22.3206971Z 
2026-02-21T09:37:22.3207001Z {-#
2026-02-21T09:37:22.3207082Z   external_resources: {
2026-02-21T09:37:22.3207179Z     mlir_reproducer: {
2026-02-21T09:37:22.3208175Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:22.3209193Z       disable_threading: false,
2026-02-21T09:37:22.3209299Z       verify_each: true
2026-02-21T09:37:22.3209390Z     }
2026-02-21T09:37:22.3209461Z   }
2026-02-21T09:37:22.3209532Z #-}
2026-02-21T09:37:22.3209808Z /tmp/torchinductor_root/4z/c4znhgsqfporwgfho3bxq3jlqbwbxoddlldndzgjqz64fqow6iws.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:22.3210498Z /tmp/torchinductor_root/4z/c4znhgsqfporwgfho3bxq3jlqbwbxoddlldndzgjqz64fqow6iws.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:22.3211055Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:22.3211835Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:37:22.3212545Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:22.3212750Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:22.5877601Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:22.5881261Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}>
2026-02-21T09:37:22.5882208Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:37:22.5883120Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}>
2026-02-21T09:37:22.5883896Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:22.5884600Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:22.5885109Z #smem = #ttg.shared_memory
2026-02-21T09:37:22.5885741Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:22.5887191Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:22.5887980Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.5888291Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.5888591Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.5888900Z     %cst_2 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.5889164Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:22.5889423Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.5889803Z     %cst_4 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.5890178Z     %cst_5 = arith.constant dense<29352960> : tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5890539Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:22.5890735Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:22.5890936Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T09:37:22.5891130Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:22.5891318Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:37:22.5891504Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:22.5891697Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:37:22.5891893Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:22.5892138Z     %cst_6 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5892387Z     %c224_i32 = arith.constant 224 : i32
2026-02-21T09:37:22.5892577Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:22.5892764Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:22.5892947Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:22.5893255Z     %cst_7 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5893581Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:22.5893912Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.5894380Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.5894824Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.5895285Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.5895743Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.5896222Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.5896559Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:22.5897022Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:22.5897605Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:22.5898109Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.5898426Z     %11 = arith.cmpi eq, %10, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.5898673Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:22.5898923Z     %13 = arith.cmpi eq, %10, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.5899160Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:22.5899423Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.5899687Z     %16 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.5900034Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.5900372Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5900618Z     scf.for %arg3 = %0 to %c112_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:22.5900804Z       %19 = arith.divsi %arg3, %c224_i32 : i32
2026-02-21T09:37:22.5900958Z       %20 = arith.muli %19, %c8_i32 : i32
2026-02-21T09:37:22.5901107Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:22.5901281Z       %22 = arith.minsi %21, %c8_i32 : i32
2026-02-21T09:37:22.5901432Z       %23 = arith.remsi %arg3, %c224_i32 : i32
2026-02-21T09:37:22.5901627Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:22.5901768Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:22.5901910Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:22.5902052Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:22.5902263Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.5902530Z       %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.5902794Z       %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.5903056Z       %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.5903259Z       %32 = arith.muli %26, %c256_i32 : i32
2026-02-21T09:37:22.5903462Z       %33 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.5903729Z       %34 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.5903998Z       %35 = arith.addi %33, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.5904262Z       %36 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.5904606Z       %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.5904923Z       %38 = arith.muli %37, %cst_2 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.5905163Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5905513Z       %40 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5905958Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst_3) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T09:37:22.5906234Z         %73 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:22.5906454Z         %74 = tt.splat %73 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.5906731Z         %75 = arith.addi %74, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.5907070Z         %76 = tt.expand_dims %75 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.5907415Z         %77 = tt.broadcast %76 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5907654Z         %78 = arith.addi %39, %77 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5907906Z         %79 = tt.addptr %6, %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5908146Z         %80 = tt.load %79 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.5908417Z         %81 = ttg.convert_layout %80 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5908824Z         %82 = arith.extf %81 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5909107Z         %83 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:22.5909254Z         %84 = tt.splat %83 : i32 -> tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5909414Z         %85 = arith.addi %84, %40 : tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5909608Z         %86 = tt.addptr %7, %85 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5909809Z         %87 = tt.load %86 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:22.5910053Z         %88 = ttg.convert_layout %87 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5910340Z         %89 = arith.shli %88, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5910576Z         %90 = arith.shrsi %89, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5910845Z         %91 = arith.shrsi %88, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5911139Z         %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.5911479Z         %93 = tt.expand_dims %91 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.5911769Z         %94 = tt.broadcast %92 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5912014Z         %95 = arith.select %12, %94, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5912254Z         %96 = tt.broadcast %93 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5912486Z         %97 = arith.select %14, %96, %95 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5912720Z         %98 = tt.reshape %97 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.5912944Z         %99 = arith.sitofp %98 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.5913201Z         %100 = ttg.local_alloc %99 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.5913533Z         %101 = ttg.local_load %100 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5914019Z         %102 = tt.dot %82, %101, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.5914416Z         %103 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:22.5914543Z         %104 = arith.muli %103, %c2_i32 : i32
2026-02-21T09:37:22.5914722Z         %105 = tt.splat %104 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.5914955Z         %106 = arith.addi %105, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.5915237Z         %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.5915514Z         %108 = tt.broadcast %107 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5915714Z         %109 = arith.addi %39, %108 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5915919Z         %110 = tt.addptr %6, %109 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5916128Z         %111 = tt.load %110 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.5916401Z         %112 = ttg.convert_layout %111 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5916807Z         %113 = arith.extf %112 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5917096Z         %114 = arith.muli %103, %c7168_i32 : i32
2026-02-21T09:37:22.5917247Z         %115 = tt.splat %114 : i32 -> tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5917405Z         %116 = arith.addi %115, %40 : tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5917601Z         %117 = tt.addptr %7, %116 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5917804Z         %118 = tt.load %117 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:22.5918047Z         %119 = ttg.convert_layout %118 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5918330Z         %120 = arith.shli %119, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5918596Z         %121 = arith.shrsi %120, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5918836Z         %122 = arith.shrsi %119, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5919125Z         %123 = tt.expand_dims %121 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.5919465Z         %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.5919754Z         %125 = tt.broadcast %123 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5919996Z         %126 = arith.select %12, %125, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5920241Z         %127 = tt.broadcast %124 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5920477Z         %128 = arith.select %14, %127, %126 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5920715Z         %129 = tt.reshape %128 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.5920943Z         %130 = arith.sitofp %129 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.5921197Z         %131 = ttg.local_alloc %130 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.5921524Z         %132 = ttg.local_load %131 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5922023Z         %133 = tt.dot %113, %132, %102, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.5922365Z         %134 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:22.5922489Z         %135 = arith.muli %134, %c2_i32 : i32
2026-02-21T09:37:22.5922701Z         %136 = tt.splat %135 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.5922925Z         %137 = arith.addi %136, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.5923202Z         %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.5923475Z         %139 = tt.broadcast %138 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5923671Z         %140 = arith.addi %39, %139 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5923868Z         %141 = tt.addptr %6, %140 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5924079Z         %142 = tt.load %141 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.5924342Z         %143 = ttg.convert_layout %142 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5924745Z         %144 = arith.extf %143 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5925030Z         %145 = arith.muli %134, %c7168_i32 : i32
2026-02-21T09:37:22.5925173Z         %146 = tt.splat %145 : i32 -> tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5925334Z         %147 = arith.addi %146, %40 : tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5925529Z         %148 = tt.addptr %7, %147 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5925735Z         %149 = tt.load %148 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:22.5925983Z         %150 = ttg.convert_layout %149 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5926265Z         %151 = arith.shli %150, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5926545Z         %152 = arith.shrsi %151, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5926779Z         %153 = arith.shrsi %150, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5927073Z         %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.5927412Z         %155 = tt.expand_dims %153 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.5927696Z         %156 = tt.broadcast %154 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5927940Z         %157 = arith.select %12, %156, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5928183Z         %158 = tt.broadcast %155 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5928420Z         %159 = arith.select %14, %158, %157 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5928653Z         %160 = tt.reshape %159 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.5928878Z         %161 = arith.sitofp %160 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.5929136Z         %162 = ttg.local_alloc %161 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.5929464Z         %163 = ttg.local_load %162 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5929970Z         %164 = tt.dot %144, %163, %133, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.5930318Z         scf.yield %164 : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.5930443Z       } {tt.flatten}
2026-02-21T09:37:22.5930563Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5930761Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.5930961Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.5931220Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5931615Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5931914Z       %47 = arith.addi %40, %cst_5 : tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5932114Z       %48 = tt.addptr %7, %47 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi32, #blocked2>
2026-02-21T09:37:22.5932310Z       %49 = tt.load %48 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:22.5932552Z       %50 = ttg.convert_layout %49 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5932829Z       %51 = arith.shli %50, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5933061Z       %52 = arith.shrsi %51, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5933291Z       %53 = arith.shrsi %50, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.5933573Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.5933910Z       %55 = tt.expand_dims %53 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.5934190Z       %56 = tt.broadcast %54 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5934457Z       %57 = arith.select %12, %56, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5934690Z       %58 = tt.broadcast %55 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5934916Z       %59 = arith.select %14, %58, %57 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.5935141Z       %60 = tt.reshape %59 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.5935358Z       %61 = arith.sitofp %60 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.5935603Z       %62 = ttg.local_alloc %61 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.5935923Z       %63 = ttg.local_load %62 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.5936378Z       %64 = tt.dot %46, %63, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.5936762Z       %65 = arith.truncf %64 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T09:37:22.5937021Z       %66 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:22.5937251Z       %67 = arith.muli %66, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.5937482Z       %68 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:37:22.5937737Z       %69 = tt.broadcast %67 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:22.5937966Z       %70 = tt.broadcast %68 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:22.5938138Z       %71 = arith.addi %69, %70 : tensor<16x256xi32, #mma>
2026-02-21T09:37:22.5938325Z       %72 = tt.addptr %15, %71 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T09:37:22.5938516Z       tt.store %72, %65 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.5938650Z     } {tt.num_stages = 2 : i32}
2026-02-21T09:37:22.5938756Z     tt.return
2026-02-21T09:37:22.5938835Z   }
2026-02-21T09:37:22.5938909Z }
2026-02-21T09:37:22.5938951Z 
2026-02-21T09:37:22.5938982Z {-#
2026-02-21T09:37:22.5939065Z   external_resources: {
2026-02-21T09:37:22.5939161Z     mlir_reproducer: {
2026-02-21T09:37:22.5940150Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:22.5941136Z       disable_threading: false,
2026-02-21T09:37:22.5941242Z       verify_each: true
2026-02-21T09:37:22.5941331Z     }
2026-02-21T09:37:22.5941405Z   }
2026-02-21T09:37:22.5941472Z #-}
2026-02-21T09:37:22.5941749Z /tmp/torchinductor_root/6p/c6poxudocrgxudtay7fxqjmqmqdswjjmsl6bp7kfxegfvwk56ksq.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:22.5942427Z /tmp/torchinductor_root/6p/c6poxudocrgxudtay7fxqjmqmqdswjjmsl6bp7kfxegfvwk56ksq.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:22.5942973Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:22.5943786Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:37:22.5944489Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:22.5944654Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:22.6585434Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:22.6589189Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}>
2026-02-21T09:37:22.6590128Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:37:22.6590967Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}>
2026-02-21T09:37:22.6591739Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:22.6592440Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:22.6592941Z #smem = #ttg.shared_memory
2026-02-21T09:37:22.6593571Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:22.6595038Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:22.6596100Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.6596550Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:22.6596880Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:22.6597198Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:22.6597503Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:22.6597891Z     %cst_0 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6598235Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:22.6598437Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:37:22.6598583Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:22.6598697Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:22.6598804Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T09:37:22.6598920Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:22.6599037Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:22.6599226Z     %cst_1 = arith.constant dense<29352960> : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6599493Z     %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.6599713Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.6599925Z     %cst_4 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6600137Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.6600306Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.6600473Z     %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.6600625Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:22.6600824Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.6601101Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.6601458Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:22.6601774Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.6602047Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.6602290Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.6602533Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6602956Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:22.6603454Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:22.6603913Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.6610674Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.6610903Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:22.6611106Z     %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.6611294Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:22.6611503Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.6611796Z     %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.6612073Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.6612349Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6612542Z     scf.for %arg3 = %0 to %c112_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:22.6612690Z       %19 = arith.divsi %arg3, %c112_i32 : i32
2026-02-21T09:37:22.6612815Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:22.6612932Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:22.6613049Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:22.6613168Z       %23 = arith.remsi %arg3, %c112_i32 : i32
2026-02-21T09:37:22.6613290Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:22.6613402Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:22.6613513Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:22.6613630Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:22.6613797Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.6614016Z       %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.6614249Z       %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.6614457Z       %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.6614620Z       %32 = arith.muli %26, %c256_i32 : i32
2026-02-21T09:37:22.6614828Z       %33 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:22.6615075Z       %34 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.6615323Z       %35 = arith.addi %33, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:22.6615570Z       %36 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.6615877Z       %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.6616124Z       %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.6618175Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6618582Z       %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6618984Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T09:37:22.6619206Z         %72 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:22.6619385Z         %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.6619609Z         %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.6619888Z         %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.6620167Z         %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6620363Z         %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6620560Z         %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6620764Z         %79 = tt.load %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.6621028Z         %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6621488Z         %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6621775Z         %82 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:22.6621953Z         %83 = tt.splat %82 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6622182Z         %84 = arith.addi %83, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6622491Z         %85 = tt.addptr %7, %84 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6622799Z         %86 = tt.load %85 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6623028Z         %87 = arith.shli %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6623259Z         %88 = arith.shrsi %87, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6623491Z         %89 = arith.shrsi %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6623780Z         %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.6624117Z         %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.6624400Z         %92 = tt.broadcast %90 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6624637Z         %93 = arith.select %12, %92, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6624874Z         %94 = tt.broadcast %91 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6625103Z         %95 = arith.select %14, %94, %93 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6625334Z         %96 = tt.reshape %95 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.6625555Z         %97 = arith.sitofp %96 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.6625842Z         %98 = ttg.local_alloc %97 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.6626161Z         %99 = ttg.local_load %98 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6626630Z         %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.6626974Z         %101 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:22.6627099Z         %102 = arith.muli %101, %c2_i32 : i32
2026-02-21T09:37:22.6627271Z         %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.6627495Z         %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.6627775Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.6628048Z         %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6628243Z         %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6628442Z         %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6628648Z         %109 = tt.load %108 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.6628914Z         %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6629340Z         %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6629627Z         %112 = arith.muli %101, %c7168_i32 : i32
2026-02-21T09:37:22.6629805Z         %113 = tt.splat %112 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6630039Z         %114 = arith.addi %113, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6630349Z         %115 = tt.addptr %7, %114 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6630657Z         %116 = tt.load %115 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6630892Z         %117 = arith.shli %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6631129Z         %118 = arith.shrsi %117, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6631367Z         %119 = arith.shrsi %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6631661Z         %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.6631998Z         %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.6632284Z         %122 = tt.broadcast %120 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6632526Z         %123 = arith.select %12, %122, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6632768Z         %124 = tt.broadcast %121 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6633006Z         %125 = arith.select %14, %124, %123 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6633240Z         %126 = tt.reshape %125 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.6633502Z         %127 = arith.sitofp %126 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.6633753Z         %128 = ttg.local_alloc %127 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.6634078Z         %129 = ttg.local_load %128 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6634549Z         %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.6634888Z         %131 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:22.6635012Z         %132 = arith.muli %131, %c2_i32 : i32
2026-02-21T09:37:22.6635183Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.6635406Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.6635685Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.6635957Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6636150Z         %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6636348Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6636552Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.6636817Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6637241Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6637522Z         %142 = arith.muli %131, %c7168_i32 : i32
2026-02-21T09:37:22.6637701Z         %143 = tt.splat %142 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6637930Z         %144 = arith.addi %143, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6638244Z         %145 = tt.addptr %7, %144 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6638556Z         %146 = tt.load %145 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6638792Z         %147 = arith.shli %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6639032Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6639266Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6639558Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.6639893Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.6640178Z         %152 = tt.broadcast %150 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6640421Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6640660Z         %154 = tt.broadcast %151 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6640897Z         %155 = arith.select %14, %154, %153 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6641131Z         %156 = tt.reshape %155 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.6641396Z         %157 = arith.sitofp %156 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.6641649Z         %158 = ttg.local_alloc %157 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.6641973Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6642443Z         %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.6642866Z         scf.yield %160 : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.6642987Z       } {tt.flatten}
2026-02-21T09:37:22.6643108Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6643300Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.6643505Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.6643759Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6644149Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6644477Z       %47 = arith.addi %40, %cst_1 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6644784Z       %48 = tt.addptr %7, %47 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6645126Z       %49 = tt.load %48 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6645351Z       %50 = arith.shli %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6645580Z       %51 = arith.shrsi %50, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6645810Z       %52 = arith.shrsi %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.6646090Z       %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.6646419Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.6646696Z       %55 = tt.broadcast %53 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6646928Z       %56 = arith.select %12, %55, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6647162Z       %57 = tt.broadcast %54 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6647389Z       %58 = arith.select %14, %57, %56 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.6647615Z       %59 = tt.reshape %58 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.6647834Z       %60 = arith.sitofp %59 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.6648077Z       %61 = ttg.local_alloc %60 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.6648393Z       %62 = ttg.local_load %61 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.6648849Z       %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.6649227Z       %64 = arith.truncf %63 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T09:37:22.6649518Z       %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:22.6649748Z       %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.6649980Z       %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:37:22.6650238Z       %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:22.6650440Z       %69 = tt.broadcast %67 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:22.6650615Z       %70 = arith.addi %68, %69 : tensor<16x256xi32, #mma>
2026-02-21T09:37:22.6650795Z       %71 = tt.addptr %15, %70 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T09:37:22.6650989Z       tt.store %71, %64 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.6651127Z     } {tt.num_stages = 2 : i32}
2026-02-21T09:37:22.6651234Z     tt.return
2026-02-21T09:37:22.6651315Z   }
2026-02-21T09:37:22.6651389Z }
2026-02-21T09:37:22.6651431Z 
2026-02-21T09:37:22.6651460Z {-#
2026-02-21T09:37:22.6651539Z   external_resources: {
2026-02-21T09:37:22.6651635Z     mlir_reproducer: {
2026-02-21T09:37:22.6652653Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:22.6653630Z       disable_threading: false,
2026-02-21T09:37:22.6653734Z       verify_each: true
2026-02-21T09:37:22.6653826Z     }
2026-02-21T09:37:22.6653895Z   }
2026-02-21T09:37:22.6653964Z #-}
2026-02-21T09:37:22.6654243Z /tmp/torchinductor_root/xl/cxl247ga7zjukj7rwk2nqnku4ple2jgfgavadqlguhja5hgrbtdf.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:22.6654924Z /tmp/torchinductor_root/xl/cxl247ga7zjukj7rwk2nqnku4ple2jgfgavadqlguhja5hgrbtdf.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:22.6655467Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:22.6656238Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:37:22.6656939Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:22.6657103Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:22.8531631Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:22.8535166Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}>
2026-02-21T09:37:22.8536113Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:37:22.8536953Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}>
2026-02-21T09:37:22.8537688Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:22.8538030Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:22.8538266Z #smem = #ttg.shared_memory
2026-02-21T09:37:22.8538568Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:22.8539194Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:22.8539695Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.8539907Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:22.8540064Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:22.8540224Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:22.8540365Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:22.8540552Z     %cst_0 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8540751Z     %c7168_i64 = arith.constant 7168 : i64
2026-02-21T09:37:22.8540902Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T09:37:22.8541054Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:37:22.8541203Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:22.8541350Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:22.8541490Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T09:37:22.8541639Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:22.8541787Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:22.8542096Z     %cst_1 = arith.constant dense<29352960> : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8542447Z     %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.8542785Z     %cst_3 = arith.constant dense<0> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8543115Z     %cst_4 = arith.constant dense<7168> : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8543442Z     %cst_5 = arith.constant dense<0> : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8543743Z     %cst_6 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.8544016Z     %cst_7 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8544287Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.8544505Z     %cst_9 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.8544728Z     %cst_10 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.8544917Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:22.8545169Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.8545523Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.8545923Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:22.8546323Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.8546673Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.8546980Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.8547275Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8547699Z     %8 = arith.extsi %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:22.8548188Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:22.8548622Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:22.8549043Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.8549306Z     %12 = arith.cmpi eq, %11, %cst_8 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.8549514Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:22.8549722Z     %14 = arith.cmpi eq, %11, %cst_9 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.8549924Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:22.8550144Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.8550366Z     %17 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.8550658Z     %18 = tt.expand_dims %17 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.8550942Z     %19 = tt.broadcast %18 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8551145Z     scf.for %arg3 = %0 to %c112_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:22.8551301Z       %20 = arith.divsi %arg3, %c112_i32 : i32
2026-02-21T09:37:22.8551429Z       %21 = arith.muli %20, %c4_i32 : i32
2026-02-21T09:37:22.8551584Z       %22 = arith.subi %c4_i32, %21 : i32
2026-02-21T09:37:22.8551705Z       %23 = arith.minsi %22, %c4_i32 : i32
2026-02-21T09:37:22.8551837Z       %24 = arith.remsi %arg3, %c112_i32 : i32
2026-02-21T09:37:22.8551961Z       %25 = arith.remsi %24, %23 : i32
2026-02-21T09:37:22.8552099Z       %26 = arith.addi %21, %25 : i32
2026-02-21T09:37:22.8552215Z       %27 = arith.divsi %24, %23 : i32
2026-02-21T09:37:22.8552337Z       %28 = arith.muli %26, %c16_i32 : i32
2026-02-21T09:37:22.8552513Z       %29 = tt.splat %28 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.8552740Z       %30 = tt.splat %28 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.8552962Z       %31 = arith.addi %29, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.8553183Z       %32 = arith.addi %30, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.8553357Z       %33 = arith.muli %27, %c256_i32 : i32
2026-02-21T09:37:22.8553524Z       %34 = tt.splat %33 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.8553741Z       %35 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.8554022Z       %36 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.8554285Z       %37 = arith.muli %36, %cst_6 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.8554489Z       %38 = tt.broadcast %37 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8554668Z       %39 = arith.extsi %33 : i32 to i64
2026-02-21T09:37:22.8554887Z       %40 = tt.splat %39 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:22.8555201Z       %41 = arith.addi %40, %8 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:22.8555617Z       %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8556041Z       %43 = arith.cmpi sge, %42, %cst_5 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8556297Z       %44 = arith.cmpi slt, %42, %cst_4 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8556542Z       %45 = arith.andi %43, %44 : tensor<1x256xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8556820Z       %46 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T09:37:22.8557049Z         %77 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:22.8557227Z         %78 = tt.splat %77 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.8557444Z         %79 = arith.addi %78, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.8557714Z         %80 = tt.expand_dims %79 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.8557988Z         %81 = tt.broadcast %80 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8558177Z         %82 = arith.addi %38, %81 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8558373Z         %83 = tt.addptr %6, %82 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8558572Z         %84 = tt.load %83 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.8558834Z         %85 = ttg.convert_layout %84 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8559234Z         %86 = arith.extf %85 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8559552Z         %87 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:37:22.8559673Z         %88 = arith.muli %87, %c7168_i64 : i64
2026-02-21T09:37:22.8559847Z         %89 = tt.splat %88 : i64 -> tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8560072Z         %90 = arith.addi %89, %42 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8560383Z         %91 = tt.addptr %7, %90 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8560703Z         %92 = tt.load %91, %45, %cst_3 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8560944Z         %93 = arith.shli %92, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8561174Z         %94 = arith.shrsi %93, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8561407Z         %95 = arith.shrsi %92, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8561691Z         %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.8562026Z         %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.8562306Z         %98 = tt.broadcast %96 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8562542Z         %99 = arith.select %13, %98, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8562852Z         %100 = tt.broadcast %97 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8563087Z         %101 = arith.select %15, %100, %99 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8563322Z         %102 = tt.reshape %101 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.8563551Z         %103 = arith.sitofp %102 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.8563841Z         %104 = ttg.local_alloc %103 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.8564169Z         %105 = ttg.local_load %104 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8564641Z         %106 = tt.dot %86, %105, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.8564985Z         %107 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:22.8565107Z         %108 = arith.muli %107, %c2_i32 : i32
2026-02-21T09:37:22.8565275Z         %109 = tt.splat %108 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.8565498Z         %110 = arith.addi %109, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.8565777Z         %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.8566049Z         %112 = tt.broadcast %111 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8566241Z         %113 = arith.addi %38, %112 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8566438Z         %114 = tt.addptr %6, %113 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8566642Z         %115 = tt.load %114 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.8566905Z         %116 = ttg.convert_layout %115 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8567345Z         %117 = arith.extf %116 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8567621Z         %118 = arith.extsi %107 : i32 to i64
2026-02-21T09:37:22.8567743Z         %119 = arith.muli %118, %c7168_i64 : i64
2026-02-21T09:37:22.8567918Z         %120 = tt.splat %119 : i64 -> tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8568145Z         %121 = arith.addi %120, %42 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8568454Z         %122 = tt.addptr %7, %121 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8568776Z         %123 = tt.load %122, %45, %cst_3 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8569019Z         %124 = arith.shli %123, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8569255Z         %125 = arith.shrsi %124, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8569490Z         %126 = arith.shrsi %123, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8569779Z         %127 = tt.expand_dims %125 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.8570112Z         %128 = tt.expand_dims %126 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.8570396Z         %129 = tt.broadcast %127 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8570638Z         %130 = arith.select %13, %129, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8570876Z         %131 = tt.broadcast %128 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8571110Z         %132 = arith.select %15, %131, %130 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8571340Z         %133 = tt.reshape %132 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.8571598Z         %134 = arith.sitofp %133 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.8571849Z         %135 = ttg.local_alloc %134 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.8572169Z         %136 = ttg.local_load %135 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8572633Z         %137 = tt.dot %117, %136, %106, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.8572975Z         %138 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:22.8573094Z         %139 = arith.muli %138, %c2_i32 : i32
2026-02-21T09:37:22.8573264Z         %140 = tt.splat %139 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.8573488Z         %141 = arith.addi %140, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.8573760Z         %142 = tt.expand_dims %141 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.8574036Z         %143 = tt.broadcast %142 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8574226Z         %144 = arith.addi %38, %143 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8574422Z         %145 = tt.addptr %6, %144 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8574624Z         %146 = tt.load %145 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.8574923Z         %147 = ttg.convert_layout %146 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8575318Z         %148 = arith.extf %147 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8575595Z         %149 = arith.extsi %138 : i32 to i64
2026-02-21T09:37:22.8575715Z         %150 = arith.muli %149, %c7168_i64 : i64
2026-02-21T09:37:22.8575893Z         %151 = tt.splat %150 : i64 -> tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8576125Z         %152 = arith.addi %151, %42 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8576438Z         %153 = tt.addptr %7, %152 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8576704Z         %154 = arith.cmpi slt, %149, %c4096_i64 : i64
2026-02-21T09:37:22.8576885Z         %155 = tt.splat %154 : i1 -> tensor<1x256xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8577112Z         %156 = arith.andi %155, %45 : tensor<1x256xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8577357Z         %157 = tt.load %153, %156, %cst_3 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8577603Z         %158 = arith.shli %157, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8577837Z         %159 = arith.shrsi %158, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8578073Z         %160 = arith.shrsi %157, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8578363Z         %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.8578696Z         %162 = tt.expand_dims %160 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.8578983Z         %163 = tt.broadcast %161 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8579256Z         %164 = arith.select %13, %163, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8579499Z         %165 = tt.broadcast %162 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8579738Z         %166 = arith.select %15, %165, %164 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8579972Z         %167 = tt.reshape %166 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.8580198Z         %168 = arith.sitofp %167 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.8580448Z         %169 = ttg.local_alloc %168 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.8580774Z         %170 = ttg.local_load %169 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8581236Z         %171 = tt.dot %148, %170, %137, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.8581583Z         scf.yield %171 : tensor<16x256xf32, #mma>
2026-02-21T09:37:22.8581719Z       } {tt.disallow_acc_multi_buffer, tt.flatten}
2026-02-21T09:37:22.8581871Z       %47 = arith.addi %38, %19 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8582063Z       %48 = tt.addptr %6, %47 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.8582260Z       %49 = tt.load %48 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.8582514Z       %50 = ttg.convert_layout %49 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8582934Z       %51 = arith.extf %50 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8583264Z       %52 = arith.addi %42, %cst_1 : tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8583571Z       %53 = tt.addptr %7, %52 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8583884Z       %54 = tt.load %53, %45, %cst_3 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8584118Z       %55 = arith.shli %54, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8584345Z       %56 = arith.shrsi %55, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8584571Z       %57 = arith.shrsi %54, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.8584851Z       %58 = tt.expand_dims %56 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.8585179Z       %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:22.8585454Z       %60 = tt.broadcast %58 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8585687Z       %61 = arith.select %13, %60, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8585918Z       %62 = tt.broadcast %59 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8586142Z       %63 = arith.select %15, %62, %61 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:22.8586366Z       %64 = tt.reshape %63 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:22.8586584Z       %65 = arith.sitofp %64 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:22.8586830Z       %66 = ttg.local_alloc %65 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:22.8587180Z       %67 = ttg.local_load %66 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.8587637Z       %68 = tt.dot %51, %67, %46, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:22.8588012Z       %69 = arith.truncf %68 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T09:37:22.8588271Z       %70 = tt.expand_dims %32 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:22.8588500Z       %71 = arith.muli %70, %cst_10 : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.8588734Z       %72 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:37:22.8588987Z       %73 = tt.broadcast %71 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:22.8589188Z       %74 = tt.broadcast %72 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:22.8589363Z       %75 = arith.addi %73, %74 : tensor<16x256xi32, #mma>
2026-02-21T09:37:22.8589543Z       %76 = tt.addptr %16, %75 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T09:37:22.8589733Z       tt.store %76, %69 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.8589867Z     } {tt.num_stages = 2 : i32}
2026-02-21T09:37:22.8589970Z     tt.return
2026-02-21T09:37:22.8590048Z   }
2026-02-21T09:37:22.8590120Z }
2026-02-21T09:37:22.8590162Z 
2026-02-21T09:37:22.8590192Z {-#
2026-02-21T09:37:22.8590271Z   external_resources: {
2026-02-21T09:37:22.8590370Z     mlir_reproducer: {
2026-02-21T09:37:22.8591389Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:22.8592379Z       disable_threading: false,
2026-02-21T09:37:22.8592484Z       verify_each: true
2026-02-21T09:37:22.8592570Z     }
2026-02-21T09:37:22.8592643Z   }
2026-02-21T09:37:22.8592712Z #-}
2026-02-21T09:37:22.8592986Z /tmp/torchinductor_root/vq/cvqpt4p6b7rcrefty43lyx4zk3ehkrwcm5jbv3au3db4fydatzse.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:22.8593661Z /tmp/torchinductor_root/vq/cvqpt4p6b7rcrefty43lyx4zk3ehkrwcm5jbv3au3db4fydatzse.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:22.8594207Z [78s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:22.8594988Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:37:22.8595704Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:22.8595872Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:22.9857107Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:22.9859571Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:37:22.9860467Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:22.9861310Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:22.9862129Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:22.9862946Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:22.9863827Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:22.9865145Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:22.9866284Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.9866795Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.9867269Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.9867776Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x128xf32, #mma>
2026-02-21T09:37:22.9868278Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.9868673Z     %cst_4 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.9869149Z     %cst_5 = arith.constant dense<29352960> : tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9869419Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:22.9869629Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:22.9869836Z     %c224_i32 = arith.constant 224 : i32
2026-02-21T09:37:22.9870032Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:22.9870229Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:22.9870425Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:37:22.9870626Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:22.9870813Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:22.9871065Z     %cst_6 = arith.constant dense<0> : tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9871311Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:22.9871499Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:22.9871688Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:22.9872008Z     %cst_7 = arith.constant dense<4> : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9872326Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:22.9872667Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.9873140Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.9873608Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.9874077Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.9874535Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.9874951Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.9875296Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:22.9875759Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:22.9876525Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:22.9877220Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.9877625Z     %11 = arith.cmpi eq, %10, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.9877950Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked>
2026-02-21T09:37:22.9878208Z     %13 = arith.cmpi eq, %10, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:22.9878451Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked>
2026-02-21T09:37:22.9878722Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.9878991Z     %16 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.9879351Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.9879700Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9879949Z     scf.for %arg3 = %0 to %c224_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:22.9880134Z       %19 = arith.divsi %arg3, %c224_i32 : i32
2026-02-21T09:37:22.9880294Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:22.9880444Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:22.9880587Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:22.9880742Z       %23 = arith.remsi %arg3, %c224_i32 : i32
2026-02-21T09:37:22.9880895Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:22.9881084Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:22.9881230Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:22.9881382Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:22.9881596Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.9881865Z       %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.9882143Z       %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:22.9882410Z       %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:22.9882693Z       %32 = arith.muli %26, %c128_i32 : i32
2026-02-21T09:37:22.9882911Z       %33 = tt.splat %32 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.9883181Z       %34 = tt.splat %32 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.9883455Z       %35 = arith.addi %33, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:22.9883733Z       %36 = arith.addi %34, %4 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:22.9884081Z       %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.9884406Z       %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:22.9884653Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9885013Z       %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9885434Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x128xf32, #mma>)  : i32 {
2026-02-21T09:37:22.9885713Z         %72 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:22.9885941Z         %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.9886212Z         %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.9886606Z         %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.9886909Z         %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9887103Z         %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9887308Z         %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9887514Z         %79 = tt.load %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.9887788Z         %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9888202Z         %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9888498Z         %82 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:22.9888642Z         %83 = tt.splat %82 : i32 -> tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9888800Z         %84 = arith.addi %83, %40 : tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9888999Z         %85 = tt.addptr %7, %84 : tensor<1x128x!tt.ptr<i8>, #blocked2>, tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9889201Z         %86 = tt.load %85 : tensor<1x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:22.9889449Z         %87 = ttg.convert_layout %86 : tensor<1x128xi8, #blocked2> -> tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9889741Z         %88 = arith.shli %87, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9890019Z         %89 = arith.shrsi %88, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9890259Z         %90 = arith.shrsi %87, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9890556Z         %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:22.9890899Z         %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:22.9891188Z         %93 = tt.broadcast %91 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9891432Z         %94 = arith.select %12, %93, %cst_6 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9891674Z         %95 = tt.broadcast %92 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9891907Z         %96 = arith.select %14, %95, %94 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9892143Z         %97 = tt.reshape %96 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked3>
2026-02-21T09:37:22.9892372Z         %98 = arith.sitofp %97 : tensor<2x128xi8, #blocked3> to tensor<2x128xf32, #blocked3>
2026-02-21T09:37:22.9892687Z         %99 = ttg.convert_layout %98 : tensor<2x128xf32, #blocked3> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9893167Z         %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:37:22.9893527Z         %101 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:22.9893651Z         %102 = arith.muli %101, %c2_i32 : i32
2026-02-21T09:37:22.9893828Z         %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.9894061Z         %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.9894344Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.9894687Z         %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9894886Z         %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9895094Z         %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9895311Z         %109 = tt.load %108 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.9895585Z         %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9895998Z         %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9896292Z         %112 = arith.muli %101, %c7168_i32 : i32
2026-02-21T09:37:22.9896441Z         %113 = tt.splat %112 : i32 -> tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9896611Z         %114 = arith.addi %113, %40 : tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9896813Z         %115 = tt.addptr %7, %114 : tensor<1x128x!tt.ptr<i8>, #blocked2>, tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9897023Z         %116 = tt.load %115 : tensor<1x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:22.9897273Z         %117 = ttg.convert_layout %116 : tensor<1x128xi8, #blocked2> -> tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9897564Z         %118 = arith.shli %117, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9897803Z         %119 = arith.shrsi %118, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9898068Z         %120 = arith.shrsi %117, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9898359Z         %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:22.9898697Z         %122 = tt.expand_dims %120 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:22.9898985Z         %123 = tt.broadcast %121 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9899230Z         %124 = arith.select %12, %123, %cst_6 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9899474Z         %125 = tt.broadcast %122 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9899713Z         %126 = arith.select %14, %125, %124 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9899944Z         %127 = tt.reshape %126 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked3>
2026-02-21T09:37:22.9900175Z         %128 = arith.sitofp %127 : tensor<2x128xi8, #blocked3> to tensor<2x128xf32, #blocked3>
2026-02-21T09:37:22.9900478Z         %129 = ttg.convert_layout %128 : tensor<2x128xf32, #blocked3> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9900939Z         %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:37:22.9901281Z         %131 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:22.9901401Z         %132 = arith.muli %131, %c2_i32 : i32
2026-02-21T09:37:22.9901572Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.9901796Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:22.9902074Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:22.9902386Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9902577Z         %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9902775Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9902977Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.9903244Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9903640Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9903919Z         %142 = arith.muli %131, %c7168_i32 : i32
2026-02-21T09:37:22.9904063Z         %143 = tt.splat %142 : i32 -> tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9904221Z         %144 = arith.addi %143, %40 : tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9904418Z         %145 = tt.addptr %7, %144 : tensor<1x128x!tt.ptr<i8>, #blocked2>, tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9904618Z         %146 = tt.load %145 : tensor<1x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:22.9904862Z         %147 = ttg.convert_layout %146 : tensor<1x128xi8, #blocked2> -> tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9905145Z         %148 = arith.shli %147, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9905381Z         %149 = arith.shrsi %148, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9905616Z         %150 = arith.shrsi %147, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9905937Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:22.9906276Z         %152 = tt.expand_dims %150 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:22.9906561Z         %153 = tt.broadcast %151 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9906804Z         %154 = arith.select %12, %153, %cst_6 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9907044Z         %155 = tt.broadcast %152 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9907282Z         %156 = arith.select %14, %155, %154 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9907517Z         %157 = tt.reshape %156 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked3>
2026-02-21T09:37:22.9907742Z         %158 = arith.sitofp %157 : tensor<2x128xi8, #blocked3> to tensor<2x128xf32, #blocked3>
2026-02-21T09:37:22.9908039Z         %159 = ttg.convert_layout %158 : tensor<2x128xf32, #blocked3> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9908505Z         %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:37:22.9908851Z         scf.yield %160 : tensor<16x128xf32, #mma>
2026-02-21T09:37:22.9908973Z       } {tt.flatten}
2026-02-21T09:37:22.9909088Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9909281Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:22.9909477Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:22.9909732Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9910117Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9910441Z       %47 = arith.addi %40, %cst_5 : tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9910637Z       %48 = tt.addptr %7, %47 : tensor<1x128x!tt.ptr<i8>, #blocked2>, tensor<1x128xi32, #blocked2>
2026-02-21T09:37:22.9910831Z       %49 = tt.load %48 : tensor<1x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:22.9911068Z       %50 = ttg.convert_layout %49 : tensor<1x128xi8, #blocked2> -> tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9911341Z       %51 = arith.shli %50, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9911570Z       %52 = arith.shrsi %51, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9911800Z       %53 = arith.shrsi %50, %cst_7 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:22.9912082Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:22.9912415Z       %55 = tt.expand_dims %53 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:37:22.9912691Z       %56 = tt.broadcast %54 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9912922Z       %57 = arith.select %12, %56, %cst_6 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9913154Z       %58 = tt.broadcast %55 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9913379Z       %59 = arith.select %14, %58, %57 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:37:22.9913600Z       %60 = tt.reshape %59 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked3>
2026-02-21T09:37:22.9913863Z       %61 = arith.sitofp %60 : tensor<2x128xi8, #blocked3> to tensor<2x128xf32, #blocked3>
2026-02-21T09:37:22.9914152Z       %62 = ttg.convert_layout %61 : tensor<2x128xf32, #blocked3> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:22.9914604Z       %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:37:22.9914980Z       %64 = arith.truncf %63 : tensor<16x128xf32, #mma> to tensor<16x128xbf16, #mma>
2026-02-21T09:37:22.9915243Z       %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:22.9915481Z       %66 = arith.muli %65, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:22.9915711Z       %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:37:22.9915969Z       %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x128xi32, #mma>
2026-02-21T09:37:22.9916169Z       %69 = tt.broadcast %67 : tensor<1x128xi32, #mma> -> tensor<16x128xi32, #mma>
2026-02-21T09:37:22.9916347Z       %70 = arith.addi %68, %69 : tensor<16x128xi32, #mma>
2026-02-21T09:37:22.9916530Z       %71 = tt.addptr %15, %70 : tensor<16x128x!tt.ptr<bf16>, #mma>, tensor<16x128xi32, #mma>
2026-02-21T09:37:22.9916719Z       tt.store %71, %64 : tensor<16x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:22.9916881Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 2 : i32}
2026-02-21T09:37:22.9917015Z     tt.return
2026-02-21T09:37:22.9917098Z   }
2026-02-21T09:37:22.9917170Z }
2026-02-21T09:37:22.9917216Z 
2026-02-21T09:37:22.9917247Z {-#
2026-02-21T09:37:22.9917326Z   external_resources: {
2026-02-21T09:37:22.9917429Z     mlir_reproducer: {
2026-02-21T09:37:22.9918424Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:22.9919443Z       disable_threading: false,
2026-02-21T09:37:22.9919547Z       verify_each: true
2026-02-21T09:37:22.9919641Z     }
2026-02-21T09:37:22.9919713Z   }
2026-02-21T09:37:22.9919785Z #-}
2026-02-21T09:37:22.9920065Z /tmp/torchinductor_root/am/cambokeaynklkiwh66dt4xp3c3u5b7f7n3f6v6blvel4ay7c5o5p.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:22.9920748Z /tmp/torchinductor_root/am/cambokeaynklkiwh66dt4xp3c3u5b7f7n3f6v6blvel4ay7c5o5p.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:22.9921307Z [79s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:22.9922111Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:37:22.9922867Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:22.9923090Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:23.1257569Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:23.1260979Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:37:23.1261937Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:23.1262791Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:37:23.1263599Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:37:23.1264359Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:23.1265065Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:23.1265565Z #smem = #ttg.shared_memory
2026-02-21T09:37:23.1266212Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:23.1267520Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:23.1268577Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:23.1268955Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:23.1269297Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:23.1269641Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T09:37:23.1270015Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:23.1270431Z     %cst_4 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.1271104Z     %cst_5 = arith.constant dense<29352960> : tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1271409Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:23.1271639Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:23.1271875Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T09:37:23.1272095Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:23.1272317Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:23.1272535Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:37:23.1272750Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:23.1272966Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:23.1273251Z     %cst_6 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1273531Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:23.1273753Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:23.1273968Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:23.1274319Z     %cst_7 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1274689Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:23.1275065Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:23.1275599Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:23.1276120Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:23.1276639Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:23.1277152Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.1277652Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:23.1277949Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:23.1278330Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:23.1278903Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:23.1279468Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:23.1279821Z     %11 = arith.cmpi eq, %10, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:23.1280109Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:23.1280383Z     %13 = arith.cmpi eq, %10, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:23.1280643Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:23.1280945Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:23.1281245Z     %16 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.1281631Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:23.1282014Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1282285Z     scf.for %arg3 = %0 to %c112_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:23.1282495Z       %19 = arith.divsi %arg3, %c112_i32 : i32
2026-02-21T09:37:23.1282738Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:23.1282907Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:23.1283076Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:23.1283242Z       %23 = arith.remsi %arg3, %c112_i32 : i32
2026-02-21T09:37:23.1283464Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:23.1283625Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:23.1283787Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:23.1283947Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:23.1284180Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:23.1284479Z       %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:23.1284776Z       %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:23.1285091Z       %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:23.1285317Z       %32 = arith.muli %26, %c256_i32 : i32
2026-02-21T09:37:23.1285551Z       %33 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:23.1285850Z       %34 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:23.1286151Z       %35 = arith.addi %33, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:23.1286445Z       %36 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:23.1286803Z       %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:23.1287078Z       %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:23.1287292Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1287599Z       %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1288116Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T09:37:23.1288363Z         %73 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:23.1288554Z         %74 = tt.splat %73 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.1288796Z         %75 = arith.addi %74, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.1289092Z         %76 = tt.expand_dims %75 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:23.1289394Z         %77 = tt.broadcast %76 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1289608Z         %78 = arith.addi %39, %77 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1289824Z         %79 = tt.addptr %6, %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1290049Z         %80 = tt.load %79 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:23.1290339Z         %81 = ttg.convert_layout %80 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1290786Z         %82 = arith.extf %81 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1291101Z         %83 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:23.1291261Z         %84 = tt.splat %83 : i32 -> tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1291434Z         %85 = arith.addi %84, %40 : tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1291644Z         %86 = tt.addptr %7, %85 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1291860Z         %87 = tt.load %86 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:23.1292122Z         %88 = ttg.convert_layout %87 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1292434Z         %89 = arith.shli %88, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1292691Z         %90 = arith.shrsi %89, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1292982Z         %91 = arith.shrsi %88, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1293304Z         %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.1293677Z         %93 = tt.expand_dims %91 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.1293986Z         %94 = tt.broadcast %92 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1294255Z         %95 = arith.select %12, %94, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1294516Z         %96 = tt.broadcast %93 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1294772Z         %97 = arith.select %14, %96, %95 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1295033Z         %98 = tt.reshape %97 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3>
2026-02-21T09:37:23.1295277Z         %99 = arith.sitofp %98 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3>
2026-02-21T09:37:23.1295559Z         %100 = ttg.local_alloc %99 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:23.1295922Z         %101 = ttg.local_load %100 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1296453Z         %102 = tt.dot %82, %101, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:23.1296886Z         %103 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:23.1297024Z         %104 = arith.muli %103, %c2_i32 : i32
2026-02-21T09:37:23.1297219Z         %105 = tt.splat %104 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.1297443Z         %106 = arith.addi %105, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.1297725Z         %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:23.1298007Z         %108 = tt.broadcast %107 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1298202Z         %109 = arith.addi %39, %108 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1298403Z         %110 = tt.addptr %6, %109 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1298610Z         %111 = tt.load %110 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:23.1298883Z         %112 = ttg.convert_layout %111 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1299285Z         %113 = arith.extf %112 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1299563Z         %114 = arith.muli %103, %c7168_i32 : i32
2026-02-21T09:37:23.1299710Z         %115 = tt.splat %114 : i32 -> tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1299870Z         %116 = arith.addi %115, %40 : tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1300069Z         %117 = tt.addptr %7, %116 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1300274Z         %118 = tt.load %117 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:23.1300517Z         %119 = ttg.convert_layout %118 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1300811Z         %120 = arith.shli %119, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1301094Z         %121 = arith.shrsi %120, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1301336Z         %122 = arith.shrsi %119, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1301629Z         %123 = tt.expand_dims %121 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.1301965Z         %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.1302255Z         %125 = tt.broadcast %123 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1302498Z         %126 = arith.select %12, %125, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1302742Z         %127 = tt.broadcast %124 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1302977Z         %128 = arith.select %14, %127, %126 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1303207Z         %129 = tt.reshape %128 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3>
2026-02-21T09:37:23.1303431Z         %130 = arith.sitofp %129 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3>
2026-02-21T09:37:23.1303708Z         %131 = ttg.local_alloc %130 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:23.1304123Z         %132 = ttg.local_load %131 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1304647Z         %133 = tt.dot %113, %132, %102, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:23.1305027Z         %134 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:23.1305193Z         %135 = arith.muli %134, %c2_i32 : i32
2026-02-21T09:37:23.1305389Z         %136 = tt.splat %135 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.1305640Z         %137 = arith.addi %136, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.1305939Z         %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:23.1312190Z         %139 = tt.broadcast %138 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1312416Z         %140 = arith.addi %39, %139 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1312621Z         %141 = tt.addptr %6, %140 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1312832Z         %142 = tt.load %141 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:23.1313106Z         %143 = ttg.convert_layout %142 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1313512Z         %144 = arith.extf %143 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1313799Z         %145 = arith.muli %134, %c7168_i32 : i32
2026-02-21T09:37:23.1313946Z         %146 = tt.splat %145 : i32 -> tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1314107Z         %147 = arith.addi %146, %40 : tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1314304Z         %148 = tt.addptr %7, %147 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1314508Z         %149 = tt.load %148 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:23.1314756Z         %150 = ttg.convert_layout %149 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1315036Z         %151 = arith.shli %150, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1315337Z         %152 = arith.shrsi %151, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1315573Z         %153 = arith.shrsi %150, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1315864Z         %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.1316205Z         %155 = tt.expand_dims %153 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.1316492Z         %156 = tt.broadcast %154 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1316739Z         %157 = arith.select %12, %156, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1316982Z         %158 = tt.broadcast %155 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1317219Z         %159 = arith.select %14, %158, %157 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1317450Z         %160 = tt.reshape %159 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3>
2026-02-21T09:37:23.1317675Z         %161 = arith.sitofp %160 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3>
2026-02-21T09:37:23.1317931Z         %162 = ttg.local_alloc %161 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:23.1318255Z         %163 = ttg.local_load %162 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1318753Z         %164 = tt.dot %144, %163, %133, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:23.1319104Z         scf.yield %164 : tensor<16x256xf32, #mma>
2026-02-21T09:37:23.1319228Z       } {tt.flatten}
2026-02-21T09:37:23.1319347Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1319542Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.1319739Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:23.1319998Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1320389Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1320685Z       %47 = arith.addi %40, %cst_5 : tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1320884Z       %48 = tt.addptr %7, %47 : tensor<1x256x!tt.ptr<i8>, #blocked2>, tensor<1x256xi32, #blocked2>
2026-02-21T09:37:23.1321078Z       %49 = tt.load %48 : tensor<1x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:23.1321316Z       %50 = ttg.convert_layout %49 : tensor<1x256xi8, #blocked2> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1321589Z       %51 = arith.shli %50, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1321821Z       %52 = arith.shrsi %51, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1322050Z       %53 = arith.shrsi %50, %cst_7 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.1322333Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.1322719Z       %55 = tt.expand_dims %53 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.1323001Z       %56 = tt.broadcast %54 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1323396Z       %57 = arith.select %12, %56, %cst_6 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1323627Z       %58 = tt.broadcast %55 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1323853Z       %59 = arith.select %14, %58, %57 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.1324080Z       %60 = tt.reshape %59 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3>
2026-02-21T09:37:23.1324296Z       %61 = arith.sitofp %60 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3>
2026-02-21T09:37:23.1324546Z       %62 = ttg.local_alloc %61 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:23.1324866Z       %63 = ttg.local_load %62 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.1325323Z       %64 = tt.dot %46, %63, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:23.1325704Z       %65 = arith.truncf %64 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T09:37:23.1325961Z       %66 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:23.1326193Z       %67 = arith.muli %66, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:23.1326424Z       %68 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:37:23.1326680Z       %69 = tt.broadcast %67 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:23.1326921Z       %70 = tt.broadcast %68 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:23.1327093Z       %71 = arith.addi %69, %70 : tensor<16x256xi32, #mma>
2026-02-21T09:37:23.1327277Z       %72 = tt.addptr %15, %71 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T09:37:23.1327465Z       tt.store %72, %65 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:23.1327601Z     } {tt.num_stages = 2 : i32}
2026-02-21T09:37:23.1327702Z     tt.return
2026-02-21T09:37:23.1327781Z   }
2026-02-21T09:37:23.1327853Z }
2026-02-21T09:37:23.1327895Z 
2026-02-21T09:37:23.1327925Z {-#
2026-02-21T09:37:23.1327967Z   external_resources: {
2026-02-21T09:37:23.1328004Z     mlir_reproducer: {
2026-02-21T09:37:23.1328945Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:23.1328989Z       disable_threading: false,
2026-02-21T09:37:23.1329025Z       verify_each: true
2026-02-21T09:37:23.1329056Z     }
2026-02-21T09:37:23.1329089Z   }
2026-02-21T09:37:23.1329119Z #-}
2026-02-21T09:37:23.1329354Z /tmp/torchinductor_root/75/c75e3kbo34qoroitpjojd5lcsamuqtto3niepbqlxidsoepxczrm.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:23.1329767Z /tmp/torchinductor_root/75/c75e3kbo34qoroitpjojd5lcsamuqtto3niepbqlxidsoepxczrm.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:23.1329880Z [79s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:23.1330504Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:37:23.1330595Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:23.1330676Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:23.3520446Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:23.3524388Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}>
2026-02-21T09:37:23.3525336Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:37:23.3526175Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}>
2026-02-21T09:37:23.3526960Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 8], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:23.3527523Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:23.3527868Z #smem = #ttg.shared_memory
2026-02-21T09:37:23.3528241Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:23.3529108Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:23.3529756Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T09:37:23.3530008Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:23.3530197Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:23.3530378Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:23.3530550Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:23.3530770Z     %cst_0 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3531006Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:23.3531191Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:37:23.3531375Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:23.3531551Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:23.3531722Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T09:37:23.3531903Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:23.3532086Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:23.3532389Z     %cst_1 = arith.constant dense<29352960> : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3532812Z     %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.3533152Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:23.3533482Z     %cst_4 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3533812Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:23.3534079Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:23.3534340Z     %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:23.3534568Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:23.3534874Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:23.3535300Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:23.3535847Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:23.3536333Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:23.3536749Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.3537121Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:23.3537411Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3537778Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:23.3538272Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:23.3538762Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:23.3539065Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:23.3539298Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:23.3539535Z     %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:23.3539758Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:37:23.3540009Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:23.3542646Z     %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.3542989Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:23.3543318Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3543549Z     scf.for %arg3 = %0 to %c112_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:23.3543725Z       %19 = arith.divsi %arg3, %c112_i32 : i32
2026-02-21T09:37:23.3543871Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:23.3544009Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:23.3544146Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:23.3544290Z       %23 = arith.remsi %arg3, %c112_i32 : i32
2026-02-21T09:37:23.3544431Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:23.3544564Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:23.3544696Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:23.3544832Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:23.3545034Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:23.3545294Z       %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:23.3545544Z       %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:23.3545795Z       %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:23.3545987Z       %32 = arith.muli %26, %c256_i32 : i32
2026-02-21T09:37:23.3546236Z       %33 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:23.3546544Z       %34 = tt.splat %32 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:23.3546848Z       %35 = arith.addi %33, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:23.3547155Z       %36 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:23.3547543Z       %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:23.3547844Z       %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:23.3548048Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3548401Z       %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3548794Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T09:37:23.3549008Z         %72 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:23.3549181Z         %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.3549396Z         %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.3549670Z         %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:23.3549936Z         %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3550130Z         %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3550324Z         %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3550522Z         %79 = tt.load %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:23.3550783Z         %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3551208Z         %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3551493Z         %82 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:23.3551665Z         %83 = tt.splat %82 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3551887Z         %84 = arith.addi %83, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3552195Z         %85 = tt.addptr %7, %84 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3552496Z         %86 = tt.load %85 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3552723Z         %87 = arith.shli %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3552953Z         %88 = arith.shrsi %87, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3553184Z         %89 = arith.shrsi %86, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3553471Z         %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.3553799Z         %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.3554082Z         %92 = tt.broadcast %90 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3554316Z         %93 = arith.select %12, %92, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3554547Z         %94 = tt.broadcast %91 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3554772Z         %95 = arith.select %14, %94, %93 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3554996Z         %96 = tt.reshape %95 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:23.3555214Z         %97 = arith.sitofp %96 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:23.3555512Z         %98 = ttg.local_alloc %97 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:23.3555832Z         %99 = ttg.local_load %98 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3556300Z         %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:23.3556642Z         %101 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:23.3556762Z         %102 = arith.muli %101, %c2_i32 : i32
2026-02-21T09:37:23.3556934Z         %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.3557153Z         %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.3557429Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:23.3557700Z         %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3557892Z         %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3558087Z         %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3558289Z         %109 = tt.load %108 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:23.3558554Z         %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3558979Z         %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3559261Z         %112 = arith.muli %101, %c7168_i32 : i32
2026-02-21T09:37:23.3559437Z         %113 = tt.splat %112 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3559665Z         %114 = arith.addi %113, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3559974Z         %115 = tt.addptr %7, %114 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3560280Z         %116 = tt.load %115 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3560510Z         %117 = arith.shli %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3560747Z         %118 = arith.shrsi %117, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3560985Z         %119 = arith.shrsi %116, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3561276Z         %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.3561609Z         %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.3561898Z         %122 = tt.broadcast %120 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3562142Z         %123 = arith.select %12, %122, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3562381Z         %124 = tt.broadcast %121 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3562684Z         %125 = arith.select %14, %124, %123 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3562916Z         %126 = tt.reshape %125 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:23.3563179Z         %127 = arith.sitofp %126 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:23.3563431Z         %128 = ttg.local_alloc %127 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:23.3563757Z         %129 = ttg.local_load %128 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3564229Z         %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:23.3564572Z         %131 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:23.3564693Z         %132 = arith.muli %131, %c2_i32 : i32
2026-02-21T09:37:23.3564865Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.3565084Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:23.3565368Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:23.3565644Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3565838Z         %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3566035Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3566237Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:23.3566501Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3566934Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3567211Z         %142 = arith.muli %131, %c7168_i32 : i32
2026-02-21T09:37:23.3567388Z         %143 = tt.splat %142 : i32 -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3567615Z         %144 = arith.addi %143, %40 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3567923Z         %145 = tt.addptr %7, %144 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3568231Z         %146 = tt.load %145 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3568461Z         %147 = arith.shli %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3568701Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3568938Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3569232Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.3569570Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.3569858Z         %152 = tt.broadcast %150 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3570102Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3570341Z         %154 = tt.broadcast %151 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3570582Z         %155 = arith.select %14, %154, %153 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3570816Z         %156 = tt.reshape %155 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:23.3571072Z         %157 = arith.sitofp %156 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:23.3571327Z         %158 = ttg.local_alloc %157 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:23.3571648Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3572118Z         %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:23.3572462Z         scf.yield %160 : tensor<16x256xf32, #mma>
2026-02-21T09:37:23.3572585Z       } {tt.flatten}
2026-02-21T09:37:23.3572705Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3572894Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:23.3573097Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:23.3573355Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3573741Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3574066Z       %47 = arith.addi %40, %cst_1 : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3574374Z       %48 = tt.addptr %7, %47 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3574702Z       %49 = tt.load %48 : tensor<1x256x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3574926Z       %50 = arith.shli %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3575158Z       %51 = arith.shrsi %50, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3575387Z       %52 = arith.shrsi %49, %cst_4 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:23.3575672Z       %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.3576003Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:37:23.3576281Z       %55 = tt.broadcast %53 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3576513Z       %56 = arith.select %12, %55, %cst_0 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3576750Z       %57 = tt.broadcast %54 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3576982Z       %58 = arith.select %14, %57, %56 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:37:23.3577205Z       %59 = tt.reshape %58 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked2>
2026-02-21T09:37:23.3577426Z       %60 = arith.sitofp %59 : tensor<2x256xi8, #blocked2> to tensor<2x256xf32, #blocked2>
2026-02-21T09:37:23.3577672Z       %61 = ttg.local_alloc %60 : (tensor<2x256xf32, #blocked2>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:37:23.3577992Z       %62 = ttg.local_load %61 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:23.3578454Z       %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:37:23.3578831Z       %64 = arith.truncf %63 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T09:37:23.3579131Z       %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:23.3579360Z       %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:23.3579591Z       %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:37:23.3579846Z       %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:23.3580043Z       %69 = tt.broadcast %67 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:37:23.3580218Z       %70 = arith.addi %68, %69 : tensor<16x256xi32, #mma>
2026-02-21T09:37:23.3580400Z       %71 = tt.addptr %15, %70 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T09:37:23.3580592Z       tt.store %71, %64 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:23.3580731Z     } {tt.num_stages = 2 : i32}
2026-02-21T09:37:23.3580837Z     tt.return
2026-02-21T09:37:23.3580917Z   }
2026-02-21T09:37:23.3580988Z }
2026-02-21T09:37:23.3581031Z 
2026-02-21T09:37:23.3581065Z {-#
2026-02-21T09:37:23.3581143Z   external_resources: {
2026-02-21T09:37:23.3581246Z     mlir_reproducer: {
2026-02-21T09:37:23.3582273Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:23.3583258Z       disable_threading: false,
2026-02-21T09:37:23.3583361Z       verify_each: true
2026-02-21T09:37:23.3583454Z     }
2026-02-21T09:37:23.3583527Z   }
2026-02-21T09:37:23.3583596Z #-}
2026-02-21T09:37:23.3583869Z /tmp/torchinductor_root/fy/cfyx5plztc5dsvzazhgqd5rvw2taewki2ywtdqugmslksmnddieo.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:23.3584548Z /tmp/torchinductor_root/fy/cfyx5plztc5dsvzazhgqd5rvw2taewki2ywtdqugmslksmnddieo.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:23.3585097Z [79s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:23.3585873Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:37:23.3586583Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:23.3586754Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:24.8588522Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 104/104 16.4 configs/s
2026-02-21T09:37:27.1489159Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 353.2         
2026-02-21T09:37:27.1489752Z                                                                   configs/s     
2026-02-21T09:37:27.7292300Z [83s] Generation 1 complete: 
2026-02-21T09:37:27.7292668Z error=15
2026-02-21T09:37:27.7292873Z ok=91
2026-02-21T09:37:27.7293087Z min=0.1501
2026-02-21T09:37:27.7293324Z mid=0.5532
2026-02-21T09:37:27.7293530Z max=8.1947
2026-02-21T09:37:27.7293766Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:37:27.7294443Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:37:27.7294804Z  'l2_groupings': [1],
2026-02-21T09:37:27.7295079Z  'load_eviction_policies': ['', ''],
2026-02-21T09:37:27.7295406Z  'loop_orders': [[0, 1]],
2026-02-21T09:37:27.7295682Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:37:27.7295967Z  'num_stages': 2,
2026-02-21T09:37:27.7296193Z  'num_warps': 4,
2026-02-21T09:37:27.7296437Z  'pid_type': 'flat',
2026-02-21T09:37:27.7296703Z  'range_flattens': [None, None],
2026-02-21T09:37:27.7297013Z  'range_multi_buffers': [None, True],
2026-02-21T09:37:27.7297334Z  'range_num_stages': [0, 1],
2026-02-21T09:37:27.7297579Z  'range_unroll_factors': [0, 1],
2026-02-21T09:37:27.7297845Z  'range_warp_specializes': [],
2026-02-21T09:37:27.7298094Z  'waves_per_eu': 3}
2026-02-21T09:37:27.7450794Z [83s] Fitting surrogate: 206 points, 206 targets
2026-02-21T09:37:29.3756379Z [85s] Generation 2 starting: 97 neighbors, 5 active search path(s)
2026-02-21T09:37:45.1001155Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 5.6 configs/s
2026-02-21T09:37:48.3940166Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:48.3943913Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:37:48.3946613Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:48.3947501Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:48.3948252Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:48.3949549Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:48.3950859Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:48.3951863Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:48.3952344Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.3952785Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.3953246Z     %cst_2 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:48.3953734Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T09:37:48.3954315Z     %cst_4 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.3954842Z     %cst_5 = arith.constant dense<29352960> : tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3955160Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:48.3955395Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:48.3955631Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:48.3955848Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:48.3956063Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:48.3956290Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:37:48.3956511Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:48.3956802Z     %cst_6 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3957089Z     %c224_i32 = arith.constant 224 : i32
2026-02-21T09:37:48.3957319Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:48.3957536Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:48.3957888Z     %cst_7 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3958269Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:48.3958496Z     %1 = arith.divsi %0, %c224_i32 : i32
2026-02-21T09:37:48.3958889Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:37:48.3959109Z     %3 = arith.subi %c4_i32, %2 : i32
2026-02-21T09:37:48.3959325Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:37:48.3959541Z     %5 = arith.remsi %0, %c224_i32 : i32
2026-02-21T09:37:48.3959766Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:37:48.3959983Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:37:48.3960189Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:37:48.3960399Z     %9 = arith.muli %7, %c16_i32 : i32
2026-02-21T09:37:48.3960796Z     %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:48.3961338Z     %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:48.3961832Z     %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:48.3962247Z     %13 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:48.3962865Z     %14 = arith.addi %12, %10 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:48.3963278Z     %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:48.3963599Z     %16 = arith.muli %8, %c64_i32 : i32
2026-02-21T09:37:48.3963970Z     %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:48.3964493Z     %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:48.3964888Z     %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:48.3965176Z     %20 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:48.3965531Z     %21 = arith.addi %19, %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:48.3965819Z     %22 = arith.addi %20, %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:48.3966165Z     %23 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.3966594Z     %24 = tt.expand_dims %14 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:48.3966945Z     %25 = arith.muli %24, %cst_2 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:48.3967210Z     %26 = tt.broadcast %25 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3967511Z     %27 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.3967881Z     %28 = tt.expand_dims %22 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3968256Z     %29 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:48.3968638Z     %30 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:48.3969226Z     %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:48.3969787Z     %32 = tt.expand_dims %31 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.3970139Z     %33 = arith.cmpi eq, %32, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.3970414Z     %34 = tt.broadcast %33 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:48.3970686Z     %35 = arith.cmpi eq, %32, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.3970953Z     %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:48.3971321Z     %37 = scf.for %arg3 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg4 = %cst_3) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:48.3971676Z       %72 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:37:48.3971915Z       %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.3972218Z       %74 = arith.addi %73, %23 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.3972593Z       %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.3972969Z       %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3973235Z       %77 = arith.addi %26, %76 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3973534Z       %78 = tt.addptr %27, %77 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3973820Z       %79 = tt.load %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.3974190Z       %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3974700Z       %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3975014Z       %82 = arith.muli %arg3, %c7168_i32 : i32
2026-02-21T09:37:48.3975169Z       %83 = tt.splat %82 : i32 -> tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3975336Z       %84 = arith.addi %83, %28 : tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3975544Z       %85 = tt.addptr %29, %84 : tensor<1x64x!tt.ptr<i8>, #blocked2>, tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3975757Z       %86 = tt.load %85 : tensor<1x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:48.3976050Z       %87 = ttg.convert_layout %86 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3976349Z       %88 = arith.shli %87, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3976602Z       %89 = arith.shrsi %88, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3976851Z       %90 = arith.shrsi %87, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3977161Z       %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.3977521Z       %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.3977820Z       %93 = tt.broadcast %91 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3978073Z       %94 = arith.select %34, %93, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3978325Z       %95 = tt.broadcast %92 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3978573Z       %96 = arith.select %36, %95, %94 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3978815Z       %97 = tt.reshape %96 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.3979049Z       %98 = arith.sitofp %97 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.3979362Z       %99 = ttg.convert_layout %98 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3979871Z       %100 = tt.dot %81, %99, %arg4, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.3980250Z       %101 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:37:48.3980388Z       %102 = arith.muli %101, %c2_i32 : i32
2026-02-21T09:37:48.3980573Z       %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.3980866Z       %104 = arith.addi %103, %23 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.3981168Z       %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.3981474Z       %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3981687Z       %107 = arith.addi %26, %106 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3981905Z       %108 = tt.addptr %27, %107 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3982133Z       %109 = tt.load %108 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.3982426Z       %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3982883Z       %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3983198Z       %112 = arith.muli %101, %c7168_i32 : i32
2026-02-21T09:37:48.3983352Z       %113 = tt.splat %112 : i32 -> tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3983525Z       %114 = arith.addi %113, %28 : tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3983736Z       %115 = tt.addptr %29, %114 : tensor<1x64x!tt.ptr<i8>, #blocked2>, tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3983956Z       %116 = tt.load %115 : tensor<1x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:48.3984221Z       %117 = ttg.convert_layout %116 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3984526Z       %118 = arith.shli %117, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3984797Z       %119 = arith.shrsi %118, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3985027Z       %120 = arith.shrsi %117, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3985312Z       %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.3985640Z       %122 = tt.expand_dims %120 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.3985918Z       %123 = tt.broadcast %121 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3986155Z       %124 = arith.select %34, %123, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3986390Z       %125 = tt.broadcast %122 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3986623Z       %126 = arith.select %36, %125, %124 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3986849Z       %127 = tt.reshape %126 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.3987065Z       %128 = arith.sitofp %127 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.3987356Z       %129 = ttg.convert_layout %128 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3987810Z       %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.3988148Z       %131 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:37:48.3988269Z       %132 = arith.muli %131, %c2_i32 : i32
2026-02-21T09:37:48.3988435Z       %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.3988658Z       %134 = arith.addi %133, %23 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.3988930Z       %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.3989252Z       %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3989441Z       %137 = arith.addi %26, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3989635Z       %138 = tt.addptr %27, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3989842Z       %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.3990102Z       %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3990498Z       %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3990776Z       %142 = arith.muli %131, %c7168_i32 : i32
2026-02-21T09:37:48.3990915Z       %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3991070Z       %144 = arith.addi %143, %28 : tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3991259Z       %145 = tt.addptr %29, %144 : tensor<1x64x!tt.ptr<i8>, #blocked2>, tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3991456Z       %146 = tt.load %145 : tensor<1x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:48.3991692Z       %147 = ttg.convert_layout %146 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3991968Z       %148 = arith.shli %147, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3992201Z       %149 = arith.shrsi %148, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3992465Z       %150 = arith.shrsi %147, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3992748Z       %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.3993081Z       %152 = tt.expand_dims %150 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.3993359Z       %153 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3993594Z       %154 = arith.select %34, %153, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3993824Z       %155 = tt.broadcast %152 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3994054Z       %156 = arith.select %36, %155, %154 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.3994279Z       %157 = tt.reshape %156 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.3994495Z       %158 = arith.sitofp %157 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.3994784Z       %159 = ttg.convert_layout %158 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3995242Z       %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.3995582Z       scf.yield %160 : tensor<16x64xf32, #mma>
2026-02-21T09:37:48.3995700Z     } {tt.flatten}
2026-02-21T09:37:48.3995843Z     %38 = arith.addi %23, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.3996117Z     %39 = tt.expand_dims %38 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.3996380Z     %40 = tt.broadcast %39 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3996572Z     %41 = arith.addi %26, %40 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3996764Z     %42 = tt.addptr %27, %41 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.3997003Z     %43 = tt.load %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.3997257Z     %44 = ttg.convert_layout %43 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3997640Z     %45 = arith.extf %44 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.3997932Z     %46 = arith.addi %28, %cst_5 : tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3998127Z     %47 = tt.addptr %29, %46 : tensor<1x64x!tt.ptr<i8>, #blocked2>, tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.3998315Z     %48 = tt.load %47 : tensor<1x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:48.3998547Z     %49 = ttg.convert_layout %48 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3998821Z     %50 = arith.shli %49, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3999046Z     %51 = arith.shrsi %50, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3999268Z     %52 = arith.shrsi %49, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.3999540Z     %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.3999863Z     %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.4000129Z     %55 = tt.broadcast %53 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.4000389Z     %56 = arith.select %34, %55, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.4000615Z     %57 = tt.broadcast %54 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.4000835Z     %58 = arith.select %36, %57, %56 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.4001051Z     %59 = tt.reshape %58 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.4001259Z     %60 = arith.sitofp %59 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.4001540Z     %61 = ttg.convert_layout %60 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.4001982Z     %62 = tt.dot %45, %61, %37, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.4002348Z     %63 = arith.truncf %62 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:48.4002661Z     %64 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:48.4002892Z     %65 = arith.muli %64, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:48.4003112Z     %66 = tt.expand_dims %21 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:48.4003360Z     %67 = tt.broadcast %65 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:48.4003552Z     %68 = tt.broadcast %66 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:48.4003723Z     %69 = arith.addi %67, %68 : tensor<16x64xi32, #mma>
2026-02-21T09:37:48.4003888Z     %70 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:48.4004093Z     %71 = tt.addptr %70, %69 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:48.4004278Z     tt.store %71, %63 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:48.4004406Z     tt.return
2026-02-21T09:37:48.4004487Z   }
2026-02-21T09:37:48.4004561Z }
2026-02-21T09:37:48.4004644Z 
2026-02-21T09:37:48.4004678Z {-#
2026-02-21T09:37:48.4004758Z   external_resources: {
2026-02-21T09:37:48.4004857Z     mlir_reproducer: {
2026-02-21T09:37:48.4005842Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:48.4006828Z       disable_threading: false,
2026-02-21T09:37:48.4006937Z       verify_each: true
2026-02-21T09:37:48.4007024Z     }
2026-02-21T09:37:48.4007099Z   }
2026-02-21T09:37:48.4007167Z #-}
2026-02-21T09:37:48.4007448Z /tmp/torchinductor_root/ub/cubzumuvy6exska34ne3anqev2uijxs3c2fdfge754rtqpbtslar.py:12:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:48.4008130Z /tmp/torchinductor_root/ub/cubzumuvy6exska34ne3anqev2uijxs3c2fdfge754rtqpbtslar.py:12:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:48.4008677Z [104s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:48.4009430Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:48.4010080Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:48.4010246Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:48.5875554Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:48.5878819Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:37:48.5879749Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:48.5880599Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:48.5881380Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:48.5882275Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:48.5883691Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:48.5884685Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:48.5885034Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.5885299Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.5885564Z     %cst_2 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:48.5885797Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:48.5886029Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T09:37:48.5886473Z     %cst_4 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.5886860Z     %cst_5 = arith.constant dense<29352960> : tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5887094Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:48.5887276Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:48.5887455Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T09:37:48.5887634Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T09:37:48.5887809Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:48.5887977Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:37:48.5888148Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:48.5888323Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:48.5888539Z     %cst_6 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5888755Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:48.5888920Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:48.5889083Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:48.5889363Z     %cst_7 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5889644Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:48.5889924Z     %1 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:48.5890333Z     %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:48.5890739Z     %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:48.5891135Z     %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:48.5891599Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.5891957Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.5892258Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:48.5892663Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:48.5893282Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:48.5893884Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.5894259Z     %11 = arith.cmpi eq, %10, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.5894517Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:48.5894740Z     %13 = arith.cmpi eq, %10, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.5894959Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:48.5895196Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:48.5895436Z     %16 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.5895755Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.5896064Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5896287Z     scf.for %arg3 = %0 to %c448_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:48.5896457Z       %19 = arith.divsi %arg3, %c16_i32 : i32
2026-02-21T09:37:48.5896595Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:48.5896737Z       %21 = arith.subi %c112_i32, %20 : i32
2026-02-21T09:37:48.5896869Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:48.5897005Z       %23 = arith.remsi %arg3, %c16_i32 : i32
2026-02-21T09:37:48.5897178Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:48.5897310Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:48.5897435Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:48.5897567Z       %27 = arith.muli %25, %c64_i32 : i32
2026-02-21T09:37:48.5897752Z       %28 = tt.splat %27 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:48.5897998Z       %29 = tt.splat %27 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:48.5898245Z       %30 = arith.addi %28, %1 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:48.5898485Z       %31 = arith.addi %29, %2 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:48.5898680Z       %32 = arith.muli %26, %c16_i32 : i32
2026-02-21T09:37:48.5898871Z       %33 = tt.splat %32 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:48.5899111Z       %34 = tt.splat %32 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:48.5899356Z       %35 = arith.addi %33, %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:48.5899616Z       %36 = arith.addi %34, %4 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:48.5899922Z       %37 = tt.expand_dims %35 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:48.5900211Z       %38 = arith.muli %37, %cst_2 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:48.5900436Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5900751Z       %40 = tt.expand_dims %31 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5901165Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst_3) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:48.5901422Z         %72 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:48.5901620Z         %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.5901875Z         %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.5902185Z         %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.5902499Z         %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5902716Z         %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5902946Z         %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5903182Z         %79 = tt.load %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.5903487Z         %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5903959Z         %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5904288Z         %82 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:48.5904452Z         %83 = tt.splat %82 : i32 -> tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5904609Z         %84 = arith.addi %83, %40 : tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5904798Z         %85 = tt.addptr %7, %84 : tensor<1x64x!tt.ptr<i8>, #blocked2>, tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5904993Z         %86 = tt.load %85 : tensor<1x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:48.5905229Z         %87 = ttg.convert_layout %86 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5905509Z         %88 = arith.shli %87, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5905772Z         %89 = arith.shrsi %88, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5906000Z         %90 = arith.shrsi %87, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5906282Z         %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.5906607Z         %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.5906882Z         %93 = tt.broadcast %91 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5907112Z         %94 = arith.select %12, %93, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5907342Z         %95 = tt.broadcast %92 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5907566Z         %96 = arith.select %14, %95, %94 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5907794Z         %97 = tt.reshape %96 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.5908010Z         %98 = arith.sitofp %97 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.5908296Z         %99 = ttg.convert_layout %98 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5908756Z         %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.5909101Z         %101 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:48.5909224Z         %102 = arith.muli %101, %c2_i32 : i32
2026-02-21T09:37:48.5909426Z         %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.5909656Z         %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.5909932Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.5910209Z         %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5910401Z         %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5910600Z         %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5910806Z         %109 = tt.load %108 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.5911071Z         %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5911468Z         %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5911752Z         %112 = arith.muli %101, %c7168_i32 : i32
2026-02-21T09:37:48.5911896Z         %113 = tt.splat %112 : i32 -> tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5912054Z         %114 = arith.addi %113, %40 : tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5912245Z         %115 = tt.addptr %7, %114 : tensor<1x64x!tt.ptr<i8>, #blocked2>, tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5912442Z         %116 = tt.load %115 : tensor<1x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:48.5912680Z         %117 = ttg.convert_layout %116 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5912961Z         %118 = arith.shli %117, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5913199Z         %119 = arith.shrsi %118, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5913463Z         %120 = arith.shrsi %117, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5913753Z         %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.5914084Z         %122 = tt.expand_dims %120 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.5914370Z         %123 = tt.broadcast %121 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5914608Z         %124 = arith.select %12, %123, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5914843Z         %125 = tt.broadcast %122 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5915078Z         %126 = arith.select %14, %125, %124 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5915308Z         %127 = tt.reshape %126 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.5915536Z         %128 = arith.sitofp %127 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.5915832Z         %129 = ttg.convert_layout %128 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5916298Z         %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.5916644Z         %131 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:48.5916769Z         %132 = arith.muli %131, %c2_i32 : i32
2026-02-21T09:37:48.5916938Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.5917196Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.5917471Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.5917749Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5917939Z         %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5918140Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5918344Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.5918607Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5919010Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5919290Z         %142 = arith.muli %131, %c7168_i32 : i32
2026-02-21T09:37:48.5919432Z         %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5919590Z         %144 = arith.addi %143, %40 : tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5919781Z         %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr<i8>, #blocked2>, tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5919979Z         %146 = tt.load %145 : tensor<1x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:48.5920217Z         %147 = ttg.convert_layout %146 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5920495Z         %148 = arith.shli %147, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5920728Z         %149 = arith.shrsi %148, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5920963Z         %150 = arith.shrsi %147, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5921251Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.5921626Z         %152 = tt.expand_dims %150 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.5921908Z         %153 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5922145Z         %154 = arith.select %12, %153, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5922378Z         %155 = tt.broadcast %152 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5922648Z         %156 = arith.select %14, %155, %154 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5922877Z         %157 = tt.reshape %156 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.5923100Z         %158 = arith.sitofp %157 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.5923397Z         %159 = ttg.convert_layout %158 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5923852Z         %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.5924195Z         scf.yield %160 : tensor<16x64xf32, #mma>
2026-02-21T09:37:48.5924315Z       } {tt.flatten}
2026-02-21T09:37:48.5924430Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5924624Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.5924865Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.5925122Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5925516Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5925814Z       %47 = arith.addi %40, %cst_5 : tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5926014Z       %48 = tt.addptr %7, %47 : tensor<1x64x!tt.ptr<i8>, #blocked2>, tensor<1x64xi32, #blocked2>
2026-02-21T09:37:48.5926208Z       %49 = tt.load %48 : tensor<1x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:37:48.5926448Z       %50 = ttg.convert_layout %49 : tensor<1x64xi8, #blocked2> -> tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5926721Z       %51 = arith.shli %50, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5926957Z       %52 = arith.shrsi %51, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5927189Z       %53 = arith.shrsi %50, %cst_7 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.5927471Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.5927798Z       %55 = tt.expand_dims %53 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.5928071Z       %56 = tt.broadcast %54 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5928298Z       %57 = arith.select %12, %56, %cst_6 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5928525Z       %58 = tt.broadcast %55 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5928746Z       %59 = arith.select %14, %58, %57 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.5928966Z       %60 = tt.reshape %59 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.5929215Z       %61 = arith.sitofp %60 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.5929501Z       %62 = ttg.convert_layout %61 : tensor<2x64xf32, #blocked2> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.5929950Z       %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.5930326Z       %64 = arith.truncf %63 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:48.5930583Z       %65 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:48.5930934Z       %66 = arith.muli %65, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:48.5931184Z       %67 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:48.5931547Z       %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:48.5931759Z       %69 = tt.broadcast %67 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:48.5931961Z       %70 = arith.addi %68, %69 : tensor<16x64xi32, #mma>
2026-02-21T09:37:48.5932182Z       %71 = tt.addptr %15, %70 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:48.5942858Z       tt.store %71, %64 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:48.5943045Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:37:48.5943188Z     tt.return
2026-02-21T09:37:48.5943269Z   }
2026-02-21T09:37:48.5943348Z }
2026-02-21T09:37:48.5943392Z 
2026-02-21T09:37:48.5943424Z {-#
2026-02-21T09:37:48.5943504Z   external_resources: {
2026-02-21T09:37:48.5943666Z     mlir_reproducer: {
2026-02-21T09:37:48.5944659Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:48.5945653Z       disable_threading: false,
2026-02-21T09:37:48.5945759Z       verify_each: true
2026-02-21T09:37:48.5945847Z     }
2026-02-21T09:37:48.5945919Z   }
2026-02-21T09:37:48.5945991Z #-}
2026-02-21T09:37:48.5946273Z /tmp/torchinductor_root/3g/c3gtwuhxllsqoypqhlqwz2aa4bkm5b5sdacguxpaftodwwlmfw7h.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:48.5946965Z /tmp/torchinductor_root/3g/c3gtwuhxllsqoypqhlqwz2aa4bkm5b5sdacguxpaftodwwlmfw7h.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:48.5947522Z [104s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:48.5948311Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[2, 0], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:48.5949020Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:48.5949188Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:48.7794514Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:48.7798084Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:37:48.7799033Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:48.7799868Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:48.7800652Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:48.7801370Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:48.7801870Z #smem = #ttg.shared_memory
2026-02-21T09:37:48.7802520Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:48.7803937Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:48.7805029Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T09:37:48.7805396Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:48.7805639Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:48.7805821Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:48.7805958Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:48.7806135Z     %cst_0 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7806445Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:48.7806593Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:37:48.7806735Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:48.7806876Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:48.7807015Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T09:37:48.7807158Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:48.7807301Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:48.7807536Z     %cst_1 = arith.constant dense<29352960> : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7807871Z     %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.7808143Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:48.7808415Z     %cst_4 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7808689Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.7808899Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.7809112Z     %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:48.7809291Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:48.7809536Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:48.7809879Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:48.7810262Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:48.7810648Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:48.7810974Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.7811280Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.7811642Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7812016Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:48.7812531Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:48.7813024Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.7813340Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.7813582Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:48.7813822Z     %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:48.7814056Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:48.7814312Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:48.7814573Z     %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.7814912Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.7815242Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7815465Z     scf.for %arg3 = %0 to %c448_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:48.7815613Z       %19 = arith.divsi %arg3, %c448_i32 : i32
2026-02-21T09:37:48.7815738Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:48.7815854Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:48.7816002Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:48.7816122Z       %23 = arith.remsi %arg3, %c448_i32 : i32
2026-02-21T09:37:48.7816241Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:48.7816355Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:48.7816465Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:48.7816578Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:48.7816746Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:48.7816957Z       %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:48.7817167Z       %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:48.7817373Z       %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:48.7817535Z       %32 = arith.muli %26, %c64_i32 : i32
2026-02-21T09:37:48.7817739Z       %33 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:48.7817988Z       %34 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:48.7818236Z       %35 = arith.addi %33, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:48.7818480Z       %36 = arith.addi %34, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:48.7818743Z       %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:48.7818989Z       %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:48.7819179Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7819546Z       %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7819937Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:48.7820184Z         %72 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:48.7820356Z         %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.7820576Z         %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.7820844Z         %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.7821117Z         %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7821307Z         %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7821510Z         %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7821710Z         %79 = tt.load %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.7821977Z         %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7822383Z         %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7822664Z         %82 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:48.7822842Z         %83 = tt.splat %82 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7823067Z         %84 = arith.addi %83, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7823366Z         %85 = tt.addptr %7, %84 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7823699Z         %86 = tt.load %85 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7823929Z         %87 = arith.shli %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7824162Z         %88 = arith.shrsi %87, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7824393Z         %89 = arith.shrsi %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7824674Z         %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.7825000Z         %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.7825274Z         %92 = tt.broadcast %90 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7825509Z         %93 = arith.select %12, %92, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7825740Z         %94 = tt.broadcast %91 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7825970Z         %95 = arith.select %14, %94, %93 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7826190Z         %96 = tt.reshape %95 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.7826403Z         %97 = arith.sitofp %96 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.7826648Z         %98 = ttg.local_alloc %97 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:48.7826965Z         %99 = ttg.local_load %98 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7827429Z         %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.7827774Z         %101 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:48.7827927Z         %102 = arith.muli %101, %c2_i32 : i32
2026-02-21T09:37:48.7828098Z         %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.7828322Z         %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.7828594Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.7828870Z         %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7829063Z         %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7829264Z         %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7829477Z         %109 = tt.load %108 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.7829742Z         %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7830151Z         %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7830432Z         %112 = arith.muli %101, %c7168_i32 : i32
2026-02-21T09:37:48.7830610Z         %113 = tt.splat %112 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7830839Z         %114 = arith.addi %113, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7831146Z         %115 = tt.addptr %7, %114 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7831483Z         %116 = tt.load %115 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7831712Z         %117 = arith.shli %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7831950Z         %118 = arith.shrsi %117, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7832186Z         %119 = arith.shrsi %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7832473Z         %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.7832810Z         %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.7833092Z         %122 = tt.broadcast %120 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7833329Z         %123 = arith.select %12, %122, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7833566Z         %124 = tt.broadcast %121 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7833799Z         %125 = arith.select %14, %124, %123 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7834034Z         %126 = tt.reshape %125 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.7834256Z         %127 = arith.sitofp %126 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.7834506Z         %128 = ttg.local_alloc %127 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:48.7834832Z         %129 = ttg.local_load %128 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7835302Z         %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.7835678Z         %131 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:48.7835802Z         %132 = arith.muli %131, %c2_i32 : i32
2026-02-21T09:37:48.7835970Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.7836192Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:48.7836465Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:48.7836742Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7836937Z         %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7837134Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7837343Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.7837606Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7838013Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7838295Z         %142 = arith.muli %131, %c7168_i32 : i32
2026-02-21T09:37:48.7838467Z         %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7838695Z         %144 = arith.addi %143, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7839000Z         %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7839357Z         %146 = tt.load %145 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7839584Z         %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7839820Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7840053Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7840336Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.7840667Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.7840943Z         %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7841184Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7841419Z         %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7841647Z         %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7841871Z         %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.7842087Z         %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.7842335Z         %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:48.7842703Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7843168Z         %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.7843549Z         scf.yield %160 : tensor<16x64xf32, #mma>
2026-02-21T09:37:48.7843675Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:37:48.7843814Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7844005Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:48.7844199Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:48.7844453Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7844837Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7845161Z       %47 = arith.addi %40, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7845466Z       %48 = tt.addptr %7, %47 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7845768Z       %49 = tt.load %48 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7845990Z       %50 = arith.shli %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7846215Z       %51 = arith.shrsi %50, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7846438Z       %52 = arith.shrsi %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:48.7846714Z       %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.7847070Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:48.7847341Z       %55 = tt.broadcast %53 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7847573Z       %56 = arith.select %12, %55, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7847798Z       %57 = tt.broadcast %54 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7848019Z       %58 = arith.select %14, %57, %56 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:48.7848232Z       %59 = tt.reshape %58 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:48.7848441Z       %60 = arith.sitofp %59 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:48.7848680Z       %61 = ttg.local_alloc %60 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:48.7848992Z       %62 = ttg.local_load %61 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:48.7849445Z       %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:48.7849815Z       %64 = arith.truncf %63 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:48.7850069Z       %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:48.7850302Z       %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:48.7850526Z       %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:48.7850773Z       %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:48.7850967Z       %69 = tt.broadcast %67 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:48.7851138Z       %70 = arith.addi %68, %69 : tensor<16x64xi32, #mma>
2026-02-21T09:37:48.7851348Z       %71 = tt.addptr %15, %70 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:48.7851533Z       tt.store %71, %64 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:48.7851692Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:37:48.7851827Z     tt.return
2026-02-21T09:37:48.7851904Z   }
2026-02-21T09:37:48.7851972Z }
2026-02-21T09:37:48.7852016Z 
2026-02-21T09:37:48.7852045Z {-#
2026-02-21T09:37:48.7852121Z   external_resources: {
2026-02-21T09:37:48.7852220Z     mlir_reproducer: {
2026-02-21T09:37:48.7853215Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:48.7854204Z       disable_threading: false,
2026-02-21T09:37:48.7854306Z       verify_each: true
2026-02-21T09:37:48.7854395Z     }
2026-02-21T09:37:48.7854464Z   }
2026-02-21T09:37:48.7854530Z #-}
2026-02-21T09:37:48.7854805Z /tmp/torchinductor_root/et/cetnhmlxe7uqgvdferbhtte42dogre5ejn2acagrtopbdcr5d64g.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:48.7855594Z /tmp/torchinductor_root/et/cetnhmlxe7uqgvdferbhtte42dogre5ejn2acagrtopbdcr5d64g.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:48.7856144Z [104s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:48.7856923Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:48.7857631Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:48.7857798Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:49.0188011Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:49.0191618Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:37:49.0192589Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:49.0193432Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:49.0194265Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:49.0195020Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:49.0195752Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:49.0196832Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:49.0198051Z     %cst = arith.constant dense<7168> : tensor<1x64xi64, #mma>
2026-02-21T09:37:49.0198458Z     %cst_0 = arith.constant dense<0> : tensor<1x64xi64, #mma>
2026-02-21T09:37:49.0198820Z     %cst_1 = arith.constant dense<64> : tensor<16x1xi64, #mma>
2026-02-21T09:37:49.0199176Z     %cst_2 = arith.constant dense<0> : tensor<16x1xi64, #mma>
2026-02-21T09:37:49.0199541Z     %cst_3 = arith.constant dense<7168> : tensor<16x1xi64, #mma>
2026-02-21T09:37:49.0199933Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.0200316Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.0200716Z     %cst_6 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.0201128Z     %cst_7 = arith.constant dense<7168> : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.0201527Z     %cst_8 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:49.0201859Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:49.0202123Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:37:49.0202367Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:49.0202698Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:49.0202941Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T09:37:49.0203203Z     %c4092_i32 = arith.constant 4092 : i32
2026-02-21T09:37:49.0203454Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:37:49.0203758Z     %cst_9 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0204081Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:37:49.0204330Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:49.0204541Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:49.0204906Z     %cst_10 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0205289Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:49.0205596Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:49.0206032Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.0206449Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.0206859Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.0207270Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.0207682Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.0208058Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:49.0208372Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:49.0208791Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:49.0209455Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:49.0210085Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.0210482Z     %12 = arith.cmpi eq, %11, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.0210791Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:37:49.0211094Z     %14 = arith.cmpi eq, %11, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.0211390Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:37:49.0211713Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.0212188Z     %17 = arith.extsi %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.0212704Z     %18 = arith.extsi %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.0213094Z     scf.for %arg3 = %0 to %c448_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:49.0213323Z       %19 = arith.divsi %arg3, %c448_i32 : i32
2026-02-21T09:37:49.0213513Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:49.0213694Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:49.0213874Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:49.0214062Z       %23 = arith.remsi %arg3, %c448_i32 : i32
2026-02-21T09:37:49.0214243Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:49.0214415Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:49.0214566Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:49.0214721Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:49.0214924Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:49.0215209Z       %29 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:49.0215411Z       %30 = arith.muli %26, %c64_i32 : i32
2026-02-21T09:37:49.0215604Z       %31 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.0215859Z       %32 = arith.addi %31, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.0216182Z       %33 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:37:49.0216476Z       %34 = arith.muli %33, %cst_8 : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:49.0216753Z       %35 = tt.broadcast %34 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0217082Z       %36 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T09:37:49.0217410Z       %37 = tt.broadcast %36 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0217726Z       %38 = scf.for %arg4 = %c0_i32 to %c4092_i32 step %c6_i32 iter_args(%arg5 = %cst_6) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:49.0218048Z         %63 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.0218312Z         %64 = arith.addi %63, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.0218515Z         %65 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:49.0218714Z         %66 = tt.splat %65 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.0218970Z         %67 = arith.addi %66, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.0219291Z         %68 = tt.expand_dims %67 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:49.0219619Z         %69 = tt.broadcast %68 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0219846Z         %70 = arith.addi %35, %69 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0220085Z         %71 = tt.addptr %7, %70 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0220323Z         %72 = tt.load %71 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:49.0220640Z         %73 = ttg.convert_layout %72 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0221129Z         %74 = arith.extf %73 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0221578Z         %75 = tt.expand_dims %64 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.0221908Z         %76 = arith.muli %75, %cst_7 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.0222130Z         %77 = tt.broadcast %76 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0222354Z         %78 = arith.addi %77, %37 : tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0222583Z         %79 = tt.addptr %8, %78 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0222811Z         %80 = tt.load %79 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:49.0223094Z         %81 = ttg.convert_layout %80 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0223423Z         %82 = arith.shli %81, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0223703Z         %83 = arith.shrsi %82, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0223983Z         %84 = arith.shrsi %81, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0224321Z         %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.0224665Z         %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.0224937Z         %87 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0225167Z         %88 = arith.select %13, %87, %cst_9 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0225398Z         %89 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0225652Z         %90 = arith.select %15, %89, %88 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0225871Z         %91 = tt.reshape %90 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:37:49.0226087Z         %92 = arith.sitofp %91 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:37:49.0226374Z         %93 = ttg.convert_layout %92 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0226837Z         %94 = tt.dot %74, %93, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.0227177Z         %95 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:49.0227345Z         %96 = tt.splat %95 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.0227558Z         %97 = arith.addi %96, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.0227727Z         %98 = arith.muli %95, %c2_i32 : i32
2026-02-21T09:37:49.0227891Z         %99 = tt.splat %98 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.0228101Z         %100 = arith.addi %99, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.0228376Z         %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:49.0228650Z         %102 = tt.broadcast %101 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0228842Z         %103 = arith.addi %35, %102 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0229039Z         %104 = tt.addptr %7, %103 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0229240Z         %105 = tt.load %104 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:49.0229505Z         %106 = ttg.convert_layout %105 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0229937Z         %107 = arith.extf %106 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0230315Z         %108 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.0230560Z         %109 = arith.muli %108, %cst_7 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.0230747Z         %110 = tt.broadcast %109 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0230936Z         %111 = arith.addi %110, %37 : tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0231128Z         %112 = tt.addptr %8, %111 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0231323Z         %113 = tt.load %112 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:49.0231563Z         %114 = ttg.convert_layout %113 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0231843Z         %115 = arith.shli %114, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0232077Z         %116 = arith.shrsi %115, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0232310Z         %117 = arith.shrsi %114, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0232600Z         %118 = tt.expand_dims %116 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.0232937Z         %119 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.0233214Z         %120 = tt.broadcast %118 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0233494Z         %121 = arith.select %13, %120, %cst_9 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0233732Z         %122 = tt.broadcast %119 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0233966Z         %123 = arith.select %15, %122, %121 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0234196Z         %124 = tt.reshape %123 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:37:49.0234414Z         %125 = arith.sitofp %124 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:37:49.0234708Z         %126 = ttg.convert_layout %125 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0235167Z         %127 = tt.dot %107, %126, %94, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.0235505Z         %128 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:37:49.0235679Z         %129 = tt.splat %128 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.0235898Z         %130 = arith.addi %129, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.0236070Z         %131 = arith.muli %128, %c2_i32 : i32
2026-02-21T09:37:49.0236238Z         %132 = tt.splat %131 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.0236461Z         %133 = arith.addi %132, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.0236737Z         %134 = tt.expand_dims %133 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:49.0237012Z         %135 = tt.broadcast %134 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0237208Z         %136 = arith.addi %35, %135 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0237401Z         %137 = tt.addptr %7, %136 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0237730Z         %138 = tt.load %137 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:49.0237993Z         %139 = ttg.convert_layout %138 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0238387Z         %140 = arith.extf %139 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0238764Z         %141 = tt.expand_dims %130 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.0239011Z         %142 = arith.muli %141, %cst_7 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.0239202Z         %143 = tt.broadcast %142 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0239393Z         %144 = arith.addi %143, %37 : tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0239586Z         %145 = tt.addptr %8, %144 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0239782Z         %146 = tt.load %145 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:49.0240019Z         %147 = ttg.convert_layout %146 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0240297Z         %148 = arith.shli %147, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0240533Z         %149 = arith.shrsi %148, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0240766Z         %150 = arith.shrsi %147, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0241086Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.0241419Z         %152 = tt.expand_dims %150 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.0241698Z         %153 = tt.broadcast %151 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0241934Z         %154 = arith.select %13, %153, %cst_9 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0242168Z         %155 = tt.broadcast %152 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0242401Z         %156 = arith.select %15, %155, %154 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0242679Z         %157 = tt.reshape %156 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:37:49.0242900Z         %158 = arith.sitofp %157 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:37:49.0243199Z         %159 = ttg.convert_layout %158 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0243667Z         %160 = tt.dot %140, %159, %127, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.0244014Z         scf.yield %160 : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.0244134Z       } {tt.flatten}
2026-02-21T09:37:49.0244325Z       %39 = scf.for %arg4 = %c4092_i32 to %c4096_i32 step %c2_i32 iter_args(%arg5 = %38) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:49.0244595Z         %63 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.0244816Z         %64 = arith.addi %63, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.0244991Z         %65 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:49.0245157Z         %66 = tt.splat %65 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.0245367Z         %67 = arith.addi %66, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.0245677Z         %68 = tt.expand_dims %67 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:49.0245945Z         %69 = tt.broadcast %68 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0246131Z         %70 = arith.addi %35, %69 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0246322Z         %71 = tt.addptr %7, %70 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.0246518Z         %72 = tt.load %71 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:49.0246775Z         %73 = ttg.convert_layout %72 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0247167Z         %74 = arith.extf %73 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0247548Z         %75 = tt.expand_dims %64 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.0247789Z         %76 = arith.muli %75, %cst_7 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.0247973Z         %77 = tt.broadcast %76 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0248160Z         %78 = arith.addi %77, %37 : tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0248347Z         %79 = tt.addptr %8, %78 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.0248540Z         %80 = tt.load %79 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:49.0248775Z         %81 = ttg.convert_layout %80 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0249086Z         %82 = arith.shli %81, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0249317Z         %83 = arith.shrsi %82, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0249545Z         %84 = arith.shrsi %81, %cst_10 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.0249829Z         %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.0250159Z         %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.0250431Z         %87 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0250662Z         %88 = arith.select %13, %87, %cst_9 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0250890Z         %89 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0251111Z         %90 = arith.select %15, %89, %88 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.0251333Z         %91 = tt.reshape %90 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:37:49.0251544Z         %92 = arith.sitofp %91 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:37:49.0251836Z         %93 = ttg.convert_layout %92 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.0252288Z         %94 = tt.dot %74, %93, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.0252631Z         scf.yield %94 : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.0252762Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:37:49.0252928Z       %40 = arith.truncf %39 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:49.0253092Z       %41 = arith.extsi %27 : i32 to i64
2026-02-21T09:37:49.0253239Z       %42 = arith.extsi %30 : i32 to i64
2026-02-21T09:37:49.0253396Z       %43 = tt.splat %41 : i64 -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.0253599Z       %44 = arith.addi %43, %17 : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.0253857Z       %45 = tt.expand_dims %44 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi64, #mma>
2026-02-21T09:37:49.0254090Z       %46 = arith.muli %45, %cst_3 : tensor<16x1xi64, #mma>
2026-02-21T09:37:49.0254259Z       %47 = tt.broadcast %46 : tensor<16x1xi64, #mma> -> tensor<16x64xi64, #mma>
2026-02-21T09:37:49.0254455Z       %48 = tt.splat %42 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.0254654Z       %49 = arith.addi %48, %18 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.0254907Z       %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi64, #mma>
2026-02-21T09:37:49.0255161Z       %51 = tt.broadcast %50 : tensor<1x64xi64, #mma> -> tensor<16x64xi64, #mma>
2026-02-21T09:37:49.0255333Z       %52 = arith.addi %47, %51 : tensor<16x64xi64, #mma>
2026-02-21T09:37:49.0255509Z       %53 = tt.addptr %16, %52 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi64, #mma>
2026-02-21T09:37:49.0255698Z       %54 = arith.cmpi sge, %45, %cst_2 : tensor<16x1xi64, #mma>
2026-02-21T09:37:49.0255857Z       %55 = arith.cmpi slt, %45, %cst_1 : tensor<16x1xi64, #mma>
2026-02-21T09:37:49.0256007Z       %56 = arith.andi %54, %55 : tensor<16x1xi1, #mma>
2026-02-21T09:37:49.0256170Z       %57 = tt.broadcast %56 : tensor<16x1xi1, #mma> -> tensor<16x64xi1, #mma>
2026-02-21T09:37:49.0256345Z       %58 = arith.cmpi sge, %50, %cst_0 : tensor<1x64xi64, #mma>
2026-02-21T09:37:49.0256532Z       %59 = arith.cmpi slt, %50, %cst : tensor<1x64xi64, #mma>
2026-02-21T09:37:49.0256678Z       %60 = arith.andi %58, %59 : tensor<1x64xi1, #mma>
2026-02-21T09:37:49.0256841Z       %61 = tt.broadcast %60 : tensor<1x64xi1, #mma> -> tensor<16x64xi1, #mma>
2026-02-21T09:37:49.0257008Z       %62 = arith.andi %57, %61 : tensor<16x64xi1, #mma>
2026-02-21T09:37:49.0257156Z       tt.store %53, %40, %62 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.0257315Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:37:49.0257449Z     tt.return
2026-02-21T09:37:49.0257528Z   }
2026-02-21T09:37:49.0257602Z }
2026-02-21T09:37:49.0257646Z 
2026-02-21T09:37:49.0257676Z {-#
2026-02-21T09:37:49.0257755Z   external_resources: {
2026-02-21T09:37:49.0257851Z     mlir_reproducer: {
2026-02-21T09:37:49.0258846Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:49.0259837Z       disable_threading: false,
2026-02-21T09:37:49.0259942Z       verify_each: true
2026-02-21T09:37:49.0260030Z     }
2026-02-21T09:37:49.0260101Z   }
2026-02-21T09:37:49.0260169Z #-}
2026-02-21T09:37:49.0260446Z /tmp/torchinductor_root/wo/cwoivxvx4gcd3oojv5o2u53akbi3va4uyq3akf2pkujxur7bsnzy.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:49.0261127Z /tmp/torchinductor_root/wo/cwoivxvx4gcd3oojv5o2u53akbi3va4uyq3akf2pkujxur7bsnzy.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:49.0261708Z [105s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:49.0262485Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 64], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:49.0263190Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:49.0263353Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:49.2710205Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:49.2717123Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:37:49.2717691Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:37:49.2718207Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:37:49.2718686Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:49.2719109Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:49.2719418Z #smem = #ttg.shared_memory
2026-02-21T09:37:49.2724124Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:49.2724867Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:49.2725431Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2725663Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:49.2725835Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:49.2726004Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:49.2726161Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:49.2726364Z     %cst_0 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2726575Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:49.2726744Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:37:49.2726902Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:49.2727063Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:49.2727228Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T09:37:49.2727393Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:49.2727559Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:49.2727715Z     %c4864_i32 = arith.constant 4864 : i32
2026-02-21T09:37:49.2727994Z     %cst_1 = arith.constant dense<29352960> : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2728380Z     %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2728652Z     %c2879_i32 = arith.constant 2879 : i32
2026-02-21T09:37:49.2728861Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.2729166Z     %cst_4 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2729474Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.2729715Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.2729962Z     %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:49.2730167Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:49.2730508Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.2730901Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.2731339Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.2731776Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.2732152Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2732494Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2732837Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2733277Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:49.2733873Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:49.2734428Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.2734710Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.2734960Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:49.2735175Z     %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.2735422Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:49.2735652Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.2735835Z     %16 = arith.subi %c2879_i32, %0 : i32
2026-02-21T09:37:49.2735970Z     %17 = arith.divui %16, %c2432_i32 : i32
2026-02-21T09:37:49.2736098Z     %18 = arith.remsi %17, %c2_i32 : i32
2026-02-21T09:37:49.2736225Z     %19 = arith.subi %17, %18 : i32
2026-02-21T09:37:49.2736355Z     %20 = arith.muli %19, %c2432_i32 : i32
2026-02-21T09:37:49.2736485Z     %21 = arith.addi %0, %20 : i32
2026-02-21T09:37:49.2736664Z     %22 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2736973Z     %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.2737275Z     %24 = tt.broadcast %23 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2737485Z     scf.for %arg3 = %0 to %21 step %c4864_i32  : i32 {
2026-02-21T09:37:49.2737639Z       %25 = arith.divsi %arg3, %c448_i32 : i32
2026-02-21T09:37:49.2737774Z       %26 = arith.muli %25, %c4_i32 : i32
2026-02-21T09:37:49.2737902Z       %27 = arith.subi %c4_i32, %26 : i32
2026-02-21T09:37:49.2738027Z       %28 = arith.minsi %27, %c4_i32 : i32
2026-02-21T09:37:49.2738160Z       %29 = arith.remsi %arg3, %c448_i32 : i32
2026-02-21T09:37:49.2738291Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:37:49.2738413Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:37:49.2738537Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:37:49.2738661Z       %33 = arith.muli %31, %c16_i32 : i32
2026-02-21T09:37:49.2738848Z       %34 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.2739080Z       %35 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.2739317Z       %36 = arith.addi %34, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.2739550Z       %37 = arith.addi %35, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.2739772Z       %38 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:37:49.2739997Z       %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.2740270Z       %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.2740546Z       %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.2740819Z       %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.2741111Z       %43 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.2741391Z       %44 = arith.muli %43, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.2741604Z       %45 = tt.broadcast %44 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2741999Z       %46 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2742436Z       %47 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:49.2742675Z         %132 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:49.2742868Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2743120Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2743433Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.2743786Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2744012Z         %137 = arith.addi %45, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2744240Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2744456Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2744727Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2745132Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2745414Z         %142 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:49.2745597Z         %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2745830Z         %144 = arith.addi %143, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2746145Z         %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2746462Z         %146 = tt.load %145 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2746695Z         %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2746938Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2747178Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2747478Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2747825Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2748143Z         %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2748386Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2748629Z         %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2748863Z         %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2749098Z         %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2749320Z         %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2749576Z         %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2749906Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2750384Z         %160 = tt.dot %141, %159, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2750738Z         %161 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:49.2750866Z         %162 = arith.muli %161, %c2_i32 : i32
2026-02-21T09:37:49.2751044Z         %163 = tt.splat %162 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2751273Z         %164 = arith.addi %163, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2751549Z         %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.2751878Z         %166 = tt.broadcast %165 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2752077Z         %167 = arith.addi %45, %166 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2752284Z         %168 = tt.addptr %6, %167 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2752496Z         %169 = tt.load %168 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2752763Z         %170 = ttg.convert_layout %169 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2753168Z         %171 = arith.extf %170 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2753453Z         %172 = arith.muli %161, %c7168_i32 : i32
2026-02-21T09:37:49.2753634Z         %173 = tt.splat %172 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2753869Z         %174 = arith.addi %173, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2754178Z         %175 = tt.addptr %7, %174 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2754491Z         %176 = tt.load %175 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2754723Z         %177 = arith.shli %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2754965Z         %178 = arith.shrsi %177, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2755205Z         %179 = arith.shrsi %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2755493Z         %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2755833Z         %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2756153Z         %182 = tt.broadcast %180 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2756397Z         %183 = arith.select %12, %182, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2756639Z         %184 = tt.broadcast %181 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2756871Z         %185 = arith.select %14, %184, %183 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2757105Z         %186 = tt.reshape %185 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2757327Z         %187 = arith.sitofp %186 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2757581Z         %188 = ttg.local_alloc %187 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2757911Z         %189 = ttg.local_load %188 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2758382Z         %190 = tt.dot %171, %189, %160, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2758731Z         %191 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:49.2758860Z         %192 = arith.muli %191, %c2_i32 : i32
2026-02-21T09:37:49.2759032Z         %193 = tt.splat %192 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2759263Z         %194 = arith.addi %193, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2759546Z         %195 = tt.expand_dims %194 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.2759867Z         %196 = tt.broadcast %195 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2760066Z         %197 = arith.addi %45, %196 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2760266Z         %198 = tt.addptr %6, %197 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2760478Z         %199 = tt.load %198 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2760744Z         %200 = ttg.convert_layout %199 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2761149Z         %201 = arith.extf %200 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2761438Z         %202 = arith.muli %191, %c7168_i32 : i32
2026-02-21T09:37:49.2761614Z         %203 = tt.splat %202 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2761849Z         %204 = arith.addi %203, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2762157Z         %205 = tt.addptr %7, %204 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2762472Z         %206 = tt.load %205 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2762754Z         %207 = arith.shli %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2762989Z         %208 = arith.shrsi %207, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2763229Z         %209 = arith.shrsi %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2763518Z         %210 = tt.expand_dims %208 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2763860Z         %211 = tt.expand_dims %209 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2764182Z         %212 = tt.broadcast %210 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2764419Z         %213 = arith.select %12, %212, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2764662Z         %214 = tt.broadcast %211 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2764893Z         %215 = arith.select %14, %214, %213 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2765127Z         %216 = tt.reshape %215 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2765353Z         %217 = arith.sitofp %216 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2765606Z         %218 = ttg.local_alloc %217 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2765936Z         %219 = ttg.local_load %218 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2766406Z         %220 = tt.dot %201, %219, %190, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2766756Z         scf.yield %220 : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2766883Z       } {tt.flatten}
2026-02-21T09:37:49.2767004Z       %48 = arith.addi %45, %24 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2767204Z       %49 = tt.addptr %6, %48 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2767405Z       %50 = tt.load %49 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2767706Z       %51 = ttg.convert_layout %50 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2768104Z       %52 = arith.extf %51 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2768439Z       %53 = arith.addi %46, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2768752Z       %54 = tt.addptr %7, %53 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2769054Z       %55 = tt.load %54 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2769285Z       %56 = arith.shli %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2769521Z       %57 = arith.shrsi %56, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2769753Z       %58 = arith.shrsi %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2770042Z       %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2770377Z       %60 = tt.expand_dims %58 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2770653Z       %61 = tt.broadcast %59 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2770891Z       %62 = arith.select %12, %61, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2771119Z       %63 = tt.broadcast %60 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2771344Z       %64 = arith.select %14, %63, %62 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2771569Z       %65 = tt.reshape %64 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2771785Z       %66 = arith.sitofp %65 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2772032Z       %67 = ttg.local_alloc %66 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2772379Z       %68 = ttg.local_load %67 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2772837Z       %69 = tt.dot %52, %68, %47, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2773215Z       %70 = arith.truncf %69 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:49.2773472Z       %71 = tt.expand_dims %37 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:49.2773711Z       %72 = arith.muli %71, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:49.2773942Z       %73 = tt.expand_dims %42 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:49.2774201Z       %74 = tt.broadcast %72 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2774402Z       %75 = tt.broadcast %73 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2774577Z       %76 = arith.addi %74, %75 : tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2774762Z       %77 = tt.addptr %15, %76 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2774951Z       tt.store %77, %70 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.2775097Z       %78 = arith.addi %arg3, %c2432_i32 : i32
2026-02-21T09:37:49.2775222Z       %79 = arith.divsi %78, %c448_i32 : i32
2026-02-21T09:37:49.2775346Z       %80 = arith.muli %79, %c4_i32 : i32
2026-02-21T09:37:49.2775468Z       %81 = arith.subi %c4_i32, %80 : i32
2026-02-21T09:37:49.2775586Z       %82 = arith.minsi %81, %c4_i32 : i32
2026-02-21T09:37:49.2775745Z       %83 = arith.remsi %78, %c448_i32 : i32
2026-02-21T09:37:49.2775864Z       %84 = arith.remsi %83, %82 : i32
2026-02-21T09:37:49.2775989Z       %85 = arith.addi %80, %84 : i32
2026-02-21T09:37:49.2776104Z       %86 = arith.divsi %83, %82 : i32
2026-02-21T09:37:49.2776222Z       %87 = arith.muli %85, %c16_i32 : i32
2026-02-21T09:37:49.2776389Z       %88 = tt.splat %87 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.2776605Z       %89 = tt.splat %87 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.2776820Z       %90 = arith.addi %88, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.2777030Z       %91 = arith.addi %89, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.2777197Z       %92 = arith.muli %86, %c64_i32 : i32
2026-02-21T09:37:49.2777400Z       %93 = tt.splat %92 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.2777654Z       %94 = tt.splat %92 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.2777908Z       %95 = arith.addi %93, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.2778154Z       %96 = arith.addi %94, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.2778420Z       %97 = tt.expand_dims %90 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.2778665Z       %98 = arith.muli %97, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.2778858Z       %99 = tt.broadcast %98 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2779212Z       %100 = tt.expand_dims %95 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2779606Z       %101 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:49.2779861Z         %132 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:49.2780038Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2780269Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2780553Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.2780832Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2781033Z         %137 = arith.addi %99, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2781236Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2781449Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2781724Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2782133Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2782425Z         %142 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:49.2782605Z         %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2782839Z         %144 = arith.addi %143, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2783154Z         %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2783499Z         %146 = tt.load %145 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2783736Z         %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2783977Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2784213Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2784507Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2784840Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2785127Z         %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2785370Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2785605Z         %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2785837Z         %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2786068Z         %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2786291Z         %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2786545Z         %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2786864Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2787335Z         %160 = tt.dot %141, %159, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2787682Z         %161 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:49.2787838Z         %162 = arith.muli %161, %c2_i32 : i32
2026-02-21T09:37:49.2795709Z         %163 = tt.splat %162 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2795947Z         %164 = arith.addi %163, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2796229Z         %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.2796510Z         %166 = tt.broadcast %165 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2796705Z         %167 = arith.addi %99, %166 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2796905Z         %168 = tt.addptr %6, %167 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2797112Z         %169 = tt.load %168 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2797384Z         %170 = ttg.convert_layout %169 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2797794Z         %171 = arith.extf %170 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2798079Z         %172 = arith.muli %161, %c7168_i32 : i32
2026-02-21T09:37:49.2798256Z         %173 = tt.splat %172 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2798486Z         %174 = arith.addi %173, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2798800Z         %175 = tt.addptr %7, %174 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2799169Z         %176 = tt.load %175 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2799402Z         %177 = arith.shli %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2799641Z         %178 = arith.shrsi %177, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2799876Z         %179 = arith.shrsi %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2800167Z         %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2800506Z         %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2800789Z         %182 = tt.broadcast %180 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2801034Z         %183 = arith.select %12, %182, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2801272Z         %184 = tt.broadcast %181 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2801505Z         %185 = arith.select %14, %184, %183 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2801734Z         %186 = tt.reshape %185 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2801954Z         %187 = arith.sitofp %186 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2802204Z         %188 = ttg.local_alloc %187 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2802525Z         %189 = ttg.local_load %188 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2803043Z         %190 = tt.dot %171, %189, %160, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2803389Z         %191 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:49.2803550Z         %192 = arith.muli %191, %c2_i32 : i32
2026-02-21T09:37:49.2803722Z         %193 = tt.splat %192 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2803942Z         %194 = arith.addi %193, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2804215Z         %195 = tt.expand_dims %194 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.2804490Z         %196 = tt.broadcast %195 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2804684Z         %197 = arith.addi %99, %196 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2804880Z         %198 = tt.addptr %6, %197 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2805087Z         %199 = tt.load %198 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2805353Z         %200 = ttg.convert_layout %199 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2805759Z         %201 = arith.extf %200 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2806042Z         %202 = arith.muli %191, %c7168_i32 : i32
2026-02-21T09:37:49.2806217Z         %203 = tt.splat %202 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2806445Z         %204 = arith.addi %203, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2806756Z         %205 = tt.addptr %7, %204 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2807100Z         %206 = tt.load %205 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2807330Z         %207 = arith.shli %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2807568Z         %208 = arith.shrsi %207, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2807805Z         %209 = arith.shrsi %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2808093Z         %210 = tt.expand_dims %208 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2808429Z         %211 = tt.expand_dims %209 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2808712Z         %212 = tt.broadcast %210 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2808952Z         %213 = arith.select %12, %212, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2809187Z         %214 = tt.broadcast %211 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2809418Z         %215 = arith.select %14, %214, %213 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2809645Z         %216 = tt.reshape %215 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2809864Z         %217 = arith.sitofp %216 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2810113Z         %218 = ttg.local_alloc %217 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2810438Z         %219 = ttg.local_load %218 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2810909Z         %220 = tt.dot %201, %219, %190, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2811259Z         scf.yield %220 : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2811433Z       } {tt.flatten}
2026-02-21T09:37:49.2811551Z       %102 = arith.addi %99, %24 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2811754Z       %103 = tt.addptr %6, %102 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2811959Z       %104 = tt.load %103 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2812225Z       %105 = ttg.convert_layout %104 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2812626Z       %106 = arith.extf %105 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2812959Z       %107 = arith.addi %100, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2813274Z       %108 = tt.addptr %7, %107 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2813585Z       %109 = tt.load %108 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2813814Z       %110 = arith.shli %109, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2814046Z       %111 = arith.shrsi %110, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2814276Z       %112 = arith.shrsi %109, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2814560Z       %113 = tt.expand_dims %111 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2814929Z       %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2815210Z       %115 = tt.broadcast %113 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2815448Z       %116 = arith.select %12, %115, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2815680Z       %117 = tt.broadcast %114 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2815912Z       %118 = arith.select %14, %117, %116 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2816137Z       %119 = tt.reshape %118 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2816358Z       %120 = arith.sitofp %119 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2816606Z       %121 = ttg.local_alloc %120 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2816926Z       %122 = ttg.local_load %121 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2817393Z       %123 = tt.dot %106, %122, %101, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2817777Z       %124 = arith.truncf %123 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:49.2818041Z       %125 = tt.expand_dims %91 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:49.2818279Z       %126 = arith.muli %125, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:49.2818506Z       %127 = tt.expand_dims %96 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:49.2818761Z       %128 = tt.broadcast %126 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2818963Z       %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2819137Z       %130 = arith.addi %128, %129 : tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2819357Z       %131 = tt.addptr %15, %130 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2819547Z       tt.store %131, %124 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.2819683Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:37:49.2819813Z     scf.for %arg3 = %21 to %c448_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:49.2819955Z       %25 = arith.divsi %arg3, %c448_i32 : i32
2026-02-21T09:37:49.2820076Z       %26 = arith.muli %25, %c4_i32 : i32
2026-02-21T09:37:49.2820190Z       %27 = arith.subi %c4_i32, %26 : i32
2026-02-21T09:37:49.2820304Z       %28 = arith.minsi %27, %c4_i32 : i32
2026-02-21T09:37:49.2820424Z       %29 = arith.remsi %arg3, %c448_i32 : i32
2026-02-21T09:37:49.2820541Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:37:49.2820650Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:37:49.2820762Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:37:49.2820871Z       %33 = arith.muli %31, %c16_i32 : i32
2026-02-21T09:37:49.2821041Z       %34 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.2821253Z       %35 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.2821460Z       %36 = arith.addi %34, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.2821667Z       %37 = arith.addi %35, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.2821826Z       %38 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:37:49.2822030Z       %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.2822279Z       %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.2822692Z       %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.2822938Z       %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.2823204Z       %43 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.2823454Z       %44 = arith.muli %43, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.2823644Z       %45 = tt.broadcast %44 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2823993Z       %46 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2824383Z       %47 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:49.2824597Z         %78 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:49.2824768Z         %79 = tt.splat %78 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2824988Z         %80 = arith.addi %79, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2825260Z         %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.2825532Z         %82 = tt.broadcast %81 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2825721Z         %83 = arith.addi %45, %82 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2825917Z         %84 = tt.addptr %6, %83 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2826119Z         %85 = tt.load %84 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2826379Z         %86 = ttg.convert_layout %85 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2826780Z         %87 = arith.extf %86 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2827091Z         %88 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:49.2827266Z         %89 = tt.splat %88 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2827489Z         %90 = arith.addi %89, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2827792Z         %91 = tt.addptr %7, %90 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2828097Z         %92 = tt.load %91 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2828321Z         %93 = arith.shli %92, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2828554Z         %94 = arith.shrsi %93, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2828785Z         %95 = arith.shrsi %92, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2829072Z         %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2829403Z         %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2829678Z         %98 = tt.broadcast %96 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2829908Z         %99 = arith.select %12, %98, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2830139Z         %100 = tt.broadcast %97 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2830382Z         %101 = arith.select %14, %100, %99 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2830638Z         %102 = tt.reshape %101 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2830861Z         %103 = arith.sitofp %102 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2831109Z         %104 = ttg.local_alloc %103 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2831429Z         %105 = ttg.local_load %104 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2831896Z         %106 = tt.dot %87, %105, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2832241Z         %107 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:49.2832363Z         %108 = arith.muli %107, %c2_i32 : i32
2026-02-21T09:37:49.2832534Z         %109 = tt.splat %108 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2832757Z         %110 = arith.addi %109, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2833036Z         %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.2833310Z         %112 = tt.broadcast %111 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2833503Z         %113 = arith.addi %45, %112 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2833699Z         %114 = tt.addptr %6, %113 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2833906Z         %115 = tt.load %114 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2834173Z         %116 = ttg.convert_layout %115 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2834575Z         %117 = arith.extf %116 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2834891Z         %118 = arith.muli %107, %c7168_i32 : i32
2026-02-21T09:37:49.2835065Z         %119 = tt.splat %118 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2835293Z         %120 = arith.addi %119, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2835601Z         %121 = tt.addptr %7, %120 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2835906Z         %122 = tt.load %121 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2836139Z         %123 = arith.shli %122, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2836376Z         %124 = arith.shrsi %123, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2836612Z         %125 = arith.shrsi %122, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2836904Z         %126 = tt.expand_dims %124 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2837235Z         %127 = tt.expand_dims %125 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2837521Z         %128 = tt.broadcast %126 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2837757Z         %129 = arith.select %12, %128, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2837993Z         %130 = tt.broadcast %127 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2838226Z         %131 = arith.select %14, %130, %129 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2838487Z         %132 = tt.reshape %131 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2838714Z         %133 = arith.sitofp %132 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2838964Z         %134 = ttg.local_alloc %133 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2839287Z         %135 = ttg.local_load %134 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2839757Z         %136 = tt.dot %117, %135, %106, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2840102Z         %137 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:49.2840226Z         %138 = arith.muli %137, %c2_i32 : i32
2026-02-21T09:37:49.2840404Z         %139 = tt.splat %138 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2840627Z         %140 = arith.addi %139, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.2840909Z         %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.2841188Z         %142 = tt.broadcast %141 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2841383Z         %143 = arith.addi %45, %142 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2841579Z         %144 = tt.addptr %6, %143 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2841786Z         %145 = tt.load %144 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2842053Z         %146 = ttg.convert_layout %145 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2842462Z         %147 = arith.extf %146 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2842804Z         %148 = arith.muli %137, %c7168_i32 : i32
2026-02-21T09:37:49.2842978Z         %149 = tt.splat %148 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2843205Z         %150 = arith.addi %149, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2843512Z         %151 = tt.addptr %7, %150 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2843817Z         %152 = tt.load %151 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2844047Z         %153 = arith.shli %152, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2844280Z         %154 = arith.shrsi %153, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2844516Z         %155 = arith.shrsi %152, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2844810Z         %156 = tt.expand_dims %154 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2845143Z         %157 = tt.expand_dims %155 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2845425Z         %158 = tt.broadcast %156 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2845662Z         %159 = arith.select %12, %158, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2845897Z         %160 = tt.broadcast %157 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2846127Z         %161 = arith.select %14, %160, %159 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2846386Z         %162 = tt.reshape %161 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2846611Z         %163 = arith.sitofp %162 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2846858Z         %164 = ttg.local_alloc %163 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2847181Z         %165 = ttg.local_load %164 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2847646Z         %166 = tt.dot %147, %165, %136, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2847990Z         scf.yield %166 : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2848106Z       } {tt.flatten}
2026-02-21T09:37:49.2848223Z       %48 = arith.addi %45, %24 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2848415Z       %49 = tt.addptr %6, %48 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.2848615Z       %50 = tt.load %49 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.2848872Z       %51 = ttg.convert_layout %50 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2849268Z       %52 = arith.extf %51 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2849596Z       %53 = arith.addi %46, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2849905Z       %54 = tt.addptr %7, %53 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2850210Z       %55 = tt.load %54 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2850432Z       %56 = arith.shli %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2850693Z       %57 = arith.shrsi %56, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2850918Z       %58 = arith.shrsi %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.2851197Z       %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2851525Z       %60 = tt.expand_dims %58 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.2851797Z       %61 = tt.broadcast %59 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2852026Z       %62 = arith.select %12, %61, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2852257Z       %63 = tt.broadcast %60 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2852477Z       %64 = arith.select %14, %63, %62 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.2852699Z       %65 = tt.reshape %64 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.2852909Z       %66 = arith.sitofp %65 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.2853150Z       %67 = ttg.local_alloc %66 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.2853464Z       %68 = ttg.local_load %67 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.2853919Z       %69 = tt.dot %52, %68, %47, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.2854325Z       %70 = arith.truncf %69 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:49.2854581Z       %71 = tt.expand_dims %37 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:49.2854815Z       %72 = arith.muli %71, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:49.2855046Z       %73 = tt.expand_dims %42 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:49.2855296Z       %74 = tt.broadcast %72 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2855491Z       %75 = tt.broadcast %73 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2855661Z       %76 = arith.addi %74, %75 : tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2855840Z       %77 = tt.addptr %15, %76 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:49.2856027Z       tt.store %77, %70 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.2856162Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:37:49.2856265Z     tt.return
2026-02-21T09:37:49.2856342Z   }
2026-02-21T09:37:49.2856414Z }
2026-02-21T09:37:49.2856456Z 
2026-02-21T09:37:49.2856487Z {-#
2026-02-21T09:37:49.2856567Z   external_resources: {
2026-02-21T09:37:49.2856665Z     mlir_reproducer: {
2026-02-21T09:37:49.2857680Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:49.2858684Z       disable_threading: false,
2026-02-21T09:37:49.2858786Z       verify_each: true
2026-02-21T09:37:49.2858877Z     }
2026-02-21T09:37:49.2858976Z   }
2026-02-21T09:37:49.2859043Z #-}
2026-02-21T09:37:49.2859318Z /tmp/torchinductor_root/g6/cg6tqsj2ilamzoxlnt4q3ufxel6xqyijdituwathk7todtheq5a5.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:49.2860005Z /tmp/torchinductor_root/g6/cg6tqsj2ilamzoxlnt4q3ufxel6xqyijdituwathk7todtheq5a5.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:49.2860557Z [105s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:49.2861343Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:49.2862056Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:49.2862224Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:49.4136056Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:49.4139098Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:37:49.4140033Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:49.4141105Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:49.4141928Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:37:49.4142701Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:49.4143580Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:49.4144901Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:49.4145952Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:49.4146249Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.4146471Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.4146687Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.4146886Z     %cst_3 = arith.constant dense<7168> : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.4147067Z     %cst_4 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:49.4147217Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:49.4147340Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:37:49.4147463Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:49.4147583Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:49.4147705Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T09:37:49.4147828Z     %c4092_i32 = arith.constant 4092 : i32
2026-02-21T09:37:49.4147947Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:37:49.4148093Z     %cst_5 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4148247Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:37:49.4148362Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:49.4148472Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:49.4148699Z     %cst_6 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4148890Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:49.4149087Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:49.4149359Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.4149628Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.4149889Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.4150155Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.4150417Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.4150658Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:49.4150859Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:49.4151123Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:49.4151543Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:49.4151951Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.4152240Z     %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.4152435Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:37:49.4152635Z     %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.4152818Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:37:49.4153022Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.4153202Z     scf.for %arg3 = %0 to %c448_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:49.4153351Z       %17 = arith.divsi %arg3, %c448_i32 : i32
2026-02-21T09:37:49.4153473Z       %18 = arith.muli %17, %c4_i32 : i32
2026-02-21T09:37:49.4153590Z       %19 = arith.subi %c4_i32, %18 : i32
2026-02-21T09:37:49.4153704Z       %20 = arith.minsi %19, %c4_i32 : i32
2026-02-21T09:37:49.4153823Z       %21 = arith.remsi %arg3, %c448_i32 : i32
2026-02-21T09:37:49.4153943Z       %22 = arith.remsi %21, %20 : i32
2026-02-21T09:37:49.4154057Z       %23 = arith.addi %18, %22 : i32
2026-02-21T09:37:49.4154185Z       %24 = arith.divsi %21, %20 : i32
2026-02-21T09:37:49.4154299Z       %25 = arith.muli %23, %c16_i32 : i32
2026-02-21T09:37:49.4154467Z       %26 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:49.4154680Z       %27 = tt.splat %25 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.4154893Z       %28 = arith.addi %26, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:49.4155103Z       %29 = arith.addi %27, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.4155265Z       %30 = arith.muli %24, %c64_i32 : i32
2026-02-21T09:37:49.4155426Z       %31 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.4155633Z       %32 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.4155843Z       %33 = arith.addi %31, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.4156050Z       %34 = arith.addi %32, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.4156348Z       %35 = tt.expand_dims %28 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:37:49.4156597Z       %36 = arith.muli %35, %cst_4 : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:49.4156788Z       %37 = tt.broadcast %36 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4157064Z       %38 = tt.expand_dims %33 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T09:37:49.4157334Z       %39 = tt.broadcast %38 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4157601Z       %40 = scf.for %arg4 = %c0_i32 to %c4092_i32 step %c6_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:49.4157870Z         %50 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.4158093Z         %51 = arith.addi %50, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.4158269Z         %52 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:49.4158433Z         %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.4158647Z         %54 = arith.addi %53, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.4158917Z         %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:49.4159189Z         %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4159378Z         %57 = arith.addi %37, %56 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4159576Z         %58 = tt.addptr %7, %57 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4159808Z         %59 = tt.load %58 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:49.4160071Z         %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4160480Z         %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4160859Z         %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.4161102Z         %63 = arith.muli %62, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.4161288Z         %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4161475Z         %65 = arith.addi %64, %39 : tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4161669Z         %66 = tt.addptr %8, %65 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4161864Z         %67 = tt.load %66 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:49.4162105Z         %68 = ttg.convert_layout %67 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4162384Z         %69 = arith.shli %68, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4162673Z         %70 = arith.shrsi %69, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4162906Z         %71 = arith.shrsi %68, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4163185Z         %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.4163514Z         %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.4163794Z         %74 = tt.broadcast %72 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4164064Z         %75 = arith.select %13, %74, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4164296Z         %76 = tt.broadcast %73 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4164522Z         %77 = arith.select %15, %76, %75 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4164744Z         %78 = tt.reshape %77 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:37:49.4164963Z         %79 = arith.sitofp %78 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:37:49.4165250Z         %80 = ttg.convert_layout %79 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4165720Z         %81 = tt.dot %61, %80, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.4166070Z         %82 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:49.4166241Z         %83 = tt.splat %82 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.4166459Z         %84 = arith.addi %83, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.4166626Z         %85 = arith.muli %82, %c2_i32 : i32
2026-02-21T09:37:49.4166790Z         %86 = tt.splat %85 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.4167004Z         %87 = arith.addi %86, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.4167272Z         %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:49.4167589Z         %89 = tt.broadcast %88 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4167775Z         %90 = arith.addi %37, %89 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4167973Z         %91 = tt.addptr %7, %90 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4168175Z         %92 = tt.load %91 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:49.4168435Z         %93 = ttg.convert_layout %92 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4168833Z         %94 = arith.extf %93 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4169211Z         %95 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.4169456Z         %96 = arith.muli %95, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.4169646Z         %97 = tt.broadcast %96 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4169830Z         %98 = arith.addi %97, %39 : tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4170021Z         %99 = tt.addptr %8, %98 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4170216Z         %100 = tt.load %99 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:49.4170458Z         %101 = ttg.convert_layout %100 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4170742Z         %102 = arith.shli %101, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4170976Z         %103 = arith.shrsi %102, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4171212Z         %104 = arith.shrsi %101, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4171505Z         %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.4171842Z         %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.4172158Z         %107 = tt.broadcast %105 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4172393Z         %108 = arith.select %13, %107, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4172628Z         %109 = tt.broadcast %106 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4172859Z         %110 = arith.select %15, %109, %108 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4173087Z         %111 = tt.reshape %110 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:37:49.4173307Z         %112 = arith.sitofp %111 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:37:49.4173601Z         %113 = ttg.convert_layout %112 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4174065Z         %114 = tt.dot %94, %113, %81, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.4174406Z         %115 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:37:49.4174579Z         %116 = tt.splat %115 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.4174803Z         %117 = arith.addi %116, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.4174975Z         %118 = arith.muli %115, %c2_i32 : i32
2026-02-21T09:37:49.4175144Z         %119 = tt.splat %118 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.4175393Z         %120 = arith.addi %119, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.4175670Z         %121 = tt.expand_dims %120 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:49.4175947Z         %122 = tt.broadcast %121 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4176137Z         %123 = arith.addi %37, %122 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4176335Z         %124 = tt.addptr %7, %123 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4176538Z         %125 = tt.load %124 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:49.4176806Z         %126 = ttg.convert_layout %125 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4177211Z         %127 = arith.extf %126 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4177594Z         %128 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.4177844Z         %129 = arith.muli %128, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.4178033Z         %130 = tt.broadcast %129 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4178224Z         %131 = arith.addi %130, %39 : tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4178419Z         %132 = tt.addptr %8, %131 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4178615Z         %133 = tt.load %132 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:49.4178856Z         %134 = ttg.convert_layout %133 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4179134Z         %135 = arith.shli %134, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4179370Z         %136 = arith.shrsi %135, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4179637Z         %137 = arith.shrsi %134, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4179925Z         %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.4180262Z         %139 = tt.expand_dims %137 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.4180543Z         %140 = tt.broadcast %138 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4180783Z         %141 = arith.select %13, %140, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4181021Z         %142 = tt.broadcast %139 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4181252Z         %143 = arith.select %15, %142, %141 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4181481Z         %144 = tt.reshape %143 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:37:49.4181704Z         %145 = arith.sitofp %144 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:37:49.4182002Z         %146 = ttg.convert_layout %145 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4182466Z         %147 = tt.dot %127, %146, %114, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.4182811Z         scf.yield %147 : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.4182931Z       } {tt.flatten}
2026-02-21T09:37:49.4183151Z       %41 = scf.for %arg4 = %c4092_i32 to %c4096_i32 step %c2_i32 iter_args(%arg5 = %40) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:49.4183422Z         %50 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.4183645Z         %51 = arith.addi %50, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.4183816Z         %52 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:49.4183981Z         %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.4184192Z         %54 = arith.addi %53, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:49.4184462Z         %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:37:49.4184733Z         %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4184919Z         %57 = arith.addi %37, %56 : tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4185115Z         %58 = tt.addptr %7, %57 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:37:49.4185309Z         %59 = tt.load %58 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:49.4185571Z         %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4185966Z         %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4186344Z         %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.4186586Z         %63 = arith.muli %62, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:37:49.4186772Z         %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4186958Z         %65 = arith.addi %64, %39 : tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4187148Z         %66 = tt.addptr %8, %65 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:37:49.4187367Z         %67 = tt.load %66 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:49.4187601Z         %68 = ttg.convert_layout %67 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4187874Z         %69 = arith.shli %68, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4188103Z         %70 = arith.shrsi %69, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4188331Z         %71 = arith.shrsi %68, %cst_6 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.4188612Z         %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.4188944Z         %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:37:49.4189219Z         %74 = tt.broadcast %72 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4189454Z         %75 = arith.select %13, %74, %cst_5 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4189686Z         %76 = tt.broadcast %73 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4189908Z         %77 = arith.select %15, %76, %75 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:37:49.4190129Z         %78 = tt.reshape %77 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:37:49.4190341Z         %79 = arith.sitofp %78 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:37:49.4190626Z         %80 = ttg.convert_layout %79 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.4191131Z         %81 = tt.dot %61, %80, %arg5, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.4191475Z         scf.yield %81 : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.4191603Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:37:49.4191765Z       %42 = arith.truncf %41 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:49.4192022Z       %43 = tt.expand_dims %29 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:49.4192253Z       %44 = arith.muli %43, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:49.4192472Z       %45 = tt.expand_dims %34 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:49.4192720Z       %46 = tt.broadcast %44 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.4192917Z       %47 = tt.broadcast %45 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.4193085Z       %48 = arith.addi %46, %47 : tensor<16x64xi32, #mma>
2026-02-21T09:37:49.4193267Z       %49 = tt.addptr %16, %48 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:49.4193451Z       tt.store %49, %42 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.4193608Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:37:49.4193741Z     tt.return
2026-02-21T09:37:49.4193818Z   }
2026-02-21T09:37:49.4193886Z }
2026-02-21T09:37:49.4193929Z 
2026-02-21T09:37:49.4193958Z {-#
2026-02-21T09:37:49.4194037Z   external_resources: {
2026-02-21T09:37:49.4194134Z     mlir_reproducer: {
2026-02-21T09:37:49.4195136Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:49.4196167Z       disable_threading: false,
2026-02-21T09:37:49.4196268Z       verify_each: true
2026-02-21T09:37:49.4196357Z     }
2026-02-21T09:37:49.4196426Z   }
2026-02-21T09:37:49.4196491Z #-}
2026-02-21T09:37:49.4196767Z /tmp/torchinductor_root/fk/cfkxb2bdtdprrt3gfdetbnk47ke6tjz4fd3xdafarz2lhml4efpd.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:49.4197452Z /tmp/torchinductor_root/fk/cfkxb2bdtdprrt3gfdetbnk47ke6tjz4fd3xdafarz2lhml4efpd.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:49.4198009Z [105s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:49.4198801Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:49.4199509Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:49.4199673Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:49.7338742Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:49.7345881Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:37:49.7346446Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:49.7346966Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:49.7347432Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:49.7347862Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:49.7348173Z #smem = #ttg.shared_memory
2026-02-21T09:37:49.7348543Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:49.7349322Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:49.7349954Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7350210Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:49.7350404Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:49.7350589Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:49.7350766Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:49.7350941Z     %c896_i32 = arith.constant 896 : i32
2026-02-21T09:37:49.7351179Z     %cst_0 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7351418Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:49.7351613Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:37:49.7351795Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:49.7351978Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:37:49.7352152Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:49.7352327Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T09:37:49.7352602Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:49.7352784Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:49.7352960Z     %c4864_i32 = arith.constant 4864 : i32
2026-02-21T09:37:49.7353276Z     %cst_1 = arith.constant dense<29352960> : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7353712Z     %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7354019Z     %c2879_i32 = arith.constant 2879 : i32
2026-02-21T09:37:49.7354252Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.7354567Z     %cst_4 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7354866Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.7355103Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.7355332Z     %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:49.7355535Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:49.7355807Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.7356182Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.7356612Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.7357036Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.7357407Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7357789Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7358115Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7358542Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:49.7359117Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:49.7359678Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.7360029Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.7360297Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:49.7360570Z     %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:49.7360828Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:49.7361118Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.7361341Z     %16 = arith.subi %c2879_i32, %0 : i32
2026-02-21T09:37:49.7361505Z     %17 = arith.divui %16, %c2432_i32 : i32
2026-02-21T09:37:49.7361664Z     %18 = arith.remsi %17, %c2_i32 : i32
2026-02-21T09:37:49.7361823Z     %19 = arith.subi %17, %18 : i32
2026-02-21T09:37:49.7362015Z     %20 = arith.muli %19, %c2432_i32 : i32
2026-02-21T09:37:49.7362176Z     %21 = arith.addi %0, %20 : i32
2026-02-21T09:37:49.7362400Z     %22 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7362865Z     %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.7363245Z     %24 = tt.broadcast %23 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7363503Z     scf.for %arg3 = %0 to %21 step %c4864_i32  : i32 {
2026-02-21T09:37:49.7363752Z       %25 = arith.divsi %arg3, %c896_i32 : i32
2026-02-21T09:37:49.7363917Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:37:49.7364075Z       %27 = arith.subi %c4_i32, %26 : i32
2026-02-21T09:37:49.7364230Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:37:49.7364392Z       %29 = arith.remsi %arg3, %c896_i32 : i32
2026-02-21T09:37:49.7364554Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:37:49.7364707Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:37:49.7364859Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:37:49.7365010Z       %33 = arith.muli %31, %c16_i32 : i32
2026-02-21T09:37:49.7365209Z       %34 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.7365441Z       %35 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.7365669Z       %36 = arith.addi %34, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.7365900Z       %37 = arith.addi %35, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.7366076Z       %38 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:37:49.7366294Z       %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.7366561Z       %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.7366830Z       %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.7367096Z       %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.7367383Z       %43 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.7367695Z       %44 = arith.muli %43, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.7367899Z       %45 = tt.broadcast %44 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7368289Z       %46 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7368715Z       %47 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:49.7368949Z         %132 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:49.7369141Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7369387Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7369690Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.7369999Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7370214Z         %137 = arith.addi %45, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7370434Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7370659Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7370954Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7371397Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7371710Z         %142 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:49.7371908Z         %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7372160Z         %144 = arith.addi %143, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7372536Z         %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7372871Z         %146 = tt.load %145 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7373126Z         %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7373385Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7373643Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7373964Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7374330Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7374619Z         %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7374859Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7375094Z         %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7375325Z         %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7375554Z         %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7375774Z         %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7376170Z         %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7376493Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7376972Z         %160 = tt.dot %141, %159, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7377321Z         %161 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:49.7377442Z         %162 = arith.muli %161, %c2_i32 : i32
2026-02-21T09:37:49.7377611Z         %163 = tt.splat %162 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7377833Z         %164 = arith.addi %163, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7378110Z         %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.7378387Z         %166 = tt.broadcast %165 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7378581Z         %167 = arith.addi %45, %166 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7378777Z         %168 = tt.addptr %6, %167 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7378981Z         %169 = tt.load %168 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7379246Z         %170 = ttg.convert_layout %169 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7379648Z         %171 = arith.extf %170 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7379929Z         %172 = arith.muli %161, %c7168_i32 : i32
2026-02-21T09:37:49.7380106Z         %173 = tt.splat %172 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7380330Z         %174 = arith.addi %173, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7380680Z         %175 = tt.addptr %7, %174 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7380987Z         %176 = tt.load %175 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7381217Z         %177 = arith.shli %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7381450Z         %178 = arith.shrsi %177, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7381684Z         %179 = arith.shrsi %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7381975Z         %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7382310Z         %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7382595Z         %182 = tt.broadcast %180 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7382833Z         %183 = arith.select %12, %182, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7383068Z         %184 = tt.broadcast %181 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7383297Z         %185 = arith.select %14, %184, %183 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7383525Z         %186 = tt.reshape %185 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7383744Z         %187 = arith.sitofp %186 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7384019Z         %188 = ttg.local_alloc %187 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7384343Z         %189 = ttg.local_load %188 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7384814Z         %190 = tt.dot %171, %189, %160, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7385157Z         %191 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:49.7385278Z         %192 = arith.muli %191, %c2_i32 : i32
2026-02-21T09:37:49.7385449Z         %193 = tt.splat %192 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7385676Z         %194 = arith.addi %193, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7385957Z         %195 = tt.expand_dims %194 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.7386234Z         %196 = tt.broadcast %195 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7386429Z         %197 = arith.addi %45, %196 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7386628Z         %198 = tt.addptr %6, %197 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7386833Z         %199 = tt.load %198 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7387097Z         %200 = ttg.convert_layout %199 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7387500Z         %201 = arith.extf %200 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7387782Z         %202 = arith.muli %191, %c7168_i32 : i32
2026-02-21T09:37:49.7387960Z         %203 = tt.splat %202 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7388188Z         %204 = arith.addi %203, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7388527Z         %205 = tt.addptr %7, %204 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7388834Z         %206 = tt.load %205 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7389063Z         %207 = arith.shli %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7389299Z         %208 = arith.shrsi %207, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7389534Z         %209 = arith.shrsi %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7389824Z         %210 = tt.expand_dims %208 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7390157Z         %211 = tt.expand_dims %209 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7390443Z         %212 = tt.broadcast %210 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7390680Z         %213 = arith.select %12, %212, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7390916Z         %214 = tt.broadcast %211 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7391147Z         %215 = arith.select %14, %214, %213 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7391377Z         %216 = tt.reshape %215 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7391597Z         %217 = arith.sitofp %216 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7391880Z         %218 = ttg.local_alloc %217 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7392206Z         %219 = ttg.local_load %218 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7392668Z         %220 = tt.dot %201, %219, %190, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7393012Z         scf.yield %220 : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7393129Z       } {tt.flatten}
2026-02-21T09:37:49.7393244Z       %48 = arith.addi %45, %24 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7393438Z       %49 = tt.addptr %6, %48 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7393635Z       %50 = tt.load %49 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7393898Z       %51 = ttg.convert_layout %50 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7394295Z       %52 = arith.extf %51 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7394624Z       %53 = arith.addi %46, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7394930Z       %54 = tt.addptr %7, %53 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7395227Z       %55 = tt.load %54 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7395449Z       %56 = arith.shli %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7395676Z       %57 = arith.shrsi %56, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7395904Z       %58 = arith.shrsi %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7396217Z       %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7396542Z       %60 = tt.expand_dims %58 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7396817Z       %61 = tt.broadcast %59 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7397046Z       %62 = arith.select %12, %61, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7397274Z       %63 = tt.broadcast %60 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7397497Z       %64 = arith.select %14, %63, %62 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7397715Z       %65 = tt.reshape %64 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7397929Z       %66 = arith.sitofp %65 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7398176Z       %67 = ttg.local_alloc %66 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7398492Z       %68 = ttg.local_load %67 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7398951Z       %69 = tt.dot %52, %68, %47, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7399327Z       %70 = arith.truncf %69 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:49.7399586Z       %71 = tt.expand_dims %37 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:49.7399851Z       %72 = arith.muli %71, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:49.7400074Z       %73 = tt.expand_dims %42 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:49.7400323Z       %74 = tt.broadcast %72 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7400514Z       %75 = tt.broadcast %73 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7400686Z       %76 = arith.addi %74, %75 : tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7400862Z       %77 = tt.addptr %15, %76 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7401047Z       tt.store %77, %70 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.7401186Z       %78 = arith.addi %arg3, %c2432_i32 : i32
2026-02-21T09:37:49.7401307Z       %79 = arith.divsi %78, %c896_i32 : i32
2026-02-21T09:37:49.7401422Z       %80 = arith.muli %79, %c8_i32 : i32
2026-02-21T09:37:49.7401536Z       %81 = arith.subi %c4_i32, %80 : i32
2026-02-21T09:37:49.7401653Z       %82 = arith.minsi %81, %c8_i32 : i32
2026-02-21T09:37:49.7401769Z       %83 = arith.remsi %78, %c896_i32 : i32
2026-02-21T09:37:49.7401883Z       %84 = arith.remsi %83, %82 : i32
2026-02-21T09:37:49.7401997Z       %85 = arith.addi %80, %84 : i32
2026-02-21T09:37:49.7402105Z       %86 = arith.divsi %83, %82 : i32
2026-02-21T09:37:49.7402214Z       %87 = arith.muli %85, %c16_i32 : i32
2026-02-21T09:37:49.7402377Z       %88 = tt.splat %87 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.7402622Z       %89 = tt.splat %87 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.7402832Z       %90 = arith.addi %88, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.7403037Z       %91 = arith.addi %89, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.7403198Z       %92 = arith.muli %86, %c64_i32 : i32
2026-02-21T09:37:49.7403399Z       %93 = tt.splat %92 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.7403647Z       %94 = tt.splat %92 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.7403935Z       %95 = arith.addi %93, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.7404180Z       %96 = arith.addi %94, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.7404444Z       %97 = tt.expand_dims %90 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.7404690Z       %98 = arith.muli %97, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.7404878Z       %99 = tt.broadcast %98 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7405227Z       %100 = tt.expand_dims %95 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7405618Z       %101 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:49.7405839Z         %132 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:49.7406014Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7406238Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7406514Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.7406789Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7406984Z         %137 = arith.addi %99, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7407230Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7407436Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7407706Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7408107Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7408393Z         %142 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:49.7408569Z         %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7408798Z         %144 = arith.addi %143, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7409112Z         %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7409422Z         %146 = tt.load %145 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7409656Z         %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7409889Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7410125Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7410415Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7410748Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7411035Z         %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7411275Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7411551Z         %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7411784Z         %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7412013Z         %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7412236Z         %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7412488Z         %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7412815Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7413286Z         %160 = tt.dot %141, %159, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7413639Z         %161 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:49.7413762Z         %162 = arith.muli %161, %c2_i32 : i32
2026-02-21T09:37:49.7413934Z         %163 = tt.splat %162 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7414160Z         %164 = arith.addi %163, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7414438Z         %165 = tt.expand_dims %164 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.7414713Z         %166 = tt.broadcast %165 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7414909Z         %167 = arith.addi %99, %166 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7415139Z         %168 = tt.addptr %6, %167 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7415346Z         %169 = tt.load %168 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7423307Z         %170 = ttg.convert_layout %169 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7423727Z         %171 = arith.extf %170 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7424016Z         %172 = arith.muli %161, %c7168_i32 : i32
2026-02-21T09:37:49.7424195Z         %173 = tt.splat %172 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7424423Z         %174 = arith.addi %173, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7424737Z         %175 = tt.addptr %7, %174 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7425043Z         %176 = tt.load %175 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7425278Z         %177 = arith.shli %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7425513Z         %178 = arith.shrsi %177, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7425748Z         %179 = arith.shrsi %176, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7426036Z         %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7426370Z         %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7426653Z         %182 = tt.broadcast %180 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7426893Z         %183 = arith.select %12, %182, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7427183Z         %184 = tt.broadcast %181 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7427414Z         %185 = arith.select %14, %184, %183 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7427639Z         %186 = tt.reshape %185 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7427859Z         %187 = arith.sitofp %186 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7428106Z         %188 = ttg.local_alloc %187 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7428423Z         %189 = ttg.local_load %188 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7428891Z         %190 = tt.dot %171, %189, %160, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7429240Z         %191 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:49.7429363Z         %192 = arith.muli %191, %c2_i32 : i32
2026-02-21T09:37:49.7429534Z         %193 = tt.splat %192 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7429754Z         %194 = arith.addi %193, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7430028Z         %195 = tt.expand_dims %194 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.7430302Z         %196 = tt.broadcast %195 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7430493Z         %197 = arith.addi %99, %196 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7430726Z         %198 = tt.addptr %6, %197 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7430929Z         %199 = tt.load %198 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7431196Z         %200 = ttg.convert_layout %199 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7431594Z         %201 = arith.extf %200 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7431875Z         %202 = arith.muli %191, %c7168_i32 : i32
2026-02-21T09:37:49.7432048Z         %203 = tt.splat %202 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7432274Z         %204 = arith.addi %203, %100 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7432583Z         %205 = tt.addptr %7, %204 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7432887Z         %206 = tt.load %205 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7433116Z         %207 = arith.shli %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7433349Z         %208 = arith.shrsi %207, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7433580Z         %209 = arith.shrsi %206, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7433864Z         %210 = tt.expand_dims %208 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7434195Z         %211 = tt.expand_dims %209 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7434472Z         %212 = tt.broadcast %210 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7434707Z         %213 = arith.select %12, %212, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7434971Z         %214 = tt.broadcast %211 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7435198Z         %215 = arith.select %14, %214, %213 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7435423Z         %216 = tt.reshape %215 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7435640Z         %217 = arith.sitofp %216 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7435885Z         %218 = ttg.local_alloc %217 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7436199Z         %219 = ttg.local_load %218 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7436670Z         %220 = tt.dot %201, %219, %190, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7437018Z         scf.yield %220 : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7437136Z       } {tt.flatten}
2026-02-21T09:37:49.7437254Z       %102 = arith.addi %99, %24 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7437449Z       %103 = tt.addptr %6, %102 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7437651Z       %104 = tt.load %103 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7437913Z       %105 = ttg.convert_layout %104 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7438303Z       %106 = arith.extf %105 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7438664Z       %107 = arith.addi %100, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7438980Z       %108 = tt.addptr %7, %107 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7439282Z       %109 = tt.load %108 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7439509Z       %110 = arith.shli %109, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7439737Z       %111 = arith.shrsi %110, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7439967Z       %112 = arith.shrsi %109, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7440248Z       %113 = tt.expand_dims %111 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7440578Z       %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7440856Z       %115 = tt.broadcast %113 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7441090Z       %116 = arith.select %12, %115, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7441322Z       %117 = tt.broadcast %114 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7441549Z       %118 = arith.select %14, %117, %116 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7441772Z       %119 = tt.reshape %118 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7441989Z       %120 = arith.sitofp %119 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7442233Z       %121 = ttg.local_alloc %120 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7442548Z       %122 = ttg.local_load %121 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7443077Z       %123 = tt.dot %106, %122, %101, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7443454Z       %124 = arith.truncf %123 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:49.7443714Z       %125 = tt.expand_dims %91 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:49.7443948Z       %126 = arith.muli %125, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:49.7444175Z       %127 = tt.expand_dims %96 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:49.7444430Z       %128 = tt.broadcast %126 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7444627Z       %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7444807Z       %130 = arith.addi %128, %129 : tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7444990Z       %131 = tt.addptr %15, %130 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7445181Z       tt.store %131, %124 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.7445317Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:37:49.7445446Z     scf.for %arg3 = %21 to %c448_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:49.7445591Z       %25 = arith.divsi %arg3, %c896_i32 : i32
2026-02-21T09:37:49.7445710Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:37:49.7445825Z       %27 = arith.subi %c4_i32, %26 : i32
2026-02-21T09:37:49.7445937Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:37:49.7446055Z       %29 = arith.remsi %arg3, %c896_i32 : i32
2026-02-21T09:37:49.7446170Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:37:49.7446329Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:37:49.7446438Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:37:49.7446550Z       %33 = arith.muli %31, %c16_i32 : i32
2026-02-21T09:37:49.7446714Z       %34 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.7446921Z       %35 = tt.splat %33 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.7447127Z       %36 = arith.addi %34, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:49.7447332Z       %37 = arith.addi %35, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:49.7447490Z       %38 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:37:49.7447690Z       %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.7447935Z       %40 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.7448180Z       %41 = arith.addi %39, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:49.7448424Z       %42 = arith.addi %40, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:49.7448686Z       %43 = tt.expand_dims %36 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.7448930Z       %44 = arith.muli %43, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:49.7449117Z       %45 = tt.broadcast %44 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7449465Z       %46 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7449854Z       %47 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:49.7450066Z         %78 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:49.7450233Z         %79 = tt.splat %78 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7450484Z         %80 = arith.addi %79, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7450767Z         %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.7451034Z         %82 = tt.broadcast %81 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7451221Z         %83 = arith.addi %45, %82 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7451413Z         %84 = tt.addptr %6, %83 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7451611Z         %85 = tt.load %84 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7451870Z         %86 = ttg.convert_layout %85 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7452263Z         %87 = arith.extf %86 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7452548Z         %88 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:49.7452718Z         %89 = tt.splat %88 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7452937Z         %90 = arith.addi %89, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7453235Z         %91 = tt.addptr %7, %90 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7453533Z         %92 = tt.load %91 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7453754Z         %93 = arith.shli %92, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7454013Z         %94 = arith.shrsi %93, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7454246Z         %95 = arith.shrsi %92, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7454523Z         %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7454848Z         %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7455121Z         %98 = tt.broadcast %96 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7455347Z         %99 = arith.select %12, %98, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7455575Z         %100 = tt.broadcast %97 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7455803Z         %101 = arith.select %14, %100, %99 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7456027Z         %102 = tt.reshape %101 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7456249Z         %103 = arith.sitofp %102 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7456494Z         %104 = ttg.local_alloc %103 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7456813Z         %105 = ttg.local_load %104 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7457276Z         %106 = tt.dot %87, %105, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7457615Z         %107 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:49.7457739Z         %108 = arith.muli %107, %c2_i32 : i32
2026-02-21T09:37:49.7457905Z         %109 = tt.splat %108 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7458158Z         %110 = arith.addi %109, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7458432Z         %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.7458704Z         %112 = tt.broadcast %111 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7458895Z         %113 = arith.addi %45, %112 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7459091Z         %114 = tt.addptr %6, %113 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7459293Z         %115 = tt.load %114 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7459559Z         %116 = ttg.convert_layout %115 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7459955Z         %117 = arith.extf %116 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7460238Z         %118 = arith.muli %107, %c7168_i32 : i32
2026-02-21T09:37:49.7460409Z         %119 = tt.splat %118 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7460634Z         %120 = arith.addi %119, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7460940Z         %121 = tt.addptr %7, %120 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7461247Z         %122 = tt.load %121 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7461477Z         %123 = arith.shli %122, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7461741Z         %124 = arith.shrsi %123, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7461978Z         %125 = arith.shrsi %122, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7462263Z         %126 = tt.expand_dims %124 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7462594Z         %127 = tt.expand_dims %125 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7462876Z         %128 = tt.broadcast %126 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7463109Z         %129 = arith.select %12, %128, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7463341Z         %130 = tt.broadcast %127 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7463573Z         %131 = arith.select %14, %130, %129 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7463797Z         %132 = tt.reshape %131 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7464017Z         %133 = arith.sitofp %132 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7464261Z         %134 = ttg.local_alloc %133 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7464579Z         %135 = ttg.local_load %134 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7465039Z         %136 = tt.dot %117, %135, %106, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7465377Z         %137 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:49.7465500Z         %138 = arith.muli %137, %c2_i32 : i32
2026-02-21T09:37:49.7465667Z         %139 = tt.splat %138 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7465918Z         %140 = arith.addi %139, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:49.7466189Z         %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:49.7466460Z         %142 = tt.broadcast %141 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7466651Z         %143 = arith.addi %45, %142 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7466846Z         %144 = tt.addptr %6, %143 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7467046Z         %145 = tt.load %144 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7467311Z         %146 = ttg.convert_layout %145 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7467705Z         %147 = arith.extf %146 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7467991Z         %148 = arith.muli %137, %c7168_i32 : i32
2026-02-21T09:37:49.7468162Z         %149 = tt.splat %148 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7468385Z         %150 = arith.addi %149, %46 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7468692Z         %151 = tt.addptr %7, %150 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7468999Z         %152 = tt.load %151 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7469228Z         %153 = arith.shli %152, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7469488Z         %154 = arith.shrsi %153, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7469725Z         %155 = arith.shrsi %152, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7470008Z         %156 = tt.expand_dims %154 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7470337Z         %157 = tt.expand_dims %155 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7470615Z         %158 = tt.broadcast %156 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7470849Z         %159 = arith.select %12, %158, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7471080Z         %160 = tt.broadcast %157 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7471309Z         %161 = arith.select %14, %160, %159 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7471536Z         %162 = tt.reshape %161 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7471754Z         %163 = arith.sitofp %162 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7472003Z         %164 = ttg.local_alloc %163 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7472323Z         %165 = ttg.local_load %164 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7472785Z         %166 = tt.dot %147, %165, %136, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7473126Z         scf.yield %166 : tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7473243Z       } {tt.flatten}
2026-02-21T09:37:49.7473358Z       %48 = arith.addi %45, %24 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7473584Z       %49 = tt.addptr %6, %48 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:49.7473781Z       %50 = tt.load %49 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:49.7474037Z       %51 = ttg.convert_layout %50 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7474428Z       %52 = arith.extf %51 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7474752Z       %53 = arith.addi %46, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7475057Z       %54 = tt.addptr %7, %53 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7475356Z       %55 = tt.load %54 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7475581Z       %56 = arith.shli %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7475807Z       %57 = arith.shrsi %56, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7476030Z       %58 = arith.shrsi %55, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:49.7476305Z       %59 = tt.expand_dims %57 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7476631Z       %60 = tt.expand_dims %58 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:49.7476902Z       %61 = tt.broadcast %59 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7477158Z       %62 = arith.select %12, %61, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7477382Z       %63 = tt.broadcast %60 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7477607Z       %64 = arith.select %14, %63, %62 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:49.7477821Z       %65 = tt.reshape %64 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:49.7478030Z       %66 = arith.sitofp %65 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:49.7478269Z       %67 = ttg.local_alloc %66 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:49.7478580Z       %68 = ttg.local_load %67 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:49.7479033Z       %69 = tt.dot %52, %68, %47, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:49.7479404Z       %70 = arith.truncf %69 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:49.7479660Z       %71 = tt.expand_dims %37 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:49.7479889Z       %72 = arith.muli %71, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:49.7480111Z       %73 = tt.expand_dims %42 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:49.7480355Z       %74 = tt.broadcast %72 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7480549Z       %75 = tt.broadcast %73 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7480719Z       %76 = arith.addi %74, %75 : tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7480893Z       %77 = tt.addptr %15, %76 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:49.7481082Z       tt.store %77, %70 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:49.7481213Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:37:49.7481346Z     tt.return
2026-02-21T09:37:49.7481422Z   }
2026-02-21T09:37:49.7481493Z }
2026-02-21T09:37:49.7481535Z 
2026-02-21T09:37:49.7481564Z {-#
2026-02-21T09:37:49.7481642Z   external_resources: {
2026-02-21T09:37:49.7481739Z     mlir_reproducer: {
2026-02-21T09:37:49.7482767Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:49.7483755Z       disable_threading: false,
2026-02-21T09:37:49.7483863Z       verify_each: true
2026-02-21T09:37:49.7483948Z     }
2026-02-21T09:37:49.7484018Z   }
2026-02-21T09:37:49.7484084Z #-}
2026-02-21T09:37:49.7484358Z /tmp/torchinductor_root/vo/cvoepzha374qpppib5bjjzr2722y5wpm227pikwbnla374as2rrd.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:49.7485035Z /tmp/torchinductor_root/vo/cvoepzha374qpppib5bjjzr2722y5wpm227pikwbnla374as2rrd.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:49.7485578Z [105s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:49.7486389Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:49.7487097Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:49.7487261Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:50.0111373Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:50.0114878Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:37:50.0115680Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:50.0116432Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:50.0117125Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:50.0117715Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:50.0118137Z #smem = #ttg.shared_memory
2026-02-21T09:37:50.0118679Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:50.0119784Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:50.0120683Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T09:37:50.0121050Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:50.0121331Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:50.0121848Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:50.0122103Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:50.0122431Z     %cst_0 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0122884Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:50.0123154Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:37:50.0123409Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:50.0123665Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:50.0123920Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T09:37:50.0124191Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:50.0124408Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:50.0124760Z     %cst_1 = arith.constant dense<29352960> : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0125198Z     %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.0125558Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:50.0125915Z     %cst_4 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0126263Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.0126542Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.0126814Z     %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:50.0127044Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:50.0127368Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.0127803Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.0128377Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:50.0128878Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.0129310Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.0129700Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:50.0130087Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0130586Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:50.0131267Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:50.0131922Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.0132339Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.0132652Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:50.0132968Z     %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.0133273Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:50.0133601Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:50.0133946Z     %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.0134392Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:50.0134767Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0135003Z     scf.for %arg3 = %0 to %c448_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:50.0135229Z       %19 = arith.divsi %arg3, %c448_i32 : i32
2026-02-21T09:37:50.0135380Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:50.0135522Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:50.0135663Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:50.0135829Z       %23 = arith.remsi %arg3, %c448_i32 : i32
2026-02-21T09:37:50.0135972Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:50.0136107Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:50.0136239Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:50.0136374Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:50.0136579Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.0136841Z       %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.0137098Z       %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.0137358Z       %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.0137555Z       %32 = arith.muli %26, %c64_i32 : i32
2026-02-21T09:37:50.0137804Z       %33 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:50.0138107Z       %34 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.0138411Z       %35 = arith.addi %33, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:50.0138713Z       %36 = arith.addi %34, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.0139083Z       %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:50.0139391Z       %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:50.0139628Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0140060Z       %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0140541Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:50.0140807Z         %72 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:50.0141015Z         %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.0141282Z         %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.0141620Z         %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:50.0141955Z         %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0142192Z         %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0142435Z         %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0142681Z         %79 = tt.load %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:50.0143005Z         %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0143503Z         %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0143855Z         %82 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:50.0144073Z         %83 = tt.splat %82 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0144344Z         %84 = arith.addi %83, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0144763Z         %85 = tt.addptr %7, %84 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0145130Z         %86 = tt.load %85 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0145359Z         %87 = arith.shli %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0145590Z         %88 = arith.shrsi %87, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0145818Z         %89 = arith.shrsi %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0146101Z         %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.0146426Z         %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.0146704Z         %92 = tt.broadcast %90 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0146935Z         %93 = arith.select %12, %92, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0147162Z         %94 = tt.broadcast %91 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0147383Z         %95 = arith.select %14, %94, %93 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0147603Z         %96 = tt.reshape %95 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:50.0147817Z         %97 = arith.sitofp %96 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:50.0148098Z         %98 = ttg.local_alloc %97 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:50.0148413Z         %99 = ttg.local_load %98 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0148881Z         %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.0149223Z         %101 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:50.0149346Z         %102 = arith.muli %101, %c2_i32 : i32
2026-02-21T09:37:50.0149516Z         %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.0149737Z         %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.0150014Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:50.0150289Z         %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0150485Z         %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0150682Z         %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0150884Z         %109 = tt.load %108 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:50.0151149Z         %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0151548Z         %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0151827Z         %112 = arith.muli %101, %c7168_i32 : i32
2026-02-21T09:37:50.0152004Z         %113 = tt.splat %112 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0152228Z         %114 = arith.addi %113, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0152584Z         %115 = tt.addptr %7, %114 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0152890Z         %116 = tt.load %115 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0153118Z         %117 = arith.shli %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0153351Z         %118 = arith.shrsi %117, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0153584Z         %119 = arith.shrsi %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0153873Z         %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.0154206Z         %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.0154492Z         %122 = tt.broadcast %120 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0154728Z         %123 = arith.select %12, %122, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0154959Z         %124 = tt.broadcast %121 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0155186Z         %125 = arith.select %14, %124, %123 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0155411Z         %126 = tt.reshape %125 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:50.0155626Z         %127 = arith.sitofp %126 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:50.0155904Z         %128 = ttg.local_alloc %127 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:50.0156219Z         %129 = ttg.local_load %128 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0156680Z         %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.0157018Z         %131 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:50.0157137Z         %132 = arith.muli %131, %c2_i32 : i32
2026-02-21T09:37:50.0157303Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.0157521Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.0157792Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:50.0158065Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0158256Z         %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0158453Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0158652Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:50.0158913Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0159305Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0159579Z         %142 = arith.muli %131, %c7168_i32 : i32
2026-02-21T09:37:50.0159751Z         %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0159970Z         %144 = arith.addi %143, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0160308Z         %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0160609Z         %146 = tt.load %145 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0160835Z         %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0161065Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0161296Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0161581Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.0161911Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.0162191Z         %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0162422Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0162694Z         %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0162920Z         %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0163146Z         %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:50.0163364Z         %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:50.0163660Z         %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:50.0163975Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0164434Z         %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.0164774Z         scf.yield %160 : tensor<16x64xf32, #mma>
2026-02-21T09:37:50.0164889Z       } {tt.flatten}
2026-02-21T09:37:50.0165004Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0165193Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.0165386Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:50.0165645Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0166028Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0166354Z       %47 = arith.addi %40, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0166657Z       %48 = tt.addptr %7, %47 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0166952Z       %49 = tt.load %48 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0167170Z       %50 = arith.shli %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0167391Z       %51 = arith.shrsi %50, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0167618Z       %52 = arith.shrsi %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.0167896Z       %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.0168252Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.0168520Z       %55 = tt.broadcast %53 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0168744Z       %56 = arith.select %12, %55, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0168968Z       %57 = tt.broadcast %54 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0169187Z       %58 = arith.select %14, %57, %56 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.0169401Z       %59 = tt.reshape %58 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:50.0169611Z       %60 = arith.sitofp %59 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:50.0169850Z       %61 = ttg.local_alloc %60 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:50.0170158Z       %62 = ttg.local_load %61 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.0170605Z       %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.0170970Z       %64 = arith.truncf %63 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:50.0171223Z       %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:50.0171485Z       %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:50.0171708Z       %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:50.0171959Z       %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.0172148Z       %69 = tt.broadcast %67 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.0172317Z       %70 = arith.addi %68, %69 : tensor<16x64xi32, #mma>
2026-02-21T09:37:50.0172491Z       %71 = tt.addptr %15, %70 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:50.0172677Z       tt.store %71, %64 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:50.0172835Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:37:50.0172969Z     tt.return
2026-02-21T09:37:50.0173047Z   }
2026-02-21T09:37:50.0173117Z }
2026-02-21T09:37:50.0173159Z 
2026-02-21T09:37:50.0173187Z {-#
2026-02-21T09:37:50.0173264Z   external_resources: {
2026-02-21T09:37:50.0173363Z     mlir_reproducer: {
2026-02-21T09:37:50.0174354Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:50.0175339Z       disable_threading: false,
2026-02-21T09:37:50.0175440Z       verify_each: true
2026-02-21T09:37:50.0175528Z     }
2026-02-21T09:37:50.0175595Z   }
2026-02-21T09:37:50.0175661Z #-}
2026-02-21T09:37:50.0175933Z /tmp/torchinductor_root/7s/c7sb5sb6s3beeqyxvqjiy2qgqbn2owvyq5rq233iflctmhwzatng.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:50.0176600Z /tmp/torchinductor_root/7s/c7sb5sb6s3beeqyxvqjiy2qgqbn2owvyq5rq233iflctmhwzatng.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:50.0177176Z [106s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:50.0177942Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:50.0178643Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:50.0178805Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:50.1414156Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:50.1418851Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:37:50.1419274Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:50.1419650Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:50.1420004Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:37:50.1420430Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:50.1420657Z #smem = #ttg.shared_memory
2026-02-21T09:37:50.1420950Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:50.1421535Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:50.1422019Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T09:37:50.1422218Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:50.1422368Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:37:50.1422517Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:50.1422654Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:37:50.1422832Z     %cst_0 = arith.constant dense<0> : tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1423018Z     %c7168_i32 = arith.constant 7168 : i32
2026-02-21T09:37:50.1423166Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:37:50.1423309Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:50.1423446Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:50.1423587Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T09:37:50.1423730Z     %c4095_i32 = arith.constant 4095 : i32
2026-02-21T09:37:50.1423876Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:50.1424116Z     %cst_1 = arith.constant dense<29352960> : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1424440Z     %cst_2 = arith.constant dense<8190> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.1424702Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:50.1424965Z     %cst_4 = arith.constant dense<4> : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1425223Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.1425421Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.1425668Z     %cst_7 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:50.1425834Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:50.1426064Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.1426382Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.1426735Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:50.1427090Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.1427399Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.1427682Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:50.1427961Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1428316Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:50.1428797Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:50.1429261Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.1429555Z     %11 = arith.cmpi eq, %10, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.1429819Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:50.1430044Z     %13 = arith.cmpi eq, %10, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.1430267Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x64xi1, #blocked>
2026-02-21T09:37:50.1430505Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:50.1430752Z     %16 = arith.addi %5, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.1431075Z     %17 = tt.expand_dims %16 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:50.1431385Z     %18 = tt.broadcast %17 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1431608Z     scf.for %arg3 = %0 to %c448_i32 step %c2432_i32  : i32 {
2026-02-21T09:37:50.1431801Z       %19 = arith.divsi %arg3, %c448_i32 : i32
2026-02-21T09:37:50.1431944Z       %20 = arith.muli %19, %c4_i32 : i32
2026-02-21T09:37:50.1432085Z       %21 = arith.subi %c4_i32, %20 : i32
2026-02-21T09:37:50.1432132Z       %22 = arith.minsi %21, %c4_i32 : i32
2026-02-21T09:37:50.1432187Z       %23 = arith.remsi %arg3, %c448_i32 : i32
2026-02-21T09:37:50.1432236Z       %24 = arith.remsi %23, %22 : i32
2026-02-21T09:37:50.1432282Z       %25 = arith.addi %20, %24 : i32
2026-02-21T09:37:50.1432327Z       %26 = arith.divsi %23, %22 : i32
2026-02-21T09:37:50.1432374Z       %27 = arith.muli %25, %c16_i32 : i32
2026-02-21T09:37:50.1432483Z       %28 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.1432577Z       %29 = tt.splat %27 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.1432684Z       %30 = arith.addi %28, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.1432778Z       %31 = arith.addi %29, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.1432824Z       %32 = arith.muli %26, %c64_i32 : i32
2026-02-21T09:37:50.1432977Z       %33 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:50.1433116Z       %34 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.1433263Z       %35 = arith.addi %33, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:37:50.1433359Z       %36 = arith.addi %34, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.1433526Z       %37 = tt.expand_dims %30 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:37:50.1433599Z       %38 = arith.muli %37, %cst_3 : tensor<16x1xi32, #blocked1>
2026-02-21T09:37:50.1433707Z       %39 = tt.broadcast %38 : tensor<16x1xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1433966Z       %40 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1434121Z       %41 = scf.for %arg4 = %c0_i32 to %c4095_i32 step %c3_i32 iter_args(%arg5 = %cst) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:50.1434178Z         %72 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:50.1434280Z         %73 = tt.splat %72 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.1434367Z         %74 = arith.addi %73, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.1434507Z         %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:50.1434595Z         %76 = tt.broadcast %75 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1434654Z         %77 = arith.addi %39, %76 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1434785Z         %78 = tt.addptr %6, %77 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1434847Z         %79 = tt.load %78 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:50.1435011Z         %80 = ttg.convert_layout %79 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1435210Z         %81 = arith.extf %80 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1435256Z         %82 = arith.muli %arg4, %c7168_i32 : i32
2026-02-21T09:37:50.1435346Z         %83 = tt.splat %82 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1435437Z         %84 = arith.addi %83, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1435608Z         %85 = tt.addptr %7, %84 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1435702Z         %86 = tt.load %85 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1435801Z         %87 = arith.shli %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1435896Z         %88 = arith.shrsi %87, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1435988Z         %89 = arith.shrsi %86, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1436132Z         %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.1436274Z         %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.1436364Z         %92 = tt.broadcast %90 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1436464Z         %93 = arith.select %12, %92, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1436552Z         %94 = tt.broadcast %91 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1436676Z         %95 = arith.select %14, %94, %93 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1436760Z         %96 = tt.reshape %95 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:50.1436846Z         %97 = arith.sitofp %96 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:50.1436958Z         %98 = ttg.local_alloc %97 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:50.1437121Z         %99 = ttg.local_load %98 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1437387Z         %100 = tt.dot %81, %99, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.1437435Z         %101 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:37:50.1437478Z         %102 = arith.muli %101, %c2_i32 : i32
2026-02-21T09:37:50.1437570Z         %103 = tt.splat %102 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.1437663Z         %104 = arith.addi %103, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.1437808Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:50.1437899Z         %106 = tt.broadcast %105 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1437962Z         %107 = arith.addi %39, %106 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1438061Z         %108 = tt.addptr %6, %107 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1438153Z         %109 = tt.load %108 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:50.1438318Z         %110 = ttg.convert_layout %109 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1438514Z         %111 = arith.extf %110 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1438562Z         %112 = arith.muli %101, %c7168_i32 : i32
2026-02-21T09:37:50.1438655Z         %113 = tt.splat %112 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1438747Z         %114 = arith.addi %113, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1438922Z         %115 = tt.addptr %7, %114 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1439018Z         %116 = tt.load %115 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1439116Z         %117 = arith.shli %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1439215Z         %118 = arith.shrsi %117, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1439310Z         %119 = arith.shrsi %116, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1439456Z         %120 = tt.expand_dims %118 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.1439601Z         %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.1439694Z         %122 = tt.broadcast %120 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1439798Z         %123 = arith.select %12, %122, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1439892Z         %124 = tt.broadcast %121 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1440024Z         %125 = arith.select %14, %124, %123 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1440111Z         %126 = tt.reshape %125 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:50.1440204Z         %127 = arith.sitofp %126 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:50.1440320Z         %128 = ttg.local_alloc %127 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:50.1440485Z         %129 = ttg.local_load %128 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1440747Z         %130 = tt.dot %111, %129, %100, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.1440793Z         %131 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:37:50.1440837Z         %132 = arith.muli %131, %c2_i32 : i32
2026-02-21T09:37:50.1440930Z         %133 = tt.splat %132 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.1441021Z         %134 = arith.addi %133, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.1441164Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:37:50.1441255Z         %136 = tt.broadcast %135 : tensor<1x2xi32, #blocked1> -> tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1441317Z         %137 = arith.addi %39, %136 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1441443Z         %138 = tt.addptr %6, %137 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1441507Z         %139 = tt.load %138 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:50.1441672Z         %140 = ttg.convert_layout %139 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1441864Z         %141 = arith.extf %140 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1441909Z         %142 = arith.muli %131, %c7168_i32 : i32
2026-02-21T09:37:50.1442002Z         %143 = tt.splat %142 : i32 -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1442092Z         %144 = arith.addi %143, %40 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1442266Z         %145 = tt.addptr %7, %144 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1442360Z         %146 = tt.load %145 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1442456Z         %147 = arith.shli %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1442554Z         %148 = arith.shrsi %147, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1442688Z         %149 = arith.shrsi %146, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1442833Z         %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.1442980Z         %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.1443071Z         %152 = tt.broadcast %150 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1443172Z         %153 = arith.select %12, %152, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1443260Z         %154 = tt.broadcast %151 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1443401Z         %155 = arith.select %14, %154, %153 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1443488Z         %156 = tt.reshape %155 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:50.1443578Z         %157 = arith.sitofp %156 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:50.1443696Z         %158 = ttg.local_alloc %157 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:50.1443860Z         %159 = ttg.local_load %158 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1444122Z         %160 = tt.dot %141, %159, %130, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.1444173Z         scf.yield %160 : tensor<16x64xf32, #mma>
2026-02-21T09:37:50.1444218Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:37:50.1444278Z       %42 = arith.addi %39, %18 : tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1444378Z       %43 = tt.addptr %6, %42 : tensor<16x2x!tt.ptr<bf16>, #blocked1>, tensor<16x2xi32, #blocked1>
2026-02-21T09:37:50.1444438Z       %44 = tt.load %43 : tensor<16x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:37:50.1444600Z       %45 = ttg.convert_layout %44 : tensor<16x2xbf16, #blocked1> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1444798Z       %46 = arith.extf %45 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1444924Z       %47 = arith.addi %40, %cst_1 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1445093Z       %48 = tt.addptr %7, %47 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
﻿2026-02-21T09:37:50.1446668Z       %49 = tt.load %48 : tensor<1x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1446763Z       %50 = arith.shli %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1446857Z       %51 = arith.shrsi %50, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1446948Z       %52 = arith.shrsi %49, %cst_4 : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.1447093Z       %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.1447237Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x64xi8, #blocked>
2026-02-21T09:37:50.1447327Z       %55 = tt.broadcast %53 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1447428Z       %56 = arith.select %12, %55, %cst_0 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1447527Z       %57 = tt.broadcast %54 : tensor<1x1x64xi8, #blocked> -> tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1447621Z       %58 = arith.select %14, %57, %56 : tensor<1x2x64xi1, #blocked>, tensor<1x2x64xi8, #blocked>
2026-02-21T09:37:50.1447708Z       %59 = tt.reshape %58 : tensor<1x2x64xi8, #blocked> -> tensor<2x64xi8, #blocked2>
2026-02-21T09:37:50.1447794Z       %60 = arith.sitofp %59 : tensor<2x64xi8, #blocked2> to tensor<2x64xf32, #blocked2>
2026-02-21T09:37:50.1447905Z       %61 = ttg.local_alloc %60 : (tensor<2x64xf32, #blocked2>) -> !ttg.memdesc<2x64xf32, #shared, #smem>
2026-02-21T09:37:50.1448068Z       %62 = ttg.local_load %61 : !ttg.memdesc<2x64xf32, #shared, #smem> -> tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.1448325Z       %63 = tt.dot %46, %62, %41, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.1448429Z       %64 = arith.truncf %63 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:50.1448563Z       %65 = tt.expand_dims %31 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:50.1448620Z       %66 = arith.muli %65, %cst_7 : tensor<16x1xi32, #mma>
2026-02-21T09:37:50.1448749Z       %67 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:50.1448825Z       %68 = tt.broadcast %66 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.1448903Z       %69 = tt.broadcast %67 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.1448958Z       %70 = arith.addi %68, %69 : tensor<16x64xi32, #mma>
2026-02-21T09:37:50.1449048Z       %71 = tt.addptr %15, %70 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:50.1449106Z       tt.store %71, %64 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:50.1449172Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:37:50.1449207Z     tt.return
2026-02-21T09:37:50.1449238Z   }
2026-02-21T09:37:50.1449269Z }
2026-02-21T09:37:50.1449273Z 
2026-02-21T09:37:50.1449301Z {-#
2026-02-21T09:37:50.1449341Z   external_resources: {
2026-02-21T09:37:50.1449377Z     mlir_reproducer: {
2026-02-21T09:37:50.1450349Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:50.1450456Z       disable_threading: false,
2026-02-21T09:37:50.1450491Z       verify_each: true
2026-02-21T09:37:50.1450520Z     }
2026-02-21T09:37:50.1450549Z   }
2026-02-21T09:37:50.1450577Z #-}
2026-02-21T09:37:50.1450813Z /tmp/torchinductor_root/qr/cqrrueayeq7jpuik2iuudslnn32l6itvmutzhyd7ctjkz356xuau.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:50.1451224Z /tmp/torchinductor_root/qr/cqrrueayeq7jpuik2iuudslnn32l6itvmutzhyd7ctjkz356xuau.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:50.1451334Z [106s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:50.1451965Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:50.1452020Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:50.1452099Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:50.4682119Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:37:50.4687468Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:37:50.4687931Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:50.4688235Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:50.4688530Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:37:50.4688807Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:37:50.4689057Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:37:50.4689286Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:37:50.4689467Z #smem = #ttg.shared_memory
2026-02-21T09:37:50.4689695Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:37:50.4690162Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:37:50.4690628Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:37:50.4690801Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.4690969Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.4691148Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x64xf32, #mma>
2026-02-21T09:37:50.4691305Z     %c5311_i32 = arith.constant 5311 : i32
2026-02-21T09:37:50.4691456Z     %cst_3 = arith.constant dense<7168> : tensor<4x1xi32, #blocked1>
2026-02-21T09:37:50.4691697Z     %cst_4 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:50.4691845Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:37:50.4691963Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:37:50.4692075Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:37:50.4692231Z     %c112_i32 = arith.constant 112 : i32
2026-02-21T09:37:50.4692345Z     %c448_i32 = arith.constant 448 : i32
2026-02-21T09:37:50.4692458Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:37:50.4692571Z     %c14592_i32 = arith.constant 14592 : i32
2026-02-21T09:37:50.4692692Z     %c9728_i32 = arith.constant 9728 : i32
2026-02-21T09:37:50.4692833Z     %cst_5 = arith.constant dense<0> : tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4692976Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:37:50.4693084Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:37:50.4693193Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:37:50.4693307Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:37:50.4693418Z     %c4864_i32 = arith.constant 4864 : i32
2026-02-21T09:37:50.4693602Z     %cst_6 = arith.constant dense<4> : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4693807Z     %0 = tt.get_program_id x : i32
2026-02-21T09:37:50.4693999Z     %1 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.4694273Z     %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.4694535Z     %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:50.4694792Z     %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.4695048Z     %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.4695312Z     %6 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:50.4695546Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:50.4718019Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:50.4718290Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:37:50.4718702Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:37:50.4719099Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.4719347Z     %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.4719547Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked>
2026-02-21T09:37:50.4719738Z     %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:37:50.4719923Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked>
2026-02-21T09:37:50.4720127Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:50.4720284Z     %17 = arith.subi %c5311_i32, %0 : i32
2026-02-21T09:37:50.4720405Z     %18 = arith.divui %17, %c4864_i32 : i32
2026-02-21T09:37:50.4720519Z     %19 = arith.remsi %18, %c3_i32 : i32
2026-02-21T09:37:50.4720631Z     %20 = arith.subi %18, %19 : i32
2026-02-21T09:37:50.4720740Z     %21 = arith.muli %20, %c4864_i32 : i32
2026-02-21T09:37:50.4720854Z     %22 = arith.addi %0, %21 : i32
2026-02-21T09:37:50.4720977Z     scf.for %arg3 = %0 to %22 step %c14592_i32  : i32 {
2026-02-21T09:37:50.4721112Z       %23 = arith.divsi %arg3, %c8_i32 : i32
2026-02-21T09:37:50.4721229Z       %24 = arith.muli %23, %c2_i32 : i32
2026-02-21T09:37:50.4721392Z       %25 = arith.subi %c112_i32, %24 : i32
2026-02-21T09:37:50.4721506Z       %26 = arith.minsi %25, %c2_i32 : i32
2026-02-21T09:37:50.4721624Z       %27 = arith.remsi %arg3, %c8_i32 : i32
2026-02-21T09:37:50.4721739Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:37:50.4721867Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:37:50.4721979Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:37:50.4722088Z       %31 = arith.muli %29, %c64_i32 : i32
2026-02-21T09:37:50.4722258Z       %32 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.4722469Z       %33 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.4722725Z       %34 = arith.addi %32, %1 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.4722931Z       %35 = arith.addi %33, %2 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.4723087Z       %36 = arith.muli %30, %c16_i32 : i32
2026-02-21T09:37:50.4723255Z       %37 = tt.splat %36 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:50.4723464Z       %38 = tt.splat %36 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.4723668Z       %39 = arith.addi %37, %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:50.4723874Z       %40 = arith.addi %38, %4 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.4724134Z       %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:37:50.4724382Z       %42 = arith.muli %41, %cst_4 : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:50.4724572Z       %43 = tt.broadcast %42 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4724843Z       %44 = tt.expand_dims %34 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T09:37:50.4725115Z       %45 = tt.broadcast %44 : tensor<1x64xi32, #blocked1> -> tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4725396Z       %46 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:50.4725669Z         %121 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.4725896Z         %122 = arith.addi %121, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.4726068Z         %123 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:50.4726236Z         %124 = tt.splat %123 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:50.4726452Z         %125 = arith.addi %124, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:50.4726726Z         %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:37:50.4727003Z         %127 = tt.broadcast %126 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4727198Z         %128 = arith.addi %43, %127 : tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4727400Z         %129 = tt.addptr %7, %128 : tensor<16x8x!tt.ptr<bf16>, #blocked2>, tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4727607Z         %130 = tt.load %129 : tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:50.4727828Z         %131 = ttg.local_alloc %130 : (tensor<16x8xbf16, #blocked2>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:37:50.4728159Z         %132 = ttg.local_load %131 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4728564Z         %133 = arith.extf %132 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4728985Z         %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:37:50.4729232Z         %135 = arith.muli %134, %cst_3 : tensor<4x1xi32, #blocked1>
2026-02-21T09:37:50.4729425Z         %136 = tt.broadcast %135 : tensor<4x1xi32, #blocked1> -> tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4729640Z         %137 = arith.addi %136, %45 : tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4729835Z         %138 = tt.addptr %8, %137 : tensor<4x64x!tt.ptr<i8>, #blocked1>, tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4730037Z         %139 = tt.load %138 : tensor<4x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:50.4730276Z         %140 = ttg.convert_layout %139 : tensor<4x64xi8, #blocked1> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4730556Z         %141 = arith.shli %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4730796Z         %142 = arith.shrsi %141, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4731029Z         %143 = arith.shrsi %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4731318Z         %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:37:50.4731651Z         %145 = tt.expand_dims %143 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:37:50.4731933Z         %146 = tt.broadcast %144 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4732172Z         %147 = arith.select %13, %146, %cst_5 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4732408Z         %148 = tt.broadcast %145 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4732641Z         %149 = arith.select %15, %148, %147 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4732870Z         %150 = tt.reshape %149 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:37:50.4733110Z         %151 = arith.sitofp %150 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:37:50.4733359Z         %152 = ttg.local_alloc %151 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:37:50.4733679Z         %153 = ttg.local_load %152 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4734152Z         %154 = tt.dot %133, %153, %arg5, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.4734504Z         scf.yield %154 : tensor<16x64xf32, #mma>
2026-02-21T09:37:50.4734633Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:37:50.4734798Z       %47 = arith.truncf %46 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:50.4735053Z       %48 = tt.expand_dims %40 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:50.4735286Z       %49 = arith.muli %48, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:50.4735505Z       %50 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:50.4735752Z       %51 = tt.broadcast %49 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4735945Z       %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4736112Z       %53 = arith.addi %51, %52 : tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4736293Z       %54 = tt.addptr %16, %53 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4736478Z       tt.store %54, %47 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:50.4736663Z       %55 = arith.addi %arg3, %c4864_i32 : i32
2026-02-21T09:37:50.4736784Z       %56 = arith.divsi %55, %c8_i32 : i32
2026-02-21T09:37:50.4736902Z       %57 = arith.muli %56, %c2_i32 : i32
2026-02-21T09:37:50.4737017Z       %58 = arith.subi %c112_i32, %57 : i32
2026-02-21T09:37:50.4737152Z       %59 = arith.minsi %58, %c2_i32 : i32
2026-02-21T09:37:50.4737265Z       %60 = arith.remsi %55, %c8_i32 : i32
2026-02-21T09:37:50.4737375Z       %61 = arith.remsi %60, %59 : i32
2026-02-21T09:37:50.4737487Z       %62 = arith.addi %57, %61 : i32
2026-02-21T09:37:50.4737594Z       %63 = arith.divsi %60, %59 : i32
2026-02-21T09:37:50.4737704Z       %64 = arith.muli %62, %c64_i32 : i32
2026-02-21T09:37:50.4737867Z       %65 = tt.splat %64 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.4738077Z       %66 = tt.splat %64 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.4738290Z       %67 = arith.addi %65, %1 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.4738495Z       %68 = arith.addi %66, %2 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.4738657Z       %69 = arith.muli %63, %c16_i32 : i32
2026-02-21T09:37:50.4738817Z       %70 = tt.splat %69 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:50.4739025Z       %71 = tt.splat %69 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.4739230Z       %72 = arith.addi %70, %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:50.4739434Z       %73 = arith.addi %71, %4 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.4739696Z       %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:37:50.4739940Z       %75 = arith.muli %74, %cst_4 : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:50.4740129Z       %76 = tt.broadcast %75 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4740400Z       %77 = tt.expand_dims %67 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T09:37:50.4740689Z       %78 = tt.broadcast %77 : tensor<1x64xi32, #blocked1> -> tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4740954Z       %79 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:50.4741218Z         %121 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.4741445Z         %122 = arith.addi %121, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.4741619Z         %123 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:50.4741788Z         %124 = tt.splat %123 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:50.4742009Z         %125 = arith.addi %124, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:50.4742279Z         %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:37:50.4749282Z         %127 = tt.broadcast %126 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4749507Z         %128 = arith.addi %76, %127 : tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4749714Z         %129 = tt.addptr %7, %128 : tensor<16x8x!tt.ptr<bf16>, #blocked2>, tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4749922Z         %130 = tt.load %129 : tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:50.4750143Z         %131 = ttg.local_alloc %130 : (tensor<16x8xbf16, #blocked2>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:37:50.4750473Z         %132 = ttg.local_load %131 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4750941Z         %133 = arith.extf %132 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4751328Z         %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:37:50.4751608Z         %135 = arith.muli %134, %cst_3 : tensor<4x1xi32, #blocked1>
2026-02-21T09:37:50.4751799Z         %136 = tt.broadcast %135 : tensor<4x1xi32, #blocked1> -> tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4751992Z         %137 = arith.addi %136, %78 : tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4752185Z         %138 = tt.addptr %8, %137 : tensor<4x64x!tt.ptr<i8>, #blocked1>, tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4752386Z         %139 = tt.load %138 : tensor<4x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:50.4752629Z         %140 = ttg.convert_layout %139 : tensor<4x64xi8, #blocked1> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4752912Z         %141 = arith.shli %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4753147Z         %142 = arith.shrsi %141, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4753378Z         %143 = arith.shrsi %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4753665Z         %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:37:50.4753999Z         %145 = tt.expand_dims %143 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:37:50.4754277Z         %146 = tt.broadcast %144 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4754512Z         %147 = arith.select %13, %146, %cst_5 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4754771Z         %148 = tt.broadcast %145 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4755000Z         %149 = arith.select %15, %148, %147 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4755258Z         %150 = tt.reshape %149 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:37:50.4755480Z         %151 = arith.sitofp %150 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:37:50.4755727Z         %152 = ttg.local_alloc %151 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:37:50.4756049Z         %153 = ttg.local_load %152 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4756520Z         %154 = tt.dot %133, %153, %arg5, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.4756871Z         scf.yield %154 : tensor<16x64xf32, #mma>
2026-02-21T09:37:50.4757001Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:37:50.4757165Z       %80 = arith.truncf %79 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:50.4757418Z       %81 = tt.expand_dims %73 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:50.4757646Z       %82 = arith.muli %81, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:50.4757866Z       %83 = tt.expand_dims %68 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:50.4758113Z       %84 = tt.broadcast %82 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4758305Z       %85 = tt.broadcast %83 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4758473Z       %86 = arith.addi %84, %85 : tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4758685Z       %87 = tt.addptr %16, %86 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4758870Z       tt.store %87, %80 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:50.4759013Z       %88 = arith.addi %arg3, %c9728_i32 : i32
2026-02-21T09:37:50.4759133Z       %89 = arith.divsi %88, %c8_i32 : i32
2026-02-21T09:37:50.4759265Z       %90 = arith.muli %89, %c2_i32 : i32
2026-02-21T09:37:50.4759381Z       %91 = arith.subi %c112_i32, %90 : i32
2026-02-21T09:37:50.4759497Z       %92 = arith.minsi %91, %c2_i32 : i32
2026-02-21T09:37:50.4759610Z       %93 = arith.remsi %88, %c8_i32 : i32
2026-02-21T09:37:50.4759721Z       %94 = arith.remsi %93, %92 : i32
2026-02-21T09:37:50.4759833Z       %95 = arith.addi %90, %94 : i32
2026-02-21T09:37:50.4759938Z       %96 = arith.divsi %93, %92 : i32
2026-02-21T09:37:50.4760048Z       %97 = arith.muli %95, %c64_i32 : i32
2026-02-21T09:37:50.4760209Z       %98 = tt.splat %97 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.4760421Z       %99 = tt.splat %97 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.4760632Z       %100 = arith.addi %98, %1 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.4760840Z       %101 = arith.addi %99, %2 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.4761004Z       %102 = arith.muli %96, %c16_i32 : i32
2026-02-21T09:37:50.4761172Z       %103 = tt.splat %102 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:50.4761386Z       %104 = tt.splat %102 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.4761605Z       %105 = arith.addi %103, %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:50.4761814Z       %106 = arith.addi %104, %4 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.4762088Z       %107 = tt.expand_dims %105 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:37:50.4762339Z       %108 = arith.muli %107, %cst_4 : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:50.4762552Z       %109 = tt.broadcast %108 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4762862Z       %110 = tt.expand_dims %100 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T09:37:50.4763138Z       %111 = tt.broadcast %110 : tensor<1x64xi32, #blocked1> -> tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4763406Z       %112 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:50.4763671Z         %121 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.4763895Z         %122 = arith.addi %121, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.4764069Z         %123 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:50.4764241Z         %124 = tt.splat %123 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:50.4764461Z         %125 = arith.addi %124, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:50.4764731Z         %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:37:50.4765006Z         %127 = tt.broadcast %126 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4765197Z         %128 = arith.addi %109, %127 : tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4765396Z         %129 = tt.addptr %7, %128 : tensor<16x8x!tt.ptr<bf16>, #blocked2>, tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4765603Z         %130 = tt.load %129 : tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:50.4765819Z         %131 = ttg.local_alloc %130 : (tensor<16x8xbf16, #blocked2>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:37:50.4766186Z         %132 = ttg.local_load %131 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4766593Z         %133 = arith.extf %132 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4766987Z         %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:37:50.4767234Z         %135 = arith.muli %134, %cst_3 : tensor<4x1xi32, #blocked1>
2026-02-21T09:37:50.4767424Z         %136 = tt.broadcast %135 : tensor<4x1xi32, #blocked1> -> tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4767618Z         %137 = arith.addi %136, %111 : tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4767812Z         %138 = tt.addptr %8, %137 : tensor<4x64x!tt.ptr<i8>, #blocked1>, tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4768011Z         %139 = tt.load %138 : tensor<4x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:50.4768254Z         %140 = ttg.convert_layout %139 : tensor<4x64xi8, #blocked1> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4768531Z         %141 = arith.shli %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4768766Z         %142 = arith.shrsi %141, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4768999Z         %143 = arith.shrsi %140, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4769284Z         %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:37:50.4769616Z         %145 = tt.expand_dims %143 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:37:50.4769901Z         %146 = tt.broadcast %144 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4770138Z         %147 = arith.select %13, %146, %cst_5 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4770400Z         %148 = tt.broadcast %145 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4770626Z         %149 = arith.select %15, %148, %147 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4770851Z         %150 = tt.reshape %149 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:37:50.4771069Z         %151 = arith.sitofp %150 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:37:50.4771319Z         %152 = ttg.local_alloc %151 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:37:50.4771644Z         %153 = ttg.local_load %152 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4772111Z         %154 = tt.dot %133, %153, %arg5, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.4772462Z         scf.yield %154 : tensor<16x64xf32, #mma>
2026-02-21T09:37:50.4772591Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:37:50.4772755Z       %113 = arith.truncf %112 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:50.4773017Z       %114 = tt.expand_dims %106 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:50.4773249Z       %115 = arith.muli %114, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:50.4773476Z       %116 = tt.expand_dims %101 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:50.4773726Z       %117 = tt.broadcast %115 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4773957Z       %118 = tt.broadcast %116 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4774135Z       %119 = arith.addi %117, %118 : tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4774320Z       %120 = tt.addptr %16, %119 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4774526Z       tt.store %120, %113 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:50.4774681Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:37:50.4774838Z     scf.for %arg3 = %22 to %c448_i32 step %c4864_i32  : i32 {
2026-02-21T09:37:50.4774976Z       %23 = arith.divsi %arg3, %c8_i32 : i32
2026-02-21T09:37:50.4775095Z       %24 = arith.muli %23, %c2_i32 : i32
2026-02-21T09:37:50.4775211Z       %25 = arith.subi %c112_i32, %24 : i32
2026-02-21T09:37:50.4775326Z       %26 = arith.minsi %25, %c2_i32 : i32
2026-02-21T09:37:50.4775446Z       %27 = arith.remsi %arg3, %c8_i32 : i32
2026-02-21T09:37:50.4775558Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:37:50.4775672Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:37:50.4775778Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:37:50.4775891Z       %31 = arith.muli %29, %c64_i32 : i32
2026-02-21T09:37:50.4776052Z       %32 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.4776261Z       %33 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.4776469Z       %34 = arith.addi %32, %1 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:37:50.4776672Z       %35 = arith.addi %33, %2 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:37:50.4776830Z       %36 = arith.muli %30, %c16_i32 : i32
2026-02-21T09:37:50.4776989Z       %37 = tt.splat %36 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:50.4777196Z       %38 = tt.splat %36 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.4777402Z       %39 = arith.addi %37, %3 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:37:50.4777603Z       %40 = arith.addi %38, %4 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:37:50.4777885Z       %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:37:50.4778128Z       %42 = arith.muli %41, %cst_4 : tensor<16x1xi32, #blocked2>
2026-02-21T09:37:50.4778316Z       %43 = tt.broadcast %42 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4778588Z       %44 = tt.expand_dims %34 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T09:37:50.4778855Z       %45 = tt.broadcast %44 : tensor<1x64xi32, #blocked1> -> tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4779116Z       %46 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x64xf32, #mma>)  : i32 {
2026-02-21T09:37:50.4779382Z         %55 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.4779602Z         %56 = arith.addi %55, %5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:37:50.4779772Z         %57 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:37:50.4779933Z         %58 = tt.splat %57 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:50.4780145Z         %59 = arith.addi %58, %6 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:37:50.4780410Z         %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:37:50.4780678Z         %61 = tt.broadcast %60 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4780865Z         %62 = arith.addi %43, %61 : tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4781089Z         %63 = tt.addptr %7, %62 : tensor<16x8x!tt.ptr<bf16>, #blocked2>, tensor<16x8xi32, #blocked2>
2026-02-21T09:37:50.4781288Z         %64 = tt.load %63 : tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:37:50.4781503Z         %65 = ttg.local_alloc %64 : (tensor<16x8xbf16, #blocked2>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:37:50.4781842Z         %66 = ttg.local_load %65 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4782239Z         %67 = arith.extf %66 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4782613Z         %68 = tt.expand_dims %56 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:37:50.4782856Z         %69 = arith.muli %68, %cst_3 : tensor<4x1xi32, #blocked1>
2026-02-21T09:37:50.4783040Z         %70 = tt.broadcast %69 : tensor<4x1xi32, #blocked1> -> tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4783228Z         %71 = arith.addi %70, %45 : tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4783421Z         %72 = tt.addptr %8, %71 : tensor<4x64x!tt.ptr<i8>, #blocked1>, tensor<4x64xi32, #blocked1>
2026-02-21T09:37:50.4783612Z         %73 = tt.load %72 : tensor<4x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:37:50.4783848Z         %74 = ttg.convert_layout %73 : tensor<4x64xi8, #blocked1> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4784119Z         %75 = arith.shli %74, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4784349Z         %76 = arith.shrsi %75, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4784577Z         %77 = arith.shrsi %74, %cst_6 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:37:50.4784851Z         %78 = tt.expand_dims %76 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:37:50.4785179Z         %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:37:50.4785467Z         %80 = tt.broadcast %78 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4785699Z         %81 = arith.select %13, %80, %cst_5 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4785927Z         %82 = tt.broadcast %79 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4786146Z         %83 = arith.select %15, %82, %81 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:37:50.4786364Z         %84 = tt.reshape %83 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:37:50.4786575Z         %85 = arith.sitofp %84 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:37:50.4786820Z         %86 = ttg.local_alloc %85 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:37:50.4787135Z         %87 = ttg.local_load %86 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:37:50.4787596Z         %88 = tt.dot %67, %87, %arg5, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x64xf32, #mma>
2026-02-21T09:37:50.4787937Z         scf.yield %88 : tensor<16x64xf32, #mma>
2026-02-21T09:37:50.4788066Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:37:50.4788226Z       %47 = arith.truncf %46 : tensor<16x64xf32, #mma> to tensor<16x64xbf16, #mma>
2026-02-21T09:37:50.4788480Z       %48 = tt.expand_dims %40 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:37:50.4788707Z       %49 = arith.muli %48, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:37:50.4788960Z       %50 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:37:50.4789205Z       %51 = tt.broadcast %49 : tensor<16x1xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4789399Z       %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4789581Z       %53 = arith.addi %51, %52 : tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4789755Z       %54 = tt.addptr %16, %53 : tensor<16x64x!tt.ptr<bf16>, #mma>, tensor<16x64xi32, #mma>
2026-02-21T09:37:50.4789938Z       tt.store %54, %47 : tensor<16x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:37:50.4790092Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:37:50.4790225Z     tt.return
2026-02-21T09:37:50.4790301Z   }
2026-02-21T09:37:50.4790376Z }
2026-02-21T09:37:50.4790419Z 
2026-02-21T09:37:50.4790451Z {-#
2026-02-21T09:37:50.4790529Z   external_resources: {
2026-02-21T09:37:50.4790628Z     mlir_reproducer: {
2026-02-21T09:37:50.4791620Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:37:50.4792612Z       disable_threading: false,
2026-02-21T09:37:50.4792718Z       verify_each: true
2026-02-21T09:37:50.4792804Z     }
2026-02-21T09:37:50.4792878Z   }
2026-02-21T09:37:50.4792945Z #-}
2026-02-21T09:37:50.4793216Z /tmp/torchinductor_root/ht/chtj3uegjhd2go2a3wm2vl7tyhopnj23oo6r55fq36bsvovs2hug.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:37:50.4793888Z /tmp/torchinductor_root/ht/chtj3uegjhd2go2a3wm2vl7tyhopnj23oo6r55fq36bsvovs2hug.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:37:50.4794452Z [106s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:37:50.4795221Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[1, 3], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:37:50.4795927Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:37:50.4796093Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:37:51.4110865Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 16.0 configs/s
2026-02-21T09:37:55.3789541Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 230.1         
2026-02-21T09:37:55.3790198Z                                                                   configs/s     
2026-02-21T09:37:55.9644558Z [112s] Generation 2 complete: 
2026-02-21T09:37:55.9644945Z error=15
2026-02-21T09:37:55.9645151Z ok=87
2026-02-21T09:37:55.9645359Z min=0.1117
2026-02-21T09:37:55.9645568Z mid=0.2379
2026-02-21T09:37:55.9645762Z max=8.9027
2026-02-21T09:37:55.9645988Z best={'block_sizes': [64, 64, 16],
2026-02-21T09:37:55.9646355Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:37:55.9646705Z  'l2_groupings': [1],
2026-02-21T09:37:55.9646981Z  'load_eviction_policies': ['', ''],
2026-02-21T09:37:55.9647290Z  'loop_orders': [[0, 1]],
2026-02-21T09:37:55.9647996Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:37:55.9648266Z  'num_stages': 2,
2026-02-21T09:37:55.9648493Z  'num_warps': 4,
2026-02-21T09:37:55.9648741Z  'pid_type': 'flat',
2026-02-21T09:37:55.9649003Z  'range_flattens': [None, None],
2026-02-21T09:37:55.9649304Z  'range_multi_buffers': [None, None],
2026-02-21T09:37:55.9649720Z  'range_num_stages': [0, 0],
2026-02-21T09:37:55.9649996Z  'range_unroll_factors': [0, 1],
2026-02-21T09:37:55.9650290Z  'range_warp_specializes': [],
2026-02-21T09:37:55.9650568Z  'waves_per_eu': 2}
2026-02-21T09:37:55.9984414Z [112s] Fitting surrogate: 308 points, 308 targets
2026-02-21T09:37:56.9687166Z [113s] Generation 3 starting: 98 neighbors, 5 active search path(s)
2026-02-21T09:38:13.1621123Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 7.5 configs/s
2026-02-21T09:38:18.4554075Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:38:18.4555601Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:38:18.4558104Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:38:18.4559068Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:38:18.4559866Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:38:18.4560581Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:38:18.4561216Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:38:18.4561676Z #smem = #ttg.shared_memory
2026-02-21T09:38:18.4562035Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:38:18.4562836Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:38:18.4563713Z     %cst = arith.constant dense<0.000000e+00> : tensor<16x32xf32, #mma>
2026-02-21T09:38:18.4563957Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:38:18.4564137Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:38:18.4564326Z     %c224_i32 = arith.constant 224 : i32
2026-02-21T09:38:18.4564499Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:38:18.4564722Z     %cst_0 = arith.constant dense<0> : tensor<4x2x32xi8, #blocked>
2026-02-21T09:38:18.4564940Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:38:18.4565114Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:38:18.4565294Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:38:18.4565511Z     %cst_1 = arith.constant dense<8192> : tensor<16x1xi32, #blocked1>
2026-02-21T09:38:18.4565849Z     %cst_2 = arith.constant dense<7168> : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4566235Z     %cst_3 = arith.constant dense<4> : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4566564Z     %cst_4 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:38:18.4566821Z     %cst_5 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:38:18.4567078Z     %cst_6 = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:38:18.4567298Z     %0 = tt.get_program_id x : i32
2026-02-21T09:38:18.4567470Z     %1 = arith.remsi %0, %c224_i32 : i32
2026-02-21T09:38:18.4567653Z     %2 = arith.divsi %0, %c224_i32 : i32
2026-02-21T09:38:18.4567821Z     %3 = arith.muli %1, %c32_i32 : i32
2026-02-21T09:38:18.4568187Z     %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:38:18.4568872Z     %5 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:38:18.4569297Z     %6 = tt.splat %3 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:38:18.4569743Z     %7 = tt.splat %3 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:38:18.4570111Z     %8 = arith.addi %6, %4 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:38:18.4570480Z     %9 = arith.addi %7, %5 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:38:18.4570724Z     %10 = arith.muli %2, %c16_i32 : i32
2026-02-21T09:38:18.4571026Z     %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:38:18.4571427Z     %12 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:38:18.4571717Z     %13 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:38:18.4572022Z     %14 = tt.splat %10 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:38:18.4572266Z     %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:38:18.4572513Z     %16 = arith.addi %14, %12 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:38:18.4572835Z     %17 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:38:18.4573194Z     %18 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:38:18.4573551Z     %19 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<16x1xi32, #blocked1>
2026-02-21T09:38:18.4573845Z     %20 = arith.muli %19, %cst_1 : tensor<16x1xi32, #blocked1>
2026-02-21T09:38:18.4574070Z     %21 = tt.broadcast %20 : tensor<16x1xi32, #blocked1> -> tensor<16x8xi32, #blocked1>
2026-02-21T09:38:18.4574350Z     %22 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:38:18.4574741Z     %23 = tt.expand_dims %8 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4575237Z     %24 = tt.broadcast %23 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4575614Z     %25 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4575972Z     %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:38:18.4576453Z     %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:38:18.4576916Z     %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:38:18.4577219Z     %29 = arith.cmpi eq, %28, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:38:18.4577449Z     %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked>
2026-02-21T09:38:18.4577680Z     %31 = arith.cmpi eq, %28, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:38:18.4577904Z     %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked>
2026-02-21T09:38:18.4578205Z     %33 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg4 = %cst) -> (tensor<16x32xf32, #mma>)  : i32 {
2026-02-21T09:38:18.4578661Z       %43 = tt.splat %arg3 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:38:18.4579017Z       %44 = arith.addi %43, %17 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:38:18.4579269Z       %45 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:38:18.4579490Z       %46 = tt.splat %45 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:38:18.4579742Z       %47 = arith.addi %46, %18 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:38:18.4580060Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:38:18.4580379Z       %49 = tt.broadcast %48 : tensor<1x8xi32, #blocked1> -> tensor<16x8xi32, #blocked1>
2026-02-21T09:38:18.4580603Z       %50 = arith.addi %21, %49 : tensor<16x8xi32, #blocked1>
2026-02-21T09:38:18.4580833Z       %51 = tt.addptr %22, %50 : tensor<16x8x!tt.ptr<bf16>, #blocked1>, tensor<16x8xi32, #blocked1>
2026-02-21T09:38:18.4581076Z       %52 = tt.load %51 : tensor<16x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:38:18.4581325Z       %53 = ttg.local_alloc %52 : (tensor<16x8xbf16, #blocked1>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:38:18.4581676Z       %54 = ttg.local_load %53 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:38:18.4582077Z       %55 = arith.extf %54 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:38:18.4582532Z       %56 = tt.expand_dims %44 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4582886Z       %57 = arith.muli %56, %cst_2 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4583191Z       %58 = tt.broadcast %57 : tensor<4x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4583492Z       %59 = arith.addi %58, %24 : tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4583826Z       %60 = tt.addptr %25, %59 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4584133Z       %61 = tt.load %60 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4584360Z       %62 = arith.shli %61, %cst_3 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4584594Z       %63 = arith.shrsi %62, %cst_3 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4584828Z       %64 = arith.shrsi %61, %cst_3 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:38:18.4585116Z       %65 = tt.expand_dims %63 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:38:18.4585444Z       %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:38:18.4585725Z       %67 = tt.broadcast %65 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:38:18.4585957Z       %68 = arith.select %30, %67, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:38:18.4586190Z       %69 = tt.broadcast %66 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:38:18.4586420Z       %70 = arith.select %32, %69, %68 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:38:18.4586639Z       %71 = tt.reshape %70 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:38:18.4586858Z       %72 = arith.sitofp %71 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:38:18.4587129Z       %73 = ttg.local_alloc %72 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:38:18.4587448Z       %74 = ttg.local_load %73 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:38:18.4587918Z       %75 = tt.dot %55, %74, %arg4, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma>
2026-02-21T09:38:18.4588281Z       scf.yield %75 : tensor<16x32xf32, #mma>
2026-02-21T09:38:18.4588415Z     } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:38:18.4588578Z     %34 = arith.truncf %33 : tensor<16x32xf32, #mma> to tensor<16x32xbf16, #mma>
2026-02-21T09:38:18.4588837Z     %35 = tt.expand_dims %16 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:38:18.4589073Z     %36 = arith.muli %35, %cst_6 : tensor<16x1xi32, #mma>
2026-02-21T09:38:18.4589297Z     %37 = tt.expand_dims %9 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma>
2026-02-21T09:38:18.4589547Z     %38 = tt.broadcast %36 : tensor<16x1xi32, #mma> -> tensor<16x32xi32, #mma>
2026-02-21T09:38:18.4589740Z     %39 = tt.broadcast %37 : tensor<1x32xi32, #mma> -> tensor<16x32xi32, #mma>
2026-02-21T09:38:18.4589916Z     %40 = arith.addi %38, %39 : tensor<16x32xi32, #mma>
2026-02-21T09:38:18.4590087Z     %41 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:38:18.4590294Z     %42 = tt.addptr %41, %40 : tensor<16x32x!tt.ptr<bf16>, #mma>, tensor<16x32xi32, #mma>
2026-02-21T09:38:18.4590490Z     tt.store %42, %34 : tensor<16x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:38:18.4590619Z     tt.return
2026-02-21T09:38:18.4590707Z   }
2026-02-21T09:38:18.4590786Z }
2026-02-21T09:38:18.4590835Z 
2026-02-21T09:38:18.4590868Z {-#
2026-02-21T09:38:18.4590953Z   external_resources: {
2026-02-21T09:38:18.4591058Z     mlir_reproducer: {
2026-02-21T09:38:18.4592054Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:38:18.4593073Z       disable_threading: false,
2026-02-21T09:38:18.4593182Z       verify_each: true
2026-02-21T09:38:18.4593277Z     }
2026-02-21T09:38:18.4593354Z   }
2026-02-21T09:38:18.4593433Z #-}
2026-02-21T09:38:18.4593713Z /tmp/torchinductor_root/da/cdattooczxmlgvhivkfyxauwjzd73sq74blzs7vbmlnrihpx52hr.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:38:18.4594401Z /tmp/torchinductor_root/da/cdattooczxmlgvhivkfyxauwjzd73sq74blzs7vbmlnrihpx52hr.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:38:18.4594960Z [134s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:38:18.4595670Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:38:18.4596326Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:38:18.4596531Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:38:19.5143755Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 15.8 configs/s
2026-02-21T09:38:26.2784685Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 155.3         
2026-02-21T09:38:26.2785707Z                                                                   configs/s     
2026-02-21T09:38:26.8504467Z [142s] Generation 3 complete: 
2026-02-21T09:38:26.8504825Z error=2
2026-02-21T09:38:26.8505034Z ok=101
2026-02-21T09:38:26.8505243Z min=0.1116
2026-02-21T09:38:26.8505453Z mid=0.2058
2026-02-21T09:38:26.8505661Z max=12.0261
2026-02-21T09:38:26.8505903Z best={'block_sizes': [64, 64, 16],
2026-02-21T09:38:26.8506272Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:38:26.8506630Z  'l2_groupings': [1],
2026-02-21T09:38:26.8506900Z  'load_eviction_policies': ['', ''],
2026-02-21T09:38:26.8507219Z  'loop_orders': [[0, 1]],
2026-02-21T09:38:26.8507503Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:38:26.8507807Z  'num_stages': 2,
2026-02-21T09:38:26.8508029Z  'num_warps': 4,
2026-02-21T09:38:26.8508273Z  'pid_type': 'flat',
2026-02-21T09:38:26.8508549Z  'range_flattens': [None, True],
2026-02-21T09:38:26.8508861Z  'range_multi_buffers': [None, None],
2026-02-21T09:38:26.8509177Z  'range_num_stages': [0, 0],
2026-02-21T09:38:26.8509453Z  'range_unroll_factors': [0, 1],
2026-02-21T09:38:26.8509751Z  'range_warp_specializes': [],
2026-02-21T09:38:26.8510022Z  'waves_per_eu': 2}
2026-02-21T09:38:26.9175240Z [143s] Fitting surrogate: 411 points, 411 targets
2026-02-21T09:38:27.8925056Z [143s] Generation 4 starting: 93 neighbors, 5 active search path(s)
2026-02-21T09:38:45.0472271Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 3.1 configs/s
2026-02-21T09:38:50.9074607Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 16.4 configs/s
2026-02-21T09:38:56.6565832Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 164.2         
2026-02-21T09:38:56.6566186Z                                                                   configs/s     
2026-02-21T09:38:57.2563058Z [173s] Generation 4 complete: 
2026-02-21T09:38:57.2564014Z error=7
2026-02-21T09:38:57.2564254Z ok=91
2026-02-21T09:38:57.2564452Z min=0.1092
2026-02-21T09:38:57.2564682Z mid=0.2270
2026-02-21T09:38:57.2564886Z max=8.0735
2026-02-21T09:38:57.2565198Z best={'block_sizes': [64, 64, 16],
2026-02-21T09:38:57.2565574Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:38:57.2565942Z  'l2_groupings': [1],
2026-02-21T09:38:57.2566199Z  'load_eviction_policies': ['', ''],
2026-02-21T09:38:57.2566473Z  'loop_orders': [[0, 1]],
2026-02-21T09:38:57.2566722Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:38:57.2566961Z  'num_stages': 1,
2026-02-21T09:38:57.2567429Z  'num_warps': 4,
2026-02-21T09:38:57.2567726Z  'pid_type': 'flat',
2026-02-21T09:38:57.2567982Z  'range_flattens': [None, True],
2026-02-21T09:38:57.2568266Z  'range_multi_buffers': [None, None],
2026-02-21T09:38:57.2568617Z  'range_num_stages': [0, 0],
2026-02-21T09:38:57.2568873Z  'range_unroll_factors': [0, 1],
2026-02-21T09:38:57.2569135Z  'range_warp_specializes': [],
2026-02-21T09:38:57.2569391Z  'waves_per_eu': 3}
2026-02-21T09:38:57.3202466Z [173s] Fitting surrogate: 509 points, 509 targets
2026-02-21T09:38:58.2904321Z [174s] Generation 5 starting: 91 neighbors, 5 active search path(s)
2026-02-21T09:39:17.7830277Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 2.2 configs/s
2026-02-21T09:39:23.3461630Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 17.0 configs/s
2026-02-21T09:39:31.4165172Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 119.2         
2026-02-21T09:39:31.4165790Z                                                                   configs/s     
2026-02-21T09:39:32.1046830Z [208s] Generation 5 complete: 
2026-02-21T09:39:32.1047212Z error=7
2026-02-21T09:39:32.1047406Z ok=89
2026-02-21T09:39:32.1047604Z min=0.1086
2026-02-21T09:39:32.1047800Z mid=0.1562
2026-02-21T09:39:32.1048433Z max=5.9452
2026-02-21T09:39:32.1048653Z best={'block_sizes': [64, 64, 16],
2026-02-21T09:39:32.1049025Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:39:32.1049367Z  'l2_groupings': [1],
2026-02-21T09:39:32.1049632Z  'load_eviction_policies': ['', ''],
2026-02-21T09:39:32.1050045Z  'loop_orders': [[0, 1]],
2026-02-21T09:39:32.1050309Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:39:32.1050568Z  'num_stages': 1,
2026-02-21T09:39:32.1050784Z  'num_warps': 4,
2026-02-21T09:39:32.1051009Z  'pid_type': 'flat',
2026-02-21T09:39:32.1051255Z  'range_flattens': [None, True],
2026-02-21T09:39:32.1051545Z  'range_multi_buffers': [None, None],
2026-02-21T09:39:32.1051836Z  'range_num_stages': [0, 0],
2026-02-21T09:39:32.1052105Z  'range_unroll_factors': [0, 1],
2026-02-21T09:39:32.1052398Z  'range_warp_specializes': [],
2026-02-21T09:39:32.1052671Z  'waves_per_eu': 3}
2026-02-21T09:39:32.1978945Z [208s] Fitting surrogate: 605 points, 605 targets
2026-02-21T09:39:33.0933917Z [209s] Generation 6 starting: 85 neighbors, 5 active search path(s)
2026-02-21T09:39:52.5166883Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.9 configs/s
2026-02-21T09:39:56.6908658Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:39:56.6917439Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:39:56.6918347Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:39:56.6919188Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:39:56.6919963Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:39:56.6920872Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:39:56.6921376Z #smem = #ttg.shared_memory
2026-02-21T09:39:56.6921969Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:39:56.6923293Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:39:56.6924334Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:39:56.6924761Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:39:56.6925262Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:39:56.6925726Z     %cst_2 = arith.constant dense<7168> : tensor<2x1xi32, #blocked1>
2026-02-21T09:39:56.6926746Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2>
2026-02-21T09:39:56.6927122Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:39:56.6927602Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<16x128xf32, #mma>
2026-02-21T09:39:56.6927968Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:39:56.6928358Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:39:56.6928673Z     %c224_i32 = arith.constant 224 : i32
2026-02-21T09:39:56.6929066Z     %cst_5 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6929509Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:39:56.6929842Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:39:56.6930169Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:39:56.6930526Z     %c4092_i32 = arith.constant 4092 : i32
2026-02-21T09:39:56.6930857Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:39:56.6931405Z     %cst_6 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6931942Z     %0 = tt.get_program_id x : i32
2026-02-21T09:39:56.6932211Z     %1 = arith.divsi %0, %c224_i32 : i32
2026-02-21T09:39:56.6932452Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T09:39:56.6932696Z     %3 = arith.subi %c4_i32, %2 : i32
2026-02-21T09:39:56.6932963Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T09:39:56.6933349Z     %5 = arith.remsi %0, %c224_i32 : i32
2026-02-21T09:39:56.6933613Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:39:56.6933860Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:39:56.6934098Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:39:56.6934282Z     %9 = arith.muli %7, %c16_i32 : i32
2026-02-21T09:39:56.6934578Z     %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:39:56.6934984Z     %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:39:56.6935400Z     %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:39:56.6935717Z     %13 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:39:56.6936026Z     %14 = arith.addi %12, %10 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:39:56.6936410Z     %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:39:56.6936652Z     %16 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:39:56.6936933Z     %17 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:39:56.6937342Z     %18 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:39:56.6937709Z     %19 = tt.splat %16 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:39:56.6938057Z     %20 = tt.splat %16 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:39:56.6938376Z     %21 = arith.addi %19, %17 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:39:56.6938680Z     %22 = arith.addi %20, %18 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:39:56.6939044Z     %23 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:39:56.6939440Z     %24 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:39:56.6939887Z     %25 = tt.expand_dims %14 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:39:56.6940254Z     %26 = arith.muli %25, %cst_3 : tensor<16x1xi32, #blocked2>
2026-02-21T09:39:56.6940534Z     %27 = tt.broadcast %26 : tensor<16x1xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6940849Z     %28 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:39:56.6941294Z     %29 = tt.expand_dims %22 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:39:56.6941710Z     %30 = tt.broadcast %29 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6942026Z     %31 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:39:56.6942400Z     %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:39:56.6942874Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:39:56.6943326Z     %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:39:56.6943614Z     %35 = arith.cmpi eq, %34, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:39:56.6943841Z     %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:39:56.6944059Z     %37 = arith.cmpi eq, %34, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:39:56.6944276Z     %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:39:56.6944575Z     %39 = scf.for %arg3 = %c0_i32 to %c4092_i32 step %c6_i32 iter_args(%arg4 = %cst_4) -> (tensor<16x128xf32, #mma>)  : i32 {
2026-02-21T09:39:56.6944901Z       %50 = tt.splat %arg3 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:39:56.6945157Z       %51 = arith.addi %50, %23 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:39:56.6945352Z       %52 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:39:56.6945539Z       %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:39:56.6945787Z       %54 = arith.addi %53, %24 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:39:56.6946105Z       %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:39:56.6946436Z       %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6946651Z       %57 = arith.addi %27, %56 : tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6946877Z       %58 = tt.addptr %28, %57 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6947110Z       %59 = tt.load %58 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:39:56.6947411Z       %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6947870Z       %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6948300Z       %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:39:56.6948577Z       %63 = arith.muli %62, %cst_2 : tensor<2x1xi32, #blocked1>
2026-02-21T09:39:56.6948794Z       %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6949016Z       %65 = arith.addi %64, %30 : tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6949237Z       %66 = tt.addptr %31, %65 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6949461Z       %67 = tt.load %66 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:39:56.6949737Z       %68 = ttg.convert_layout %67 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6950053Z       %69 = arith.shli %68, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6950318Z       %70 = arith.shrsi %69, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6950619Z       %71 = arith.shrsi %68, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6950940Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:39:56.6951321Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:39:56.6951641Z       %74 = tt.broadcast %72 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6951916Z       %75 = arith.select %36, %74, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6952182Z       %76 = tt.broadcast %73 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6952427Z       %77 = arith.select %38, %76, %75 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6952652Z       %78 = tt.reshape %77 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked1>
2026-02-21T09:39:56.6952866Z       %79 = arith.sitofp %78 : tensor<4x128xi8, #blocked1> to tensor<4x128xf32, #blocked1>
2026-02-21T09:39:56.6953113Z       %80 = ttg.local_alloc %79 : (tensor<4x128xf32, #blocked1>) -> !ttg.memdesc<4x128xf32, #shared, #smem>
2026-02-21T09:39:56.6953447Z       %81 = ttg.local_load %80 : !ttg.memdesc<4x128xf32, #shared, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6953913Z       %82 = tt.dot %61, %81, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:39:56.6954256Z       %83 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:39:56.6954421Z       %84 = tt.splat %83 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:39:56.6954644Z       %85 = arith.addi %84, %23 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:39:56.6954814Z       %86 = arith.muli %83, %c2_i32 : i32
2026-02-21T09:39:56.6954973Z       %87 = tt.splat %86 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:39:56.6955204Z       %88 = arith.addi %87, %24 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:39:56.6955470Z       %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:39:56.6955738Z       %90 = tt.broadcast %89 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6955923Z       %91 = arith.addi %27, %90 : tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6956114Z       %92 = tt.addptr %28, %91 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6956313Z       %93 = tt.load %92 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:39:56.6956572Z       %94 = ttg.convert_layout %93 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6956963Z       %95 = arith.extf %94 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6957342Z       %96 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:39:56.6957581Z       %97 = arith.muli %96, %cst_2 : tensor<2x1xi32, #blocked1>
2026-02-21T09:39:56.6957770Z       %98 = tt.broadcast %97 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6957956Z       %99 = arith.addi %98, %30 : tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6958177Z       %100 = tt.addptr %31, %99 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6958390Z       %101 = tt.load %100 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:39:56.6958669Z       %102 = ttg.convert_layout %101 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6958957Z       %103 = arith.shli %102, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6959197Z       %104 = arith.shrsi %103, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6959439Z       %105 = arith.shrsi %102, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6959735Z       %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:39:56.6960075Z       %107 = tt.expand_dims %105 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:39:56.6960364Z       %108 = tt.broadcast %106 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6960606Z       %109 = arith.select %36, %108, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6969505Z       %110 = tt.broadcast %107 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6969781Z       %111 = arith.select %38, %110, %109 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6970064Z       %112 = tt.reshape %111 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked1>
2026-02-21T09:39:56.6970294Z       %113 = arith.sitofp %112 : tensor<4x128xi8, #blocked1> to tensor<4x128xf32, #blocked1>
2026-02-21T09:39:56.6970548Z       %114 = ttg.local_alloc %113 : (tensor<4x128xf32, #blocked1>) -> !ttg.memdesc<4x128xf32, #shared, #smem>
2026-02-21T09:39:56.6970873Z       %115 = ttg.local_load %114 : !ttg.memdesc<4x128xf32, #shared, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6971352Z       %116 = tt.dot %95, %115, %82, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:39:56.6971695Z       %117 = arith.addi %arg3, %c4_i32 : i32
2026-02-21T09:39:56.6971890Z       %118 = tt.splat %117 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:39:56.6972116Z       %119 = arith.addi %118, %23 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:39:56.6972288Z       %120 = arith.muli %117, %c2_i32 : i32
2026-02-21T09:39:56.6972457Z       %121 = tt.splat %120 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:39:56.6972673Z       %122 = arith.addi %121, %24 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:39:56.6972950Z       %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:39:56.6973228Z       %124 = tt.broadcast %123 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6973424Z       %125 = arith.addi %27, %124 : tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6973627Z       %126 = tt.addptr %28, %125 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6973833Z       %127 = tt.load %126 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:39:56.6974099Z       %128 = ttg.convert_layout %127 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6974494Z       %129 = arith.extf %128 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6974871Z       %130 = tt.expand_dims %119 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:39:56.6975118Z       %131 = arith.muli %130, %cst_2 : tensor<2x1xi32, #blocked1>
2026-02-21T09:39:56.6975344Z       %132 = tt.broadcast %131 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6975539Z       %133 = arith.addi %132, %30 : tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6975743Z       %134 = tt.addptr %31, %133 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6975945Z       %135 = tt.load %134 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:39:56.6976189Z       %136 = ttg.convert_layout %135 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6976467Z       %137 = arith.shli %136, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6976704Z       %138 = arith.shrsi %137, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6976939Z       %139 = arith.shrsi %136, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6977228Z       %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:39:56.6977567Z       %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:39:56.6977856Z       %142 = tt.broadcast %140 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6978109Z       %143 = arith.select %36, %142, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6978348Z       %144 = tt.broadcast %141 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6978580Z       %145 = arith.select %38, %144, %143 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6978814Z       %146 = tt.reshape %145 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked1>
2026-02-21T09:39:56.6979036Z       %147 = arith.sitofp %146 : tensor<4x128xi8, #blocked1> to tensor<4x128xf32, #blocked1>
2026-02-21T09:39:56.6979293Z       %148 = ttg.local_alloc %147 : (tensor<4x128xf32, #blocked1>) -> !ttg.memdesc<4x128xf32, #shared, #smem>
2026-02-21T09:39:56.6979636Z       %149 = ttg.local_load %148 : !ttg.memdesc<4x128xf32, #shared, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6980104Z       %150 = tt.dot %129, %149, %116, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:39:56.6980457Z       scf.yield %150 : tensor<16x128xf32, #mma>
2026-02-21T09:39:56.6980591Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:39:56.6980806Z     %40 = scf.for %arg3 = %c4092_i32 to %c4096_i32 step %c2_i32 iter_args(%arg4 = %39) -> (tensor<16x128xf32, #mma>)  : i32 {
2026-02-21T09:39:56.6981076Z       %50 = tt.splat %arg3 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:39:56.6981298Z       %51 = arith.addi %50, %23 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:39:56.6981470Z       %52 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:39:56.6981634Z       %53 = tt.splat %52 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:39:56.6981845Z       %54 = arith.addi %53, %24 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:39:56.6982121Z       %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:39:56.6982391Z       %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6982584Z       %57 = arith.addi %27, %56 : tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6982777Z       %58 = tt.addptr %28, %57 : tensor<16x4x!tt.ptr<bf16>, #blocked2>, tensor<16x4xi32, #blocked2>
2026-02-21T09:39:56.6982978Z       %59 = tt.load %58 : tensor<16x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:39:56.6983270Z       %60 = ttg.convert_layout %59 : tensor<16x4xbf16, #blocked2> -> tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6983671Z       %61 = arith.extf %60 : tensor<16x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6984043Z       %62 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:39:56.6984283Z       %63 = arith.muli %62, %cst_2 : tensor<2x1xi32, #blocked1>
2026-02-21T09:39:56.6984470Z       %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6984661Z       %65 = arith.addi %64, %30 : tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6984852Z       %66 = tt.addptr %31, %65 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:39:56.6985051Z       %67 = tt.load %66 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:39:56.6985294Z       %68 = ttg.convert_layout %67 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6985566Z       %69 = arith.shli %68, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6985816Z       %70 = arith.shrsi %69, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6986043Z       %71 = arith.shrsi %68, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:39:56.6986329Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:39:56.6986663Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:39:56.6986944Z       %74 = tt.broadcast %72 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6987183Z       %75 = arith.select %36, %74, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6987416Z       %76 = tt.broadcast %73 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6987661Z       %77 = arith.select %38, %76, %75 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:39:56.6987886Z       %78 = tt.reshape %77 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked1>
2026-02-21T09:39:56.6988102Z       %79 = arith.sitofp %78 : tensor<4x128xi8, #blocked1> to tensor<4x128xf32, #blocked1>
2026-02-21T09:39:56.6988350Z       %80 = ttg.local_alloc %79 : (tensor<4x128xf32, #blocked1>) -> !ttg.memdesc<4x128xf32, #shared, #smem>
2026-02-21T09:39:56.6988666Z       %81 = ttg.local_load %80 : !ttg.memdesc<4x128xf32, #shared, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:39:56.6989135Z       %82 = tt.dot %61, %81, %arg4, inputPrecision = tf32 : tensor<16x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x128xf32, #mma>
2026-02-21T09:39:56.6989478Z       scf.yield %82 : tensor<16x128xf32, #mma>
2026-02-21T09:39:56.6989606Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:39:56.6989774Z     %41 = arith.truncf %40 : tensor<16x128xf32, #mma> to tensor<16x128xbf16, #mma>
2026-02-21T09:39:56.6990030Z     %42 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:39:56.6990260Z     %43 = arith.muli %42, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:39:56.6990488Z     %44 = tt.expand_dims %21 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:39:56.6990741Z     %45 = tt.broadcast %43 : tensor<16x1xi32, #mma> -> tensor<16x128xi32, #mma>
2026-02-21T09:39:56.6990938Z     %46 = tt.broadcast %44 : tensor<1x128xi32, #mma> -> tensor<16x128xi32, #mma>
2026-02-21T09:39:56.6991138Z     %47 = arith.addi %45, %46 : tensor<16x128xi32, #mma>
2026-02-21T09:39:56.6991308Z     %48 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:39:56.6991520Z     %49 = tt.addptr %48, %47 : tensor<16x128x!tt.ptr<bf16>, #mma>, tensor<16x128xi32, #mma>
2026-02-21T09:39:56.6991707Z     tt.store %49, %41 : tensor<16x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:39:56.6991839Z     tt.return
2026-02-21T09:39:56.6991919Z   }
2026-02-21T09:39:56.6991996Z }
2026-02-21T09:39:56.6992042Z 
2026-02-21T09:39:56.6992073Z {-#
2026-02-21T09:39:56.6992158Z   external_resources: {
2026-02-21T09:39:56.6992256Z     mlir_reproducer: {
2026-02-21T09:39:56.6993256Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:39:56.6994275Z       disable_threading: false,
2026-02-21T09:39:56.6994383Z       verify_each: true
2026-02-21T09:39:56.6994471Z     }
2026-02-21T09:39:56.6994546Z   }
2026-02-21T09:39:56.6994614Z #-}
2026-02-21T09:39:56.6994892Z /tmp/torchinductor_root/iu/ciu33d2kqmchnrze6gaykjz6ozjy4nn6ut2tq2xu35tct73cvs4a.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:39:56.6995575Z /tmp/torchinductor_root/iu/ciu33d2kqmchnrze6gaykjz6ozjy4nn6ut2tq2xu35tct73cvs4a.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:39:56.6996120Z [232s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:39:56.6996853Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 16, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:39:56.6997503Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:39:56.6997670Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:39:58.0283015Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 16.2 configs/s
2026-02-21T09:40:02.8317985Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 195.9         
2026-02-21T09:40:02.8318645Z                                                                   configs/s     
2026-02-21T09:40:03.4209266Z [239s] Generation 6 complete: 
2026-02-21T09:40:03.4209637Z error=5
2026-02-21T09:40:03.4209833Z ok=85
2026-02-21T09:40:03.4210020Z min=0.1021
2026-02-21T09:40:03.4210217Z mid=0.2053
2026-02-21T09:40:03.4210418Z max=12.5282
2026-02-21T09:40:03.4210637Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:40:03.4210987Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:40:03.4211324Z  'l2_groupings': [1],
2026-02-21T09:40:03.4211580Z  'load_eviction_policies': ['', ''],
2026-02-21T09:40:03.4211874Z  'loop_orders': [[0, 1]],
2026-02-21T09:40:03.4212141Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:40:03.4212393Z  'num_stages': 2,
2026-02-21T09:40:03.4212613Z  'num_warps': 8,
2026-02-21T09:40:03.4212828Z  'pid_type': 'flat',
2026-02-21T09:40:03.4213076Z  'range_flattens': [None, False],
2026-02-21T09:40:03.4213366Z  'range_multi_buffers': [None, False],
2026-02-21T09:40:03.4213656Z  'range_num_stages': [0, 0],
2026-02-21T09:40:03.4214288Z  'range_unroll_factors': [0, 1],
2026-02-21T09:40:03.4214582Z  'range_warp_specializes': [],
2026-02-21T09:40:03.4214843Z  'waves_per_eu': 2}
2026-02-21T09:40:03.4898227Z [239s] Fitting surrogate: 695 points, 695 targets
2026-02-21T09:40:05.0373543Z [241s] Generation 7 starting: 55 neighbors, 3 active search path(s)
2026-02-21T09:40:15.7569388Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 3.7 configs/s
2026-02-21T09:40:19.1154771Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 17.4 configs/s
2026-02-21T09:40:23.1271952Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 231.2         
2026-02-21T09:40:23.1272562Z                                                                   configs/s     
2026-02-21T09:40:23.6336524Z [259s] Generation 7 complete: 
2026-02-21T09:40:23.6869679Z error=4
2026-02-21T09:40:23.6869903Z ok=54
2026-02-21T09:40:23.6870112Z min=0.1020
2026-02-21T09:40:23.6870320Z mid=0.1978
2026-02-21T09:40:23.6870515Z max=7.7511
2026-02-21T09:40:23.6870785Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:40:23.6871164Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:40:23.6871538Z  'l2_groupings': [1],
2026-02-21T09:40:23.6871816Z  'load_eviction_policies': ['', ''],
2026-02-21T09:40:23.6872121Z  'loop_orders': [[0, 1]],
2026-02-21T09:40:23.6872642Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:40:23.6872906Z  'num_stages': 2,
2026-02-21T09:40:23.6873140Z  'num_warps': 8,
2026-02-21T09:40:23.6873368Z  'pid_type': 'flat',
2026-02-21T09:40:23.6873626Z  'range_flattens': [None, False],
2026-02-21T09:40:23.6873937Z  'range_multi_buffers': [None, False],
2026-02-21T09:40:23.6874249Z  'range_num_stages': [0, 0],
2026-02-21T09:40:23.6874534Z  'range_unroll_factors': [0, 1],
2026-02-21T09:40:23.6874833Z  'range_warp_specializes': [],
2026-02-21T09:40:23.6875106Z  'waves_per_eu': 2}
2026-02-21T09:40:23.6875390Z [259s] Fitting surrogate: 753 points, 753 targets
2026-02-21T09:40:24.2932538Z [260s] Generation 8 starting: 46 neighbors, 3 active search path(s)
2026-02-21T09:40:32.6712518Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 9.6 configs/s
2026-02-21T09:40:35.5866545Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 17.0 configs/s
2026-02-21T09:40:38.7600339Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 288.1         
2026-02-21T09:40:38.7600946Z                                                                   configs/s     
2026-02-21T09:40:39.2439725Z [275s] Generation 8 complete: 
2026-02-21T09:40:39.2440065Z error=3
2026-02-21T09:40:39.2440293Z ok=46
2026-02-21T09:40:39.2440501Z min=0.1019
2026-02-21T09:40:39.2440713Z mid=0.2245
2026-02-21T09:40:39.2440912Z max=6.0647
2026-02-21T09:40:39.2441140Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:40:39.2444036Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:40:39.2444484Z  'l2_groupings': [2],
2026-02-21T09:40:39.2444771Z  'load_eviction_policies': ['', ''],
2026-02-21T09:40:39.2445091Z  'loop_orders': [[1, 0]],
2026-02-21T09:40:39.2445410Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:40:39.2445688Z  'num_stages': 3,
2026-02-21T09:40:39.2445943Z  'num_warps': 8,
2026-02-21T09:40:39.2446182Z  'pid_type': 'flat',
2026-02-21T09:40:39.2446446Z  'range_flattens': [None, True],
2026-02-21T09:40:39.2446769Z  'range_multi_buffers': [None, None],
2026-02-21T09:40:39.2447088Z  'range_num_stages': [0, 3],
2026-02-21T09:40:39.2447375Z  'range_unroll_factors': [0, 1],
2026-02-21T09:40:39.2447679Z  'range_warp_specializes': [],
2026-02-21T09:40:39.2447967Z  'waves_per_eu': 4}
2026-02-21T09:40:39.2850521Z [275s] Fitting surrogate: 802 points, 802 targets
2026-02-21T09:40:39.7338037Z [275s] Generation 9 starting: 33 neighbors, 2 active search path(s)
2026-02-21T09:40:45.7168716Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 9.0 configs/s
2026-02-21T09:40:47.9534131Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 16.2 configs/s
2026-02-21T09:40:49.3855858Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 568.5         
2026-02-21T09:40:49.3856436Z                                                                   configs/s     
2026-02-21T09:40:49.8218649Z [285s] Generation 9 complete: 
2026-02-21T09:40:49.8219000Z error=1
2026-02-21T09:40:49.8219209Z ok=34
2026-02-21T09:40:49.8219435Z min=0.1019
2026-02-21T09:40:49.8219646Z mid=0.6240
2026-02-21T09:40:49.8219837Z max=4.6029
2026-02-21T09:40:49.8220064Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:40:49.8220429Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:40:49.8220780Z  'l2_groupings': [2],
2026-02-21T09:40:49.8221049Z  'load_eviction_policies': ['', ''],
2026-02-21T09:40:49.8221357Z  'loop_orders': [[1, 0]],
2026-02-21T09:40:49.8221641Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:40:49.8221907Z  'num_stages': 3,
2026-02-21T09:40:49.8222134Z  'num_warps': 8,
2026-02-21T09:40:49.8222370Z  'pid_type': 'flat',
2026-02-21T09:40:49.8222632Z  'range_flattens': [None, True],
2026-02-21T09:40:49.8222946Z  'range_multi_buffers': [None, None],
2026-02-21T09:40:49.8223254Z  'range_num_stages': [0, 3],
2026-02-21T09:40:49.8223531Z  'range_unroll_factors': [0, 1],
2026-02-21T09:40:49.8223791Z  'range_warp_specializes': [],
2026-02-21T09:40:49.8224016Z  'waves_per_eu': 4}
2026-02-21T09:40:49.8414581Z [285s] Fitting surrogate: 837 points, 837 targets
2026-02-21T09:40:50.3307219Z [286s] Generation 10 starting: 34 neighbors, 2 active search path(s)
2026-02-21T09:40:57.4232151Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 7.3 configs/s
2026-02-21T09:40:59.7069124Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 15.9 configs/s
2026-02-21T09:41:01.0895848Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 590.8         
2026-02-21T09:41:01.0896065Z                                                                   configs/s     
2026-02-21T09:41:01.4663604Z [297s] Generation 10 complete: 
2026-02-21T09:41:01.4663964Z ok=36
2026-02-21T09:41:01.4664178Z min=0.1020
2026-02-21T09:41:01.4664430Z mid=1.0527
2026-02-21T09:41:01.4664632Z max=4.3885
2026-02-21T09:41:01.4664856Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:41:01.4665563Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:41:01.4665911Z  'l2_groupings': [2],
2026-02-21T09:41:01.4666183Z  'load_eviction_policies': ['', ''],
2026-02-21T09:41:01.4666510Z  'loop_orders': [[1, 0]],
2026-02-21T09:41:01.4666780Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:41:01.4667045Z  'num_stages': 3,
2026-02-21T09:41:01.4667281Z  'num_warps': 8,
2026-02-21T09:41:01.4667507Z  'pid_type': 'flat',
2026-02-21T09:41:01.4667761Z  'range_flattens': [None, True],
2026-02-21T09:41:01.4668068Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:01.4668370Z  'range_num_stages': [0, 3],
2026-02-21T09:41:01.4668652Z  'range_unroll_factors': [0, 1],
2026-02-21T09:41:01.4668938Z  'range_warp_specializes': [],
2026-02-21T09:41:01.4669223Z  'waves_per_eu': 4}
2026-02-21T09:41:01.4845355Z [297s] Fitting surrogate: 873 points, 873 targets
2026-02-21T09:41:01.8221879Z [297s] Generation 11 starting: 22 neighbors, 1 active search path(s)
2026-02-21T09:41:06.6282818Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 3.7 configs/s
2026-02-21T09:41:08.2147684Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 23/23 15.9 configs/s
2026-02-21T09:41:08.2153645Z [304s] Generation 11 complete: 
2026-02-21T09:41:08.2153906Z ok=24
2026-02-21T09:41:08.2154078Z min=0.1020
2026-02-21T09:41:08.2154249Z mid=1.1884
2026-02-21T09:41:08.2154416Z max=4.5023
2026-02-21T09:41:08.2154600Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:41:08.2154901Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:41:08.2155189Z  'l2_groupings': [2],
2026-02-21T09:41:08.2155415Z  'load_eviction_policies': ['', ''],
2026-02-21T09:41:08.2155666Z  'loop_orders': [[1, 0]],
2026-02-21T09:41:08.2155896Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:41:08.2156117Z  'num_stages': 3,
2026-02-21T09:41:08.2156301Z  'num_warps': 8,
2026-02-21T09:41:08.2156503Z  'pid_type': 'flat',
2026-02-21T09:41:08.2157082Z  'range_flattens': [None, True],
2026-02-21T09:41:08.2157330Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:08.2157585Z  'range_num_stages': [0, 3],
2026-02-21T09:41:08.2157813Z  'range_unroll_factors': [0, 1],
2026-02-21T09:41:08.2158129Z  'range_warp_specializes': [],
2026-02-21T09:41:08.2158356Z  'waves_per_eu': 4}
2026-02-21T09:41:08.2186019Z [304s] Fitting surrogate: 897 points, 897 targets
2026-02-21T09:41:08.5209851Z [304s] Generation 12 starting: 18 neighbors, 1 active search path(s)
2026-02-21T09:41:12.6201249Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 2.5 configs/s
2026-02-21T09:41:14.0426640Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 14.8 configs/s
2026-02-21T09:41:14.0431234Z [310s] Generation 12 complete: 
2026-02-21T09:41:14.0431536Z ok=20
2026-02-21T09:41:14.0431720Z min=0.1020
2026-02-21T09:41:14.0431905Z mid=1.2441
2026-02-21T09:41:14.0432084Z max=14.1591
2026-02-21T09:41:14.0432313Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:41:14.0432641Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:41:14.0432967Z  'l2_groupings': [2],
2026-02-21T09:41:14.0433204Z  'load_eviction_policies': ['', ''],
2026-02-21T09:41:14.0433474Z  'loop_orders': [[1, 0]],
2026-02-21T09:41:14.0433724Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:41:14.0433961Z  'num_stages': 3,
2026-02-21T09:41:14.0434159Z  'num_warps': 8,
2026-02-21T09:41:14.0434362Z  'pid_type': 'flat',
2026-02-21T09:41:14.0434587Z  'range_flattens': [None, True],
2026-02-21T09:41:14.0434850Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:14.0435114Z  'range_num_stages': [0, 3],
2026-02-21T09:41:14.0435356Z  'range_unroll_factors': [0, 1],
2026-02-21T09:41:14.0435612Z  'range_warp_specializes': [],
2026-02-21T09:41:14.0435852Z  'waves_per_eu': 4}
2026-02-21T09:41:14.0463975Z [310s] Fitting surrogate: 917 points, 917 targets
2026-02-21T09:41:14.3503957Z [310s] Generation 13 starting: 19 neighbors, 1 active search path(s)
2026-02-21T09:41:18.7773174Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 3.0 configs/s
2026-02-21T09:41:20.2201798Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 15.3 configs/s
2026-02-21T09:41:20.2206781Z [316s] Generation 13 complete: 
2026-02-21T09:41:20.2207139Z ok=21
2026-02-21T09:41:20.2207349Z min=0.1020
2026-02-21T09:41:20.2207565Z mid=1.3551
2026-02-21T09:41:20.2207764Z max=5.4281
2026-02-21T09:41:20.2207989Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:41:20.2208348Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:41:20.2208695Z  'l2_groupings': [2],
2026-02-21T09:41:20.2208967Z  'load_eviction_policies': ['', ''],
2026-02-21T09:41:20.2209279Z  'loop_orders': [[1, 0]],
2026-02-21T09:41:20.2209555Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:41:20.2209824Z  'num_stages': 3,
2026-02-21T09:41:20.2210048Z  'num_warps': 8,
2026-02-21T09:41:20.2210279Z  'pid_type': 'flat',
2026-02-21T09:41:20.2210536Z  'range_flattens': [None, True],
2026-02-21T09:41:20.2210814Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:20.2211043Z  'range_num_stages': [0, 3],
2026-02-21T09:41:20.2211243Z  'range_unroll_factors': [0, 1],
2026-02-21T09:41:20.2211466Z  'range_warp_specializes': [],
2026-02-21T09:41:20.2211667Z  'waves_per_eu': 4}
2026-02-21T09:41:20.2240165Z [316s] Fitting surrogate: 938 points, 938 targets
2026-02-21T09:41:20.5299422Z [316s] Generation 14 starting: 19 neighbors, 1 active search path(s)
2026-02-21T09:41:23.3771512Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 7.1 configs/s
2026-02-21T09:41:24.7722968Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 15.9 configs/s
2026-02-21T09:41:24.7728431Z [320s] Generation 14 complete: 
2026-02-21T09:41:24.7728781Z ok=21
2026-02-21T09:41:24.7729003Z min=0.1020
2026-02-21T09:41:24.7729216Z mid=1.0311
2026-02-21T09:41:24.7729420Z max=4.5522
2026-02-21T09:41:24.7729648Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:41:24.7730012Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:41:24.7730679Z  'l2_groupings': [2],
2026-02-21T09:41:24.7730914Z  'load_eviction_policies': ['', ''],
2026-02-21T09:41:24.7731196Z  'loop_orders': [[1, 0]],
2026-02-21T09:41:24.7731433Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:41:24.7731664Z  'num_stages': 3,
2026-02-21T09:41:24.7731952Z  'num_warps': 8,
2026-02-21T09:41:24.7732151Z  'pid_type': 'flat',
2026-02-21T09:41:24.7732372Z  'range_flattens': [None, True],
2026-02-21T09:41:24.7732635Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:24.7732897Z  'range_num_stages': [0, 3],
2026-02-21T09:41:24.7733132Z  'range_unroll_factors': [0, 1],
2026-02-21T09:41:24.7733386Z  'range_warp_specializes': [],
2026-02-21T09:41:24.7733622Z  'waves_per_eu': 4}
2026-02-21T09:41:24.7760912Z [320s] Fitting surrogate: 959 points, 959 targets
2026-02-21T09:41:25.0758397Z [321s] Generation 15 starting: 19 neighbors, 1 active search path(s)
2026-02-21T09:41:56.4376421Z [352s] Timeout after 30s compiling Config(block_sizes=[128, 2, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:41:56.4398424Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 0.4 configs/s
2026-02-21T09:41:57.8901928Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 14.1 configs/s
2026-02-21T09:41:57.8913593Z [353s] Generation 15 complete: 
2026-02-21T09:41:57.8916739Z timeout=1
2026-02-21T09:41:57.8917028Z ok=20
2026-02-21T09:41:57.8917233Z min=0.1020
2026-02-21T09:41:57.8917431Z mid=1.8204
2026-02-21T09:41:57.8917625Z max=12.7759
2026-02-21T09:41:57.8917846Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:41:57.8918232Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:41:57.8918560Z  'l2_groupings': [2],
2026-02-21T09:41:57.8918855Z  'load_eviction_policies': ['', ''],
2026-02-21T09:41:57.8919153Z  'loop_orders': [[1, 0]],
2026-02-21T09:41:57.8919418Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:41:57.8919697Z  'num_stages': 3,
2026-02-21T09:41:57.8919923Z  'num_warps': 8,
2026-02-21T09:41:57.8920147Z  'pid_type': 'flat',
2026-02-21T09:41:57.8920401Z  'range_flattens': [None, True],
2026-02-21T09:41:57.8920696Z  'range_multi_buffers': [None, None],
2026-02-21T09:41:57.8920992Z  'range_num_stages': [0, 3],
2026-02-21T09:41:57.8921256Z  'range_unroll_factors': [0, 1],
2026-02-21T09:41:57.8921529Z  'range_warp_specializes': [],
2026-02-21T09:41:57.8921804Z  'waves_per_eu': 4}
2026-02-21T09:41:57.8942078Z [353s] Fitting surrogate: 980 points, 980 targets
2026-02-21T09:41:58.1979866Z [354s] Generation 16 starting: 19 neighbors, 1 active search path(s)
2026-02-21T09:42:13.1923274Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 0.9 configs/s
2026-02-21T09:42:14.8820719Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 12.8 configs/s
2026-02-21T09:42:14.8823596Z [370s] Generation 16 complete: 
2026-02-21T09:42:14.8823884Z ok=21
2026-02-21T09:42:14.8824063Z min=0.1020
2026-02-21T09:42:14.8824248Z mid=2.2351
2026-02-21T09:42:14.8824407Z max=12.5435
2026-02-21T09:42:14.8824595Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:42:14.8824886Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:42:14.8825172Z  'l2_groupings': [2],
2026-02-21T09:42:14.8825387Z  'load_eviction_policies': ['', ''],
2026-02-21T09:42:14.8825631Z  'loop_orders': [[1, 0]],
2026-02-21T09:42:14.8825845Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:42:14.8826052Z  'num_stages': 3,
2026-02-21T09:42:14.8826242Z  'num_warps': 8,
2026-02-21T09:42:14.8826416Z  'pid_type': 'flat',
2026-02-21T09:42:14.8826622Z  'range_flattens': [None, True],
2026-02-21T09:42:14.8826861Z  'range_multi_buffers': [None, None],
2026-02-21T09:42:14.8827101Z  'range_num_stages': [0, 3],
2026-02-21T09:42:14.8827322Z  'range_unroll_factors': [0, 1],
2026-02-21T09:42:14.8827551Z  'range_warp_specializes': [],
2026-02-21T09:42:14.8827777Z  'waves_per_eu': 4}
2026-02-21T09:42:14.8857462Z [370s] Fitting surrogate: 1001 points, 1001 targets
2026-02-21T09:42:15.1735515Z [371s] Generation 17 starting: 19 neighbors, 1 active search path(s)
2026-02-21T09:42:23.3247920Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 1.0 configs/s
2026-02-21T09:42:25.0272483Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 12.7 configs/s
2026-02-21T09:42:25.0277145Z [381s] Generation 17 complete: 
2026-02-21T09:42:25.0277475Z ok=21
2026-02-21T09:42:25.0277714Z min=0.1020
2026-02-21T09:42:25.0277926Z mid=1.0149
2026-02-21T09:42:25.0278133Z max=32.8611
2026-02-21T09:42:25.0278370Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:42:25.0278749Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:42:25.0279102Z  'l2_groupings': [2],
2026-02-21T09:42:25.0279379Z  'load_eviction_policies': ['', ''],
2026-02-21T09:42:25.0279694Z  'loop_orders': [[1, 0]],
2026-02-21T09:42:25.0280002Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:42:25.0280280Z  'num_stages': 3,
2026-02-21T09:42:25.0280829Z  'num_warps': 8,
2026-02-21T09:42:25.0281063Z  'pid_type': 'flat',
2026-02-21T09:42:25.0281322Z  'range_flattens': [None, True],
2026-02-21T09:42:25.0281637Z  'range_multi_buffers': [None, None],
2026-02-21T09:42:25.0281942Z  'range_num_stages': [0, 3],
2026-02-21T09:42:25.0282223Z  'range_unroll_factors': [0, 1],
2026-02-21T09:42:25.0282521Z  'range_warp_specializes': [],
2026-02-21T09:42:25.0282882Z  'waves_per_eu': 4}
2026-02-21T09:42:25.0314424Z [381s] Fitting surrogate: 1022 points, 1022 targets
2026-02-21T09:42:25.3215097Z [381s] Generation 18 starting: 19 neighbors, 1 active search path(s)
2026-02-21T09:42:29.4404382Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 4.7 configs/s
2026-02-21T09:42:31.0335240Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 13.7 configs/s
2026-02-21T09:42:31.0339427Z [387s] Generation 18 complete: 
2026-02-21T09:42:31.0339965Z ok=21
2026-02-21T09:42:31.0340256Z min=0.1020
2026-02-21T09:42:31.0340473Z mid=1.9155
2026-02-21T09:42:31.0340705Z max=10.0905
2026-02-21T09:42:31.0340945Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:42:31.0341320Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:42:31.0341689Z  'l2_groupings': [2],
2026-02-21T09:42:31.0341981Z  'load_eviction_policies': ['', ''],
2026-02-21T09:42:31.0342306Z  'loop_orders': [[1, 0]],
2026-02-21T09:42:31.0342591Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:42:31.0342870Z  'num_stages': 3,
2026-02-21T09:42:31.0343102Z  'num_warps': 8,
2026-02-21T09:42:31.0343344Z  'pid_type': 'flat',
2026-02-21T09:42:31.0343604Z  'range_flattens': [None, True],
2026-02-21T09:42:31.0343871Z  'range_multi_buffers': [None, None],
2026-02-21T09:42:31.0343997Z  'range_num_stages': [0, 3],
2026-02-21T09:42:31.0344114Z  'range_unroll_factors': [0, 1],
2026-02-21T09:42:31.0344232Z  'range_warp_specializes': [],
2026-02-21T09:42:31.0344347Z  'waves_per_eu': 4}
2026-02-21T09:42:31.0372960Z [387s] Fitting surrogate: 1043 points, 1043 targets
2026-02-21T09:42:31.3370005Z [387s] Generation 19 starting: 21 neighbors, 1 active search path(s)
2026-02-21T09:43:03.2891508Z [419s] Timeout after 30s compiling Config(block_sizes=[64, 4, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:43:03.2911816Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 0.2 configs/s
2026-02-21T09:43:05.3938835Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 22/22 11.1 configs/s
2026-02-21T09:43:05.3942900Z [421s] Generation 19 complete: 
2026-02-21T09:43:05.3945849Z timeout=1
2026-02-21T09:43:05.3946388Z ok=22
2026-02-21T09:43:05.3946679Z min=0.1020
2026-02-21T09:43:05.3946947Z mid=3.7060
2026-02-21T09:43:05.3949870Z max=21.7618
2026-02-21T09:43:05.3950105Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:43:05.3950488Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:43:05.3950813Z  'l2_groupings': [2],
2026-02-21T09:43:05.3951378Z  'load_eviction_policies': ['', ''],
2026-02-21T09:43:05.3951668Z  'loop_orders': [[1, 0]],
2026-02-21T09:43:05.3951924Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:43:05.3952169Z  'num_stages': 3,
2026-02-21T09:43:05.3952384Z  'num_warps': 8,
2026-02-21T09:43:05.3952601Z  'pid_type': 'flat',
2026-02-21T09:43:05.3952832Z  'range_flattens': [None, True],
2026-02-21T09:43:05.3953116Z  'range_multi_buffers': [None, None],
2026-02-21T09:43:05.3953397Z  'range_num_stages': [0, 3],
2026-02-21T09:43:05.3953647Z  'range_unroll_factors': [0, 1],
2026-02-21T09:43:05.3953921Z  'range_warp_specializes': [],
2026-02-21T09:43:05.3954172Z  'waves_per_eu': 4}
2026-02-21T09:43:05.3976606Z [421s] Fitting surrogate: 1066 points, 1066 targets
2026-02-21T09:43:06.4657220Z [422s] Generation 20 starting: 18 neighbors, 1 active search path(s)
2026-02-21T09:43:09.8016655Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 6.6 configs/s
2026-02-21T09:43:10.5429606Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:43:10.5432472Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 4], order = [2, 1, 0]}>
2026-02-21T09:43:10.5433246Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 4], order = [1, 0]}>
2026-02-21T09:43:10.5433950Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:43:10.5434607Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 16], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:43:10.5435204Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:43:10.5435779Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:43:10.5436200Z #smem = #ttg.shared_memory
2026-02-21T09:43:10.5436679Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:43:10.5437590Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:43:10.5438429Z     %cst = arith.constant dense<7168> : tensor<16x1xi32, #mma>
2026-02-21T09:43:10.5438759Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:43:10.5439087Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:43:10.5439722Z     %cst_2 = arith.constant dense<7168> : tensor<4x1xi32, #blocked1>
2026-02-21T09:43:10.5440059Z     %cst_3 = arith.constant dense<8192> : tensor<16x1xi32, #blocked2>
2026-02-21T09:43:10.5440417Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T09:43:10.5440717Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:43:10.5440947Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:43:10.5441168Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:43:10.5441377Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:43:10.5441645Z     %cst_5 = arith.constant dense<0> : tensor<4x2x256xi8, #blocked>
2026-02-21T09:43:10.5441933Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:43:10.5442153Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:43:10.5442364Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:43:10.5442676Z     %c28_i32 = arith.constant 28 : i32
2026-02-21T09:43:10.5443025Z     %cst_6 = arith.constant dense<4> : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:43:10.5443384Z     %0 = tt.get_program_id x : i32
2026-02-21T09:43:10.5443590Z     %1 = arith.divsi %0, %c8_i32 : i32
2026-02-21T09:43:10.5443805Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:43:10.5444017Z     %3 = arith.subi %c28_i32, %2 : i32
2026-02-21T09:43:10.5444299Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:43:10.5444504Z     %5 = arith.remsi %0, %c8_i32 : i32
2026-02-21T09:43:10.5444711Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:43:10.5444917Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:43:10.5445114Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:43:10.5445320Z     %9 = arith.muli %7, %c256_i32 : i32
2026-02-21T09:43:10.5445687Z     %10 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:43:10.5446223Z     %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:43:10.5446660Z     %12 = tt.splat %9 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:43:10.5446956Z     %13 = tt.splat %9 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:43:10.5447281Z     %14 = arith.addi %12, %10 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:43:10.5447576Z     %15 = arith.addi %13, %11 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:43:10.5447807Z     %16 = arith.muli %8, %c16_i32 : i32
2026-02-21T09:43:10.5448080Z     %17 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:43:10.5448448Z     %18 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:43:10.5448770Z     %19 = tt.splat %16 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:43:10.5449062Z     %20 = tt.splat %16 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:43:10.5449351Z     %21 = arith.addi %19, %17 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:43:10.5449634Z     %22 = arith.addi %20, %18 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:43:10.5449962Z     %23 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:43:10.5450339Z     %24 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:43:10.5450749Z     %25 = tt.expand_dims %21 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:43:10.5451090Z     %26 = arith.muli %25, %cst_3 : tensor<16x1xi32, #blocked2>
2026-02-21T09:43:10.5451353Z     %27 = tt.broadcast %26 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:43:10.5451637Z     %28 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:43:10.5452061Z     %29 = tt.expand_dims %15 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:43:10.5452445Z     %30 = tt.broadcast %29 : tensor<1x256xi32, #blocked1> -> tensor<4x256xi32, #blocked1>
2026-02-21T09:43:10.5452733Z     %31 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:43:10.5453108Z     %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:43:10.5453670Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:43:10.5454231Z     %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:43:10.5454572Z     %35 = arith.cmpi eq, %34, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:43:10.5454845Z     %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x256xi1, #blocked>
2026-02-21T09:43:10.5455114Z     %37 = arith.cmpi eq, %34, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:43:10.5455371Z     %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x256xi1, #blocked>
2026-02-21T09:43:10.5455760Z     %39 = scf.for %arg3 = %c0_i32 to %c4096_i32 step %c4_i32 iter_args(%arg4 = %cst_4) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T09:43:10.5456127Z       %49 = tt.splat %arg3 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:43:10.5456415Z       %50 = arith.addi %49, %23 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:43:10.5456604Z       %51 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:43:10.5456781Z       %52 = tt.splat %51 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:43:10.5457016Z       %53 = arith.addi %52, %24 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:43:10.5457311Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:43:10.5457623Z       %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:43:10.5457830Z       %56 = arith.addi %27, %55 : tensor<16x8xi32, #blocked2>
2026-02-21T09:43:10.5458037Z       %57 = tt.addptr %28, %56 : tensor<16x8x!tt.ptr<bf16>, #blocked2>, tensor<16x8xi32, #blocked2>
2026-02-21T09:43:10.5458257Z       %58 = tt.load %57 : tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:43:10.5458494Z       %59 = ttg.local_alloc %58 : (tensor<16x8xbf16, #blocked2>) -> !ttg.memdesc<16x8xbf16, #shared, #smem>
2026-02-21T09:43:10.5458844Z       %60 = ttg.local_load %59 : !ttg.memdesc<16x8xbf16, #shared, #smem> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:43:10.5459279Z       %61 = arith.extf %60 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:43:10.5459685Z       %62 = tt.expand_dims %50 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:43:10.5459950Z       %63 = arith.muli %62, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:43:10.5460158Z       %64 = tt.broadcast %63 : tensor<4x1xi32, #blocked1> -> tensor<4x256xi32, #blocked1>
2026-02-21T09:43:10.5460362Z       %65 = arith.addi %64, %30 : tensor<4x256xi32, #blocked1>
2026-02-21T09:43:10.5460578Z       %66 = tt.addptr %31, %65 : tensor<4x256x!tt.ptr<i8>, #blocked1>, tensor<4x256xi32, #blocked1>
2026-02-21T09:43:10.5460788Z       %67 = tt.load %66 : tensor<4x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:43:10.5461048Z       %68 = ttg.convert_layout %67 : tensor<4x256xi8, #blocked1> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:43:10.5461382Z       %69 = arith.shli %68, %cst_6 : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:43:10.5461632Z       %70 = arith.shrsi %69, %cst_6 : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:43:10.5461882Z       %71 = arith.shrsi %68, %cst_6 : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:43:10.5462192Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x256xi8, #blocked>
2026-02-21T09:43:10.5462556Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x256xi8, #blocked>
2026-02-21T09:43:10.5462866Z       %74 = tt.broadcast %72 : tensor<4x1x256xi8, #blocked> -> tensor<4x2x256xi8, #blocked>
2026-02-21T09:43:10.5463124Z       %75 = arith.select %36, %74, %cst_5 : tensor<4x2x256xi1, #blocked>, tensor<4x2x256xi8, #blocked>
2026-02-21T09:43:10.5463378Z       %76 = tt.broadcast %73 : tensor<4x1x256xi8, #blocked> -> tensor<4x2x256xi8, #blocked>
2026-02-21T09:43:10.5463630Z       %77 = arith.select %38, %76, %75 : tensor<4x2x256xi1, #blocked>, tensor<4x2x256xi8, #blocked>
2026-02-21T09:43:10.5463879Z       %78 = tt.reshape %77 : tensor<4x2x256xi8, #blocked> -> tensor<8x256xi8, #blocked1>
2026-02-21T09:43:10.5464119Z       %79 = arith.sitofp %78 : tensor<8x256xi8, #blocked1> to tensor<8x256xf32, #blocked1>
2026-02-21T09:43:10.5464401Z       %80 = ttg.local_alloc %79 : (tensor<8x256xf32, #blocked1>) -> !ttg.memdesc<8x256xf32, #shared1, #smem>
2026-02-21T09:43:10.5464750Z       %81 = ttg.local_load %80 : !ttg.memdesc<8x256xf32, #shared1, #smem> -> tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:43:10.5465261Z       %82 = tt.dot %61, %81, %arg4, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:43:10.5465636Z       scf.yield %82 : tensor<16x256xf32, #mma>
2026-02-21T09:43:10.5465849Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32}
2026-02-21T09:43:10.5466118Z     %40 = arith.truncf %39 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T09:43:10.5466396Z     %41 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:43:10.5466648Z     %42 = arith.muli %41, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:43:10.5466885Z     %43 = tt.expand_dims %14 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:43:10.5467139Z     %44 = tt.broadcast %42 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:43:10.5467335Z     %45 = tt.broadcast %43 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:43:10.5467509Z     %46 = arith.addi %44, %45 : tensor<16x256xi32, #mma>
2026-02-21T09:43:10.5467678Z     %47 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:43:10.5467888Z     %48 = tt.addptr %47, %46 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T09:43:10.5468078Z     tt.store %48, %40 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:43:10.5468204Z     tt.return
2026-02-21T09:43:10.5468288Z   }
2026-02-21T09:43:10.5468362Z }
2026-02-21T09:43:10.5468407Z 
2026-02-21T09:43:10.5468437Z {-#
2026-02-21T09:43:10.5468518Z   external_resources: {
2026-02-21T09:43:10.5468615Z     mlir_reproducer: {
2026-02-21T09:43:10.5469655Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:43:10.5470639Z       disable_threading: false,
2026-02-21T09:43:10.5470744Z       verify_each: true
2026-02-21T09:43:10.5470833Z     }
2026-02-21T09:43:10.5470905Z   }
2026-02-21T09:43:10.5470974Z #-}
2026-02-21T09:43:10.5471245Z /tmp/torchinductor_root/ys/cysxx4khj5nqdldlywplcst5qbz6wuczzqt4j7dvs6oooswg7jlh.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:43:10.5471922Z /tmp/torchinductor_root/ys/cysxx4khj5nqdldlywplcst5qbz6wuczzqt4j7dvs6oooswg7jlh.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:43:10.5472471Z [426s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:43:10.5473194Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:43:10.5473864Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:43:10.5474033Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:43:10.7915235Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 21.6 configs/s
2026-02-21T09:43:10.7918863Z [426s] Generation 20 complete: 
2026-02-21T09:43:10.7919204Z error=5
2026-02-21T09:43:10.7919410Z ok=15
2026-02-21T09:43:10.7919610Z min=0.1020
2026-02-21T09:43:10.7919818Z mid=0.4212
2026-02-21T09:43:10.7920018Z max=2.3786
2026-02-21T09:43:10.7920245Z best={'block_sizes': [64, 64, 32],
2026-02-21T09:43:10.7920630Z  'indexing': ['pointer', 'pointer', 'pointer'],
2026-02-21T09:43:10.7920985Z  'l2_groupings': [2],
2026-02-21T09:43:10.7921455Z  'load_eviction_policies': ['', ''],
2026-02-21T09:43:10.7921764Z  'loop_orders': [[1, 0]],
2026-02-21T09:43:10.7922049Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:43:10.7922326Z  'num_stages': 3,
2026-02-21T09:43:10.7922559Z  'num_warps': 8,
2026-02-21T09:43:10.7922895Z  'pid_type': 'flat',
2026-02-21T09:43:10.7923161Z  'range_flattens': [None, True],
2026-02-21T09:43:10.7923461Z  'range_multi_buffers': [None, None],
2026-02-21T09:43:10.7923770Z  'range_num_stages': [0, 3],
2026-02-21T09:43:10.7924050Z  'range_unroll_factors': [0, 1],
2026-02-21T09:43:10.7924345Z  'range_warp_specializes': [],
2026-02-21T09:43:10.7924630Z  'waves_per_eu': 4}
2026-02-21T09:43:10.7951464Z [426s] Fitting surrogate: 1086 points, 1086 targets
2026-02-21T09:43:10.9203210Z [427s] Autotuning complete in 427.0s after searching 1045 configs.
2026-02-21T09:43:10.9203798Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:43:10.9205649Z     @helion.kernel(config=helion.Config(block_sizes=[64, 64, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:43:10.9207107Z 
2026-02-21T09:43:10.9207424Z [427s] Code of selected kernel: /tmp/torchinductor_root/f4/cf4x2z7r3h3z2xnybclxecodxrnqfexakhkt2hd4zmrekctdjdps.py
2026-02-21T09:43:10.9363182Z from __future__ import annotations
2026-02-21T09:43:10.9363440Z 
2026-02-21T09:43:10.9363788Z import torch
2026-02-21T09:43:10.9364076Z import triton
2026-02-21T09:43:10.9364335Z import triton.language as tl
2026-02-21T09:43:10.9364757Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:43:10.9365090Z 
2026-02-21T09:43:10.9365589Z _BLOCK_SIZE_2 = tl.constexpr(32)
2026-02-21T09:43:10.9365887Z _BLOCK_SIZE_1 = tl.constexpr(64)
2026-02-21T09:43:10.9366175Z _BLOCK_SIZE_0 = tl.constexpr(64)
2026-02-21T09:43:10.9366354Z 
2026-02-21T09:43:10.9366418Z @triton.jit
2026-02-21T09:43:10.9366673Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr):
2026-02-21T09:43:10.9367025Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:43:10.9367259Z     num_pid_m = tl.cdiv(7168, _BLOCK_SIZE_2)
2026-02-21T09:43:10.9367443Z     num_pid_n = tl.cdiv(64, _BLOCK_SIZE_1)
2026-02-21T09:43:10.9367631Z     inner_2d_pid = tl.program_id(0)
2026-02-21T09:43:10.9367798Z     num_pid_in_group = 2 * num_pid_n
2026-02-21T09:43:10.9367996Z     group_id = inner_2d_pid // num_pid_in_group
2026-02-21T09:43:10.9368185Z     first_pid_m = group_id * 2
2026-02-21T09:43:10.9368366Z     group_size_m = min(num_pid_m - first_pid_m, 2)
2026-02-21T09:43:10.9368627Z     pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T09:43:10.9368906Z     pid_1 = inner_2d_pid % num_pid_in_group // group_size_m
2026-02-21T09:43:10.9369117Z     offset_2 = pid_0 * _BLOCK_SIZE_2
2026-02-21T09:43:10.9369334Z     indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32)
2026-02-21T09:43:10.9369600Z     offset_1 = pid_1 * _BLOCK_SIZE_1
2026-02-21T09:43:10.9369812Z     indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32)
2026-02-21T09:43:10.9370104Z     # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:43:10.9370400Z     acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32)
2026-02-21T09:43:10.9370712Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:43:10.9371107Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:43:10.9371492Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:43:10.9371779Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:43:10.9372105Z     for offset_3 in tl.range(0, 4096, _BLOCK_SIZE_0, loop_unroll_factor=1, num_stages=3, flatten=True):
2026-02-21T09:43:10.9372453Z         indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T09:43:10.9372664Z         acc_copy = acc
2026-02-21T09:43:10.9372814Z         acc_copy_0 = acc_copy
2026-02-21T09:43:10.9373018Z         # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2
2026-02-21T09:43:10.9373232Z         mul = 2 * offset_3
2026-02-21T09:43:10.9373477Z         # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:43:10.9373746Z         iota = mul + tl.arange(0, mul_1)
2026-02-21T09:43:10.9374132Z         load = tl.load(A + (indices_1[:, None] * 8192 + iota[None, :] * 1), None)
2026-02-21T09:43:10.9374510Z         # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:43:10.9374942Z         # src[int4_gemm.py:66]:     torch.float32
2026-02-21T09:43:10.9375198Z         # src[int4_gemm.py:67]: )  # [BLOCK_SIZE_M, BLOCK_SIZE_K]
2026-02-21T09:43:10.9375466Z         v_0 = tl.cast(load, tl.float32)
2026-02-21T09:43:10.9375775Z         # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n]  # [BLOCK_SIZE_K//2, BLOCK_SIZE_N]
2026-02-21T09:43:10.9383369Z         b_tile = tl.load(B + (indices_3[:, None] * 7168 + indices_2[None, :] * 1), None)
2026-02-21T09:43:10.9383640Z         # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8)  # Sign-extend low 4 bits
2026-02-21T09:43:10.9383855Z         v_1 = tl.full([], 4, tl.int8)
2026-02-21T09:43:10.9383998Z         v_2 = b_tile << v_1
2026-02-21T09:43:10.9384120Z         v_3 = tl.full([], 4, tl.int8)
2026-02-21T09:43:10.9384245Z         v_4 = v_2 >> v_3
2026-02-21T09:43:10.9384418Z         # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8)  # Sign-extend high 4 bits
2026-02-21T09:43:10.9384691Z         v_5 = tl.full([], 4, tl.int8)
2026-02-21T09:43:10.9384813Z         v_6 = b_tile >> v_5
2026-02-21T09:43:10.9384975Z         # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1)
2026-02-21T09:43:10.9385157Z         stack_idx = tl.arange(0, 2)
2026-02-21T09:43:10.9385304Z         broadcast_idx = stack_idx[None, :, None]
2026-02-21T09:43:10.9385455Z         expanded_0 = tl.expand_dims(v_4, 1)
2026-02-21T09:43:10.9385593Z         expanded_1 = tl.expand_dims(v_6, 1)
2026-02-21T09:43:10.9385745Z         stacked_result = tl.zeros_like(expanded_0)
2026-02-21T09:43:10.9385894Z         mask_0 = broadcast_idx == 0
2026-02-21T09:43:10.9386070Z         stacked_result = tl.where(mask_0, expanded_0, stacked_result)
2026-02-21T09:43:10.9386242Z         mask_1 = broadcast_idx == 1
2026-02-21T09:43:10.9386399Z         stacked_result = tl.where(mask_1, expanded_1, stacked_result)
2026-02-21T09:43:10.9386572Z         # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:43:10.9386759Z         # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:43:10.9386930Z         # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:43:10.9387088Z         view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2])
2026-02-21T09:43:10.9387258Z         v_7 = tl.cast(view, tl.float32)
2026-02-21T09:43:10.9387427Z         # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2)  # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1]
2026-02-21T09:43:10.9387602Z         a_tile_1 = v_0[:, :, None]
2026-02-21T09:43:10.9387739Z         # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0)
2026-02-21T09:43:10.9387888Z         b_unpacked_1 = v_7[None, :, :]
2026-02-21T09:43:10.9388072Z         # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1)  # [BLOCK_SIZE_M, BLOCK_SIZE_N]
2026-02-21T09:43:10.9388260Z         v_8 = a_tile_1 * b_unpacked_1
2026-02-21T09:43:10.9388383Z         sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32)
2026-02-21T09:43:10.9388513Z         acc = acc_copy_0 + sum_1
2026-02-21T09:43:10.9388654Z     # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16)
2026-02-21T09:43:10.9388820Z     v_10 = tl.cast(acc, tl.bfloat16)
2026-02-21T09:43:10.9388975Z     tl.store(C + (indices_1[:, None] * 7168 + indices_2[None, :] * 1), v_10, None)
2026-02-21T09:43:10.9389099Z 
2026-02-21T09:43:10.9389185Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher):
2026-02-21T09:43:10.9389339Z     """
2026-02-21T09:43:10.9389448Z     BFloat16 x INT4 General Matrix Multiplication (GEMM).
2026-02-21T09:43:10.9389548Z 
2026-02-21T09:43:10.9389609Z     This kernel performs matrix multiplication where:
2026-02-21T09:43:10.9389756Z     - A is a bfloat16 matrix of shape [M, K]
2026-02-21T09:43:10.9389915Z     - B is an int8 matrix of shape [K//2, N] containing packed int4 values
2026-02-21T09:43:10.9390081Z       (two 4-bit values packed into each int8)
2026-02-21T09:43:10.9390162Z 
2026-02-21T09:43:10.9390198Z     Args:
2026-02-21T09:43:10.9390314Z         A (Tensor): Input tensor of shape [M, K] in bfloat16 format.
2026-02-21T09:43:10.9390492Z         B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format.
2026-02-21T09:43:10.9390607Z 
2026-02-21T09:43:10.9390640Z     Returns:
2026-02-21T09:43:10.9390755Z         Tensor: Output tensor of shape [M, N] in bfloat16 format.
2026-02-21T09:43:10.9390888Z     """
2026-02-21T09:43:10.9390975Z     # src[int4_gemm.py:50]: M, K = A.shape
2026-02-21T09:43:10.9391087Z     M, K = A.shape
2026-02-21T09:43:10.9391181Z     # src[int4_gemm.py:51]: _, N = B.shape
2026-02-21T09:43:10.9391288Z     _, N = B.shape
2026-02-21T09:43:10.9391429Z     # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:43:10.9391629Z     C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:43:10.9391800Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:43:10.9391941Z     _BLOCK_SIZE_2 = 32
2026-02-21T09:43:10.9392036Z     _BLOCK_SIZE_1 = 64
2026-02-21T09:43:10.9392229Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:43:10.9392494Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:43:10.9392746Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:43:10.9392921Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:43:10.9393024Z     _BLOCK_SIZE_0 = 64
2026-02-21T09:43:10.9393144Z     # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:43:10.9393321Z     # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:43:10.9393486Z     # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:43:10.9393612Z     _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0
2026-02-21T09:43:10.9393751Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:43:10.9393942Z     # src[int4_gemm.py:58]:     acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:43:10.9394104Z     # src[int4_gemm.py:57-91]: ...
2026-02-21T09:43:10.9394240Z     _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0)
2026-02-21T09:43:10.9394607Z     _launcher(_helion_matmul_bf16_int4, (triton.cdiv(7168, _BLOCK_SIZE_2) * triton.cdiv(64, _BLOCK_SIZE_1),), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=8, num_stages=3, waves_per_eu=4, matrix_instr_nonkdim=16)
2026-02-21T09:43:10.9394957Z     # src[int4_gemm.py:93]: return C
2026-02-21T09:43:10.9395065Z     return C
2026-02-21T09:43:11.9568276Z WARNING:tritonbench.utils.triton_op:Completed input ID 14:
2026-02-21T09:43:11.9568716Z x_val
2026-02-21T09:43:11.9568932Z -------------------
2026-02-21T09:43:11.9569171Z (64, 1, 7168, 8192)
2026-02-21T09:43:11.9569311Z 
2026-02-21T09:43:11.9581722Z  50%|█████     | 5/10 [40:33<41:27, 497.48s/it]WARNING:tritonbench.utils.triton_op:Running input ID 17:
2026-02-21T09:43:11.9582134Z x_val
2026-02-21T09:43:11.9582309Z ---------------------
2026-02-21T09:43:11.9582526Z (1, 4096, 8192, 1024)
2026-02-21T09:43:11.9584267Z INFO:tritonbench.utils.triton_op:Took 0.14ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T09:43:13.1274108Z INFO:tritonbench.utils.triton_op:Took 3.74ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T09:43:15.3164920Z INFO:tritonbench.utils.triton_op:Took 0.20ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T09:43:15.3183015Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:43:15.3183314Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:43:15.3183568Z               'dtype': 'torch.bfloat16',
2026-02-21T09:43:15.3183813Z               'shape': (1, 4096, 1024),
2026-02-21T09:43:15.3184056Z               'stride': (4194304, 1024, 1)},
2026-02-21T09:43:15.3184293Z             { 'device': 'cuda:0',
2026-02-21T09:43:15.3184519Z               'dtype': 'torch.int32',
2026-02-21T09:43:15.3184745Z               'shape': (1024, 8192),
2026-02-21T09:43:15.3184992Z               'stride': (8192, 1)}),
2026-02-21T09:43:15.3185208Z   'kwargs': {}}
2026-02-21T09:43:15.3221240Z INFO:tritonbench.utils.triton_op:Took 3.97ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T09:43:15.5115292Z [0s] Autotune random seed: 2138032649
2026-02-21T09:43:15.5335758Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:43:48.9946486Z [33s] Timeout after 30s compiling Config(block_sizes=[64, 256, 2], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T09:44:00.1565328Z [44s] Timeout after 30s compiling Config(block_sizes=[8, 4096, 4], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[2, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:44:00.1587259Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.3 configs/s
2026-02-21T09:44:04.4013718Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:44:04.4017443Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:44:04.4019243Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:44:04.4019624Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:44:04.4019942Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:44:04.4020461Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:44:04.4020717Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:44:04.4020897Z #smem = #ttg.shared_memory
2026-02-21T09:44:04.4021129Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:44:04.4024750Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:44:04.4025149Z     %cst = arith.constant dense<8192> : tensor<1x512xi64, #mma>
2026-02-21T09:44:04.4025324Z     %cst_0 = arith.constant dense<0> : tensor<1x512xi64, #mma>
2026-02-21T09:44:04.4025580Z     %cst_1 = arith.constant dense<4096> : tensor<32x1xi64, #mma>
2026-02-21T09:44:04.4025743Z     %cst_2 = arith.constant dense<0> : tensor<32x1xi64, #mma>
2026-02-21T09:44:04.4025904Z     %cst_3 = arith.constant dense<8192> : tensor<32x1xi64, #mma>
2026-02-21T09:44:04.4026073Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:04.4026246Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:04.4026424Z     %cst_6 = arith.constant dense<0.000000e+00> : tensor<32x512xf32, #mma>
2026-02-21T09:44:04.4026610Z     %cst_7 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1>
2026-02-21T09:44:04.4026759Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:44:04.4026879Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:44:04.4027018Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:44:04.4027134Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:44:04.4027249Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:44:04.4027366Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:44:04.4027473Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:44:04.4027590Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:44:04.4027737Z     %cst_8 = arith.constant dense<0> : tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4027888Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:44:04.4028000Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:44:04.4028110Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:44:04.4028288Z     %cst_9 = arith.constant dense<4> : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4028477Z     %0 = tt.get_program_id x : i32
2026-02-21T09:44:04.4028591Z     %1 = arith.divsi %0, %c1024_i32 : i32
2026-02-21T09:44:04.4028705Z     %2 = arith.muli %1, %c8_i32 : i32
2026-02-21T09:44:04.4028817Z     %3 = arith.subi %c16_i32, %2 : i32
2026-02-21T09:44:04.4029011Z     %4 = arith.minsi %3, %c8_i32 : i32
2026-02-21T09:44:04.4029123Z     %5 = arith.remsi %0, %c1024_i32 : i32
2026-02-21T09:44:04.4029239Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:44:04.4029348Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:44:04.4029452Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:44:04.4029585Z     %9 = arith.muli %7, %c512_i32 : i32
2026-02-21T09:44:04.4029790Z     %10 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:44:04.4030070Z     %11 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:04.4030320Z     %12 = tt.splat %9 : i32 -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:44:04.4030541Z     %13 = arith.addi %12, %10 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:44:04.4030714Z     %14 = arith.muli %8, %c32_i32 : i32
2026-02-21T09:44:04.4030912Z     %15 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:04.4031179Z     %16 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:04.4031435Z     %17 = tt.splat %14 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:04.4031664Z     %18 = arith.addi %17, %15 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:04.4031905Z     %19 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:04.4032212Z     %20 = tt.expand_dims %18 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:44:04.4032461Z     %21 = arith.muli %20, %cst_7 : tensor<32x1xi32, #blocked1>
2026-02-21T09:44:04.4032714Z     %22 = tt.broadcast %21 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4032931Z     %23 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:04.4033217Z     %24 = tt.expand_dims %13 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4033482Z     %25 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T09:44:04.4033775Z     %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:44:04.4034184Z     %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:44:04.4034581Z     %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:04.4034832Z     %29 = arith.cmpi eq, %28, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:04.4035026Z     %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked>
2026-02-21T09:44:04.4035224Z     %31 = arith.cmpi eq, %28, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:04.4035411Z     %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked>
2026-02-21T09:44:04.4035675Z     %33 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst_6) -> (tensor<32x512xf32, #mma>)  : i32 {
2026-02-21T09:44:04.4035889Z       %60 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:44:04.4036061Z       %61 = tt.splat %60 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:04.4036275Z       %62 = arith.addi %61, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:04.4036545Z       %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:04.4036832Z       %64 = tt.broadcast %63 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4037019Z       %65 = arith.addi %22, %64 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4037216Z       %66 = tt.addptr %23, %65 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4037420Z       %67 = tt.load %66 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:04.4037680Z       %68 = ttg.convert_layout %67 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4038075Z       %69 = arith.extf %68 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4038355Z       %70 = arith.muli %arg3, %c8192_i32 : i32
2026-02-21T09:44:04.4038497Z       %71 = tt.splat %70 : i32 -> tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4038651Z       %72 = arith.addi %71, %24 : tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4038841Z       %73 = tt.addptr %25, %72 : tensor<1x512x!tt.ptr<i8>, #blocked2>, tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4039040Z       %74 = tt.load %73 : tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T09:44:04.4039276Z       %75 = ttg.convert_layout %74 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4039568Z       %76 = arith.shli %75, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4039798Z       %77 = arith.shrsi %76, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4040025Z       %78 = arith.shrsi %75, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4040308Z       %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:44:04.4040666Z       %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:44:04.4040948Z       %81 = tt.broadcast %79 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4041206Z       %82 = arith.select %30, %81, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4041439Z       %83 = tt.broadcast %80 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4041669Z       %84 = arith.select %32, %83, %82 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4041893Z       %85 = tt.reshape %84 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3>
2026-02-21T09:44:04.4042114Z       %86 = arith.sitofp %85 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3>
2026-02-21T09:44:04.4042363Z       %87 = ttg.local_alloc %86 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem>
2026-02-21T09:44:04.4042754Z       %88 = ttg.local_load %87 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4043224Z       %89 = tt.dot %69, %88, %arg4, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma>
2026-02-21T09:44:04.4043567Z       %90 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:44:04.4043690Z       %91 = arith.muli %90, %c2_i32 : i32
2026-02-21T09:44:04.4043857Z       %92 = tt.splat %91 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:04.4044072Z       %93 = arith.addi %92, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:04.4044342Z       %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:04.4044632Z       %95 = tt.broadcast %94 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4044820Z       %96 = arith.addi %22, %95 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4045016Z       %97 = tt.addptr %23, %96 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4045213Z       %98 = tt.load %97 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:04.4045470Z       %99 = ttg.convert_layout %98 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4045867Z       %100 = arith.extf %99 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4046153Z       %101 = arith.muli %90, %c8192_i32 : i32
2026-02-21T09:44:04.4046296Z       %102 = tt.splat %101 : i32 -> tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4046453Z       %103 = arith.addi %102, %24 : tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4046656Z       %104 = tt.addptr %25, %103 : tensor<1x512x!tt.ptr<i8>, #blocked2>, tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4046856Z       %105 = tt.load %104 : tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T09:44:04.4047103Z       %106 = ttg.convert_layout %105 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4047407Z       %107 = arith.shli %106, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4047641Z       %108 = arith.shrsi %107, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4047878Z       %109 = arith.shrsi %106, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4048165Z       %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:44:04.4048527Z       %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:44:04.4048817Z       %112 = tt.broadcast %110 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4049088Z       %113 = arith.select %30, %112, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4049331Z       %114 = tt.broadcast %111 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4049565Z       %115 = arith.select %32, %114, %113 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4049797Z       %116 = tt.reshape %115 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3>
2026-02-21T09:44:04.4050022Z       %117 = arith.sitofp %116 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3>
2026-02-21T09:44:04.4050277Z       %118 = ttg.local_alloc %117 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem>
2026-02-21T09:44:04.4050604Z       %119 = ttg.local_load %118 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4051073Z       %120 = tt.dot %100, %119, %89, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma>
2026-02-21T09:44:04.4051422Z       %121 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:44:04.4051544Z       %122 = arith.muli %121, %c2_i32 : i32
2026-02-21T09:44:04.4051711Z       %123 = tt.splat %122 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:04.4051935Z       %124 = arith.addi %123, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:04.4052211Z       %125 = tt.expand_dims %124 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:04.4052490Z       %126 = tt.broadcast %125 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4052705Z       %127 = arith.addi %22, %126 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4052904Z       %128 = tt.addptr %23, %127 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4053110Z       %129 = tt.load %128 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:04.4053370Z       %130 = ttg.convert_layout %129 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4053766Z       %131 = arith.extf %130 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4054049Z       %132 = arith.muli %121, %c8192_i32 : i32
2026-02-21T09:44:04.4054189Z       %133 = tt.splat %132 : i32 -> tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4054346Z       %134 = arith.addi %133, %24 : tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4054543Z       %135 = tt.addptr %25, %134 : tensor<1x512x!tt.ptr<i8>, #blocked2>, tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4054747Z       %136 = tt.load %135 : tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T09:44:04.4054992Z       %137 = ttg.convert_layout %136 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4055290Z       %138 = arith.shli %137, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4055527Z       %139 = arith.shrsi %138, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4055761Z       %140 = arith.shrsi %137, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4056051Z       %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:44:04.4056407Z       %142 = tt.expand_dims %140 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:44:04.4056695Z       %143 = tt.broadcast %141 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4056964Z       %144 = arith.select %30, %143, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4057201Z       %145 = tt.broadcast %142 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4057435Z       %146 = arith.select %32, %145, %144 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4057667Z       %147 = tt.reshape %146 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3>
2026-02-21T09:44:04.4057894Z       %148 = arith.sitofp %147 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3>
2026-02-21T09:44:04.4058155Z       %149 = ttg.local_alloc %148 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem>
2026-02-21T09:44:04.4058485Z       %150 = ttg.local_load %149 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4058967Z       %151 = tt.dot %131, %150, %120, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma>
2026-02-21T09:44:04.4059318Z       %152 = arith.addi %arg3, %c3_i32 : i32
2026-02-21T09:44:04.4059443Z       %153 = arith.muli %152, %c2_i32 : i32
2026-02-21T09:44:04.4059610Z       %154 = tt.splat %153 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:04.4059838Z       %155 = arith.addi %154, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:04.4060114Z       %156 = tt.expand_dims %155 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:04.4060392Z       %157 = tt.broadcast %156 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4060599Z       %158 = arith.addi %22, %157 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4060797Z       %159 = tt.addptr %23, %158 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:04.4061002Z       %160 = tt.load %159 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:04.4061262Z       %161 = ttg.convert_layout %160 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4061659Z       %162 = arith.extf %161 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4061939Z       %163 = arith.muli %152, %c8192_i32 : i32
2026-02-21T09:44:04.4062078Z       %164 = tt.splat %163 : i32 -> tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4062234Z       %165 = arith.addi %164, %24 : tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4062430Z       %166 = tt.addptr %25, %165 : tensor<1x512x!tt.ptr<i8>, #blocked2>, tensor<1x512xi32, #blocked2>
2026-02-21T09:44:04.4062634Z       %167 = tt.load %166 : tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T09:44:04.4062876Z       %168 = ttg.convert_layout %167 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4063172Z       %169 = arith.shli %168, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4063408Z       %170 = arith.shrsi %169, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4063643Z       %171 = arith.shrsi %168, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:04.4063931Z       %172 = tt.expand_dims %170 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:44:04.4064288Z       %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:44:04.4064573Z       %174 = tt.broadcast %172 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4064833Z       %175 = arith.select %30, %174, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4065089Z       %176 = tt.broadcast %173 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4065327Z       %177 = arith.select %32, %176, %175 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:44:04.4065560Z       %178 = tt.reshape %177 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3>
2026-02-21T09:44:04.4065785Z       %179 = arith.sitofp %178 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3>
2026-02-21T09:44:04.4066043Z       %180 = ttg.local_alloc %179 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem>
2026-02-21T09:44:04.4066371Z       %181 = ttg.local_load %180 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:04.4066840Z       %182 = tt.dot %162, %181, %151, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma>
2026-02-21T09:44:04.4067193Z       scf.yield %182 : tensor<32x512xf32, #mma>
2026-02-21T09:44:04.4067319Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:44:04.4067473Z     %34 = arith.truncf %33 : tensor<32x512xf32, #mma> to tensor<32x512xbf16, #mma>
2026-02-21T09:44:04.4067637Z     %35 = arith.extsi %14 : i32 to i64
2026-02-21T09:44:04.4067756Z     %36 = arith.extsi %9 : i32 to i64
2026-02-21T09:44:04.4067907Z     %37 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:04.4068109Z     %38 = tt.splat %35 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:04.4068400Z     %39 = arith.extsi %16 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:04.4068667Z     %40 = arith.addi %38, %39 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:04.4068925Z     %41 = tt.expand_dims %40 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:44:04.4069157Z     %42 = arith.muli %41, %cst_3 : tensor<32x1xi64, #mma>
2026-02-21T09:44:04.4069330Z     %43 = tt.broadcast %42 : tensor<32x1xi64, #mma> -> tensor<32x512xi64, #mma>
2026-02-21T09:44:04.4069533Z     %44 = tt.splat %36 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:04.4069806Z     %45 = arith.extsi %11 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:04.4070075Z     %46 = arith.addi %44, %45 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:04.4070334Z     %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:44:04.4070589Z     %48 = tt.broadcast %47 : tensor<1x512xi64, #mma> -> tensor<32x512xi64, #mma>
2026-02-21T09:44:04.4070767Z     %49 = arith.addi %43, %48 : tensor<32x512xi64, #mma>
2026-02-21T09:44:04.4070947Z     %50 = tt.addptr %37, %49 : tensor<32x512x!tt.ptr<bf16>, #mma>, tensor<32x512xi64, #mma>
2026-02-21T09:44:04.4092106Z     %51 = arith.cmpi sge, %41, %cst_2 : tensor<32x1xi64, #mma>
2026-02-21T09:44:04.4092265Z     %52 = arith.cmpi slt, %41, %cst_1 : tensor<32x1xi64, #mma>
2026-02-21T09:44:04.4092420Z     %53 = arith.andi %51, %52 : tensor<32x1xi1, #mma>
2026-02-21T09:44:04.4092587Z     %54 = tt.broadcast %53 : tensor<32x1xi1, #mma> -> tensor<32x512xi1, #mma>
2026-02-21T09:44:04.4092766Z     %55 = arith.cmpi sge, %47, %cst_0 : tensor<1x512xi64, #mma>
2026-02-21T09:44:04.4092928Z     %56 = arith.cmpi slt, %47, %cst : tensor<1x512xi64, #mma>
2026-02-21T09:44:04.4093107Z     %57 = arith.andi %55, %56 : tensor<1x512xi1, #mma>
2026-02-21T09:44:04.4093277Z     %58 = tt.broadcast %57 : tensor<1x512xi1, #mma> -> tensor<32x512xi1, #mma>
2026-02-21T09:44:04.4093447Z     %59 = arith.andi %54, %58 : tensor<32x512xi1, #mma>
2026-02-21T09:44:04.4093621Z     tt.store %50, %34, %59 : tensor<32x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:04.4093757Z     tt.return
2026-02-21T09:44:04.4093837Z   }
2026-02-21T09:44:04.4093914Z }
2026-02-21T09:44:04.4093958Z 
2026-02-21T09:44:04.4093989Z {-#
2026-02-21T09:44:04.4094071Z   external_resources: {
2026-02-21T09:44:04.4094169Z     mlir_reproducer: {
2026-02-21T09:44:04.4095182Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:44:04.4096179Z       disable_threading: false,
2026-02-21T09:44:04.4096285Z       verify_each: true
2026-02-21T09:44:04.4096373Z     }
2026-02-21T09:44:04.4096447Z   }
2026-02-21T09:44:04.4096515Z #-}
2026-02-21T09:44:04.4096797Z /tmp/torchinductor_root/ar/cardaviyeotv76douoeiukvte6fo47v6lv74hfry5ejrxtjxtc3t.py:12:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:44:04.4097479Z /tmp/torchinductor_root/ar/cardaviyeotv76douoeiukvte6fo47v6lv74hfry5ejrxtjxtc3t.py:12:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:44:04.4098027Z [48s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:44:04.4098782Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 512], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:44:04.4099439Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:44:04.4099607Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:44:05.0240671Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:44:05.0248499Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:44:05.0248962Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:44:05.0249396Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:44:05.0249939Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [8, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:44:05.0250380Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:44:05.0251032Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:44:05.0251571Z     %cst = arith.constant dense<0.000000e+00> : tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0251843Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:44:05.0252004Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:44:05.0252163Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:44:05.0252418Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:44:05.0252581Z     %c54_i32 = arith.constant 54 : i32
2026-02-21T09:44:05.0252737Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:44:05.0252884Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:44:05.0253037Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:44:05.0253228Z     %cst_0 = arith.constant dense<0> : tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0253424Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:44:05.0253567Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:44:05.0253722Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:44:05.0253882Z     %c255_i32 = arith.constant 255 : i32
2026-02-21T09:44:05.0254123Z     %cst_1 = arith.constant dense<0> : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0254419Z     %cst_2 = arith.constant dense<0> : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0254615Z     %c-1_i32 = arith.constant -1 : i32
2026-02-21T09:44:05.0254815Z     %cst_3 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0255095Z     %cst_4 = arith.constant dense<4> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0255379Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:05.0255604Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:05.0255827Z     %cst_7 = arith.constant dense<8192> : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0256049Z     %cst_8 = arith.constant dense<0> : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0256266Z     %cst_9 = arith.constant dense<4096> : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0256484Z     %cst_10 = arith.constant dense<0> : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0256703Z     %cst_11 = arith.constant dense<8192> : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0256989Z     %0 = tt.get_program_id x : i32
2026-02-21T09:44:05.0257141Z     %1 = arith.muli %0, %c54_i32 : i32
2026-02-21T09:44:05.0257293Z     %2 = arith.addi %1, %c54_i32 : i32
2026-02-21T09:44:05.0257449Z     %3 = arith.minsi %2, %c32768_i32 : i32
2026-02-21T09:44:05.0257710Z     %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0258072Z     %5 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0258475Z     %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0258870Z     %7 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0259217Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0259564Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0259872Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0260285Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:44:05.0260848Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:44:05.0261374Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:05.0261711Z     %14 = arith.cmpi eq, %13, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:05.0271543Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked>
2026-02-21T09:44:05.0271762Z     %16 = arith.cmpi eq, %13, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:05.0271955Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked>
2026-02-21T09:44:05.0272193Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:05.0272469Z     %19 = arith.extsi %5 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0272799Z     %20 = arith.extsi %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0273029Z     %21 = arith.subi %3, %1 : i32
2026-02-21T09:44:05.0273144Z     %22 = arith.remsi %21, %c4_i32 : i32
2026-02-21T09:44:05.0273269Z     %23 = arith.subi %21, %22 : i32
2026-02-21T09:44:05.0273386Z     %24 = arith.addi %1, %23 : i32
2026-02-21T09:44:05.0273515Z     scf.for %arg3 = %1 to %24 step %c4_i32  : i32 {
2026-02-21T09:44:05.0273661Z       %29 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:44:05.0273785Z       %30 = arith.muli %29, %c2_i32 : i32
2026-02-21T09:44:05.0273910Z       %31 = arith.subi %c256_i32, %30 : i32
2026-02-21T09:44:05.0274032Z       %32 = arith.minsi %31, %c2_i32 : i32
2026-02-21T09:44:05.0274158Z       %33 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:44:05.0274278Z       %34 = arith.remsi %33, %32 : i32
2026-02-21T09:44:05.0274397Z       %35 = arith.addi %30, %34 : i32
2026-02-21T09:44:05.0274512Z       %36 = arith.divsi %33, %32 : i32
2026-02-21T09:44:05.0274626Z       %37 = arith.muli %35, %c32_i32 : i32
2026-02-21T09:44:05.0274838Z       %38 = tt.splat %37 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0275133Z       %39 = arith.addi %38, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0275347Z       %40 = arith.muli %36, %c32_i32 : i32
2026-02-21T09:44:05.0275538Z       %41 = tt.splat %40 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0275759Z       %42 = arith.addi %41, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0276042Z       %43 = tt.expand_dims %42 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0276293Z       %44 = arith.muli %43, %cst_3 : tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0276489Z       %45 = tt.broadcast %44 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0276841Z       %46 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0277232Z       %47 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T09:44:05.0277452Z         %200 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:44:05.0277631Z         %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0277863Z         %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0278146Z         %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:05.0278464Z         %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0278659Z         %205 = arith.addi %45, %204 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0278857Z         %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0279067Z         %207 = tt.load %206 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0279354Z         %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0279754Z         %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0280062Z         %210 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:44:05.0280240Z         %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0280469Z         %212 = arith.addi %211, %46 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0280782Z         %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0281091Z         %214 = tt.load %213 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0281326Z         %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0281562Z         %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0281802Z         %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0282094Z         %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0282427Z         %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0282789Z         %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0283026Z         %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0283259Z         %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0283517Z         %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0283743Z         %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:44:05.0283967Z         %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:44:05.0284264Z         %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0284724Z         %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0285075Z         %228 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:44:05.0285198Z         %229 = arith.muli %228, %c2_i32 : i32
2026-02-21T09:44:05.0285416Z         %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0285639Z         %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0285913Z         %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:05.0286206Z         %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0286398Z         %234 = arith.addi %45, %233 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0286599Z         %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0286802Z         %236 = tt.load %235 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0287066Z         %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0287484Z         %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0287768Z         %239 = arith.muli %228, %c8192_i32 : i32
2026-02-21T09:44:05.0287979Z         %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0288207Z         %241 = arith.addi %240, %46 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0288519Z         %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0288823Z         %243 = tt.load %242 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0289055Z         %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0289292Z         %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0289528Z         %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0289815Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0290145Z         %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0290427Z         %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0290664Z         %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0290895Z         %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0291128Z         %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0291371Z         %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:44:05.0291594Z         %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:44:05.0291894Z         %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0292351Z         %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0292696Z         scf.yield %256 : tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0292819Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:44:05.0292975Z       %48 = arith.truncf %47 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T09:44:05.0293139Z       %49 = arith.extsi %40 : i32 to i64
2026-02-21T09:44:05.0293255Z       %50 = arith.extsi %37 : i32 to i64
2026-02-21T09:44:05.0293414Z       %51 = tt.splat %49 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0293616Z       %52 = arith.addi %51, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0293872Z       %53 = tt.expand_dims %52 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0294123Z       %54 = arith.muli %53, %cst_7 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0294297Z       %55 = tt.broadcast %54 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0294497Z       %56 = tt.splat %50 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0294699Z       %57 = arith.addi %56, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0294949Z       %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0295210Z       %59 = tt.broadcast %58 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0295383Z       %60 = arith.addi %55, %59 : tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0295580Z       %61 = tt.addptr %18, %60 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0295769Z       %62 = arith.cmpi sge, %53, %cst_8 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0295931Z       %63 = arith.cmpi slt, %53, %cst_9 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0296082Z       %64 = arith.andi %62, %63 : tensor<32x1xi1, #mma>
2026-02-21T09:44:05.0296244Z       %65 = tt.broadcast %64 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0296426Z       %66 = arith.cmpi sge, %58, %cst_10 : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0296582Z       %67 = arith.cmpi slt, %58, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0296734Z       %68 = arith.andi %66, %67 : tensor<1x32xi1, #mma>
2026-02-21T09:44:05.0296894Z       %69 = tt.broadcast %68 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0297068Z       %70 = arith.andi %65, %69 : tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0297222Z       tt.store %61, %48, %70 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:05.0297367Z       %71 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:44:05.0297489Z       %72 = arith.divsi %71, %c256_i32 : i32
2026-02-21T09:44:05.0297606Z       %73 = arith.muli %72, %c2_i32 : i32
2026-02-21T09:44:05.0297724Z       %74 = arith.subi %c256_i32, %73 : i32
2026-02-21T09:44:05.0297839Z       %75 = arith.minsi %74, %c2_i32 : i32
2026-02-21T09:44:05.0297957Z       %76 = arith.remsi %71, %c256_i32 : i32
2026-02-21T09:44:05.0298071Z       %77 = arith.remsi %76, %75 : i32
2026-02-21T09:44:05.0298189Z       %78 = arith.addi %73, %77 : i32
2026-02-21T09:44:05.0298298Z       %79 = arith.divsi %76, %75 : i32
2026-02-21T09:44:05.0298411Z       %80 = arith.muli %78, %c32_i32 : i32
2026-02-21T09:44:05.0298617Z       %81 = tt.splat %80 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0298936Z       %82 = arith.addi %81, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0299145Z       %83 = arith.muli %79, %c32_i32 : i32
2026-02-21T09:44:05.0299308Z       %84 = tt.splat %83 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0299526Z       %85 = arith.addi %84, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0299800Z       %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0300046Z       %87 = arith.muli %86, %cst_3 : tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0300241Z       %88 = tt.broadcast %87 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0300589Z       %89 = tt.expand_dims %82 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0300975Z       %90 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T09:44:05.0301191Z         %200 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:44:05.0301382Z         %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0301607Z         %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0301880Z         %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:05.0302160Z         %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0302356Z         %205 = arith.addi %88, %204 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0302568Z         %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0302774Z         %207 = tt.load %206 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0303056Z         %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0303462Z         %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0303746Z         %210 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:44:05.0303922Z         %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0304148Z         %212 = arith.addi %211, %89 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0304457Z         %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0304765Z         %214 = tt.load %213 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0305003Z         %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0305236Z         %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0305476Z         %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0305760Z         %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0306093Z         %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0306377Z         %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0306632Z         %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0306870Z         %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0307104Z         %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0307335Z         %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:44:05.0307557Z         %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:44:05.0307849Z         %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0308319Z         %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0308667Z         %228 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:44:05.0308791Z         %229 = arith.muli %228, %c2_i32 : i32
2026-02-21T09:44:05.0308964Z         %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0309204Z         %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0309482Z         %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:05.0309760Z         %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0309951Z         %234 = arith.addi %88, %233 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0310153Z         %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0310377Z         %236 = tt.load %235 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0310645Z         %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0311064Z         %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0311343Z         %239 = arith.muli %228, %c8192_i32 : i32
2026-02-21T09:44:05.0311519Z         %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0311750Z         %241 = arith.addi %240, %89 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0312057Z         %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0312365Z         %243 = tt.load %242 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0312595Z         %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0312831Z         %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0313068Z         %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0313360Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0313694Z         %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0313972Z         %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0314208Z         %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0314458Z         %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0314688Z         %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0314917Z         %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:44:05.0315136Z         %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:44:05.0315427Z         %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0315884Z         %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0316231Z         scf.yield %256 : tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0316360Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:44:05.0316511Z       %91 = arith.truncf %90 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T09:44:05.0316675Z       %92 = arith.extsi %83 : i32 to i64
2026-02-21T09:44:05.0316789Z       %93 = arith.extsi %80 : i32 to i64
2026-02-21T09:44:05.0316948Z       %94 = tt.splat %92 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0317168Z       %95 = arith.addi %94, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0317420Z       %96 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0317652Z       %97 = arith.muli %96, %cst_7 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0317823Z       %98 = tt.broadcast %97 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0318022Z       %99 = tt.splat %93 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0318249Z       %100 = arith.addi %99, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0318509Z       %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0318786Z       %102 = tt.broadcast %101 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0318963Z       %103 = arith.addi %98, %102 : tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0319149Z       %104 = tt.addptr %18, %103 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0319342Z       %105 = arith.cmpi sge, %96, %cst_8 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0319506Z       %106 = arith.cmpi slt, %96, %cst_9 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0319663Z       %107 = arith.andi %105, %106 : tensor<32x1xi1, #mma>
2026-02-21T09:44:05.0319829Z       %108 = tt.broadcast %107 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0320015Z       %109 = arith.cmpi sge, %101, %cst_10 : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0320179Z       %110 = arith.cmpi slt, %101, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0320336Z       %111 = arith.andi %109, %110 : tensor<1x32xi1, #mma>
2026-02-21T09:44:05.0320504Z       %112 = tt.broadcast %111 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0320679Z       %113 = arith.andi %108, %112 : tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0320836Z       tt.store %104, %91, %113 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:05.0320980Z       %114 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:44:05.0321103Z       %115 = arith.divsi %114, %c256_i32 : i32
2026-02-21T09:44:05.0321223Z       %116 = arith.muli %115, %c2_i32 : i32
2026-02-21T09:44:05.0321343Z       %117 = arith.subi %c256_i32, %116 : i32
2026-02-21T09:44:05.0321462Z       %118 = arith.minsi %117, %c2_i32 : i32
2026-02-21T09:44:05.0321585Z       %119 = arith.remsi %114, %c256_i32 : i32
2026-02-21T09:44:05.0321706Z       %120 = arith.remsi %119, %118 : i32
2026-02-21T09:44:05.0321838Z       %121 = arith.addi %116, %120 : i32
2026-02-21T09:44:05.0321954Z       %122 = arith.divsi %119, %118 : i32
2026-02-21T09:44:05.0322070Z       %123 = arith.muli %121, %c32_i32 : i32
2026-02-21T09:44:05.0322282Z       %124 = tt.splat %123 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0322865Z       %125 = arith.addi %124, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0323081Z       %126 = arith.muli %122, %c32_i32 : i32
2026-02-21T09:44:05.0323250Z       %127 = tt.splat %126 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0323474Z       %128 = arith.addi %127, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0323793Z       %129 = tt.expand_dims %128 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0332754Z       %130 = arith.muli %129, %cst_3 : tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0332960Z       %131 = tt.broadcast %130 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0333320Z       %132 = tt.expand_dims %125 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0333765Z       %133 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T09:44:05.0333980Z         %200 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:44:05.0334155Z         %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0334380Z         %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0334677Z         %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:05.0334956Z         %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0335172Z         %205 = arith.addi %131, %204 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0335371Z         %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0335578Z         %207 = tt.load %206 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0335847Z         %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0336250Z         %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0336535Z         %210 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:44:05.0336717Z         %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0336947Z         %212 = arith.addi %211, %132 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0337260Z         %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0337576Z         %214 = tt.load %213 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0337806Z         %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0338042Z         %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0338282Z         %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0338573Z         %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0338928Z         %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0339213Z         %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0339451Z         %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0339690Z         %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0339920Z         %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0340150Z         %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:44:05.0340369Z         %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:44:05.0340667Z         %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0341135Z         %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0341497Z         %228 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:44:05.0341651Z         %229 = arith.muli %228, %c2_i32 : i32
2026-02-21T09:44:05.0341822Z         %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0342046Z         %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0342325Z         %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:05.0342612Z         %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0342807Z         %234 = arith.addi %131, %233 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0343024Z         %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0343227Z         %236 = tt.load %235 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0343494Z         %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0343891Z         %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0344171Z         %239 = arith.muli %228, %c8192_i32 : i32
2026-02-21T09:44:05.0344348Z         %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0344575Z         %241 = arith.addi %240, %132 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0344885Z         %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0345193Z         %243 = tt.load %242 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0345427Z         %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0345661Z         %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0345894Z         %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0346181Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0346531Z         %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0346812Z         %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0347051Z         %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0347283Z         %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0347512Z         %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0347738Z         %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:44:05.0347959Z         %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:44:05.0348252Z         %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0348713Z         %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0349059Z         scf.yield %256 : tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0349204Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:44:05.0349361Z       %134 = arith.truncf %133 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T09:44:05.0349535Z       %135 = arith.extsi %126 : i32 to i64
2026-02-21T09:44:05.0349652Z       %136 = arith.extsi %123 : i32 to i64
2026-02-21T09:44:05.0349813Z       %137 = tt.splat %135 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0350022Z       %138 = arith.addi %137, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0350300Z       %139 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0350541Z       %140 = arith.muli %139, %cst_7 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0350716Z       %141 = tt.broadcast %140 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0350937Z       %142 = tt.splat %136 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0351142Z       %143 = arith.addi %142, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0351403Z       %144 = tt.expand_dims %143 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0351661Z       %145 = tt.broadcast %144 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0351839Z       %146 = arith.addi %141, %145 : tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0352027Z       %147 = tt.addptr %18, %146 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0352221Z       %148 = arith.cmpi sge, %139, %cst_8 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0352387Z       %149 = arith.cmpi slt, %139, %cst_9 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0352541Z       %150 = arith.andi %148, %149 : tensor<32x1xi1, #mma>
2026-02-21T09:44:05.0352714Z       %151 = tt.broadcast %150 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0352902Z       %152 = arith.cmpi sge, %144, %cst_10 : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0353067Z       %153 = arith.cmpi slt, %144, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0353225Z       %154 = arith.andi %152, %153 : tensor<1x32xi1, #mma>
2026-02-21T09:44:05.0353390Z       %155 = tt.broadcast %154 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0353566Z       %156 = arith.andi %151, %155 : tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0353720Z       tt.store %147, %134, %156 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:05.0353868Z       %157 = arith.addi %arg3, %c3_i32 : i32
2026-02-21T09:44:05.0353992Z       %158 = arith.divsi %157, %c256_i32 : i32
2026-02-21T09:44:05.0354132Z       %159 = arith.muli %158, %c2_i32 : i32
2026-02-21T09:44:05.0354254Z       %160 = arith.subi %c256_i32, %159 : i32
2026-02-21T09:44:05.0354373Z       %161 = arith.minsi %160, %c2_i32 : i32
2026-02-21T09:44:05.0354493Z       %162 = arith.remsi %157, %c256_i32 : i32
2026-02-21T09:44:05.0354609Z       %163 = arith.remsi %162, %161 : i32
2026-02-21T09:44:05.0354724Z       %164 = arith.addi %159, %163 : i32
2026-02-21T09:44:05.0354836Z       %165 = arith.divsi %162, %161 : i32
2026-02-21T09:44:05.0354951Z       %166 = arith.muli %164, %c32_i32 : i32
2026-02-21T09:44:05.0355160Z       %167 = tt.splat %166 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0355459Z       %168 = arith.addi %167, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0355670Z       %169 = arith.muli %165, %c32_i32 : i32
2026-02-21T09:44:05.0355840Z       %170 = tt.splat %169 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0356062Z       %171 = arith.addi %170, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0356342Z       %172 = tt.expand_dims %171 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0356606Z       %173 = arith.muli %172, %cst_3 : tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0356801Z       %174 = tt.broadcast %173 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0357150Z       %175 = tt.expand_dims %168 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0357538Z       %176 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T09:44:05.0357766Z         %200 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:44:05.0357938Z         %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0358182Z         %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0358455Z         %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:05.0358730Z         %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0358924Z         %205 = arith.addi %174, %204 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0359120Z         %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0359324Z         %207 = tt.load %206 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0359587Z         %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0359988Z         %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0360274Z         %210 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:44:05.0360453Z         %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0360681Z         %212 = arith.addi %211, %175 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0360991Z         %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0361299Z         %214 = tt.load %213 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0361531Z         %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0361785Z         %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0362020Z         %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0362310Z         %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0362733Z         %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0363018Z         %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0363253Z         %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0363489Z         %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0363724Z         %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0363951Z         %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:44:05.0364177Z         %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:44:05.0364486Z         %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0364945Z         %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0365291Z         %228 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:44:05.0365412Z         %229 = arith.muli %228, %c2_i32 : i32
2026-02-21T09:44:05.0365599Z         %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0365819Z         %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0366095Z         %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:05.0366395Z         %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0366586Z         %234 = arith.addi %174, %233 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0366786Z         %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0366986Z         %236 = tt.load %235 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0367252Z         %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0367648Z         %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0367924Z         %239 = arith.muli %228, %c8192_i32 : i32
2026-02-21T09:44:05.0368103Z         %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0368330Z         %241 = arith.addi %240, %175 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0368643Z         %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0368952Z         %243 = tt.load %242 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0369181Z         %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0369417Z         %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0369669Z         %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0369956Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0370292Z         %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0370570Z         %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0370810Z         %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0371042Z         %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0371272Z         %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0371502Z         %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:44:05.0371717Z         %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:44:05.0372012Z         %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0372478Z         %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0372818Z         scf.yield %256 : tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0372943Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:44:05.0373096Z       %177 = arith.truncf %176 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T09:44:05.0373266Z       %178 = arith.extsi %169 : i32 to i64
2026-02-21T09:44:05.0373393Z       %179 = arith.extsi %166 : i32 to i64
2026-02-21T09:44:05.0373557Z       %180 = tt.splat %178 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0373768Z       %181 = arith.addi %180, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0374043Z       %182 = tt.expand_dims %181 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0374277Z       %183 = arith.muli %182, %cst_7 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0374451Z       %184 = tt.broadcast %183 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0374650Z       %185 = tt.splat %179 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0374852Z       %186 = arith.addi %185, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0375106Z       %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0375359Z       %188 = tt.broadcast %187 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0375535Z       %189 = arith.addi %184, %188 : tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0375716Z       %190 = tt.addptr %18, %189 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0375910Z       %191 = arith.cmpi sge, %182, %cst_8 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0376069Z       %192 = arith.cmpi slt, %182, %cst_9 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0376224Z       %193 = arith.andi %191, %192 : tensor<32x1xi1, #mma>
2026-02-21T09:44:05.0376390Z       %194 = tt.broadcast %193 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0376571Z       %195 = arith.cmpi sge, %187, %cst_10 : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0376733Z       %196 = arith.cmpi slt, %187, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0376883Z       %197 = arith.andi %195, %196 : tensor<1x32xi1, #mma>
2026-02-21T09:44:05.0377066Z       %198 = tt.broadcast %197 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0377236Z       %199 = arith.andi %194, %198 : tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0377392Z       tt.store %190, %177, %199 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:05.0377538Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:44:05.0377655Z     %25 = arith.subi %3, %24 : i32
2026-02-21T09:44:05.0377765Z     %26 = arith.muli %25, %c256_i32 : i32
2026-02-21T09:44:05.0377878Z     %27 = arith.subi %24, %c1_i32 : i32
2026-02-21T09:44:05.0378349Z     %28:8 = scf.for %arg3 = %c0_i32 to %26 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %27, %arg6 = %c0_i32, %arg7 = %cst, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_2, %arg11 = %cst_1) -> (i32, i32, i32, tensor<32x32xf32, #mma>, i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>)  : i32 {
2026-02-21T09:44:05.0378810Z       %29 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:44:05.0378934Z       %30 = arith.cmpi eq, %arg4, %c255_i32 : i32
2026-02-21T09:44:05.0379065Z       %31 = arith.select %30, %c0_i32, %29 : i32
2026-02-21T09:44:05.0379185Z       %32 = arith.cmpi eq, %31, %c0_i32 : i32
2026-02-21T09:44:05.0379311Z       %33 = arith.select %32, %c0_i32, %arg6 : i32
2026-02-21T09:44:05.0379538Z       %34:5 = scf.if %32 -> (i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32) {
2026-02-21T09:44:05.0379783Z         %95 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:44:05.0379902Z         %96 = arith.divsi %95, %c256_i32 : i32
2026-02-21T09:44:05.0380017Z         %97 = arith.muli %96, %c2_i32 : i32
2026-02-21T09:44:05.0380131Z         %98 = arith.subi %c256_i32, %97 : i32
2026-02-21T09:44:05.0380246Z         %99 = arith.minsi %98, %c2_i32 : i32
2026-02-21T09:44:05.0380363Z         %100 = arith.remsi %95, %c256_i32 : i32
2026-02-21T09:44:05.0380479Z         %101 = arith.remsi %100, %99 : i32
2026-02-21T09:44:05.0380594Z         %102 = arith.addi %97, %101 : i32
2026-02-21T09:44:05.0380721Z         %103 = arith.divsi %100, %99 : i32
2026-02-21T09:44:05.0380838Z         %104 = arith.muli %102, %c32_i32 : i32
2026-02-21T09:44:05.0381062Z         %105 = tt.splat %104 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0381357Z         %106 = arith.addi %105, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.0381568Z         %107 = arith.muli %103, %c32_i32 : i32
2026-02-21T09:44:05.0381736Z         %108 = tt.splat %107 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0381959Z         %109 = arith.addi %108, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.0382242Z         %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0382496Z         %111 = arith.muli %110, %cst_3 : tensor<32x1xi32, #blocked1>
2026-02-21T09:44:05.0382691Z         %112 = tt.broadcast %111 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0383043Z         %113 = tt.expand_dims %106 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0383468Z         scf.yield %104, %107, %112, %113, %95 : i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32
2026-02-21T09:44:05.0383700Z       } else {
2026-02-21T09:44:05.0383922Z         scf.yield %arg8, %arg9, %arg10, %arg11, %arg5 : i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32
2026-02-21T09:44:05.0384172Z       }
2026-02-21T09:44:05.0384251Z       %35 = arith.muli %33, %c2_i32 : i32
2026-02-21T09:44:05.0384414Z       %36 = tt.splat %35 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0384646Z       %37 = arith.addi %36, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0384910Z       %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:05.0385177Z       %39 = tt.broadcast %38 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0385362Z       %40 = arith.addi %34#2, %39 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0385553Z       %41 = tt.addptr %9, %40 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0385747Z       %42 = tt.load %41 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0386002Z       %43 = ttg.convert_layout %42 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0386393Z       %44 = arith.extf %43 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0386665Z       %45 = arith.muli %33, %c8192_i32 : i32
2026-02-21T09:44:05.0386830Z       %46 = tt.splat %45 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0387052Z       %47 = arith.addi %46, %34#3 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0387364Z       %48 = tt.addptr %10, %47 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0387666Z       %49 = tt.load %48 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0387886Z       %50 = arith.shli %49, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0388109Z       %51 = arith.shrsi %50, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0388355Z       %52 = arith.shrsi %49, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0388629Z       %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0388980Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0389249Z       %55 = tt.broadcast %53 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0389475Z       %56 = arith.select %15, %55, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0389702Z       %57 = tt.broadcast %54 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0389921Z       %58 = arith.select %17, %57, %56 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0390138Z       %59 = tt.reshape %58 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:44:05.0390347Z       %60 = arith.sitofp %59 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:44:05.0390628Z       %61 = ttg.convert_layout %60 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0391080Z       %62 = tt.dot %44, %61, %arg7, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0391412Z       %63 = arith.addi %33, %c1_i32 : i32
2026-02-21T09:44:05.0391526Z       %64 = arith.muli %63, %c2_i32 : i32
2026-02-21T09:44:05.0391683Z       %65 = tt.splat %64 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0391893Z       %66 = arith.addi %65, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.0392156Z       %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:44:05.0392438Z       %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0392625Z       %69 = arith.addi %34#2, %68 : tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0392814Z       %70 = tt.addptr %9, %69 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:44:05.0393007Z       %71 = tt.load %70 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.0393258Z       %72 = ttg.convert_layout %71 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0393642Z       %73 = arith.extf %72 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0393911Z       %74 = arith.muli %63, %c8192_i32 : i32
2026-02-21T09:44:05.0394074Z       %75 = tt.splat %74 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0394295Z       %76 = arith.addi %75, %34#3 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0394591Z       %77 = tt.addptr %10, %76 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0394902Z       %78 = tt.load %77 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0395123Z       %79 = arith.shli %78, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0395345Z       %80 = arith.shrsi %79, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0395569Z       %81 = arith.shrsi %78, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0395843Z       %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0396176Z       %83 = tt.expand_dims %81 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:44:05.0396445Z       %84 = tt.broadcast %82 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0396686Z       %85 = arith.select %15, %84, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0396912Z       %86 = tt.broadcast %83 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0397129Z       %87 = arith.select %17, %86, %85 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:44:05.0397344Z       %88 = tt.reshape %87 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:44:05.0397552Z       %89 = arith.sitofp %88 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:44:05.0397830Z       %90 = ttg.convert_layout %89 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.0398272Z       %91 = tt.dot %73, %90, %62, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0398602Z       %92 = arith.addi %33, %c2_i32 : i32
2026-02-21T09:44:05.0398724Z       %93 = arith.cmpi eq, %31, %c255_i32 : i32
2026-02-21T09:44:05.0398867Z       %94 = arith.select %93, %cst, %91 : tensor<32x32xf32, #mma>
2026-02-21T09:44:05.0398997Z       scf.if %93 {
2026-02-21T09:44:05.0399128Z         %95 = arith.truncf %91 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T09:44:05.0399289Z         %96 = arith.extsi %34#1 : i32 to i64
2026-02-21T09:44:05.0399406Z         %97 = arith.extsi %34#0 : i32 to i64
2026-02-21T09:44:05.0399562Z         %98 = tt.splat %96 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0399762Z         %99 = arith.addi %98, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.0400038Z         %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0400273Z         %101 = arith.muli %100, %cst_7 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0400447Z         %102 = tt.broadcast %101 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0400648Z         %103 = tt.splat %97 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0400853Z         %104 = arith.addi %103, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.0401112Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0401363Z         %106 = tt.broadcast %105 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0401541Z         %107 = arith.addi %102, %106 : tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0401727Z         %108 = tt.addptr %18, %107 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi64, #mma>
2026-02-21T09:44:05.0401924Z         %109 = arith.cmpi sge, %100, %cst_8 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0402088Z         %110 = arith.cmpi slt, %100, %cst_9 : tensor<32x1xi64, #mma>
2026-02-21T09:44:05.0402243Z         %111 = arith.andi %109, %110 : tensor<32x1xi1, #mma>
2026-02-21T09:44:05.0402427Z         %112 = tt.broadcast %111 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0402692Z         %113 = arith.cmpi sge, %105, %cst_10 : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0402854Z         %114 = arith.cmpi slt, %105, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:44:05.0403007Z         %115 = arith.andi %113, %114 : tensor<1x32xi1, #mma>
2026-02-21T09:44:05.0403171Z         %116 = tt.broadcast %115 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0403344Z         %117 = arith.andi %112, %116 : tensor<32x32xi1, #mma>
2026-02-21T09:44:05.0403517Z         tt.store %108, %95, %117 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:05.0403648Z       }
2026-02-21T09:44:05.0403904Z       scf.yield %31, %34#4, %92, %94, %34#0, %34#1, %34#2, %34#3 : i32, i32, i32, tensor<32x32xf32, #mma>, i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.0404199Z     }
2026-02-21T09:44:05.0404273Z     tt.return
2026-02-21T09:44:05.0404348Z   }
2026-02-21T09:44:05.0404420Z }
2026-02-21T09:44:05.0404463Z 
2026-02-21T09:44:05.0404492Z {-#
2026-02-21T09:44:05.0404573Z   external_resources: {
2026-02-21T09:44:05.0404669Z     mlir_reproducer: {
2026-02-21T09:44:05.0405660Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:44:05.0406641Z       disable_threading: false,
2026-02-21T09:44:05.0406744Z       verify_each: true
2026-02-21T09:44:05.0406829Z     }
2026-02-21T09:44:05.0406897Z   }
2026-02-21T09:44:05.0406962Z #-}
2026-02-21T09:44:05.0407234Z /tmp/torchinductor_root/mo/cmoi5lzvsrwofsyxf4bcpxp5qbx23c7jegvec4ziy5bi6i6bxgsl.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:44:05.0407908Z /tmp/torchinductor_root/mo/cmoi5lzvsrwofsyxf4bcpxp5qbx23c7jegvec4ziy5bi6i6bxgsl.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:44:05.0408468Z [49s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:44:05.0409238Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 32], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[4, 2], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:44:05.0409937Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:44:05.0410099Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:44:05.4711487Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:44:05.4720347Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:44:05.4720668Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:44:05.4720982Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:44:05.4721363Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:44:05.4721615Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:44:05.4721794Z #smem = #ttg.shared_memory
2026-02-21T09:44:05.4722018Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:44:05.4722512Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:44:05.4722998Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x16xf32, #mma>
2026-02-21T09:44:05.4723187Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:44:05.4723301Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:44:05.4723414Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:44:05.4723561Z     %cst_0 = arith.constant dense<0> : tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4723714Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:44:05.4723824Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:44:05.4723930Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:44:05.4724037Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:44:05.4724160Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:44:05.4724343Z     %cst_1 = arith.constant dense<0> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4724597Z     %cst_2 = arith.constant dense<8192> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4724850Z     %cst_3 = arith.constant dense<0> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4725097Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4725348Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4725597Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4725819Z     %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:44:05.4726029Z     %cst_8 = arith.constant dense<4> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4726235Z     %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:05.4726409Z     %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:05.4726613Z     %cst_11 = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:44:05.4726754Z     %0 = tt.get_program_id x : i32
2026-02-21T09:44:05.4726871Z     %1 = arith.divsi %0, %c512_i32 : i32
2026-02-21T09:44:05.4726985Z     %2 = arith.muli %1, %c16_i32 : i32
2026-02-21T09:44:05.4727097Z     %3 = arith.subi %c512_i32, %2 : i32
2026-02-21T09:44:05.4727210Z     %4 = arith.minsi %3, %c16_i32 : i32
2026-02-21T09:44:05.4727320Z     %5 = arith.remsi %0, %c512_i32 : i32
2026-02-21T09:44:05.4727432Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:44:05.4727539Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:44:05.4727645Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:44:05.4727749Z     %9 = arith.muli %7, %c16_i32 : i32
2026-02-21T09:44:05.4727984Z     %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4728292Z     %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.4728531Z     %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.4728734Z     %13 = arith.addi %12, %11 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:05.4728909Z     %14 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:44:05.4729129Z     %15 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.4729399Z     %16 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.4729641Z     %17 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.4729849Z     %18 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.4730055Z     %19 = arith.addi %17, %15 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:05.4730284Z     %20 = arith.addi %18, %16 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:05.4730518Z     %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.4730835Z     %22 = tt.expand_dims %19 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:44:05.4731085Z     %23 = arith.muli %22, %cst_7 : tensor<128x1xi32, #blocked1>
2026-02-21T09:44:05.4731274Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4731486Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.4731646Z     %26 = arith.extsi %9 : i32 to i64
2026-02-21T09:44:05.4731834Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4732146Z     %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4732579Z     %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4732986Z     %30 = tt.splat %26 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4733387Z     %31 = arith.extsi %10 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4733790Z     %32 = arith.addi %30, %31 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4734190Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4734608Z     %34 = tt.broadcast %33 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4734920Z     %35 = arith.cmpi sge, %33, %cst_3 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4735164Z     %36 = arith.cmpi slt, %33, %cst_2 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4735387Z     %37 = arith.andi %35, %36 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4735679Z     %38 = tt.broadcast %37 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4736032Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:44:05.4736439Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:44:05.4736834Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:05.4737109Z     %42 = arith.cmpi eq, %41, %cst_9 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:05.4737301Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked>
2026-02-21T09:44:05.4737502Z     %44 = arith.cmpi eq, %41, %cst_10 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:05.4737695Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked>
2026-02-21T09:44:05.4737955Z     %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg4 = %cst) -> (tensor<128x16xf32, #mma>)  : i32 {
2026-02-21T09:44:05.4738168Z       %56 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:44:05.4738355Z       %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.4738571Z       %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.4738877Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:44:05.4739151Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4739342Z       %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4739536Z       %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4739743Z       %63 = tt.load %62 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.4739961Z       %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:44:05.4740299Z       %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4740708Z       %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4740990Z       %67 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:44:05.4741197Z       %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4741484Z       %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4741863Z       %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4742213Z       %71 = arith.muli %70, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4742529Z       %72 = tt.broadcast %71 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4742828Z       %73 = arith.addi %72, %34 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4743131Z       %74 = tt.addptr %27, %73 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4743442Z       %75 = arith.cmpi sge, %70, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4743683Z       %76 = arith.cmpi slt, %70, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4743912Z       %77 = arith.andi %75, %76 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4744208Z       %78 = tt.broadcast %77 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4744510Z       %79 = arith.andi %78, %38 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4744749Z       %80 = tt.load %74, %79, %cst_1 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4744991Z       %81 = arith.shli %80, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4745230Z       %82 = arith.shrsi %81, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4745460Z       %83 = arith.shrsi %80, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4745738Z       %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:44:05.4746062Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:44:05.4746350Z       %86 = tt.broadcast %84 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4746578Z       %87 = arith.select %43, %86, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4746840Z       %88 = tt.broadcast %85 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4747066Z       %89 = arith.select %45, %88, %87 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4747283Z       %90 = tt.reshape %89 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T09:44:05.4747495Z       %91 = arith.sitofp %90 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T09:44:05.4747776Z       %92 = ttg.convert_layout %91 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4748249Z       %93 = tt.dot %66, %92, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma>
2026-02-21T09:44:05.4748598Z       %94 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:44:05.4748726Z       %95 = arith.muli %94, %c2_i32 : i32
2026-02-21T09:44:05.4748896Z       %96 = tt.splat %95 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.4749109Z       %97 = arith.addi %96, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.4749375Z       %98 = tt.expand_dims %97 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:44:05.4749642Z       %99 = tt.broadcast %98 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4749829Z       %100 = arith.addi %24, %99 : tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4750035Z       %101 = tt.addptr %25, %100 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4750259Z       %102 = tt.load %101 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.4750484Z       %103 = ttg.local_alloc %102 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:44:05.4750818Z       %104 = ttg.local_load %103 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4751229Z       %105 = arith.extf %104 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4751516Z       %106 = arith.extsi %94 : i32 to i64
2026-02-21T09:44:05.4751724Z       %107 = tt.splat %106 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4752023Z       %108 = arith.addi %107, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4752413Z       %109 = tt.expand_dims %108 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4752768Z       %110 = arith.muli %109, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4753090Z       %111 = tt.broadcast %110 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4753392Z       %112 = arith.addi %111, %34 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4753696Z       %113 = tt.addptr %27, %112 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4754010Z       %114 = arith.cmpi sge, %109, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4754265Z       %115 = arith.cmpi slt, %109, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4754500Z       %116 = arith.andi %114, %115 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4754817Z       %117 = tt.broadcast %116 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4755112Z       %118 = arith.andi %117, %38 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4755353Z       %119 = tt.load %113, %118, %cst_1 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4755595Z       %120 = arith.shli %119, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4755830Z       %121 = arith.shrsi %120, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4756071Z       %122 = arith.shrsi %119, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4756360Z       %123 = tt.expand_dims %121 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:44:05.4756694Z       %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:44:05.4757003Z       %125 = tt.broadcast %123 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4757237Z       %126 = arith.select %43, %125, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4757477Z       %127 = tt.broadcast %124 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4757708Z       %128 = arith.select %45, %127, %126 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4757934Z       %129 = tt.reshape %128 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T09:44:05.4758159Z       %130 = arith.sitofp %129 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T09:44:05.4758465Z       %131 = ttg.convert_layout %130 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4758925Z       %132 = tt.dot %105, %131, %93, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma>
2026-02-21T09:44:05.4759269Z       %133 = arith.addi %arg3, %c4_i32 : i32
2026-02-21T09:44:05.4759389Z       %134 = arith.muli %133, %c2_i32 : i32
2026-02-21T09:44:05.4759561Z       %135 = tt.splat %134 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.4759783Z       %136 = arith.addi %135, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.4760067Z       %137 = tt.expand_dims %136 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:44:05.4760347Z       %138 = tt.broadcast %137 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4760544Z       %139 = arith.addi %24, %138 : tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4760750Z       %140 = tt.addptr %25, %139 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4760971Z       %141 = tt.load %140 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.4761195Z       %142 = ttg.local_alloc %141 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:44:05.4761526Z       %143 = ttg.local_load %142 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4761932Z       %144 = arith.extf %143 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4762230Z       %145 = arith.extsi %133 : i32 to i64
2026-02-21T09:44:05.4762436Z       %146 = tt.splat %145 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4762786Z       %147 = arith.addi %146, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4763177Z       %148 = tt.expand_dims %147 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4763526Z       %149 = arith.muli %148, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4763833Z       %150 = tt.broadcast %149 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4764133Z       %151 = arith.addi %150, %34 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4764439Z       %152 = tt.addptr %27, %151 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4764755Z       %153 = arith.cmpi sge, %148, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4765000Z       %154 = arith.cmpi slt, %148, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4765238Z       %155 = arith.andi %153, %154 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4765536Z       %156 = tt.broadcast %155 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4765827Z       %157 = arith.andi %156, %38 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4766069Z       %158 = tt.load %152, %157, %cst_1 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4766331Z       %159 = arith.shli %158, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4766563Z       %160 = arith.shrsi %159, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4766797Z       %161 = arith.shrsi %158, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4767079Z       %162 = tt.expand_dims %160 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:44:05.4767410Z       %163 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:44:05.4767691Z       %164 = tt.broadcast %162 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4767926Z       %165 = arith.select %43, %164, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4768162Z       %166 = tt.broadcast %163 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4768391Z       %167 = arith.select %45, %166, %165 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4768617Z       %168 = tt.reshape %167 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T09:44:05.4768844Z       %169 = arith.sitofp %168 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T09:44:05.4769154Z       %170 = ttg.convert_layout %169 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4769612Z       %171 = tt.dot %144, %170, %132, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma>
2026-02-21T09:44:05.4769952Z       %172 = arith.addi %arg3, %c6_i32 : i32
2026-02-21T09:44:05.4770076Z       %173 = arith.muli %172, %c2_i32 : i32
2026-02-21T09:44:05.4770265Z       %174 = tt.splat %173 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.4770486Z       %175 = arith.addi %174, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:05.4770783Z       %176 = tt.expand_dims %175 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:44:05.4771055Z       %177 = tt.broadcast %176 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4771253Z       %178 = arith.addi %24, %177 : tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4771454Z       %179 = tt.addptr %25, %178 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:44:05.4771662Z       %180 = tt.load %179 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:05.4771884Z       %181 = ttg.local_alloc %180 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:44:05.4772211Z       %182 = ttg.local_load %181 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4772615Z       %183 = arith.extf %182 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4772902Z       %184 = arith.extsi %172 : i32 to i64
2026-02-21T09:44:05.4773111Z       %185 = tt.splat %184 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4773413Z       %186 = arith.addi %185, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:05.4773798Z       %187 = tt.expand_dims %186 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4774152Z       %188 = arith.muli %187, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4774472Z       %189 = tt.broadcast %188 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4774768Z       %190 = arith.addi %189, %34 : tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4775073Z       %191 = tt.addptr %27, %190 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4775386Z       %192 = arith.cmpi sge, %187, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4775623Z       %193 = arith.cmpi slt, %187, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4775856Z       %194 = arith.andi %192, %193 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4776149Z       %195 = tt.broadcast %194 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4776443Z       %196 = arith.andi %195, %38 : tensor<2x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4776683Z       %197 = tt.load %191, %196, %cst_1 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4776933Z       %198 = arith.shli %197, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4777163Z       %199 = arith.shrsi %198, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4777395Z       %200 = arith.shrsi %197, %cst_8 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:05.4777683Z       %201 = tt.expand_dims %199 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:44:05.4778029Z       %202 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:44:05.4778329Z       %203 = tt.broadcast %201 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4778563Z       %204 = arith.select %43, %203, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4778812Z       %205 = tt.broadcast %202 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4779040Z       %206 = arith.select %45, %205, %204 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:05.4779265Z       %207 = tt.reshape %206 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T09:44:05.4779478Z       %208 = arith.sitofp %207 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T09:44:05.4779767Z       %209 = ttg.convert_layout %208 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:05.4780220Z       %210 = tt.dot %183, %209, %171, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma>
2026-02-21T09:44:05.4780565Z       scf.yield %210 : tensor<128x16xf32, #mma>
2026-02-21T09:44:05.4780713Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:44:05.4780895Z     %47 = arith.truncf %46 : tensor<128x16xf32, #mma> to tensor<128x16xbf16, #mma>
2026-02-21T09:44:05.4781156Z     %48 = tt.expand_dims %20 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:44:05.4781386Z     %49 = arith.muli %48, %cst_11 : tensor<128x1xi32, #mma>
2026-02-21T09:44:05.4781614Z     %50 = tt.expand_dims %13 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma>
2026-02-21T09:44:05.4781865Z     %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x16xi32, #mma>
2026-02-21T09:44:05.4782060Z     %52 = tt.broadcast %50 : tensor<1x16xi32, #mma> -> tensor<128x16xi32, #mma>
2026-02-21T09:44:05.4782256Z     %53 = arith.addi %51, %52 : tensor<128x16xi32, #mma>
2026-02-21T09:44:05.4782421Z     %54 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:05.4782630Z     %55 = tt.addptr %54, %53 : tensor<128x16x!tt.ptr<bf16>, #mma>, tensor<128x16xi32, #mma>
2026-02-21T09:44:05.4782820Z     tt.store %55, %47 : tensor<128x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:05.4782946Z     tt.return
2026-02-21T09:44:05.4783030Z   }
2026-02-21T09:44:05.4783102Z }
2026-02-21T09:44:05.4783145Z 
2026-02-21T09:44:05.4783179Z {-#
2026-02-21T09:44:05.4783257Z   external_resources: {
2026-02-21T09:44:05.4783355Z     mlir_reproducer: {
2026-02-21T09:44:05.4784342Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:44:05.4785334Z       disable_threading: false,
2026-02-21T09:44:05.4785440Z       verify_each: true
2026-02-21T09:44:05.4785527Z     }
2026-02-21T09:44:05.4785599Z   }
2026-02-21T09:44:05.4785667Z #-}
2026-02-21T09:44:05.4785946Z /tmp/torchinductor_root/mr/cmr3ht6tszsgmysysw2xlrimjdi2xbhhqu2jvkhy3a7kq67i2dot.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:44:05.4786636Z /tmp/torchinductor_root/mr/cmr3ht6tszsgmysysw2xlrimjdi2xbhhqu2jvkhy3a7kq67i2dot.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:44:05.4787177Z [49s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:44:05.4787916Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:44:05.4788571Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:44:05.4788735Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:44:14.8072838Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:44:14.8075981Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:44:14.8076935Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:44:14.8077770Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:44:14.8078537Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [8, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:44:14.8079382Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:44:14.8080448Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:44:14.8081647Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x16xf32, #mma>
2026-02-21T09:44:14.8081997Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:44:14.8082270Z     %c38912_i32 = arith.constant 38912 : i32
2026-02-21T09:44:14.8082527Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:44:14.8082858Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:44:14.8083190Z     %cst_0 = arith.constant dense<0> : tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:14.8083515Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T09:44:14.8083775Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:44:14.8084023Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:44:14.8084269Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:44:14.8084518Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:44:14.8084759Z     %c255_i32 = arith.constant 255 : i32
2026-02-21T09:44:14.8085003Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:44:14.8085404Z     %cst_1 = arith.constant dense<0> : tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8085893Z     %cst_2 = arith.constant dense<0> : tensor<128x4xi32, #blocked1>
2026-02-21T09:44:14.8086224Z     %c-1_i32 = arith.constant -1 : i32
2026-02-21T09:44:14.8086470Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:44:14.8086781Z     %cst_3 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:44:14.8087351Z     %cst_4 = arith.constant dense<8192> : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8087908Z     %cst_5 = arith.constant dense<4> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8088371Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:14.8088745Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:14.8089119Z     %cst_8 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:44:14.8089491Z     %cst_9 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:44:14.8089922Z     %cst_10 = arith.constant dense<4096> : tensor<128x1xi64, #mma>
2026-02-21T09:44:14.8090276Z     %cst_11 = arith.constant dense<0> : tensor<1x16xi64, #mma>
2026-02-21T09:44:14.8090639Z     %cst_12 = arith.constant dense<8192> : tensor<1x16xi64, #mma>
2026-02-21T09:44:14.8090881Z     %0 = tt.get_program_id x : i32
2026-02-21T09:44:14.8091183Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:14.8091607Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:14.8092074Z     %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:14.8092541Z     %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:14.8093001Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:14.8093460Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:14.8093825Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:14.8094183Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8094638Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:44:14.8095265Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:44:14.8095874Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:14.8096288Z     %12 = arith.cmpi eq, %11, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:14.8096586Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked>
2026-02-21T09:44:14.8096879Z     %14 = arith.cmpi eq, %11, %cst_7 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:44:14.8097166Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked>
2026-02-21T09:44:14.8097474Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:14.8097887Z     %17 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:14.8098392Z     %18 = arith.extsi %4 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:14.8098735Z     %19 = arith.subi %c16384_i32, %0 : i32
2026-02-21T09:44:14.8098931Z     %20 = arith.ceildivsi %19, %c38912_i32 : i32
2026-02-21T09:44:14.8099120Z     %21 = arith.muli %20, %c256_i32 : i32
2026-02-21T09:44:14.8099299Z     %22 = arith.subi %0, %c38912_i32 : i32
2026-02-21T09:44:14.8100036Z     %23:8 = scf.for %arg3 = %c0_i32 to %21 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %22, %arg6 = %c0_i32, %arg7 = %cst, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_2, %arg11 = %cst_1) -> (i32, i32, i32, tensor<128x16xf32, #mma>, i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>)  : i32 {
2026-02-21T09:44:14.8100778Z       %24 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:44:14.8100959Z       %25 = arith.cmpi eq, %arg4, %c255_i32 : i32
2026-02-21T09:44:14.8101110Z       %26 = arith.select %25, %c0_i32, %24 : i32
2026-02-21T09:44:14.8101258Z       %27 = arith.cmpi eq, %26, %c0_i32 : i32
2026-02-21T09:44:14.8101405Z       %28 = arith.select %27, %c0_i32, %arg6 : i32
2026-02-21T09:44:14.8101697Z       %29:5 = scf.if %27 -> (i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32) {
2026-02-21T09:44:14.8101969Z         %64 = arith.addi %arg5, %c38912_i32 : i32
2026-02-21T09:44:14.8102138Z         %65 = arith.divsi %64, %c2048_i32 : i32
2026-02-21T09:44:14.8102280Z         %66 = arith.muli %65, %c4_i32 : i32
2026-02-21T09:44:14.8102415Z         %67 = arith.subi %c32_i32, %66 : i32
2026-02-21T09:44:14.8102554Z         %68 = arith.minsi %67, %c4_i32 : i32
2026-02-21T09:44:14.8102694Z         %69 = arith.remsi %64, %c2048_i32 : i32
2026-02-21T09:44:14.8102832Z         %70 = arith.remsi %69, %68 : i32
2026-02-21T09:44:14.8102965Z         %71 = arith.addi %66, %70 : i32
2026-02-21T09:44:14.8103097Z         %72 = arith.divsi %69, %68 : i32
2026-02-21T09:44:14.8103235Z         %73 = arith.muli %71, %c128_i32 : i32
2026-02-21T09:44:14.8103436Z         %74 = tt.splat %73 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:14.8103707Z         %75 = arith.addi %74, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:14.8103909Z         %76 = arith.muli %72, %c16_i32 : i32
2026-02-21T09:44:14.8104153Z         %77 = tt.splat %76 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:14.8104505Z         %78 = arith.addi %77, %3 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:14.8104881Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:44:14.8105186Z         %80 = arith.muli %79, %cst_3 : tensor<128x1xi32, #blocked1>
2026-02-21T09:44:14.8105416Z         %81 = tt.broadcast %80 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:44:14.8105831Z         %82 = tt.expand_dims %78 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8106362Z         %83 = tt.broadcast %82 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8106788Z         scf.yield %73, %76, %81, %83, %64 : i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32
2026-02-21T09:44:14.8107062Z       } else {
2026-02-21T09:44:14.8107334Z         scf.yield %arg8, %arg9, %arg10, %arg11, %arg5 : i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32
2026-02-21T09:44:14.8107626Z       }
2026-02-21T09:44:14.8107832Z       %30 = tt.splat %28 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:14.8108173Z       %31 = arith.addi %30, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:44:14.8108414Z       %32 = arith.muli %28, %c2_i32 : i32
2026-02-21T09:44:14.8108610Z       %33 = tt.splat %32 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:14.8108860Z       %34 = arith.addi %33, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:44:14.8109177Z       %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:44:14.8109517Z       %36 = tt.broadcast %35 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:44:14.8109742Z       %37 = arith.addi %29#2, %36 : tensor<128x4xi32, #blocked1>
2026-02-21T09:44:14.8109974Z       %38 = tt.addptr %7, %37 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:44:14.8110211Z       %39 = tt.load %38 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:44:14.8110523Z       %40 = ttg.convert_layout %39 : tensor<128x4xbf16, #blocked1> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:14.8110995Z       %41 = arith.extf %40 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:14.8111465Z       %42 = tt.expand_dims %31 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8111813Z       %43 = arith.muli %42, %cst_4 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8112111Z       %44 = tt.broadcast %43 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8112407Z       %45 = arith.addi %44, %29#3 : tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8112702Z       %46 = tt.addptr %8, %45 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8113004Z       %47 = tt.load %46 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8113227Z       %48 = arith.shli %47, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8113454Z       %49 = arith.shrsi %48, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8113684Z       %50 = arith.shrsi %47, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8113960Z       %51 = tt.expand_dims %49 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:44:14.8114285Z       %52 = tt.expand_dims %50 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:44:14.8114563Z       %53 = tt.broadcast %51 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:14.8114806Z       %54 = arith.select %13, %53, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:14.8115035Z       %55 = tt.broadcast %52 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:14.8115260Z       %56 = arith.select %15, %55, %54 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:44:14.8115476Z       %57 = tt.reshape %56 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T09:44:14.8115691Z       %58 = arith.sitofp %57 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T09:44:14.8115980Z       %59 = ttg.convert_layout %58 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:14.8116448Z       %60 = tt.dot %41, %59, %arg7, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma>
2026-02-21T09:44:14.8116796Z       %61 = arith.addi %28, %c2_i32 : i32
2026-02-21T09:44:14.8116921Z       %62 = arith.cmpi eq, %26, %c255_i32 : i32
2026-02-21T09:44:14.8117068Z       %63 = arith.select %62, %cst, %60 : tensor<128x16xf32, #mma>
2026-02-21T09:44:14.8117227Z       scf.if %62 {
2026-02-21T09:44:14.8117363Z         %64 = arith.truncf %60 : tensor<128x16xf32, #mma> to tensor<128x16xbf16, #mma>
2026-02-21T09:44:14.8117547Z         %65 = arith.extsi %29#0 : i32 to i64
2026-02-21T09:44:14.8117662Z         %66 = arith.extsi %29#1 : i32 to i64
2026-02-21T09:44:14.8117824Z         %67 = tt.splat %65 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:14.8118030Z         %68 = arith.addi %67, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:14.8118356Z         %69 = tt.expand_dims %68 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:44:14.8118636Z         %70 = arith.muli %69, %cst_8 : tensor<128x1xi64, #mma>
2026-02-21T09:44:14.8118856Z         %71 = tt.broadcast %70 : tensor<128x1xi64, #mma> -> tensor<128x16xi64, #mma>
2026-02-21T09:44:14.8119183Z         %72 = tt.splat %66 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:14.8119448Z         %73 = arith.addi %72, %18 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:14.8119730Z         %74 = tt.expand_dims %73 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi64, #mma>
2026-02-21T09:44:14.8120026Z         %75 = tt.broadcast %74 : tensor<1x16xi64, #mma> -> tensor<128x16xi64, #mma>
2026-02-21T09:44:14.8120249Z         %76 = arith.addi %71, %75 : tensor<128x16xi64, #mma>
2026-02-21T09:44:14.8120629Z         %77 = tt.addptr %16, %76 : tensor<128x16x!tt.ptr<bf16>, #mma>, tensor<128x16xi64, #mma>
2026-02-21T09:44:14.8120855Z         %78 = arith.cmpi sge, %69, %cst_9 : tensor<128x1xi64, #mma>
2026-02-21T09:44:14.8121044Z         %79 = arith.cmpi slt, %69, %cst_10 : tensor<128x1xi64, #mma>
2026-02-21T09:44:14.8121249Z         %80 = arith.andi %78, %79 : tensor<128x1xi1, #mma>
2026-02-21T09:44:14.8121449Z         %81 = tt.broadcast %80 : tensor<128x1xi1, #mma> -> tensor<128x16xi1, #mma>
2026-02-21T09:44:14.8121650Z         %82 = arith.cmpi sge, %74, %cst_11 : tensor<1x16xi64, #mma>
2026-02-21T09:44:14.8121849Z         %83 = arith.cmpi slt, %74, %cst_12 : tensor<1x16xi64, #mma>
2026-02-21T09:44:14.8122014Z         %84 = arith.andi %82, %83 : tensor<1x16xi1, #mma>
2026-02-21T09:44:14.8122219Z         %85 = tt.broadcast %84 : tensor<1x16xi1, #mma> -> tensor<128x16xi1, #mma>
2026-02-21T09:44:14.8122424Z         %86 = arith.andi %81, %85 : tensor<128x16xi1, #mma>
2026-02-21T09:44:14.8122638Z         tt.store %77, %64, %86 : tensor<128x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:14.8128940Z       }
2026-02-21T09:44:14.8129225Z       scf.yield %26, %29#4, %61, %63, %29#0, %29#1, %29#2, %29#3 : i32, i32, i32, tensor<128x16xf32, #mma>, i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:14.8129605Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32}
2026-02-21T09:44:14.8129741Z     tt.return
2026-02-21T09:44:14.8129823Z   }
2026-02-21T09:44:14.8129901Z }
2026-02-21T09:44:14.8129947Z 
2026-02-21T09:44:14.8129977Z {-#
2026-02-21T09:44:14.8130059Z   external_resources: {
2026-02-21T09:44:14.8130159Z     mlir_reproducer: {
2026-02-21T09:44:14.8131150Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:44:14.8132135Z       disable_threading: false,
2026-02-21T09:44:14.8132240Z       verify_each: true
2026-02-21T09:44:14.8132331Z     }
2026-02-21T09:44:14.8132402Z   }
2026-02-21T09:44:14.8132473Z #-}
2026-02-21T09:44:14.8132756Z /tmp/torchinductor_root/ra/crazeaaf2iq4gguqx6yt7kdjcm4mzgqgabqh3cprtf2cw7wot75s.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:44:14.8133458Z /tmp/torchinductor_root/ra/crazeaaf2iq4gguqx6yt7kdjcm4mzgqgabqh3cprtf2cw7wot75s.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:44:14.8134006Z [59s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:44:14.8134800Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:44:14.8135531Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:44:14.8135699Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:44:15.8405420Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:44:15.8410252Z #blocked = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:44:15.8411071Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:44:15.8411823Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:44:15.8412532Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:44:15.8413215Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:44:15.8413822Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:44:15.8414263Z #smem = #ttg.shared_memory
2026-02-21T09:44:15.8414816Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:44:15.8415999Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:44:15.8417099Z     %cst = arith.constant dense<4096> : tensor<256x1xi64, #mma>
2026-02-21T09:44:15.8417521Z     %cst_0 = arith.constant dense<0> : tensor<256x1xi64, #mma>
2026-02-21T09:44:15.8417933Z     %cst_1 = arith.constant dense<8192> : tensor<256x1xi64, #mma>
2026-02-21T09:44:15.8418333Z     %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #mma>
2026-02-21T09:44:15.8418718Z     %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #mma>
2026-02-21T09:44:15.8419123Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8419529Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8419939Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8420311Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:44:15.8420676Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:44:15.8421001Z     %cst_9 = arith.constant dense<0.000000e+00> : tensor<256x64xf32, #mma>
2026-02-21T09:44:15.8421278Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:44:15.8421491Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T09:44:15.8421822Z     %cst_10 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:15.8422209Z     %cst_11 = arith.constant dense<0> : tensor<1x64xi64, #blocked>
2026-02-21T09:44:15.8422564Z     %cst_12 = arith.constant dense<8192> : tensor<1x64xi64, #blocked>
2026-02-21T09:44:15.8422881Z     %cst_13 = arith.constant dense<1024> : tensor<256x1xi32, #blocked2>
2026-02-21T09:44:15.8423156Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:44:15.8423360Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:44:15.8423568Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:44:15.8423765Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:44:15.8424012Z     %cst_14 = arith.constant dense<0> : tensor<2x64xi8, #blocked>
2026-02-21T09:44:15.8424511Z     %cst_15 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked1>
2026-02-21T09:44:15.8424779Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:44:15.8424988Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:44:15.8425362Z     %cst_16 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:15.8425697Z     %0 = tt.get_program_id x : i32
2026-02-21T09:44:15.8425898Z     %1 = arith.divsi %0, %c1024_i32 : i32
2026-02-21T09:44:15.8426105Z     %2 = arith.muli %1, %c64_i32 : i32
2026-02-21T09:44:15.8426303Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:44:15.8426505Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T09:44:15.8426703Z     %5 = arith.remsi %0, %c1024_i32 : i32
2026-02-21T09:44:15.8426933Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:44:15.8427123Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:44:15.8427313Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:44:15.8427494Z     %9 = arith.muli %7, %c64_i32 : i32
2026-02-21T09:44:15.8427697Z     %10 = arith.muli %8, %c256_i32 : i32
2026-02-21T09:44:15.8428059Z     %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:44:15.8428568Z     %12 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:15.8429021Z     %13 = tt.splat %10 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:44:15.8429419Z     %14 = arith.addi %13, %11 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:44:15.8429865Z     %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:44:15.8430415Z     %16 = tt.expand_dims %14 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<256x1xi32, #blocked2>
2026-02-21T09:44:15.8430792Z     %17 = arith.muli %16, %cst_13 : tensor<256x1xi32, #blocked2>
2026-02-21T09:44:15.8431095Z     %18 = tt.broadcast %17 : tensor<256x1xi32, #blocked2> -> tensor<256x4xi32, #blocked2>
2026-02-21T09:44:15.8431404Z     %19 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<256x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:44:15.8431620Z     %20 = arith.extsi %9 : i32 to i64
2026-02-21T09:44:15.8431814Z     %21 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:44:15.8432118Z     %22 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:15.8432531Z     %23 = arith.extsi %22 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:15.8432902Z     %24 = tt.splat %20 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:44:15.8433224Z     %25 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:44:15.8433639Z     %26 = arith.extsi %25 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:44:15.8434015Z     %27 = arith.addi %24, %26 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:44:15.8434367Z     %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi64, #blocked>
2026-02-21T09:44:15.8434740Z     %29 = tt.broadcast %28 : tensor<1x64xi64, #blocked> -> tensor<2x64xi64, #blocked>
2026-02-21T09:44:15.8434997Z     %30 = arith.cmpi sge, %28, %cst_11 : tensor<1x64xi64, #blocked>
2026-02-21T09:44:15.8435214Z     %31 = arith.cmpi slt, %28, %cst_12 : tensor<1x64xi64, #blocked>
2026-02-21T09:44:15.8435427Z     %32 = arith.andi %30, %31 : tensor<1x64xi1, #blocked>
2026-02-21T09:44:15.8435660Z     %33 = tt.broadcast %32 : tensor<1x64xi1, #blocked> -> tensor<2x64xi1, #blocked>
2026-02-21T09:44:15.8436051Z     %34 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:44:15.8436607Z     %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:44:15.8437157Z     %36 = tt.expand_dims %35 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:44:15.8437492Z     %37 = arith.cmpi eq, %36, %cst_8 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:44:15.8437747Z     %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x64xi1, #blocked1>
2026-02-21T09:44:15.8437997Z     %39 = arith.cmpi eq, %36, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:44:15.8438247Z     %40 = tt.broadcast %39 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x64xi1, #blocked1>
2026-02-21T09:44:15.8438533Z     %41 = ttg.local_alloc : () -> !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable>
2026-02-21T09:44:15.8438888Z     %42 = tt.expand_dims %15 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:44:15.8439242Z     %43 = tt.broadcast %42 : tensor<1x4xi32, #blocked2> -> tensor<256x4xi32, #blocked2>
2026-02-21T09:44:15.8439493Z     %44 = arith.addi %18, %43 : tensor<256x4xi32, #blocked2>
2026-02-21T09:44:15.8439750Z     %45 = tt.addptr %19, %44 : tensor<256x4x!tt.ptr<bf16>, #blocked2>, tensor<256x4xi32, #blocked2>
2026-02-21T09:44:15.8440018Z     %46 = tt.load %45 : tensor<256x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:44:15.8440392Z     %47 = ttg.memdesc_index %41[%c0_i32] : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>
2026-02-21T09:44:15.8440869Z     ttg.local_store %46, %47 : tensor<256x4xbf16, #blocked2> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>
2026-02-21T09:44:15.8441456Z     %48:3 = scf.for %arg3 = %c0_i32 to %c510_i32 step %c2_i32 iter_args(%arg4 = %cst_9, %arg5 = %c0_i32, %arg6 = %47) -> (tensor<256x64xf32, #mma>, i32, !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>)  : i32 {
2026-02-21T09:44:15.8441895Z       %103 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:44:15.8442056Z       %104 = arith.muli %103, %c2_i32 : i32
2026-02-21T09:44:15.8442267Z       %105 = tt.splat %104 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:44:15.8442512Z       %106 = arith.addi %105, %15 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:44:15.8442916Z       %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:44:15.8443210Z       %108 = tt.broadcast %107 : tensor<1x4xi32, #blocked2> -> tensor<256x4xi32, #blocked2>
2026-02-21T09:44:15.8443414Z       %109 = arith.addi %18, %108 : tensor<256x4xi32, #blocked2>
2026-02-21T09:44:15.8443630Z       %110 = tt.addptr %19, %109 : tensor<256x4x!tt.ptr<bf16>, #blocked2>, tensor<256x4xi32, #blocked2>
2026-02-21T09:44:15.8443854Z       %111 = tt.load %110 : tensor<256x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:44:15.8444173Z       %112 = ttg.local_load %arg6 : !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:15.8444640Z       %113 = arith.extf %112 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:15.8444953Z       %114 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:44:15.8445133Z       %115 = tt.splat %114 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:15.8445363Z       %116 = arith.addi %115, %23 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:15.8445651Z       %117 = tt.expand_dims %116 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8445922Z       %118 = arith.muli %117, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8446119Z       %119 = tt.broadcast %118 : tensor<2x1xi64, #blocked> -> tensor<2x64xi64, #blocked>
2026-02-21T09:44:15.8446320Z       %120 = arith.addi %119, %29 : tensor<2x64xi64, #blocked>
2026-02-21T09:44:15.8448396Z       %121 = tt.addptr %21, %120 : tensor<2x64x!tt.ptr<i8>, #blocked>, tensor<2x64xi64, #blocked>
2026-02-21T09:44:15.8448610Z       %122 = arith.cmpi sge, %117, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8448789Z       %123 = arith.cmpi slt, %117, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8448960Z       %124 = arith.andi %122, %123 : tensor<2x1xi1, #blocked>
2026-02-21T09:44:15.8449152Z       %125 = tt.broadcast %124 : tensor<2x1xi1, #blocked> -> tensor<2x64xi1, #blocked>
2026-02-21T09:44:15.8449343Z       %126 = arith.andi %125, %33 : tensor<2x64xi1, #blocked>
2026-02-21T09:44:15.8449517Z       %127 = tt.load %121, %126, %cst_14 : tensor<2x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:44:15.8449794Z       %128 = ttg.convert_layout %127 : tensor<2x64xi8, #blocked> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:15.8450088Z       %129 = arith.shli %128, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:15.8450335Z       %130 = arith.shrsi %129, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:15.8450576Z       %131 = arith.shrsi %128, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:15.8450866Z       %132 = tt.expand_dims %130 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1>
2026-02-21T09:44:15.8451204Z       %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1>
2026-02-21T09:44:15.8451487Z       %134 = tt.broadcast %132 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1>
2026-02-21T09:44:15.8451733Z       %135 = arith.select %38, %134, %cst_15 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1>
2026-02-21T09:44:15.8451995Z       %136 = tt.broadcast %133 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1>
2026-02-21T09:44:15.8452233Z       %137 = arith.select %40, %136, %135 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1>
2026-02-21T09:44:15.8452464Z       %138 = tt.reshape %137 : tensor<2x2x64xi8, #blocked1> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:44:15.8452682Z       %139 = arith.sitofp %138 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:44:15.8452972Z       %140 = ttg.convert_layout %139 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:15.8453439Z       %141 = tt.dot %113, %140, %arg4, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x64xf32, #mma>
2026-02-21T09:44:15.8453783Z       %142 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:44:15.8453914Z       %143 = arith.cmpi slt, %142, %c1_i32 : i32
2026-02-21T09:44:15.8454041Z       %144 = arith.select %143, %142, %c0_i32 : i32
2026-02-21T09:44:15.8454309Z       %145 = ttg.memdesc_index %41[%144] : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>
2026-02-21T09:44:15.8454681Z       ttg.local_store %111, %145 : tensor<256x4xbf16, #blocked2> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>
2026-02-21T09:44:15.8454988Z       scf.yield %141, %144, %145 : tensor<256x64xf32, #mma>, i32, !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>
2026-02-21T09:44:15.8455237Z     } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:44:15.8455550Z     %49 = ttg.local_load %48#2 : !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:15.8456001Z     %50 = arith.extf %49 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:15.8456342Z     %51 = arith.addi %23, %cst_10 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:44:15.8456612Z     %52 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8456851Z     %53 = arith.muli %52, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8457031Z     %54 = tt.broadcast %53 : tensor<2x1xi64, #blocked> -> tensor<2x64xi64, #blocked>
2026-02-21T09:44:15.8457215Z     %55 = arith.addi %54, %29 : tensor<2x64xi64, #blocked>
2026-02-21T09:44:15.8457401Z     %56 = tt.addptr %21, %55 : tensor<2x64x!tt.ptr<i8>, #blocked>, tensor<2x64xi64, #blocked>
2026-02-21T09:44:15.8457595Z     %57 = arith.cmpi sge, %52, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8457760Z     %58 = arith.cmpi slt, %52, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:44:15.8457914Z     %59 = arith.andi %57, %58 : tensor<2x1xi1, #blocked>
2026-02-21T09:44:15.8458086Z     %60 = tt.broadcast %59 : tensor<2x1xi1, #blocked> -> tensor<2x64xi1, #blocked>
2026-02-21T09:44:15.8458264Z     %61 = arith.andi %60, %33 : tensor<2x64xi1, #blocked>
2026-02-21T09:44:15.8458424Z     %62 = tt.load %56, %61, %cst_14 : tensor<2x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:44:15.8458669Z     %63 = ttg.convert_layout %62 : tensor<2x64xi8, #blocked> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:15.8458940Z     %64 = arith.shli %63, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:15.8459179Z     %65 = arith.shrsi %64, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:15.8459410Z     %66 = arith.shrsi %63, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:44:15.8459712Z     %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1>
2026-02-21T09:44:15.8460043Z     %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1>
2026-02-21T09:44:15.8460320Z     %69 = tt.broadcast %67 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1>
2026-02-21T09:44:15.8460554Z     %70 = arith.select %38, %69, %cst_15 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1>
2026-02-21T09:44:15.8460787Z     %71 = tt.broadcast %68 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1>
2026-02-21T09:44:15.8461015Z     %72 = arith.select %40, %71, %70 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1>
2026-02-21T09:44:15.8461238Z     %73 = tt.reshape %72 : tensor<2x2x64xi8, #blocked1> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:44:15.8461446Z     %74 = arith.sitofp %73 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:44:15.8461736Z     %75 = ttg.convert_layout %74 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:44:15.8462188Z     %76 = tt.dot %50, %75, %48#0, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x64xf32, #mma>
2026-02-21T09:44:15.8462581Z     ttg.local_dealloc %41 : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable>
2026-02-21T09:44:15.8462790Z     %77 = arith.truncf %76 : tensor<256x64xf32, #mma> to tensor<256x64xbf16, #mma>
2026-02-21T09:44:15.8462951Z     %78 = arith.extsi %10 : i32 to i64
2026-02-21T09:44:15.8463104Z     %79 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<256x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:15.8463302Z     %80 = tt.splat %78 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:15.8463590Z     %81 = arith.extsi %12 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:15.8463865Z     %82 = arith.addi %80, %81 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:44:15.8464137Z     %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma>
2026-02-21T09:44:15.8464371Z     %84 = arith.muli %83, %cst_1 : tensor<256x1xi64, #mma>
2026-02-21T09:44:15.8464543Z     %85 = tt.broadcast %84 : tensor<256x1xi64, #mma> -> tensor<256x64xi64, #mma>
2026-02-21T09:44:15.8464742Z     %86 = tt.splat %20 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:15.8464977Z     %87 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:15.8465272Z     %88 = arith.extsi %87 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:15.8465539Z     %89 = arith.addi %86, %88 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:44:15.8465788Z     %90 = tt.expand_dims %89 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi64, #mma>
2026-02-21T09:44:15.8466038Z     %91 = tt.broadcast %90 : tensor<1x64xi64, #mma> -> tensor<256x64xi64, #mma>
2026-02-21T09:44:15.8466209Z     %92 = arith.addi %85, %91 : tensor<256x64xi64, #mma>
2026-02-21T09:44:15.8466388Z     %93 = tt.addptr %79, %92 : tensor<256x64x!tt.ptr<bf16>, #mma>, tensor<256x64xi64, #mma>
2026-02-21T09:44:15.8466581Z     %94 = arith.cmpi sge, %83, %cst_0 : tensor<256x1xi64, #mma>
2026-02-21T09:44:15.8466739Z     %95 = arith.cmpi slt, %83, %cst : tensor<256x1xi64, #mma>
2026-02-21T09:44:15.8466889Z     %96 = arith.andi %94, %95 : tensor<256x1xi1, #mma>
2026-02-21T09:44:15.8467051Z     %97 = tt.broadcast %96 : tensor<256x1xi1, #mma> -> tensor<256x64xi1, #mma>
2026-02-21T09:44:15.8467229Z     %98 = arith.cmpi sge, %90, %cst_3 : tensor<1x64xi64, #mma>
2026-02-21T09:44:15.8467386Z     %99 = arith.cmpi slt, %90, %cst_2 : tensor<1x64xi64, #mma>
2026-02-21T09:44:15.8467550Z     %100 = arith.andi %98, %99 : tensor<1x64xi1, #mma>
2026-02-21T09:44:15.8467717Z     %101 = tt.broadcast %100 : tensor<1x64xi1, #mma> -> tensor<256x64xi1, #mma>
2026-02-21T09:44:15.8467892Z     %102 = arith.andi %97, %101 : tensor<256x64xi1, #mma>
2026-02-21T09:44:15.8468048Z     tt.store %93, %77, %102 : tensor<256x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:44:15.8468181Z     tt.return
2026-02-21T09:44:15.8468261Z   }
2026-02-21T09:44:15.8468333Z }
2026-02-21T09:44:15.8468377Z 
2026-02-21T09:44:15.8468408Z {-#
2026-02-21T09:44:15.8468488Z   external_resources: {
2026-02-21T09:44:15.8468585Z     mlir_reproducer: {
2026-02-21T09:44:15.8469580Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:44:15.8470583Z       disable_threading: false,
2026-02-21T09:44:15.8470687Z       verify_each: true
2026-02-21T09:44:15.8470778Z     }
2026-02-21T09:44:15.8470848Z   }
2026-02-21T09:44:15.8470918Z #-}
2026-02-21T09:44:15.8471201Z /tmp/torchinductor_root/a4/ca4zjozlac45fwgzml4rubpym6qwmt76ix5wkhh3pvepmzy2uwlu.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:44:15.8471883Z /tmp/torchinductor_root/a4/ca4zjozlac45fwgzml4rubpym6qwmt76ix5wkhh3pvepmzy2uwlu.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:44:15.8472444Z [60s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:44:15.8473162Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 256, 64], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:44:15.8473826Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:44:15.8473993Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:44:25.7480869Z Initial population exploring neighbors  44% ━━━━━━          44/100 1.7 configs/s
2026-02-21T09:44:25.7485055Z WARNING:tritonbench.utils.triton_op:Completed input ID 17:
2026-02-21T09:44:25.7485499Z x_val
2026-02-21T09:44:25.7485775Z ---------------------
2026-02-21T09:44:25.7486061Z (1, 4096, 8192, 1024)
2026-02-21T09:44:25.7486222Z 
2026-02-21T09:44:25.7503274Z  60%|██████    | 6/10 [41:46<23:33, 353.43s/it]WARNING:tritonbench.utils.triton_op:Running input ID 21:
2026-02-21T09:44:25.7503680Z x_val
2026-02-21T09:44:25.7503846Z ---------------------
2026-02-21T09:44:25.7504059Z (4, 4096, 8192, 1024)
2026-02-21T09:44:25.7507186Z INFO:tritonbench.utils.triton_op:Took 0.25ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T09:44:26.7812439Z INFO:tritonbench.utils.triton_op:Took 4.46ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T09:44:28.9459036Z INFO:tritonbench.utils.triton_op:Took 0.17ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T09:44:28.9468760Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:44:28.9469106Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:44:28.9469302Z               'dtype': 'torch.bfloat16',
2026-02-21T09:44:28.9469516Z               'shape': (4, 4096, 1024),
2026-02-21T09:44:28.9470152Z               'stride': (4194304, 1024, 1)},
2026-02-21T09:44:28.9470319Z             { 'device': 'cuda:0',
2026-02-21T09:44:28.9470487Z               'dtype': 'torch.int32',
2026-02-21T09:44:28.9470642Z               'shape': (1024, 8192),
2026-02-21T09:44:28.9470801Z               'stride': (8192, 1)}),
2026-02-21T09:44:28.9470965Z   'kwargs': {}}
2026-02-21T09:44:28.9487560Z INFO:tritonbench.utils.triton_op:Took 2.02ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T09:44:29.1300996Z [0s] Autotune random seed: 2138032649
2026-02-21T09:44:29.1598284Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:45:02.9298681Z [33s] Timeout after 30s compiling Config(block_sizes=[64, 256, 2], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T09:45:11.6856896Z [42s] Timeout after 30s compiling Config(block_sizes=[32, 2048, 2], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:45:15.6222341Z [46s] Timeout after 30s compiling Config(block_sizes=[16, 4096, 1], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T09:45:15.6246816Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.4 configs/s
2026-02-21T09:45:36.3470003Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:45:36.3473659Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:45:36.3474152Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:45:36.3474625Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:45:36.3475084Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:45:36.3475504Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:45:36.3477277Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:45:36.3477560Z #smem = #ttg.shared_memory
2026-02-21T09:45:36.3477945Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:45:36.3478634Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:45:36.3479327Z     %cst = arith.constant dense<8192> : tensor<1x512xi64, #mma>
2026-02-21T09:45:36.3479592Z     %cst_0 = arith.constant dense<0> : tensor<1x512xi64, #mma>
2026-02-21T09:45:36.3479843Z     %cst_1 = arith.constant dense<16384> : tensor<32x1xi64, #mma>
2026-02-21T09:45:36.3480411Z     %cst_2 = arith.constant dense<0> : tensor<32x1xi64, #mma>
2026-02-21T09:45:36.3480661Z     %cst_3 = arith.constant dense<8192> : tensor<32x1xi64, #mma>
2026-02-21T09:45:36.3480935Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:36.3481215Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:36.3481608Z     %cst_6 = arith.constant dense<0.000000e+00> : tensor<32x512xf32, #mma>
2026-02-21T09:45:36.3481917Z     %cst_7 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1>
2026-02-21T09:45:36.3482118Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:45:36.3482263Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:45:36.3482419Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:45:36.3482670Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:45:36.3482818Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:45:36.3482962Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:45:36.3483099Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:45:36.3483250Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:45:36.3483440Z     %cst_8 = arith.constant dense<0> : tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3483635Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:45:36.3483782Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:45:36.3483989Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:45:36.3484211Z     %cst_9 = arith.constant dense<4> : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3484456Z     %0 = tt.get_program_id x : i32
2026-02-21T09:45:36.3484601Z     %1 = arith.divsi %0, %c4096_i32 : i32
2026-02-21T09:45:36.3484748Z     %2 = arith.muli %1, %c8_i32 : i32
2026-02-21T09:45:36.3484889Z     %3 = arith.subi %c16_i32, %2 : i32
2026-02-21T09:45:36.3485035Z     %4 = arith.minsi %3, %c8_i32 : i32
2026-02-21T09:45:36.3485177Z     %5 = arith.remsi %0, %c4096_i32 : i32
2026-02-21T09:45:36.3485326Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:45:36.3485539Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:45:36.3485678Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:45:36.3485814Z     %9 = arith.muli %7, %c512_i32 : i32
2026-02-21T09:45:36.3486155Z     %10 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:45:36.3486514Z     %11 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:36.3486826Z     %12 = tt.splat %9 : i32 -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:45:36.3487113Z     %13 = arith.addi %12, %10 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:45:36.3487332Z     %14 = arith.muli %8, %c32_i32 : i32
2026-02-21T09:45:36.3487578Z     %15 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:36.3487912Z     %16 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:36.3488209Z     %17 = tt.splat %14 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:36.3488484Z     %18 = arith.addi %17, %15 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:36.3488786Z     %19 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:36.3489171Z     %20 = tt.expand_dims %18 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:45:36.3489481Z     %21 = arith.muli %20, %cst_7 : tensor<32x1xi32, #blocked1>
2026-02-21T09:45:36.3489717Z     %22 = tt.broadcast %21 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3489986Z     %23 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:36.3490318Z     %24 = tt.expand_dims %13 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3490676Z     %25 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T09:45:36.3491019Z     %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:45:36.3491451Z     %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:45:36.3491857Z     %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:36.3492112Z     %29 = arith.cmpi eq, %28, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:36.3492307Z     %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked>
2026-02-21T09:45:36.3492530Z     %31 = arith.cmpi eq, %28, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:36.3492728Z     %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x512xi1, #blocked>
2026-02-21T09:45:36.3492996Z     %33 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst_6) -> (tensor<32x512xf32, #mma>)  : i32 {
2026-02-21T09:45:36.3493216Z       %60 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:45:36.3493392Z       %61 = tt.splat %60 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:36.3493610Z       %62 = arith.addi %61, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:36.3493882Z       %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:36.3494157Z       %64 = tt.broadcast %63 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3494346Z       %65 = arith.addi %22, %64 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3494567Z       %66 = tt.addptr %23, %65 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3494771Z       %67 = tt.load %66 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:36.3495059Z       %68 = ttg.convert_layout %67 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3495459Z       %69 = arith.extf %68 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3495744Z       %70 = arith.muli %arg3, %c8192_i32 : i32
2026-02-21T09:45:36.3495892Z       %71 = tt.splat %70 : i32 -> tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3496049Z       %72 = arith.addi %71, %24 : tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3496247Z       %73 = tt.addptr %25, %72 : tensor<1x512x!tt.ptr<i8>, #blocked2>, tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3496446Z       %74 = tt.load %73 : tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T09:45:36.3496692Z       %75 = ttg.convert_layout %74 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3496975Z       %76 = arith.shli %75, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3497208Z       %77 = arith.shrsi %76, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3497448Z       %78 = arith.shrsi %75, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3497737Z       %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:45:36.3498077Z       %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:45:36.3498363Z       %81 = tt.broadcast %79 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3498626Z       %82 = arith.select %30, %81, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3498867Z       %83 = tt.broadcast %80 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3499095Z       %84 = arith.select %32, %83, %82 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3499339Z       %85 = tt.reshape %84 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3>
2026-02-21T09:45:36.3499562Z       %86 = arith.sitofp %85 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3>
2026-02-21T09:45:36.3499808Z       %87 = ttg.local_alloc %86 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem>
2026-02-21T09:45:36.3500135Z       %88 = ttg.local_load %87 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3500606Z       %89 = tt.dot %69, %88, %arg4, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma>
2026-02-21T09:45:36.3500953Z       %90 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:45:36.3501083Z       %91 = arith.muli %90, %c2_i32 : i32
2026-02-21T09:45:36.3501249Z       %92 = tt.splat %91 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:36.3501471Z       %93 = arith.addi %92, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:36.3501742Z       %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:36.3502016Z       %95 = tt.broadcast %94 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3502212Z       %96 = arith.addi %22, %95 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3502407Z       %97 = tt.addptr %23, %96 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3502629Z       %98 = tt.load %97 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:36.3502884Z       %99 = ttg.convert_layout %98 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3503299Z       %100 = arith.extf %99 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3503581Z       %101 = arith.muli %90, %c8192_i32 : i32
2026-02-21T09:45:36.3503720Z       %102 = tt.splat %101 : i32 -> tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3503879Z       %103 = arith.addi %102, %24 : tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3504075Z       %104 = tt.addptr %25, %103 : tensor<1x512x!tt.ptr<i8>, #blocked2>, tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3504277Z       %105 = tt.load %104 : tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T09:45:36.3504526Z       %106 = ttg.convert_layout %105 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3504806Z       %107 = arith.shli %106, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3505043Z       %108 = arith.shrsi %107, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3505275Z       %109 = arith.shrsi %106, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3505568Z       %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:45:36.3505905Z       %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:45:36.3506189Z       %112 = tt.broadcast %110 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3506432Z       %113 = arith.select %30, %112, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3506690Z       %114 = tt.broadcast %111 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3506925Z       %115 = arith.select %32, %114, %113 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3507158Z       %116 = tt.reshape %115 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3>
2026-02-21T09:45:36.3507393Z       %117 = arith.sitofp %116 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3>
2026-02-21T09:45:36.3507649Z       %118 = ttg.local_alloc %117 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem>
2026-02-21T09:45:36.3507972Z       %119 = ttg.local_load %118 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3508440Z       %120 = tt.dot %100, %119, %89, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma>
2026-02-21T09:45:36.3508781Z       %121 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:45:36.3508903Z       %122 = arith.muli %121, %c2_i32 : i32
2026-02-21T09:45:36.3509071Z       %123 = tt.splat %122 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:36.3509295Z       %124 = arith.addi %123, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:36.3509570Z       %125 = tt.expand_dims %124 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:36.3509845Z       %126 = tt.broadcast %125 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3510039Z       %127 = arith.addi %22, %126 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3510238Z       %128 = tt.addptr %23, %127 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3510478Z       %129 = tt.load %128 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:36.3510741Z       %130 = ttg.convert_layout %129 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3511155Z       %131 = arith.extf %130 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3511433Z       %132 = arith.muli %121, %c8192_i32 : i32
2026-02-21T09:45:36.3511576Z       %133 = tt.splat %132 : i32 -> tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3511731Z       %134 = arith.addi %133, %24 : tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3511930Z       %135 = tt.addptr %25, %134 : tensor<1x512x!tt.ptr<i8>, #blocked2>, tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3512133Z       %136 = tt.load %135 : tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T09:45:36.3512374Z       %137 = ttg.convert_layout %136 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3512655Z       %138 = arith.shli %137, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3512888Z       %139 = arith.shrsi %138, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3513123Z       %140 = arith.shrsi %137, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3513410Z       %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:45:36.3513746Z       %142 = tt.expand_dims %140 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:45:36.3514032Z       %143 = tt.broadcast %141 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3514270Z       %144 = arith.select %30, %143, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3514528Z       %145 = tt.broadcast %142 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3514761Z       %146 = arith.select %32, %145, %144 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3514991Z       %147 = tt.reshape %146 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3>
2026-02-21T09:45:36.3515233Z       %148 = arith.sitofp %147 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3>
2026-02-21T09:45:36.3515484Z       %149 = ttg.local_alloc %148 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem>
2026-02-21T09:45:36.3515807Z       %150 = ttg.local_load %149 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3516276Z       %151 = tt.dot %131, %150, %120, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma>
2026-02-21T09:45:36.3516614Z       %152 = arith.addi %arg3, %c3_i32 : i32
2026-02-21T09:45:36.3516737Z       %153 = arith.muli %152, %c2_i32 : i32
2026-02-21T09:45:36.3516907Z       %154 = tt.splat %153 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:36.3517133Z       %155 = arith.addi %154, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:36.3517408Z       %156 = tt.expand_dims %155 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:36.3517681Z       %157 = tt.broadcast %156 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3517873Z       %158 = arith.addi %22, %157 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3518077Z       %159 = tt.addptr %23, %158 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:36.3518297Z       %160 = tt.load %159 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:36.3518561Z       %161 = ttg.convert_layout %160 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3518968Z       %162 = arith.extf %161 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3519253Z       %163 = arith.muli %152, %c8192_i32 : i32
2026-02-21T09:45:36.3519396Z       %164 = tt.splat %163 : i32 -> tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3519550Z       %165 = arith.addi %164, %24 : tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3519746Z       %166 = tt.addptr %25, %165 : tensor<1x512x!tt.ptr<i8>, #blocked2>, tensor<1x512xi32, #blocked2>
2026-02-21T09:45:36.3519946Z       %167 = tt.load %166 : tensor<1x512x!tt.ptr<i8>, #blocked2>
2026-02-21T09:45:36.3520188Z       %168 = ttg.convert_layout %167 : tensor<1x512xi8, #blocked2> -> tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3520470Z       %169 = arith.shli %168, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3520707Z       %170 = arith.shrsi %169, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3520941Z       %171 = arith.shrsi %168, %cst_9 : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:36.3521229Z       %172 = tt.expand_dims %170 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:45:36.3521567Z       %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<1x512xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x512xi8, #blocked>
2026-02-21T09:45:36.3521852Z       %174 = tt.broadcast %172 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3522091Z       %175 = arith.select %30, %174, %cst_8 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3522351Z       %176 = tt.broadcast %173 : tensor<1x1x512xi8, #blocked> -> tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3522642Z       %177 = arith.select %32, %176, %175 : tensor<1x2x512xi1, #blocked>, tensor<1x2x512xi8, #blocked>
2026-02-21T09:45:36.3522877Z       %178 = tt.reshape %177 : tensor<1x2x512xi8, #blocked> -> tensor<2x512xi8, #blocked3>
2026-02-21T09:45:36.3523130Z       %179 = arith.sitofp %178 : tensor<2x512xi8, #blocked3> to tensor<2x512xf32, #blocked3>
2026-02-21T09:45:36.3523379Z       %180 = ttg.local_alloc %179 : (tensor<2x512xf32, #blocked3>) -> !ttg.memdesc<2x512xf32, #shared, #smem>
2026-02-21T09:45:36.3523703Z       %181 = ttg.local_load %180 : !ttg.memdesc<2x512xf32, #shared, #smem> -> tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:36.3524171Z       %182 = tt.dot %162, %181, %151, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x512xf32, #mma>
2026-02-21T09:45:36.3524516Z       scf.yield %182 : tensor<32x512xf32, #mma>
2026-02-21T09:45:36.3524643Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:45:36.3524796Z     %34 = arith.truncf %33 : tensor<32x512xf32, #mma> to tensor<32x512xbf16, #mma>
2026-02-21T09:45:36.3524963Z     %35 = arith.extsi %14 : i32 to i64
2026-02-21T09:45:36.3525077Z     %36 = arith.extsi %9 : i32 to i64
2026-02-21T09:45:36.3525231Z     %37 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:45:36.3525428Z     %38 = tt.splat %35 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:36.3525698Z     %39 = arith.extsi %16 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:36.3525965Z     %40 = arith.addi %38, %39 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:36.3526234Z     %41 = tt.expand_dims %40 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:45:36.3526466Z     %42 = arith.muli %41, %cst_3 : tensor<32x1xi64, #mma>
2026-02-21T09:45:36.3526653Z     %43 = tt.broadcast %42 : tensor<32x1xi64, #mma> -> tensor<32x512xi64, #mma>
2026-02-21T09:45:36.3526855Z     %44 = tt.splat %36 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:36.3527128Z     %45 = arith.extsi %11 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:36.3527396Z     %46 = arith.addi %44, %45 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:36.3527655Z     %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:45:36.3527906Z     %48 = tt.broadcast %47 : tensor<1x512xi64, #mma> -> tensor<32x512xi64, #mma>
2026-02-21T09:45:36.3528080Z     %49 = arith.addi %43, %48 : tensor<32x512xi64, #mma>
2026-02-21T09:45:36.3528264Z     %50 = tt.addptr %37, %49 : tensor<32x512x!tt.ptr<bf16>, #mma>, tensor<32x512xi64, #mma>
2026-02-21T09:45:36.3528454Z     %51 = arith.cmpi sge, %41, %cst_2 : tensor<32x1xi64, #mma>
2026-02-21T09:45:36.3528615Z     %52 = arith.cmpi slt, %41, %cst_1 : tensor<32x1xi64, #mma>
2026-02-21T09:45:36.3528765Z     %53 = arith.andi %51, %52 : tensor<32x1xi1, #mma>
2026-02-21T09:45:36.3528930Z     %54 = tt.broadcast %53 : tensor<32x1xi1, #mma> -> tensor<32x512xi1, #mma>
2026-02-21T09:45:36.3529105Z     %55 = arith.cmpi sge, %47, %cst_0 : tensor<1x512xi64, #mma>
2026-02-21T09:45:36.3529266Z     %56 = arith.cmpi slt, %47, %cst : tensor<1x512xi64, #mma>
2026-02-21T09:45:36.3529417Z     %57 = arith.andi %55, %56 : tensor<1x512xi1, #mma>
2026-02-21T09:45:36.3529581Z     %58 = tt.broadcast %57 : tensor<1x512xi1, #mma> -> tensor<32x512xi1, #mma>
2026-02-21T09:45:36.3529751Z     %59 = arith.andi %54, %58 : tensor<32x512xi1, #mma>
2026-02-21T09:45:36.3529903Z     tt.store %50, %34, %59 : tensor<32x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:45:36.3530060Z     tt.return
2026-02-21T09:45:36.3530141Z   }
2026-02-21T09:45:36.3530223Z }
2026-02-21T09:45:36.3530266Z 
2026-02-21T09:45:36.3530302Z {-#
2026-02-21T09:45:36.3530382Z   external_resources: {
2026-02-21T09:45:36.3530481Z     mlir_reproducer: {
2026-02-21T09:45:36.3531478Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:45:36.3532486Z       disable_threading: false,
2026-02-21T09:45:36.3532593Z       verify_each: true
2026-02-21T09:45:36.3532683Z     }
2026-02-21T09:45:36.3532756Z   }
2026-02-21T09:45:36.3532824Z #-}
2026-02-21T09:45:36.3533102Z /tmp/torchinductor_root/ly/cly5ppem7bczabonqsmt6tvx5rfi6ahp2vmwbcypndonimgii6tk.py:12:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:45:36.3533785Z /tmp/torchinductor_root/ly/cly5ppem7bczabonqsmt6tvx5rfi6ahp2vmwbcypndonimgii6tk.py:12:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:45:36.3534335Z [67s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:45:36.3535076Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 512], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:45:36.3535743Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:45:36.3535908Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:45:38.0702895Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:45:38.0711427Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:45:38.0712168Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:45:38.0712799Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:45:38.0713363Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [8, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:45:38.0714004Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:45:38.0714951Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:45:38.0715729Z     %cst = arith.constant dense<0.000000e+00> : tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0716044Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:45:38.0716279Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:45:38.0716528Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:45:38.0716779Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T09:45:38.0717028Z     %c216_i32 = arith.constant 216 : i32
2026-02-21T09:45:38.0717428Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:45:38.0717655Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:45:38.0717880Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:45:38.0718122Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:45:38.0718473Z     %cst_0 = arith.constant dense<0> : tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0718773Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:45:38.0718991Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:45:38.0719225Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:45:38.0719462Z     %c255_i32 = arith.constant 255 : i32
2026-02-21T09:45:38.0719831Z     %cst_1 = arith.constant dense<0> : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0720271Z     %cst_2 = arith.constant dense<0> : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0720570Z     %c-1_i32 = arith.constant -1 : i32
2026-02-21T09:45:38.0720859Z     %cst_3 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0721193Z     %cst_4 = arith.constant dense<4> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0721534Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:38.0721803Z     %cst_6 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:38.0722088Z     %cst_7 = arith.constant dense<8192> : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0722360Z     %cst_8 = arith.constant dense<0> : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0722713Z     %cst_9 = arith.constant dense<16384> : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0722990Z     %cst_10 = arith.constant dense<0> : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0723260Z     %cst_11 = arith.constant dense<8192> : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0723496Z     %0 = tt.get_program_id x : i32
2026-02-21T09:45:38.0723688Z     %1 = arith.muli %0, %c216_i32 : i32
2026-02-21T09:45:38.0723876Z     %2 = arith.addi %1, %c216_i32 : i32
2026-02-21T09:45:38.0729219Z     %3 = arith.minsi %2, %c131072_i32 : i32
2026-02-21T09:45:38.0729566Z     %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0730070Z     %5 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0730561Z     %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0731051Z     %7 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0731467Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0731817Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0732107Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0732486Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:45:38.0732993Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:45:38.0733483Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:38.0733794Z     %14 = arith.cmpi eq, %13, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:38.0734033Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked>
2026-02-21T09:45:38.0734360Z     %16 = arith.cmpi eq, %13, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:38.0734596Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked>
2026-02-21T09:45:38.0734878Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:45:38.0735207Z     %19 = arith.extsi %5 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0735616Z     %20 = arith.extsi %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0735920Z     %21 = arith.subi %3, %1 : i32
2026-02-21T09:45:38.0736053Z     %22 = arith.remsi %21, %c4_i32 : i32
2026-02-21T09:45:38.0736201Z     %23 = arith.subi %21, %22 : i32
2026-02-21T09:45:38.0736338Z     %24 = arith.addi %1, %23 : i32
2026-02-21T09:45:38.0736489Z     scf.for %arg3 = %1 to %24 step %c4_i32  : i32 {
2026-02-21T09:45:38.0736655Z       %29 = arith.divsi %arg3, %c1024_i32 : i32
2026-02-21T09:45:38.0736808Z       %30 = arith.muli %29, %c2_i32 : i32
2026-02-21T09:45:38.0736959Z       %31 = arith.subi %c256_i32, %30 : i32
2026-02-21T09:45:38.0737106Z       %32 = arith.minsi %31, %c2_i32 : i32
2026-02-21T09:45:38.0737263Z       %33 = arith.remsi %arg3, %c1024_i32 : i32
2026-02-21T09:45:38.0737409Z       %34 = arith.remsi %33, %32 : i32
2026-02-21T09:45:38.0737552Z       %35 = arith.addi %30, %34 : i32
2026-02-21T09:45:38.0737684Z       %36 = arith.divsi %33, %32 : i32
2026-02-21T09:45:38.0737831Z       %37 = arith.muli %35, %c32_i32 : i32
2026-02-21T09:45:38.0738095Z       %38 = tt.splat %37 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0738456Z       %39 = arith.addi %38, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0738712Z       %40 = arith.muli %36, %c32_i32 : i32
2026-02-21T09:45:38.0738918Z       %41 = tt.splat %40 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0739215Z       %42 = arith.addi %41, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0739560Z       %43 = tt.expand_dims %42 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0739884Z       %44 = arith.muli %43, %cst_3 : tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0740124Z       %45 = tt.broadcast %44 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0740545Z       %46 = tt.expand_dims %39 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0740975Z       %47 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T09:45:38.0741195Z         %200 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:45:38.0741373Z         %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0741606Z         %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0741884Z         %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:38.0742167Z         %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0742370Z         %205 = arith.addi %45, %204 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0742573Z         %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0742786Z         %207 = tt.load %206 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0743055Z         %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0743467Z         %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0743782Z         %210 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:45:38.0743966Z         %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0744203Z         %212 = arith.addi %211, %46 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0744531Z         %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0744848Z         %214 = tt.load %213 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0745088Z         %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0745329Z         %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0745576Z         %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0745868Z         %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0746211Z         %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0746503Z         %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0746743Z         %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0746985Z         %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0747220Z         %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0747473Z         %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:38.0747705Z         %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:38.0748002Z         %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0748496Z         %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0748855Z         %228 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:45:38.0748984Z         %229 = arith.muli %228, %c2_i32 : i32
2026-02-21T09:45:38.0749164Z         %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0749391Z         %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0749677Z         %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:38.0749963Z         %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0750161Z         %234 = arith.addi %45, %233 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0750366Z         %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0750574Z         %236 = tt.load %235 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0750847Z         %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0751252Z         %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0751540Z         %239 = arith.muli %228, %c8192_i32 : i32
2026-02-21T09:45:38.0751744Z         %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0751974Z         %241 = arith.addi %240, %46 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0752299Z         %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0752628Z         %243 = tt.load %242 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0752862Z         %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0753103Z         %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0753341Z         %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0753642Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0753982Z         %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0754267Z         %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0754512Z         %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0754750Z         %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0754988Z         %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0755225Z         %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:38.0755453Z         %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:38.0755783Z         %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0756261Z         %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0756613Z         scf.yield %256 : tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0756745Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:45:38.0756904Z       %48 = arith.truncf %47 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T09:45:38.0757075Z       %49 = arith.extsi %40 : i32 to i64
2026-02-21T09:45:38.0757195Z       %50 = arith.extsi %37 : i32 to i64
2026-02-21T09:45:38.0757360Z       %51 = tt.splat %49 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0757570Z       %52 = arith.addi %51, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0757828Z       %53 = tt.expand_dims %52 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0758069Z       %54 = arith.muli %53, %cst_7 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0758243Z       %55 = tt.broadcast %54 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0758447Z       %56 = tt.splat %50 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0758648Z       %57 = arith.addi %56, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0758914Z       %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0759170Z       %59 = tt.broadcast %58 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0759345Z       %60 = arith.addi %55, %59 : tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0759550Z       %61 = tt.addptr %18, %60 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0759747Z       %62 = arith.cmpi sge, %53, %cst_8 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0759916Z       %63 = arith.cmpi slt, %53, %cst_9 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0760075Z       %64 = arith.andi %62, %63 : tensor<32x1xi1, #mma>
2026-02-21T09:45:38.0760255Z       %65 = tt.broadcast %64 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0760440Z       %66 = arith.cmpi sge, %58, %cst_10 : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0760602Z       %67 = arith.cmpi slt, %58, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0760759Z       %68 = arith.andi %66, %67 : tensor<1x32xi1, #mma>
2026-02-21T09:45:38.0760925Z       %69 = tt.broadcast %68 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0761102Z       %70 = arith.andi %65, %69 : tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0761258Z       tt.store %61, %48, %70 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:45:38.0761408Z       %71 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:45:38.0761536Z       %72 = arith.divsi %71, %c1024_i32 : i32
2026-02-21T09:45:38.0761660Z       %73 = arith.muli %72, %c2_i32 : i32
2026-02-21T09:45:38.0761784Z       %74 = arith.subi %c256_i32, %73 : i32
2026-02-21T09:45:38.0761905Z       %75 = arith.minsi %74, %c2_i32 : i32
2026-02-21T09:45:38.0762033Z       %76 = arith.remsi %71, %c1024_i32 : i32
2026-02-21T09:45:38.0762155Z       %77 = arith.remsi %76, %75 : i32
2026-02-21T09:45:38.0762274Z       %78 = arith.addi %73, %77 : i32
2026-02-21T09:45:38.0762392Z       %79 = arith.divsi %76, %75 : i32
2026-02-21T09:45:38.0762508Z       %80 = arith.muli %78, %c32_i32 : i32
2026-02-21T09:45:38.0762770Z       %81 = tt.splat %80 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0763068Z       %82 = arith.addi %81, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0763303Z       %83 = arith.muli %79, %c32_i32 : i32
2026-02-21T09:45:38.0763479Z       %84 = tt.splat %83 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0763716Z       %85 = arith.addi %84, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0763999Z       %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0764249Z       %87 = arith.muli %86, %cst_3 : tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0764441Z       %88 = tt.broadcast %87 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0764793Z       %89 = tt.expand_dims %82 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0765184Z       %90 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T09:45:38.0765406Z         %200 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:45:38.0765583Z         %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0765815Z         %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0766097Z         %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:38.0766378Z         %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0766578Z         %205 = arith.addi %88, %204 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0766783Z         %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0766997Z         %207 = tt.load %206 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0767297Z         %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0767700Z         %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0768007Z         %210 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:45:38.0768189Z         %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0768422Z         %212 = arith.addi %211, %89 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0768736Z         %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0769045Z         %214 = tt.load %213 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0769288Z         %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0769525Z         %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0769769Z         %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0770075Z         %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0770414Z         %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0770706Z         %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0770949Z         %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0771207Z         %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0771448Z         %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0771698Z         %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:38.0771927Z         %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:38.0772227Z         %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0772698Z         %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0773049Z         %228 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:45:38.0773177Z         %229 = arith.muli %228, %c2_i32 : i32
2026-02-21T09:45:38.0773358Z         %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0773584Z         %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0773873Z         %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:38.0774161Z         %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0774357Z         %234 = arith.addi %88, %233 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0774565Z         %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0782935Z         %236 = tt.load %235 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0783225Z         %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0783696Z         %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0783987Z         %239 = arith.muli %228, %c8192_i32 : i32
2026-02-21T09:45:38.0784170Z         %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0784429Z         %241 = arith.addi %240, %89 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0784743Z         %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0785059Z         %243 = tt.load %242 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0785298Z         %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0785542Z         %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0785788Z         %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0786084Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0786426Z         %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0786714Z         %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0786955Z         %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0787198Z         %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0787444Z         %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0787679Z         %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:38.0787927Z         %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:38.0788223Z         %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0788689Z         %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0789040Z         scf.yield %256 : tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0789175Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:45:38.0789338Z       %91 = arith.truncf %90 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T09:45:38.0789508Z       %92 = arith.extsi %83 : i32 to i64
2026-02-21T09:45:38.0789633Z       %93 = arith.extsi %80 : i32 to i64
2026-02-21T09:45:38.0789793Z       %94 = tt.splat %92 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0790006Z       %95 = arith.addi %94, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0790270Z       %96 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0790511Z       %97 = arith.muli %96, %cst_7 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0790690Z       %98 = tt.broadcast %97 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0790890Z       %99 = tt.splat %93 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0791101Z       %100 = arith.addi %99, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0791366Z       %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0791649Z       %102 = tt.broadcast %101 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0791834Z       %103 = arith.addi %98, %102 : tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0792023Z       %104 = tt.addptr %18, %103 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0792254Z       %105 = arith.cmpi sge, %96, %cst_8 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0792420Z       %106 = arith.cmpi slt, %96, %cst_9 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0792583Z       %107 = arith.andi %105, %106 : tensor<32x1xi1, #mma>
2026-02-21T09:45:38.0792761Z       %108 = tt.broadcast %107 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0792951Z       %109 = arith.cmpi sge, %101, %cst_10 : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0793123Z       %110 = arith.cmpi slt, %101, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0793285Z       %111 = arith.andi %109, %110 : tensor<1x32xi1, #mma>
2026-02-21T09:45:38.0793464Z       %112 = tt.broadcast %111 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0793643Z       %113 = arith.andi %108, %112 : tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0793809Z       tt.store %104, %91, %113 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:45:38.0793966Z       %114 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:45:38.0794095Z       %115 = arith.divsi %114, %c1024_i32 : i32
2026-02-21T09:45:38.0794225Z       %116 = arith.muli %115, %c2_i32 : i32
2026-02-21T09:45:38.0794350Z       %117 = arith.subi %c256_i32, %116 : i32
2026-02-21T09:45:38.0794476Z       %118 = arith.minsi %117, %c2_i32 : i32
2026-02-21T09:45:38.0794601Z       %119 = arith.remsi %114, %c1024_i32 : i32
2026-02-21T09:45:38.0794728Z       %120 = arith.remsi %119, %118 : i32
2026-02-21T09:45:38.0794846Z       %121 = arith.addi %116, %120 : i32
2026-02-21T09:45:38.0794967Z       %122 = arith.divsi %119, %118 : i32
2026-02-21T09:45:38.0795108Z       %123 = arith.muli %121, %c32_i32 : i32
2026-02-21T09:45:38.0795321Z       %124 = tt.splat %123 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0795653Z       %125 = arith.addi %124, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0795869Z       %126 = arith.muli %122, %c32_i32 : i32
2026-02-21T09:45:38.0796045Z       %127 = tt.splat %126 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0796276Z       %128 = arith.addi %127, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0796560Z       %129 = tt.expand_dims %128 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0796823Z       %130 = arith.muli %129, %cst_3 : tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0797023Z       %131 = tt.broadcast %130 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0797385Z       %132 = tt.expand_dims %125 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0797785Z       %133 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T09:45:38.0798006Z         %200 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:45:38.0798186Z         %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0798415Z         %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0798696Z         %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:38.0798979Z         %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0799216Z         %205 = arith.addi %131, %204 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0799423Z         %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0799631Z         %207 = tt.load %206 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0799922Z         %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0800333Z         %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0800619Z         %210 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:45:38.0800805Z         %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0801038Z         %212 = arith.addi %211, %132 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0801361Z         %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0801678Z         %214 = tt.load %213 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0801916Z         %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0802156Z         %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0802397Z         %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0802742Z         %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0803108Z         %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0803396Z         %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0803640Z         %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0803900Z         %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0804139Z         %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0804374Z         %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:38.0804601Z         %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:38.0804902Z         %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0805373Z         %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0805722Z         %228 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:45:38.0805855Z         %229 = arith.muli %228, %c2_i32 : i32
2026-02-21T09:45:38.0806030Z         %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0806265Z         %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0806547Z         %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:38.0806829Z         %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0807030Z         %234 = arith.addi %131, %233 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0807256Z         %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0807469Z         %236 = tt.load %235 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0807737Z         %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0808162Z         %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0808452Z         %239 = arith.muli %228, %c8192_i32 : i32
2026-02-21T09:45:38.0808630Z         %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0808870Z         %241 = arith.addi %240, %132 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0809189Z         %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0809500Z         %243 = tt.load %242 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0809739Z         %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0809977Z         %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0810219Z         %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0810514Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0810855Z         %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0811166Z         %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0811405Z         %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0811646Z         %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0811898Z         %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0812132Z         %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:38.0812358Z         %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:38.0812653Z         %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0813122Z         %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0813479Z         scf.yield %256 : tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0813608Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:45:38.0813773Z       %134 = arith.truncf %133 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T09:45:38.0813948Z       %135 = arith.extsi %126 : i32 to i64
2026-02-21T09:45:38.0814075Z       %136 = arith.extsi %123 : i32 to i64
2026-02-21T09:45:38.0814242Z       %137 = tt.splat %135 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0814457Z       %138 = arith.addi %137, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0814727Z       %139 = tt.expand_dims %138 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0814967Z       %140 = arith.muli %139, %cst_7 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0815150Z       %141 = tt.broadcast %140 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0815372Z       %142 = tt.splat %136 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0815585Z       %143 = arith.addi %142, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0815850Z       %144 = tt.expand_dims %143 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0816122Z       %145 = tt.broadcast %144 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0816308Z       %146 = arith.addi %141, %145 : tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0816495Z       %147 = tt.addptr %18, %146 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0816700Z       %148 = arith.cmpi sge, %139, %cst_8 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0816870Z       %149 = arith.cmpi slt, %139, %cst_9 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0817030Z       %150 = arith.andi %148, %149 : tensor<32x1xi1, #mma>
2026-02-21T09:45:38.0817210Z       %151 = tt.broadcast %150 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0817401Z       %152 = arith.cmpi sge, %144, %cst_10 : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0817575Z       %153 = arith.cmpi slt, %144, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0817737Z       %154 = arith.andi %152, %153 : tensor<1x32xi1, #mma>
2026-02-21T09:45:38.0817913Z       %155 = tt.broadcast %154 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0818094Z       %156 = arith.andi %151, %155 : tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0818254Z       tt.store %147, %134, %156 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:45:38.0818410Z       %157 = arith.addi %arg3, %c3_i32 : i32
2026-02-21T09:45:38.0818538Z       %158 = arith.divsi %157, %c1024_i32 : i32
2026-02-21T09:45:38.0818669Z       %159 = arith.muli %158, %c2_i32 : i32
2026-02-21T09:45:38.0818792Z       %160 = arith.subi %c256_i32, %159 : i32
2026-02-21T09:45:38.0818934Z       %161 = arith.minsi %160, %c2_i32 : i32
2026-02-21T09:45:38.0819063Z       %162 = arith.remsi %157, %c1024_i32 : i32
2026-02-21T09:45:38.0819187Z       %163 = arith.remsi %162, %161 : i32
2026-02-21T09:45:38.0819330Z       %164 = arith.addi %159, %163 : i32
2026-02-21T09:45:38.0819447Z       %165 = arith.divsi %162, %161 : i32
2026-02-21T09:45:38.0819570Z       %166 = arith.muli %164, %c32_i32 : i32
2026-02-21T09:45:38.0819781Z       %167 = tt.splat %166 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0820092Z       %168 = arith.addi %167, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0820310Z       %169 = arith.muli %165, %c32_i32 : i32
2026-02-21T09:45:38.0820481Z       %170 = tt.splat %169 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0820709Z       %171 = arith.addi %170, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0820993Z       %172 = tt.expand_dims %171 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0821255Z       %173 = arith.muli %172, %cst_3 : tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0821455Z       %174 = tt.broadcast %173 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0821811Z       %175 = tt.expand_dims %168 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0822209Z       %176 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x32xf32, #mma>)  : i32 {
2026-02-21T09:45:38.0822424Z         %200 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:45:38.0822603Z         %201 = tt.splat %200 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0822849Z         %202 = arith.addi %201, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0823127Z         %203 = tt.expand_dims %202 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:38.0823412Z         %204 = tt.broadcast %203 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0823622Z         %205 = arith.addi %174, %204 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0823829Z         %206 = tt.addptr %9, %205 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0824041Z         %207 = tt.load %206 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0824309Z         %208 = ttg.convert_layout %207 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0824718Z         %209 = arith.extf %208 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0825012Z         %210 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:45:38.0825198Z         %211 = tt.splat %210 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0825437Z         %212 = arith.addi %211, %175 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0825756Z         %213 = tt.addptr %10, %212 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0826073Z         %214 = tt.load %213 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0826316Z         %215 = arith.shli %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0826552Z         %216 = arith.shrsi %215, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0826809Z         %217 = arith.shrsi %214, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0827098Z         %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0827451Z         %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0827737Z         %220 = tt.broadcast %218 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0827980Z         %221 = arith.select %15, %220, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0828219Z         %222 = tt.broadcast %219 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0828448Z         %223 = arith.select %17, %222, %221 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0828681Z         %224 = tt.reshape %223 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:38.0828908Z         %225 = arith.sitofp %224 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:38.0829204Z         %226 = ttg.convert_layout %225 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0829674Z         %227 = tt.dot %209, %226, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0830020Z         %228 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:45:38.0830143Z         %229 = arith.muli %228, %c2_i32 : i32
2026-02-21T09:45:38.0830316Z         %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0830539Z         %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0830833Z         %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:38.0831110Z         %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0831305Z         %234 = arith.addi %174, %233 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0831524Z         %235 = tt.addptr %9, %234 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0831728Z         %236 = tt.load %235 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0831997Z         %237 = ttg.convert_layout %236 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0832395Z         %238 = arith.extf %237 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0832677Z         %239 = arith.muli %228, %c8192_i32 : i32
2026-02-21T09:45:38.0832859Z         %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0833090Z         %241 = arith.addi %240, %175 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0833403Z         %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0833714Z         %243 = tt.load %242 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0833942Z         %244 = arith.shli %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0834183Z         %245 = arith.shrsi %244, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0834420Z         %246 = arith.shrsi %243, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0834737Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0835072Z         %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0835373Z         %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0835612Z         %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0835846Z         %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0836080Z         %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0836310Z         %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:38.0836531Z         %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:38.0836829Z         %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0837291Z         %256 = tt.dot %238, %255, %227, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0837638Z         scf.yield %256 : tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0837764Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:45:38.0837919Z       %177 = arith.truncf %176 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T09:45:38.0838089Z       %178 = arith.extsi %169 : i32 to i64
2026-02-21T09:45:38.0838209Z       %179 = arith.extsi %166 : i32 to i64
2026-02-21T09:45:38.0838371Z       %180 = tt.splat %178 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0838580Z       %181 = arith.addi %180, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0838863Z       %182 = tt.expand_dims %181 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0839106Z       %183 = arith.muli %182, %cst_7 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0839280Z       %184 = tt.broadcast %183 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0839506Z       %185 = tt.splat %179 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0839709Z       %186 = arith.addi %185, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0839970Z       %187 = tt.expand_dims %186 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0840226Z       %188 = tt.broadcast %187 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0840403Z       %189 = arith.addi %184, %188 : tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0840588Z       %190 = tt.addptr %18, %189 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0840780Z       %191 = arith.cmpi sge, %182, %cst_8 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0840948Z       %192 = arith.cmpi slt, %182, %cst_9 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0841104Z       %193 = arith.andi %191, %192 : tensor<32x1xi1, #mma>
2026-02-21T09:45:38.0841274Z       %194 = tt.broadcast %193 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0841461Z       %195 = arith.cmpi sge, %187, %cst_10 : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0841624Z       %196 = arith.cmpi slt, %187, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0841783Z       %197 = arith.andi %195, %196 : tensor<1x32xi1, #mma>
2026-02-21T09:45:38.0841950Z       %198 = tt.broadcast %197 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0842123Z       %199 = arith.andi %194, %198 : tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0842296Z       tt.store %190, %177, %199 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:45:38.0842445Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:45:38.0842604Z     %25 = arith.subi %3, %24 : i32
2026-02-21T09:45:38.0842742Z     %26 = arith.muli %25, %c256_i32 : i32
2026-02-21T09:45:38.0842861Z     %27 = arith.subi %24, %c1_i32 : i32
2026-02-21T09:45:38.0843332Z     %28:8 = scf.for %arg3 = %c0_i32 to %26 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %27, %arg6 = %c0_i32, %arg7 = %cst, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_2, %arg11 = %cst_1) -> (i32, i32, i32, tensor<32x32xf32, #mma>, i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>)  : i32 {
2026-02-21T09:45:38.0843801Z       %29 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:45:38.0843929Z       %30 = arith.cmpi eq, %arg4, %c255_i32 : i32
2026-02-21T09:45:38.0844059Z       %31 = arith.select %30, %c0_i32, %29 : i32
2026-02-21T09:45:38.0844185Z       %32 = arith.cmpi eq, %31, %c0_i32 : i32
2026-02-21T09:45:38.0844309Z       %33 = arith.select %32, %c0_i32, %arg6 : i32
2026-02-21T09:45:38.0844541Z       %34:5 = scf.if %32 -> (i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32) {
2026-02-21T09:45:38.0844771Z         %95 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:45:38.0844894Z         %96 = arith.divsi %95, %c1024_i32 : i32
2026-02-21T09:45:38.0845017Z         %97 = arith.muli %96, %c2_i32 : i32
2026-02-21T09:45:38.0845134Z         %98 = arith.subi %c256_i32, %97 : i32
2026-02-21T09:45:38.0845253Z         %99 = arith.minsi %98, %c2_i32 : i32
2026-02-21T09:45:38.0845370Z         %100 = arith.remsi %95, %c1024_i32 : i32
2026-02-21T09:45:38.0845491Z         %101 = arith.remsi %100, %99 : i32
2026-02-21T09:45:38.0845608Z         %102 = arith.addi %97, %101 : i32
2026-02-21T09:45:38.0845721Z         %103 = arith.divsi %100, %99 : i32
2026-02-21T09:45:38.0845839Z         %104 = arith.muli %102, %c32_i32 : i32
2026-02-21T09:45:38.0846071Z         %105 = tt.splat %104 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0846373Z         %106 = arith.addi %105, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:38.0846586Z         %107 = arith.muli %103, %c32_i32 : i32
2026-02-21T09:45:38.0846774Z         %108 = tt.splat %107 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0847002Z         %109 = arith.addi %108, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:38.0847279Z         %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0847535Z         %111 = arith.muli %110, %cst_3 : tensor<32x1xi32, #blocked1>
2026-02-21T09:45:38.0847734Z         %112 = tt.broadcast %111 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0848096Z         %113 = tt.expand_dims %106 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0848521Z         scf.yield %104, %107, %112, %113, %95 : i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32
2026-02-21T09:45:38.0848755Z       } else {
2026-02-21T09:45:38.0848981Z         scf.yield %arg8, %arg9, %arg10, %arg11, %arg5 : i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32
2026-02-21T09:45:38.0849228Z       }
2026-02-21T09:45:38.0849313Z       %35 = arith.muli %33, %c2_i32 : i32
2026-02-21T09:45:38.0849477Z       %36 = tt.splat %35 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0849691Z       %37 = arith.addi %36, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0849981Z       %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:38.0850249Z       %39 = tt.broadcast %38 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0850455Z       %40 = arith.addi %34#2, %39 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0850651Z       %41 = tt.addptr %9, %40 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0850850Z       %42 = tt.load %41 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0851110Z       %43 = ttg.convert_layout %42 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0851502Z       %44 = arith.extf %43 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0851775Z       %45 = arith.muli %33, %c8192_i32 : i32
2026-02-21T09:45:38.0851946Z       %46 = tt.splat %45 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0852164Z       %47 = arith.addi %46, %34#3 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0852467Z       %48 = tt.addptr %10, %47 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0852771Z       %49 = tt.load %48 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0852997Z       %50 = arith.shli %49, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0853227Z       %51 = arith.shrsi %50, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0853455Z       %52 = arith.shrsi %49, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0853735Z       %53 = tt.expand_dims %51 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0854076Z       %54 = tt.expand_dims %52 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0854352Z       %55 = tt.broadcast %53 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0854583Z       %56 = arith.select %15, %55, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0854825Z       %57 = tt.broadcast %54 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0855046Z       %58 = arith.select %17, %57, %56 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0855263Z       %59 = tt.reshape %58 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:38.0855476Z       %60 = arith.sitofp %59 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:38.0855762Z       %61 = ttg.convert_layout %60 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0856240Z       %62 = tt.dot %44, %61, %arg7, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0856582Z       %63 = arith.addi %33, %c1_i32 : i32
2026-02-21T09:45:38.0856702Z       %64 = arith.muli %63, %c2_i32 : i32
2026-02-21T09:45:38.0856866Z       %65 = tt.splat %64 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0857079Z       %66 = arith.addi %65, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:38.0857348Z       %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:38.0857615Z       %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0857817Z       %69 = arith.addi %34#2, %68 : tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0858012Z       %70 = tt.addptr %9, %69 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:45:38.0858224Z       %71 = tt.load %70 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:38.0858480Z       %72 = ttg.convert_layout %71 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0858872Z       %73 = arith.extf %72 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0859146Z       %74 = arith.muli %63, %c8192_i32 : i32
2026-02-21T09:45:38.0859313Z       %75 = tt.splat %74 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0859536Z       %76 = arith.addi %75, %34#3 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0859841Z       %77 = tt.addptr %10, %76 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0860145Z       %78 = tt.load %77 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0860372Z       %79 = arith.shli %78, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0860598Z       %80 = arith.shrsi %79, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0860826Z       %81 = arith.shrsi %78, %cst_4 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0861103Z       %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0861429Z       %83 = tt.expand_dims %81 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:38.0861703Z       %84 = tt.broadcast %82 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0861945Z       %85 = arith.select %15, %84, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0862175Z       %86 = tt.broadcast %83 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0862396Z       %87 = arith.select %17, %86, %85 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:38.0862626Z       %88 = tt.reshape %87 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:38.0862838Z       %89 = arith.sitofp %88 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:38.0863122Z       %90 = ttg.convert_layout %89 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:38.0863572Z       %91 = tt.dot %73, %90, %62, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0863904Z       %92 = arith.addi %33, %c2_i32 : i32
2026-02-21T09:45:38.0864029Z       %93 = arith.cmpi eq, %31, %c255_i32 : i32
2026-02-21T09:45:38.0864175Z       %94 = arith.select %93, %cst, %91 : tensor<32x32xf32, #mma>
2026-02-21T09:45:38.0864310Z       scf.if %93 {
2026-02-21T09:45:38.0864447Z         %95 = arith.truncf %91 : tensor<32x32xf32, #mma> to tensor<32x32xbf16, #mma>
2026-02-21T09:45:38.0864610Z         %96 = arith.extsi %34#1 : i32 to i64
2026-02-21T09:45:38.0864731Z         %97 = arith.extsi %34#0 : i32 to i64
2026-02-21T09:45:38.0864888Z         %98 = tt.splat %96 : i64 -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0865089Z         %99 = arith.addi %98, %19 : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:38.0865350Z         %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0865598Z         %101 = arith.muli %100, %cst_7 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0865773Z         %102 = tt.broadcast %101 : tensor<32x1xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0865993Z         %103 = tt.splat %97 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0866199Z         %104 = arith.addi %103, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:38.0866464Z         %105 = tt.expand_dims %104 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0866715Z         %106 = tt.broadcast %105 : tensor<1x32xi64, #mma> -> tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0866900Z         %107 = arith.addi %102, %106 : tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0867088Z         %108 = tt.addptr %18, %107 : tensor<32x32x!tt.ptr<bf16>, #mma>, tensor<32x32xi64, #mma>
2026-02-21T09:45:38.0867283Z         %109 = arith.cmpi sge, %100, %cst_8 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0867450Z         %110 = arith.cmpi slt, %100, %cst_9 : tensor<32x1xi64, #mma>
2026-02-21T09:45:38.0867605Z         %111 = arith.andi %109, %110 : tensor<32x1xi1, #mma>
2026-02-21T09:45:38.0867780Z         %112 = tt.broadcast %111 : tensor<32x1xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0867965Z         %113 = arith.cmpi sge, %105, %cst_10 : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0868135Z         %114 = arith.cmpi slt, %105, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:45:38.0868295Z         %115 = arith.andi %113, %114 : tensor<1x32xi1, #mma>
2026-02-21T09:45:38.0868465Z         %116 = tt.broadcast %115 : tensor<1x32xi1, #mma> -> tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0868643Z         %117 = arith.andi %112, %116 : tensor<32x32xi1, #mma>
2026-02-21T09:45:38.0868800Z         tt.store %108, %95, %117 : tensor<32x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:45:38.0868934Z       }
2026-02-21T09:45:38.0869215Z       scf.yield %31, %34#4, %92, %94, %34#0, %34#1, %34#2, %34#3 : i32, i32, i32, tensor<32x32xf32, #mma>, i32, i32, tensor<32x2xi32, #blocked1>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:38.0869495Z     }
2026-02-21T09:45:38.0869572Z     tt.return
2026-02-21T09:45:38.0869650Z   }
2026-02-21T09:45:38.0869724Z }
2026-02-21T09:45:38.0869767Z 
2026-02-21T09:45:38.0869812Z {-#
2026-02-21T09:45:38.0869894Z   external_resources: {
2026-02-21T09:45:38.0869991Z     mlir_reproducer: {
2026-02-21T09:45:38.0870994Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:45:38.0871980Z       disable_threading: false,
2026-02-21T09:45:38.0872087Z       verify_each: true
2026-02-21T09:45:38.0872181Z     }
2026-02-21T09:45:38.0872252Z   }
2026-02-21T09:45:38.0872326Z #-}
2026-02-21T09:45:38.0872607Z /tmp/torchinductor_root/4g/c4gqla7w3u5bg7mh47a4i7i3ad5wu42gcpqh6igdtxb2rmf43rrj.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:45:38.0873280Z /tmp/torchinductor_root/4g/c4gqla7w3u5bg7mh47a4i7i3ad5wu42gcpqh6igdtxb2rmf43rrj.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:45:38.0873825Z [68s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:45:38.0874608Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 32], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[4, 2], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:45:38.0875317Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:45:38.0875487Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:45:39.4818015Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:45:39.4822025Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:45:39.4822915Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:45:39.4823624Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:45:39.4824302Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:45:39.4824891Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:45:39.4825318Z #smem = #ttg.shared_memory
2026-02-21T09:45:39.4825855Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:45:39.4826942Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:45:39.4827865Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32, #mma>
2026-02-21T09:45:39.4828490Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:45:39.4828757Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:45:39.4829029Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:45:39.4829290Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:45:39.4829551Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:45:39.4829961Z     %cst_0 = arith.constant dense<0> : tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4830316Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:45:39.4830566Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:45:39.4830822Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:45:39.4831082Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:45:39.4831339Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:45:39.4831592Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:45:39.4831850Z     %c512_i64 = arith.constant 512 : i64
2026-02-21T09:45:39.4832102Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T09:45:39.4832410Z     %cst_1 = arith.constant dense<0> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4832838Z     %cst_2 = arith.constant dense<8192> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4833255Z     %cst_3 = arith.constant dense<0> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4833614Z     %cst_4 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:45:39.4833964Z     %cst_5 = arith.constant dense<4> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4834313Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:39.4834591Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:39.4834869Z     %cst_8 = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:45:39.4835104Z     %0 = tt.get_program_id x : i32
2026-02-21T09:45:39.4835286Z     %1 = arith.divsi %0, %c2048_i32 : i32
2026-02-21T09:45:39.4835513Z     %2 = arith.muli %1, %c16_i32 : i32
2026-02-21T09:45:39.4835697Z     %3 = arith.subi %c256_i32, %2 : i32
2026-02-21T09:45:39.4835885Z     %4 = arith.minsi %3, %c16_i32 : i32
2026-02-21T09:45:39.4836115Z     %5 = arith.remsi %0, %c2048_i32 : i32
2026-02-21T09:45:39.4836303Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:45:39.4836484Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:45:39.4836657Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:45:39.4836828Z     %9 = arith.muli %7, %c32_i32 : i32
2026-02-21T09:45:39.4837220Z     %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:39.4837732Z     %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:39.4838118Z     %12 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:39.4838457Z     %13 = arith.addi %12, %11 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:45:39.4838723Z     %14 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:45:39.4839046Z     %15 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:39.4839508Z     %16 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:39.4839913Z     %17 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:39.4840257Z     %18 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:39.4840601Z     %19 = arith.addi %17, %15 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:45:39.4840935Z     %20 = arith.addi %18, %16 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:45:39.4841330Z     %21 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:39.4841853Z     %22 = tt.expand_dims %19 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:45:39.4842202Z     %23 = arith.muli %22, %cst_4 : tensor<128x1xi32, #blocked1>
2026-02-21T09:45:39.4842442Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4842789Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:39.4842990Z     %26 = arith.extsi %9 : i32 to i64
2026-02-21T09:45:39.4843223Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4843574Z     %28 = tt.splat %26 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:39.4844088Z     %29 = arith.extsi %10 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:39.4844583Z     %30 = arith.addi %28, %29 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:45:39.4845058Z     %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4845504Z     %32 = arith.cmpi sge, %31, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4845799Z     %33 = arith.cmpi slt, %31, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4846080Z     %34 = arith.andi %32, %33 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4846428Z     %35 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:45:39.4846958Z     %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:45:39.4847470Z     %37 = tt.expand_dims %36 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:39.4847778Z     %38 = arith.cmpi eq, %37, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:39.4848014Z     %39 = tt.broadcast %38 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked>
2026-02-21T09:45:39.4848254Z     %40 = arith.cmpi eq, %37, %cst_7 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:45:39.4848485Z     %41 = tt.broadcast %40 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked>
2026-02-21T09:45:39.4848805Z     %42 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst) -> (tensor<128x32xf32, #mma>)  : i32 {
2026-02-21T09:45:39.4849067Z       %52 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:45:39.4849275Z       %53 = tt.splat %52 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:39.4849537Z       %54 = arith.addi %53, %21 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:39.4849871Z       %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:39.4850209Z       %56 = tt.broadcast %55 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4850444Z       %57 = arith.addi %24, %56 : tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4850688Z       %58 = tt.addptr %25, %57 : tensor<128x2x!tt.ptr<bf16>, #blocked1>, tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4850935Z       %59 = tt.load %58 : tensor<128x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:39.4851206Z       %60 = ttg.local_alloc %59 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem>
2026-02-21T09:45:39.4851640Z       %61 = ttg.local_load %60 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4852143Z       %62 = arith.extf %61 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4852506Z       %63 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:45:39.4852630Z       %64 = arith.muli %63, %c8192_i64 : i64
2026-02-21T09:45:39.4852839Z       %65 = tt.splat %64 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4853057Z       %66 = arith.addi %65, %31 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4853359Z       %67 = tt.addptr %27, %66 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4853675Z       %68 = tt.load %67, %34, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4853913Z       %69 = arith.shli %68, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4854144Z       %70 = arith.shrsi %69, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4854373Z       %71 = arith.shrsi %68, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4854657Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:39.4854982Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:39.4855254Z       %74 = tt.broadcast %72 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4855484Z       %75 = arith.select %39, %74, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4855726Z       %76 = tt.broadcast %73 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4855951Z       %77 = arith.select %41, %76, %75 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4856189Z       %78 = tt.reshape %77 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:39.4856405Z       %79 = arith.sitofp %78 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:39.4856691Z       %80 = ttg.convert_layout %79 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4857150Z       %81 = tt.dot %62, %80, %arg4, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:45:39.4857496Z       %82 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:45:39.4857614Z       %83 = arith.muli %82, %c2_i32 : i32
2026-02-21T09:45:39.4857778Z       %84 = tt.splat %83 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:39.4857994Z       %85 = arith.addi %84, %21 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:39.4858263Z       %86 = tt.expand_dims %85 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:39.4858536Z       %87 = tt.broadcast %86 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4858724Z       %88 = arith.addi %24, %87 : tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4858919Z       %89 = tt.addptr %25, %88 : tensor<128x2x!tt.ptr<bf16>, #blocked1>, tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4859121Z       %90 = tt.load %89 : tensor<128x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:39.4859338Z       %91 = ttg.local_alloc %90 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem>
2026-02-21T09:45:39.4859691Z       %92 = ttg.local_load %91 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4860094Z       %93 = arith.extf %92 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4860386Z       %94 = arith.extsi %82 : i32 to i64
2026-02-21T09:45:39.4860504Z       %95 = arith.muli %94, %c8192_i64 : i64
2026-02-21T09:45:39.4860670Z       %96 = tt.splat %95 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4860892Z       %97 = arith.addi %96, %31 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4861194Z       %98 = tt.addptr %27, %97 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4861453Z       %99 = arith.cmpi slt, %94, %c512_i64 : i64
2026-02-21T09:45:39.4861628Z       %100 = tt.splat %99 : i1 -> tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4861850Z       %101 = arith.andi %100, %34 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4862088Z       %102 = tt.load %98, %101, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4862328Z       %103 = arith.shli %102, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4862568Z       %104 = arith.shrsi %103, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4862802Z       %105 = arith.shrsi %102, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4863085Z       %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:39.4863440Z       %107 = tt.expand_dims %105 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:39.4863719Z       %108 = tt.broadcast %106 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4863971Z       %109 = arith.select %39, %108, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4864202Z       %110 = tt.broadcast %107 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4864429Z       %111 = arith.select %41, %110, %109 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4864655Z       %112 = tt.reshape %111 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:39.4864871Z       %113 = arith.sitofp %112 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:39.4865171Z       %114 = ttg.convert_layout %113 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4865633Z       %115 = tt.dot %93, %114, %81, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:45:39.4865972Z       %116 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:45:39.4866093Z       %117 = arith.muli %116, %c2_i32 : i32
2026-02-21T09:45:39.4866260Z       %118 = tt.splat %117 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:39.4866480Z       %119 = arith.addi %118, %21 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:39.4866755Z       %120 = tt.expand_dims %119 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:39.4867028Z       %121 = tt.broadcast %120 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4867222Z       %122 = arith.addi %24, %121 : tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4867439Z       %123 = tt.addptr %25, %122 : tensor<128x2x!tt.ptr<bf16>, #blocked1>, tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4867650Z       %124 = tt.load %123 : tensor<128x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:39.4867883Z       %125 = ttg.local_alloc %124 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem>
2026-02-21T09:45:39.4868217Z       %126 = ttg.local_load %125 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4868646Z       %127 = arith.extf %126 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4868922Z       %128 = arith.extsi %116 : i32 to i64
2026-02-21T09:45:39.4869045Z       %129 = arith.muli %128, %c8192_i64 : i64
2026-02-21T09:45:39.4869217Z       %130 = tt.splat %129 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4869442Z       %131 = arith.addi %130, %31 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4869754Z       %132 = tt.addptr %27, %131 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4870013Z       %133 = arith.cmpi slt, %128, %c512_i64 : i64
2026-02-21T09:45:39.4870187Z       %134 = tt.splat %133 : i1 -> tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4870422Z       %135 = arith.andi %134, %34 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4870662Z       %136 = tt.load %132, %135, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4870912Z       %137 = arith.shli %136, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4871149Z       %138 = arith.shrsi %137, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4871413Z       %139 = arith.shrsi %136, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4871705Z       %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:39.4872053Z       %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:39.4872339Z       %142 = tt.broadcast %140 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4872581Z       %143 = arith.select %39, %142, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4872817Z       %144 = tt.broadcast %141 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4873055Z       %145 = arith.select %41, %144, %143 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4873283Z       %146 = tt.reshape %145 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:39.4873514Z       %147 = arith.sitofp %146 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:39.4873811Z       %148 = ttg.convert_layout %147 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4874273Z       %149 = tt.dot %127, %148, %115, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:45:39.4874625Z       %150 = arith.addi %arg3, %c3_i32 : i32
2026-02-21T09:45:39.4874753Z       %151 = arith.muli %150, %c2_i32 : i32
2026-02-21T09:45:39.4874922Z       %152 = tt.splat %151 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:39.4875148Z       %153 = arith.addi %152, %21 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:45:39.4875445Z       %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:45:39.4875727Z       %155 = tt.broadcast %154 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4875926Z       %156 = arith.addi %24, %155 : tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4876147Z       %157 = tt.addptr %25, %156 : tensor<128x2x!tt.ptr<bf16>, #blocked1>, tensor<128x2xi32, #blocked1>
2026-02-21T09:45:39.4876360Z       %158 = tt.load %157 : tensor<128x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:45:39.4876584Z       %159 = ttg.local_alloc %158 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem>
2026-02-21T09:45:39.4876921Z       %160 = ttg.local_load %159 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4877331Z       %161 = arith.extf %160 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4877618Z       %162 = arith.extsi %150 : i32 to i64
2026-02-21T09:45:39.4877747Z       %163 = arith.muli %162, %c8192_i64 : i64
2026-02-21T09:45:39.4877926Z       %164 = tt.splat %163 : i64 -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4878163Z       %165 = arith.addi %164, %31 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4878475Z       %166 = tt.addptr %27, %165 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4878749Z       %167 = arith.cmpi slt, %162, %c512_i64 : i64
2026-02-21T09:45:39.4878931Z       %168 = tt.splat %167 : i1 -> tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4879157Z       %169 = arith.andi %168, %34 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4879423Z       %170 = tt.load %166, %169, %cst_1 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4879667Z       %171 = arith.shli %170, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4879924Z       %172 = arith.shrsi %171, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4880165Z       %173 = arith.shrsi %170, %cst_5 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:45:39.4880451Z       %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:39.4880786Z       %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:45:39.4881065Z       %176 = tt.broadcast %174 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4881306Z       %177 = arith.select %39, %176, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4881547Z       %178 = tt.broadcast %175 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4881779Z       %179 = arith.select %41, %178, %177 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:45:39.4882012Z       %180 = tt.reshape %179 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:45:39.4882234Z       %181 = arith.sitofp %180 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:45:39.4882537Z       %182 = ttg.convert_layout %181 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:45:39.4883046Z       %183 = tt.dot %161, %182, %149, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:45:39.4883396Z       scf.yield %183 : tensor<128x32xf32, #mma>
2026-02-21T09:45:39.4883585Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:45:39.4883775Z     %43 = arith.truncf %42 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma>
2026-02-21T09:45:39.4884040Z     %44 = tt.expand_dims %20 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:45:39.4884299Z     %45 = arith.muli %44, %cst_8 : tensor<128x1xi32, #mma>
2026-02-21T09:45:39.4884523Z     %46 = tt.expand_dims %13 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma>
2026-02-21T09:45:39.4884779Z     %47 = tt.broadcast %45 : tensor<128x1xi32, #mma> -> tensor<128x32xi32, #mma>
2026-02-21T09:45:39.4884978Z     %48 = tt.broadcast %46 : tensor<1x32xi32, #mma> -> tensor<128x32xi32, #mma>
2026-02-21T09:45:39.4885157Z     %49 = arith.addi %47, %48 : tensor<128x32xi32, #mma>
2026-02-21T09:45:39.4885333Z     %50 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:45:39.4885548Z     %51 = tt.addptr %50, %49 : tensor<128x32x!tt.ptr<bf16>, #mma>, tensor<128x32xi32, #mma>
2026-02-21T09:45:39.4885743Z     tt.store %51, %43 : tensor<128x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:45:39.4885875Z     tt.return
2026-02-21T09:45:39.4885962Z   }
2026-02-21T09:45:39.4886037Z }
2026-02-21T09:45:39.4886086Z 
2026-02-21T09:45:39.4886120Z {-#
2026-02-21T09:45:39.4886208Z   external_resources: {
2026-02-21T09:45:39.4886309Z     mlir_reproducer: {
2026-02-21T09:45:39.4887317Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:45:39.4888310Z       disable_threading: false,
2026-02-21T09:45:39.4888438Z       verify_each: true
2026-02-21T09:45:39.4888534Z     }
2026-02-21T09:45:39.4888608Z   }
2026-02-21T09:45:39.4888684Z #-}
2026-02-21T09:45:39.4888967Z /tmp/torchinductor_root/ur/curprybqiqqa3re3xtfhh4eybc7zmkddnatmkgwbjlfqorv2budm.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:45:39.4889652Z /tmp/torchinductor_root/ur/curprybqiqqa3re3xtfhh4eybc7zmkddnatmkgwbjlfqorv2budm.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:45:39.4890200Z [70s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:45:39.4890926Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 32], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:45:39.4891596Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:45:39.4891771Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:46:17.8999207Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:46:17.9000797Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:46:17.9002049Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:46:17.9002969Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:46:17.9003747Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [8, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:46:17.9004639Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:46:17.9005664Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:46:17.9006524Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x16xf32, #mma>
2026-02-21T09:46:17.9006875Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:46:17.9007140Z     %c38912_i32 = arith.constant 38912 : i32
2026-02-21T09:46:17.9007423Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:46:17.9007675Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:46:17.9007994Z     %cst_0 = arith.constant dense<0> : tensor<2x2x16xi8, #blocked>
2026-02-21T09:46:17.9008327Z     %c65536_i32 = arith.constant 65536 : i32
2026-02-21T09:46:17.9008577Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:46:17.9008826Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:46:17.9009070Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:46:17.9009317Z     %c255_i32 = arith.constant 255 : i32
2026-02-21T09:46:17.9009561Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:46:17.9009966Z     %cst_1 = arith.constant dense<0> : tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9010453Z     %cst_2 = arith.constant dense<0> : tensor<128x4xi32, #blocked1>
2026-02-21T09:46:17.9010783Z     %c-1_i32 = arith.constant -1 : i32
2026-02-21T09:46:17.9011029Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:46:17.9011425Z     %cst_3 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:46:17.9011902Z     %cst_4 = arith.constant dense<8192> : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9012557Z     %cst_5 = arith.constant dense<4> : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9013022Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:46:17.9013390Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:46:17.9013755Z     %cst_8 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:46:17.9014117Z     %cst_9 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:46:17.9014477Z     %cst_10 = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:46:17.9014782Z     %cst_11 = arith.constant dense<0> : tensor<1x16xi64, #mma>
2026-02-21T09:46:17.9015028Z     %cst_12 = arith.constant dense<8192> : tensor<1x16xi64, #mma>
2026-02-21T09:46:17.9015250Z     %0 = tt.get_program_id x : i32
2026-02-21T09:46:17.9015551Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:17.9015981Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:46:17.9016455Z     %3 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:46:17.9016918Z     %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:46:17.9017378Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:46:17.9017839Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:46:17.9018200Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:46:17.9018610Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9019072Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:46:17.9019728Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:46:17.9020341Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:46:17.9020719Z     %12 = arith.cmpi eq, %11, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:46:17.9021016Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked>
2026-02-21T09:46:17.9021318Z     %14 = arith.cmpi eq, %11, %cst_7 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:46:17.9021593Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x16xi1, #blocked>
2026-02-21T09:46:17.9021909Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:46:17.9022318Z     %17 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:46:17.9022820Z     %18 = arith.extsi %4 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:46:17.9023170Z     %19 = arith.subi %c65536_i32, %0 : i32
2026-02-21T09:46:17.9023364Z     %20 = arith.ceildivsi %19, %c38912_i32 : i32
2026-02-21T09:46:17.9023553Z     %21 = arith.muli %20, %c256_i32 : i32
2026-02-21T09:46:17.9023732Z     %22 = arith.subi %0, %c38912_i32 : i32
2026-02-21T09:46:17.9024500Z     %23:8 = scf.for %arg3 = %c0_i32 to %21 step %c1_i32 iter_args(%arg4 = %c-1_i32, %arg5 = %22, %arg6 = %c0_i32, %arg7 = %cst, %arg8 = %c0_i32, %arg9 = %c0_i32, %arg10 = %cst_2, %arg11 = %cst_1) -> (i32, i32, i32, tensor<128x16xf32, #mma>, i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>)  : i32 {
2026-02-21T09:46:17.9025168Z       %24 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:46:17.9025317Z       %25 = arith.cmpi eq, %arg4, %c255_i32 : i32
2026-02-21T09:46:17.9025468Z       %26 = arith.select %25, %c0_i32, %24 : i32
2026-02-21T09:46:17.9025613Z       %27 = arith.cmpi eq, %26, %c0_i32 : i32
2026-02-21T09:46:17.9025758Z       %28 = arith.select %27, %c0_i32, %arg6 : i32
2026-02-21T09:46:17.9026034Z       %29:5 = scf.if %27 -> (i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32) {
2026-02-21T09:46:17.9026302Z         %64 = arith.addi %arg5, %c38912_i32 : i32
2026-02-21T09:46:17.9026452Z         %65 = arith.divsi %64, %c2048_i32 : i32
2026-02-21T09:46:17.9026595Z         %66 = arith.muli %65, %c4_i32 : i32
2026-02-21T09:46:17.9026740Z         %67 = arith.subi %c128_i32, %66 : i32
2026-02-21T09:46:17.9026877Z         %68 = arith.minsi %67, %c4_i32 : i32
2026-02-21T09:46:17.9027022Z         %69 = arith.remsi %64, %c2048_i32 : i32
2026-02-21T09:46:17.9027165Z         %70 = arith.remsi %69, %68 : i32
2026-02-21T09:46:17.9027297Z         %71 = arith.addi %66, %70 : i32
2026-02-21T09:46:17.9027432Z         %72 = arith.divsi %69, %68 : i32
2026-02-21T09:46:17.9027568Z         %73 = arith.muli %71, %c128_i32 : i32
2026-02-21T09:46:17.9027772Z         %74 = tt.splat %73 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:17.9028039Z         %75 = arith.addi %74, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:17.9028243Z         %76 = arith.muli %72, %c16_i32 : i32
2026-02-21T09:46:17.9028489Z         %77 = tt.splat %76 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:46:17.9028855Z         %78 = arith.addi %77, %3 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:46:17.9029227Z         %79 = tt.expand_dims %75 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:46:17.9029528Z         %80 = arith.muli %79, %cst_3 : tensor<128x1xi32, #blocked1>
2026-02-21T09:46:17.9029775Z         %81 = tt.broadcast %80 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:46:17.9030194Z         %82 = tt.expand_dims %78 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9030698Z         %83 = tt.broadcast %82 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9031129Z         scf.yield %73, %76, %81, %83, %64 : i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32
2026-02-21T09:46:17.9031405Z       } else {
2026-02-21T09:46:17.9031671Z         scf.yield %arg8, %arg9, %arg10, %arg11, %arg5 : i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>, i32
2026-02-21T09:46:17.9031969Z       }
2026-02-21T09:46:17.9032169Z       %30 = tt.splat %28 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:46:17.9032547Z       %31 = arith.addi %30, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:46:17.9032791Z       %32 = arith.muli %28, %c2_i32 : i32
2026-02-21T09:46:17.9032978Z       %33 = tt.splat %32 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:46:17.9033230Z       %34 = arith.addi %33, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:46:17.9033563Z       %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:46:17.9033883Z       %36 = tt.broadcast %35 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:46:17.9034133Z       %37 = arith.addi %29#2, %36 : tensor<128x4xi32, #blocked1>
2026-02-21T09:46:17.9034332Z       %38 = tt.addptr %7, %37 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:46:17.9034536Z       %39 = tt.load %38 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:46:17.9034799Z       %40 = ttg.convert_layout %39 : tensor<128x4xbf16, #blocked1> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:46:17.9035198Z       %41 = arith.extf %40 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:46:17.9035652Z       %42 = tt.expand_dims %31 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9042543Z       %43 = arith.muli %42, %cst_4 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9042906Z       %44 = tt.broadcast %43 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9043205Z       %45 = arith.addi %44, %29#3 : tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9043510Z       %46 = tt.addptr %8, %45 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9043809Z       %47 = tt.load %46 : tensor<2x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9044035Z       %48 = arith.shli %47, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9044318Z       %49 = arith.shrsi %48, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9044547Z       %50 = arith.shrsi %47, %cst_5 : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9044828Z       %51 = tt.expand_dims %49 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:46:17.9045175Z       %52 = tt.expand_dims %50 {axis = 1 : i32} : tensor<2x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x16xi8, #blocked>
2026-02-21T09:46:17.9045449Z       %53 = tt.broadcast %51 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:46:17.9045677Z       %54 = arith.select %13, %53, %cst_0 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:46:17.9045907Z       %55 = tt.broadcast %52 : tensor<2x1x16xi8, #blocked> -> tensor<2x2x16xi8, #blocked>
2026-02-21T09:46:17.9046134Z       %56 = arith.select %15, %55, %54 : tensor<2x2x16xi1, #blocked>, tensor<2x2x16xi8, #blocked>
2026-02-21T09:46:17.9046355Z       %57 = tt.reshape %56 : tensor<2x2x16xi8, #blocked> -> tensor<4x16xi8, #blocked2>
2026-02-21T09:46:17.9046569Z       %58 = arith.sitofp %57 : tensor<4x16xi8, #blocked2> to tensor<4x16xf32, #blocked2>
2026-02-21T09:46:17.9046853Z       %59 = ttg.convert_layout %58 : tensor<4x16xf32, #blocked2> -> tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:46:17.9047314Z       %60 = tt.dot %41, %59, %arg7, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x16xf32, #mma>
2026-02-21T09:46:17.9047655Z       %61 = arith.addi %28, %c2_i32 : i32
2026-02-21T09:46:17.9047780Z       %62 = arith.cmpi eq, %26, %c255_i32 : i32
2026-02-21T09:46:17.9047928Z       %63 = arith.select %62, %cst, %60 : tensor<128x16xf32, #mma>
2026-02-21T09:46:17.9048064Z       scf.if %62 {
2026-02-21T09:46:17.9048220Z         %64 = arith.truncf %60 : tensor<128x16xf32, #mma> to tensor<128x16xbf16, #mma>
2026-02-21T09:46:17.9048388Z         %65 = arith.extsi %29#0 : i32 to i64
2026-02-21T09:46:17.9048507Z         %66 = arith.extsi %29#1 : i32 to i64
2026-02-21T09:46:17.9048688Z         %67 = tt.splat %65 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:46:17.9048894Z         %68 = arith.addi %67, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:46:17.9049158Z         %69 = tt.expand_dims %68 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:46:17.9049392Z         %70 = arith.muli %69, %cst_8 : tensor<128x1xi64, #mma>
2026-02-21T09:46:17.9049568Z         %71 = tt.broadcast %70 : tensor<128x1xi64, #mma> -> tensor<128x16xi64, #mma>
2026-02-21T09:46:17.9049770Z         %72 = tt.splat %66 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:46:17.9049971Z         %73 = arith.addi %72, %18 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:46:17.9050228Z         %74 = tt.expand_dims %73 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi64, #mma>
2026-02-21T09:46:17.9050482Z         %75 = tt.broadcast %74 : tensor<1x16xi64, #mma> -> tensor<128x16xi64, #mma>
2026-02-21T09:46:17.9050660Z         %76 = arith.addi %71, %75 : tensor<128x16xi64, #mma>
2026-02-21T09:46:17.9050844Z         %77 = tt.addptr %16, %76 : tensor<128x16x!tt.ptr<bf16>, #mma>, tensor<128x16xi64, #mma>
2026-02-21T09:46:17.9051041Z         %78 = arith.cmpi sge, %69, %cst_9 : tensor<128x1xi64, #mma>
2026-02-21T09:46:17.9051206Z         %79 = arith.cmpi slt, %69, %cst_10 : tensor<128x1xi64, #mma>
2026-02-21T09:46:17.9051361Z         %80 = arith.andi %78, %79 : tensor<128x1xi1, #mma>
2026-02-21T09:46:17.9051533Z         %81 = tt.broadcast %80 : tensor<128x1xi1, #mma> -> tensor<128x16xi1, #mma>
2026-02-21T09:46:17.9051712Z         %82 = arith.cmpi sge, %74, %cst_11 : tensor<1x16xi64, #mma>
2026-02-21T09:46:17.9051896Z         %83 = arith.cmpi slt, %74, %cst_12 : tensor<1x16xi64, #mma>
2026-02-21T09:46:17.9052049Z         %84 = arith.andi %82, %83 : tensor<1x16xi1, #mma>
2026-02-21T09:46:17.9052213Z         %85 = tt.broadcast %84 : tensor<1x16xi1, #mma> -> tensor<128x16xi1, #mma>
2026-02-21T09:46:17.9052385Z         %86 = arith.andi %81, %85 : tensor<128x16xi1, #mma>
2026-02-21T09:46:17.9052553Z         tt.store %77, %64, %86 : tensor<128x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:46:17.9052685Z       }
2026-02-21T09:46:17.9052947Z       scf.yield %26, %29#4, %61, %63, %29#0, %29#1, %29#2, %29#3 : i32, i32, i32, tensor<128x16xf32, #mma>, i32, i32, tensor<128x4xi32, #blocked1>, tensor<2x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:17.9053264Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 4 : i32}
2026-02-21T09:46:17.9053399Z     tt.return
2026-02-21T09:46:17.9053478Z   }
2026-02-21T09:46:17.9053559Z }
2026-02-21T09:46:17.9053603Z 
2026-02-21T09:46:17.9053633Z {-#
2026-02-21T09:46:17.9053717Z   external_resources: {
2026-02-21T09:46:17.9053815Z     mlir_reproducer: {
2026-02-21T09:46:17.9054806Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:46:17.9055787Z       disable_threading: false,
2026-02-21T09:46:17.9055894Z       verify_each: true
2026-02-21T09:46:17.9055983Z     }
2026-02-21T09:46:17.9056059Z   }
2026-02-21T09:46:17.9056126Z #-}
2026-02-21T09:46:17.9056421Z /tmp/torchinductor_root/2i/c2i3iz5awefnwj6vhr5ioojv2job23trwyh547leswpj2jravbs2.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:46:17.9057093Z /tmp/torchinductor_root/2i/c2i3iz5awefnwj6vhr5ioojv2job23trwyh547leswpj2jravbs2.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:46:17.9057657Z [108s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:46:17.9058442Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 16], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:46:17.9059158Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:46:17.9059327Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:46:21.2035688Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:46:21.2037249Z #blocked = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:46:21.2038212Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:46:21.2039077Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:46:21.2039899Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:46:21.2041184Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:46:21.2041881Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:46:21.2042401Z #smem = #ttg.shared_memory
2026-02-21T09:46:21.2043270Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:46:21.2044566Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:46:21.2045745Z     %cst = arith.constant dense<16384> : tensor<256x1xi64, #mma>
2026-02-21T09:46:21.2045966Z     %cst_0 = arith.constant dense<0> : tensor<256x1xi64, #mma>
2026-02-21T09:46:21.2046179Z     %cst_1 = arith.constant dense<8192> : tensor<256x1xi64, #mma>
2026-02-21T09:46:21.2046392Z     %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #mma>
2026-02-21T09:46:21.2046597Z     %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #mma>
2026-02-21T09:46:21.2046807Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2047014Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2047227Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2047443Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:46:21.2047659Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:46:21.2047881Z     %cst_9 = arith.constant dense<0.000000e+00> : tensor<256x64xf32, #mma>
2026-02-21T09:46:21.2048083Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:46:21.2048231Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T09:46:21.2048460Z     %cst_10 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:21.2048800Z     %cst_11 = arith.constant dense<0> : tensor<1x64xi64, #blocked>
2026-02-21T09:46:21.2049018Z     %cst_12 = arith.constant dense<8192> : tensor<1x64xi64, #blocked>
2026-02-21T09:46:21.2049324Z     %cst_13 = arith.constant dense<1024> : tensor<256x1xi32, #blocked2>
2026-02-21T09:46:21.2049510Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:46:21.2049657Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:46:21.2049804Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:46:21.2049944Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:46:21.2050121Z     %cst_14 = arith.constant dense<0> : tensor<2x64xi8, #blocked>
2026-02-21T09:46:21.2050333Z     %cst_15 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked1>
2026-02-21T09:46:21.2050522Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:46:21.2050666Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:46:21.2050892Z     %cst_16 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:21.2051132Z     %0 = tt.get_program_id x : i32
2026-02-21T09:46:21.2051274Z     %1 = arith.divsi %0, %c4096_i32 : i32
2026-02-21T09:46:21.2051420Z     %2 = arith.muli %1, %c64_i32 : i32
2026-02-21T09:46:21.2051561Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:46:21.2051704Z     %4 = arith.minsi %3, %c64_i32 : i32
2026-02-21T09:46:21.2051844Z     %5 = arith.remsi %0, %c4096_i32 : i32
2026-02-21T09:46:21.2051988Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:46:21.2052121Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:46:21.2052254Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:46:21.2052386Z     %9 = arith.muli %7, %c64_i32 : i32
2026-02-21T09:46:21.2052529Z     %10 = arith.muli %8, %c256_i32 : i32
2026-02-21T09:46:21.2052786Z     %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:46:21.2053143Z     %12 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:46:21.2053488Z     %13 = tt.splat %10 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:46:21.2053765Z     %14 = arith.addi %13, %11 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:46:21.2054082Z     %15 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:46:21.2054518Z     %16 = tt.expand_dims %14 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<256x1xi32, #blocked2>
2026-02-21T09:46:21.2054836Z     %17 = arith.muli %16, %cst_13 : tensor<256x1xi32, #blocked2>
2026-02-21T09:46:21.2055078Z     %18 = tt.broadcast %17 : tensor<256x1xi32, #blocked2> -> tensor<256x4xi32, #blocked2>
2026-02-21T09:46:21.2055350Z     %19 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<256x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:46:21.2055555Z     %20 = arith.extsi %9 : i32 to i64
2026-02-21T09:46:21.2055728Z     %21 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:46:21.2055963Z     %22 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:21.2056280Z     %23 = arith.extsi %22 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:21.2056570Z     %24 = tt.splat %20 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:46:21.2056825Z     %25 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:46:21.2057148Z     %26 = arith.extsi %25 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:46:21.2057444Z     %27 = arith.addi %24, %26 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:46:21.2057735Z     %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi64, #blocked>
2026-02-21T09:46:21.2058006Z     %29 = tt.broadcast %28 : tensor<1x64xi64, #blocked> -> tensor<2x64xi64, #blocked>
2026-02-21T09:46:21.2058225Z     %30 = arith.cmpi sge, %28, %cst_11 : tensor<1x64xi64, #blocked>
2026-02-21T09:46:21.2058393Z     %31 = arith.cmpi slt, %28, %cst_12 : tensor<1x64xi64, #blocked>
2026-02-21T09:46:21.2058557Z     %32 = arith.andi %30, %31 : tensor<1x64xi1, #blocked>
2026-02-21T09:46:21.2058737Z     %33 = tt.broadcast %32 : tensor<1x64xi1, #blocked> -> tensor<2x64xi1, #blocked>
2026-02-21T09:46:21.2059023Z     %34 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:46:21.2059455Z     %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:46:21.2059864Z     %36 = tt.expand_dims %35 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:46:21.2060120Z     %37 = arith.cmpi eq, %36, %cst_8 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:46:21.2060317Z     %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x64xi1, #blocked1>
2026-02-21T09:46:21.2060515Z     %39 = arith.cmpi eq, %36, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:46:21.2060709Z     %40 = tt.broadcast %39 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x64xi1, #blocked1>
2026-02-21T09:46:21.2060925Z     %41 = ttg.local_alloc : () -> !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable>
2026-02-21T09:46:21.2061202Z     %42 = tt.expand_dims %15 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:46:21.2061471Z     %43 = tt.broadcast %42 : tensor<1x4xi32, #blocked2> -> tensor<256x4xi32, #blocked2>
2026-02-21T09:46:21.2061664Z     %44 = arith.addi %18, %43 : tensor<256x4xi32, #blocked2>
2026-02-21T09:46:21.2061881Z     %45 = tt.addptr %19, %44 : tensor<256x4x!tt.ptr<bf16>, #blocked2>, tensor<256x4xi32, #blocked2>
2026-02-21T09:46:21.2062086Z     %46 = tt.load %45 : tensor<256x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:46:21.2062372Z     %47 = ttg.memdesc_index %41[%c0_i32] : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>
2026-02-21T09:46:21.2062766Z     ttg.local_store %46, %47 : tensor<256x4xbf16, #blocked2> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>
2026-02-21T09:46:21.2063198Z     %48:3 = scf.for %arg3 = %c0_i32 to %c510_i32 step %c2_i32 iter_args(%arg4 = %cst_9, %arg5 = %c0_i32, %arg6 = %47) -> (tensor<256x64xf32, #mma>, i32, !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>)  : i32 {
2026-02-21T09:46:21.2063529Z       %103 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:46:21.2063651Z       %104 = arith.muli %103, %c2_i32 : i32
2026-02-21T09:46:21.2063824Z       %105 = tt.splat %104 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:46:21.2064049Z       %106 = arith.addi %105, %15 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:46:21.2064329Z       %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:46:21.2064611Z       %108 = tt.broadcast %107 : tensor<1x4xi32, #blocked2> -> tensor<256x4xi32, #blocked2>
2026-02-21T09:46:21.2064807Z       %109 = arith.addi %18, %108 : tensor<256x4xi32, #blocked2>
2026-02-21T09:46:21.2065012Z       %110 = tt.addptr %19, %109 : tensor<256x4x!tt.ptr<bf16>, #blocked2>, tensor<256x4xi32, #blocked2>
2026-02-21T09:46:21.2065218Z       %111 = tt.load %110 : tensor<256x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:46:21.2065521Z       %112 = ttg.local_load %arg6 : !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:46:21.2065977Z       %113 = arith.extf %112 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:46:21.2066271Z       %114 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:46:21.2066440Z       %115 = tt.splat %114 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:21.2066659Z       %116 = arith.addi %115, %23 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:21.2066933Z       %117 = tt.expand_dims %116 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2067177Z       %118 = arith.muli %117, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2067359Z       %119 = tt.broadcast %118 : tensor<2x1xi64, #blocked> -> tensor<2x64xi64, #blocked>
2026-02-21T09:46:21.2067548Z       %120 = arith.addi %119, %29 : tensor<2x64xi64, #blocked>
2026-02-21T09:46:21.2067740Z       %121 = tt.addptr %21, %120 : tensor<2x64x!tt.ptr<i8>, #blocked>, tensor<2x64xi64, #blocked>
2026-02-21T09:46:21.2067945Z       %122 = arith.cmpi sge, %117, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2068113Z       %123 = arith.cmpi slt, %117, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2068274Z       %124 = arith.andi %122, %123 : tensor<2x1xi1, #blocked>
2026-02-21T09:46:21.2068457Z       %125 = tt.broadcast %124 : tensor<2x1xi1, #blocked> -> tensor<2x64xi1, #blocked>
2026-02-21T09:46:21.2068641Z       %126 = arith.andi %125, %33 : tensor<2x64xi1, #blocked>
2026-02-21T09:46:21.2068807Z       %127 = tt.load %121, %126, %cst_14 : tensor<2x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:46:21.2069063Z       %128 = ttg.convert_layout %127 : tensor<2x64xi8, #blocked> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:21.2069345Z       %129 = arith.shli %128, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:21.2069610Z       %130 = arith.shrsi %129, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:21.2069868Z       %131 = arith.shrsi %128, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:21.2070157Z       %132 = tt.expand_dims %130 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1>
2026-02-21T09:46:21.2070507Z       %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1>
2026-02-21T09:46:21.2070793Z       %134 = tt.broadcast %132 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1>
2026-02-21T09:46:21.2071037Z       %135 = arith.select %38, %134, %cst_15 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1>
2026-02-21T09:46:21.2071280Z       %136 = tt.broadcast %133 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1>
2026-02-21T09:46:21.2071515Z       %137 = arith.select %40, %136, %135 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1>
2026-02-21T09:46:21.2071746Z       %138 = tt.reshape %137 : tensor<2x2x64xi8, #blocked1> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:46:21.2071966Z       %139 = arith.sitofp %138 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:46:21.2072256Z       %140 = ttg.convert_layout %139 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:46:21.2072729Z       %141 = tt.dot %113, %140, %arg4, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x64xf32, #mma>
2026-02-21T09:46:21.2073082Z       %142 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:46:21.2073207Z       %143 = arith.cmpi slt, %142, %c1_i32 : i32
2026-02-21T09:46:21.2073337Z       %144 = arith.select %143, %142, %c0_i32 : i32
2026-02-21T09:46:21.2073621Z       %145 = ttg.memdesc_index %41[%144] : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>
2026-02-21T09:46:21.2073981Z       ttg.local_store %111, %145 : tensor<256x4xbf16, #blocked2> -> !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>
2026-02-21T09:46:21.2074314Z       scf.yield %141, %144, %145 : tensor<256x64xf32, #mma>, i32, !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4>
2026-02-21T09:46:21.2074563Z     } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:46:21.2074874Z     %49 = ttg.local_load %48#2 : !ttg.memdesc<256x4xbf16, #shared, #smem, mutable, 1x256x4> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:46:21.2075296Z     %50 = arith.extf %49 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:46:21.2075620Z     %51 = arith.addi %23, %cst_10 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:46:21.2075896Z     %52 = tt.expand_dims %51 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2076133Z     %53 = arith.muli %52, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2076317Z     %54 = tt.broadcast %53 : tensor<2x1xi64, #blocked> -> tensor<2x64xi64, #blocked>
2026-02-21T09:46:21.2076499Z     %55 = arith.addi %54, %29 : tensor<2x64xi64, #blocked>
2026-02-21T09:46:21.2076687Z     %56 = tt.addptr %21, %55 : tensor<2x64x!tt.ptr<i8>, #blocked>, tensor<2x64xi64, #blocked>
2026-02-21T09:46:21.2076884Z     %57 = arith.cmpi sge, %52, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2077048Z     %58 = arith.cmpi slt, %52, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:46:21.2077205Z     %59 = arith.andi %57, %58 : tensor<2x1xi1, #blocked>
2026-02-21T09:46:21.2077376Z     %60 = tt.broadcast %59 : tensor<2x1xi1, #blocked> -> tensor<2x64xi1, #blocked>
2026-02-21T09:46:21.2077557Z     %61 = arith.andi %60, %33 : tensor<2x64xi1, #blocked>
2026-02-21T09:46:21.2077735Z     %62 = tt.load %56, %61, %cst_14 : tensor<2x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:46:21.2077983Z     %63 = ttg.convert_layout %62 : tensor<2x64xi8, #blocked> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:21.2078258Z     %64 = arith.shli %63, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:21.2078502Z     %65 = arith.shrsi %64, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:21.2078734Z     %66 = arith.shrsi %63, %cst_16 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:46:21.2079015Z     %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1>
2026-02-21T09:46:21.2079343Z     %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x64xi8, #blocked1>
2026-02-21T09:46:21.2079621Z     %69 = tt.broadcast %67 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1>
2026-02-21T09:46:21.2079852Z     %70 = arith.select %38, %69, %cst_15 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1>
2026-02-21T09:46:21.2080086Z     %71 = tt.broadcast %68 : tensor<2x1x64xi8, #blocked1> -> tensor<2x2x64xi8, #blocked1>
2026-02-21T09:46:21.2080314Z     %72 = arith.select %40, %71, %70 : tensor<2x2x64xi1, #blocked1>, tensor<2x2x64xi8, #blocked1>
2026-02-21T09:46:21.2080535Z     %73 = tt.reshape %72 : tensor<2x2x64xi8, #blocked1> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:46:21.2080748Z     %74 = arith.sitofp %73 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:46:21.2081033Z     %75 = ttg.convert_layout %74 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:46:21.2081504Z     %76 = tt.dot %50, %75, %48#0, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x64xf32, #mma>
2026-02-21T09:46:21.2081885Z     ttg.local_dealloc %41 : !ttg.memdesc<1x256x4xbf16, #shared, #smem, mutable>
2026-02-21T09:46:21.2082107Z     %77 = arith.truncf %76 : tensor<256x64xf32, #mma> to tensor<256x64xbf16, #mma>
2026-02-21T09:46:21.2082270Z     %78 = arith.extsi %10 : i32 to i64
2026-02-21T09:46:21.2082423Z     %79 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<256x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:46:21.2082671Z     %80 = tt.splat %78 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:46:21.2082944Z     %81 = arith.extsi %12 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:46:21.2083210Z     %82 = arith.addi %80, %81 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:46:21.2083466Z     %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma>
2026-02-21T09:46:21.2083700Z     %84 = arith.muli %83, %cst_1 : tensor<256x1xi64, #mma>
2026-02-21T09:46:21.2083872Z     %85 = tt.broadcast %84 : tensor<256x1xi64, #mma> -> tensor<256x64xi64, #mma>
2026-02-21T09:46:21.2084072Z     %86 = tt.splat %20 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:46:21.2084306Z     %87 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:46:21.2084603Z     %88 = arith.extsi %87 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:46:21.2084866Z     %89 = arith.addi %86, %88 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:46:21.2085116Z     %90 = tt.expand_dims %89 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi64, #mma>
2026-02-21T09:46:21.2085365Z     %91 = tt.broadcast %90 : tensor<1x64xi64, #mma> -> tensor<256x64xi64, #mma>
2026-02-21T09:46:21.2085558Z     %92 = arith.addi %85, %91 : tensor<256x64xi64, #mma>
2026-02-21T09:46:21.2085739Z     %93 = tt.addptr %79, %92 : tensor<256x64x!tt.ptr<bf16>, #mma>, tensor<256x64xi64, #mma>
2026-02-21T09:46:21.2085935Z     %94 = arith.cmpi sge, %83, %cst_0 : tensor<256x1xi64, #mma>
2026-02-21T09:46:21.2086095Z     %95 = arith.cmpi slt, %83, %cst : tensor<256x1xi64, #mma>
2026-02-21T09:46:21.2086257Z     %96 = arith.andi %94, %95 : tensor<256x1xi1, #mma>
2026-02-21T09:46:21.2086422Z     %97 = tt.broadcast %96 : tensor<256x1xi1, #mma> -> tensor<256x64xi1, #mma>
2026-02-21T09:46:21.2086601Z     %98 = arith.cmpi sge, %90, %cst_3 : tensor<1x64xi64, #mma>
2026-02-21T09:46:21.2086758Z     %99 = arith.cmpi slt, %90, %cst_2 : tensor<1x64xi64, #mma>
2026-02-21T09:46:21.2086908Z     %100 = arith.andi %98, %99 : tensor<1x64xi1, #mma>
2026-02-21T09:46:21.2087074Z     %101 = tt.broadcast %100 : tensor<1x64xi1, #mma> -> tensor<256x64xi1, #mma>
2026-02-21T09:46:21.2087253Z     %102 = arith.andi %97, %101 : tensor<256x64xi1, #mma>
2026-02-21T09:46:21.2087410Z     tt.store %93, %77, %102 : tensor<256x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:46:21.2087548Z     tt.return
2026-02-21T09:46:21.2087629Z   }
2026-02-21T09:46:21.2087704Z }
2026-02-21T09:46:21.2087746Z 
2026-02-21T09:46:21.2087780Z {-#
2026-02-21T09:46:21.2087860Z   external_resources: {
2026-02-21T09:46:21.2087963Z     mlir_reproducer: {
2026-02-21T09:46:21.2088950Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:46:21.2089982Z       disable_threading: false,
2026-02-21T09:46:21.2090089Z       verify_each: true
2026-02-21T09:46:21.2090178Z     }
2026-02-21T09:46:21.2090270Z   }
2026-02-21T09:46:21.2090341Z #-}
2026-02-21T09:46:21.2090620Z /tmp/torchinductor_root/hi/chidcxn4ljnvy3bnthwz6udzxevspp27es5to4zt73hthpghpum5.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:46:21.2091305Z /tmp/torchinductor_root/hi/chidcxn4ljnvy3bnthwz6udzxevspp27es5to4zt73hthpghpum5.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:46:21.2091851Z [112s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:46:21.2092575Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 256, 64], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:46:21.2093237Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:46:21.2093409Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:47:02.0608467Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:47:02.0611767Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:47:02.0612130Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:47:02.0612735Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:47:02.0613031Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:47:02.0613328Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:47:02.0613661Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:47:02.0613840Z #smem = #ttg.shared_memory
2026-02-21T09:47:02.0614073Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:47:02.0614540Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:47:02.0614943Z     %cst = arith.constant dense<8192> : tensor<16x1xi32, #mma>
2026-02-21T09:47:02.0615123Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:02.0615298Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:02.0615478Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x256xf32, #mma>
2026-02-21T09:47:02.0615664Z     %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0615845Z     %cst_4 = arith.constant dense<8192> : tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0616023Z     %cst_5 = arith.constant dense<1024> : tensor<16x1xi32, #blocked2>
2026-02-21T09:47:02.0616175Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:47:02.0616296Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:47:02.0616413Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:47:02.0616529Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:47:02.0616649Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:47:02.0616828Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:47:02.0616970Z     %cst_6 = arith.constant dense<0> : tensor<1x256xi8, #blocked1>
2026-02-21T09:47:02.0617113Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:47:02.0617311Z     %c512_i64 = arith.constant 512 : i64
2026-02-21T09:47:02.0617429Z     %c8192_i64 = arith.constant 8192 : i64
2026-02-21T09:47:02.0617576Z     %cst_7 = arith.constant dense<0> : tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0617718Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:47:02.0617829Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:47:02.0617940Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:47:02.0618053Z     %c1216_i32 = arith.constant 1216 : i32
2026-02-21T09:47:02.0618238Z     %cst_8 = arith.constant dense<4> : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0618432Z     %0 = tt.get_program_id x : i32
2026-02-21T09:47:02.0618628Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:47:02.0618901Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:02.0619173Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:02.0619446Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:02.0619706Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:47:02.0619971Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x2x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:47:02.0620172Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:47:02.0620454Z     %8 = arith.extsi %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:02.0620835Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:47:02.0621246Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:47:02.0621666Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:02.0621919Z     %12 = arith.cmpi eq, %11, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:02.0622112Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:47:02.0622309Z     %14 = arith.cmpi eq, %11, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:02.0622495Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x256xi1, #blocked>
2026-02-21T09:47:02.0622707Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:02.0622894Z     scf.for %arg3 = %0 to %c32768_i32 step %c1216_i32  : i32 {
2026-02-21T09:47:02.0623045Z       %17 = arith.remsi %arg3, %c1024_i32 : i32
2026-02-21T09:47:02.0623174Z       %18 = arith.divsi %arg3, %c1024_i32 : i32
2026-02-21T09:47:02.0623294Z       %19 = arith.muli %17, %c16_i32 : i32
2026-02-21T09:47:02.0623463Z       %20 = tt.splat %19 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:47:02.0623675Z       %21 = tt.splat %19 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:02.0623883Z       %22 = arith.addi %20, %1 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:47:02.0624095Z       %23 = arith.addi %21, %2 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:02.0624254Z       %24 = arith.muli %18, %c256_i32 : i32
2026-02-21T09:47:02.0624436Z       %25 = tt.splat %24 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:02.0624638Z       %26 = arith.addi %25, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:02.0624904Z       %27 = tt.expand_dims %22 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:47:02.0625172Z       %28 = arith.muli %27, %cst_5 : tensor<16x1xi32, #blocked2>
2026-02-21T09:47:02.0625362Z       %29 = tt.broadcast %28 : tensor<16x1xi32, #blocked2> -> tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0625534Z       %30 = arith.extsi %24 : i32 to i64
2026-02-21T09:47:02.0625697Z       %31 = tt.splat %30 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:02.0625917Z       %32 = arith.addi %31, %8 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:02.0626193Z       %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0626448Z       %34 = arith.cmpi sge, %33, %cst_3 : tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0626625Z       %35 = arith.cmpi slt, %33, %cst_4 : tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0626793Z       %36 = arith.andi %34, %35 : tensor<1x256xi1, #blocked1>
2026-02-21T09:47:02.0627025Z       %37 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<16x256xf32, #mma>)  : i32 {
2026-02-21T09:47:02.0627246Z         %46 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:47:02.0627413Z         %47 = tt.splat %46 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:47:02.0627630Z         %48 = arith.addi %47, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:47:02.0627899Z         %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x2xi32, #blocked2>
2026-02-21T09:47:02.0628174Z         %50 = tt.broadcast %49 : tensor<1x2xi32, #blocked2> -> tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0628413Z         %51 = arith.addi %29, %50 : tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0628608Z         %52 = tt.addptr %6, %51 : tensor<16x2x!tt.ptr<bf16>, #blocked2>, tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0628816Z         %53 = tt.load %52 : tensor<16x2x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:47:02.0629100Z         %54 = ttg.convert_layout %53 : tensor<16x2xbf16, #blocked2> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0629496Z         %55 = arith.extf %54 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0629778Z         %56 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:47:02.0629900Z         %57 = arith.muli %56, %c8192_i64 : i64
2026-02-21T09:47:02.0630039Z         %58 = tt.splat %57 : i64 -> tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0630192Z         %59 = arith.addi %58, %33 : tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0630394Z         %60 = tt.addptr %7, %59 : tensor<1x256x!tt.ptr<i8>, #blocked1>, tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0630599Z         %61 = tt.load %60, %36, %cst_6 : tensor<1x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:47:02.0630858Z         %62 = ttg.convert_layout %61 : tensor<1x256xi8, #blocked1> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0631138Z         %63 = arith.shli %62, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0631367Z         %64 = arith.shrsi %63, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0631598Z         %65 = arith.shrsi %62, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0631880Z         %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:47:02.0632235Z         %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:47:02.0632519Z         %68 = tt.broadcast %66 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0632777Z         %69 = arith.select %13, %68, %cst_7 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0633014Z         %70 = tt.broadcast %67 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0633241Z         %71 = arith.select %15, %70, %69 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0633466Z         %72 = tt.reshape %71 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3>
2026-02-21T09:47:02.0633689Z         %73 = arith.sitofp %72 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3>
2026-02-21T09:47:02.0633935Z         %74 = ttg.local_alloc %73 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:47:02.0634262Z         %75 = ttg.local_load %74 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0634737Z         %76 = tt.dot %55, %75, %arg5, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:47:02.0635083Z         %77 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:47:02.0635206Z         %78 = arith.muli %77, %c2_i32 : i32
2026-02-21T09:47:02.0635371Z         %79 = tt.splat %78 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:47:02.0635589Z         %80 = arith.addi %79, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:47:02.0635863Z         %81 = tt.expand_dims %80 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x2xi32, #blocked2>
2026-02-21T09:47:02.0636152Z         %82 = tt.broadcast %81 : tensor<1x2xi32, #blocked2> -> tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0636342Z         %83 = arith.addi %29, %82 : tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0636536Z         %84 = tt.addptr %6, %83 : tensor<16x2x!tt.ptr<bf16>, #blocked2>, tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0636738Z         %85 = tt.load %84 : tensor<16x2x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:47:02.0637013Z         %86 = ttg.convert_layout %85 : tensor<16x2xbf16, #blocked2> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0637406Z         %87 = arith.extf %86 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0637681Z         %88 = arith.extsi %77 : i32 to i64
2026-02-21T09:47:02.0637800Z         %89 = arith.muli %88, %c8192_i64 : i64
2026-02-21T09:47:02.0637938Z         %90 = tt.splat %89 : i64 -> tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0638093Z         %91 = arith.addi %90, %33 : tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0638286Z         %92 = tt.addptr %7, %91 : tensor<1x256x!tt.ptr<i8>, #blocked1>, tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0638477Z         %93 = arith.cmpi slt, %88, %c512_i64 : i64
2026-02-21T09:47:02.0638622Z         %94 = tt.splat %93 : i1 -> tensor<1x256xi1, #blocked1>
2026-02-21T09:47:02.0638778Z         %95 = arith.andi %94, %36 : tensor<1x256xi1, #blocked1>
2026-02-21T09:47:02.0638943Z         %96 = tt.load %92, %95, %cst_6 : tensor<1x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:47:02.0639201Z         %97 = ttg.convert_layout %96 : tensor<1x256xi8, #blocked1> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0639486Z         %98 = arith.shli %97, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0639719Z         %99 = arith.shrsi %98, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0639986Z         %100 = arith.shrsi %97, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0640277Z         %101 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:47:02.0640658Z         %102 = tt.expand_dims %100 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:47:02.0640954Z         %103 = tt.broadcast %101 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0641202Z         %104 = arith.select %13, %103, %cst_7 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0641448Z         %105 = tt.broadcast %102 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0641692Z         %106 = arith.select %15, %105, %104 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0641929Z         %107 = tt.reshape %106 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3>
2026-02-21T09:47:02.0642160Z         %108 = arith.sitofp %107 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3>
2026-02-21T09:47:02.0642418Z         %109 = ttg.local_alloc %108 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:47:02.0642842Z         %110 = ttg.local_load %109 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0643315Z         %111 = tt.dot %87, %110, %76, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:47:02.0643659Z         %112 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:47:02.0643790Z         %113 = arith.muli %112, %c2_i32 : i32
2026-02-21T09:47:02.0643971Z         %114 = tt.splat %113 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:47:02.0644221Z         %115 = arith.addi %114, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:47:02.0644502Z         %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x2xi32, #blocked2>
2026-02-21T09:47:02.0644801Z         %117 = tt.broadcast %116 : tensor<1x2xi32, #blocked2> -> tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0645002Z         %118 = arith.addi %29, %117 : tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0645208Z         %119 = tt.addptr %6, %118 : tensor<16x2x!tt.ptr<bf16>, #blocked2>, tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0645415Z         %120 = tt.load %119 : tensor<16x2x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:47:02.0645692Z         %121 = ttg.convert_layout %120 : tensor<16x2xbf16, #blocked2> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0646098Z         %122 = arith.extf %121 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0646385Z         %123 = arith.extsi %112 : i32 to i64
2026-02-21T09:47:02.0646517Z         %124 = arith.muli %123, %c8192_i64 : i64
2026-02-21T09:47:02.0646664Z         %125 = tt.splat %124 : i64 -> tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0646834Z         %126 = arith.addi %125, %33 : tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0647037Z         %127 = tt.addptr %7, %126 : tensor<1x256x!tt.ptr<i8>, #blocked1>, tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0647239Z         %128 = arith.cmpi slt, %123, %c512_i64 : i64
2026-02-21T09:47:02.0647387Z         %129 = tt.splat %128 : i1 -> tensor<1x256xi1, #blocked1>
2026-02-21T09:47:02.0647552Z         %130 = arith.andi %129, %36 : tensor<1x256xi1, #blocked1>
2026-02-21T09:47:02.0647728Z         %131 = tt.load %127, %130, %cst_6 : tensor<1x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:47:02.0648014Z         %132 = ttg.convert_layout %131 : tensor<1x256xi8, #blocked1> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0648304Z         %133 = arith.shli %132, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0648563Z         %134 = arith.shrsi %133, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0648806Z         %135 = arith.shrsi %132, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0649103Z         %136 = tt.expand_dims %134 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:47:02.0649442Z         %137 = tt.expand_dims %135 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:47:02.0649733Z         %138 = tt.broadcast %136 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0649980Z         %139 = arith.select %13, %138, %cst_7 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0650225Z         %140 = tt.broadcast %137 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0650468Z         %141 = arith.select %15, %140, %139 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0650701Z         %142 = tt.reshape %141 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3>
2026-02-21T09:47:02.0650933Z         %143 = arith.sitofp %142 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3>
2026-02-21T09:47:02.0651186Z         %144 = ttg.local_alloc %143 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:47:02.0651515Z         %145 = ttg.local_load %144 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0652007Z         %146 = tt.dot %122, %145, %111, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:47:02.0652354Z         %147 = arith.addi %arg4, %c3_i32 : i32
2026-02-21T09:47:02.0652487Z         %148 = arith.muli %147, %c2_i32 : i32
2026-02-21T09:47:02.0652665Z         %149 = tt.splat %148 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:47:02.0652905Z         %150 = arith.addi %149, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:47:02.0653188Z         %151 = tt.expand_dims %150 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x2xi32, #blocked2>
2026-02-21T09:47:02.0653465Z         %152 = tt.broadcast %151 : tensor<1x2xi32, #blocked2> -> tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0653663Z         %153 = arith.addi %29, %152 : tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0653867Z         %154 = tt.addptr %6, %153 : tensor<16x2x!tt.ptr<bf16>, #blocked2>, tensor<16x2xi32, #blocked2>
2026-02-21T09:47:02.0654080Z         %155 = tt.load %154 : tensor<16x2x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:47:02.0654351Z         %156 = ttg.convert_layout %155 : tensor<16x2xbf16, #blocked2> -> tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0654758Z         %157 = arith.extf %156 : tensor<16x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0655048Z         %158 = arith.extsi %147 : i32 to i64
2026-02-21T09:47:02.0655173Z         %159 = arith.muli %158, %c8192_i64 : i64
2026-02-21T09:47:02.0655322Z         %160 = tt.splat %159 : i64 -> tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0655488Z         %161 = arith.addi %160, %33 : tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0655691Z         %162 = tt.addptr %7, %161 : tensor<1x256x!tt.ptr<i8>, #blocked1>, tensor<1x256xi64, #blocked1>
2026-02-21T09:47:02.0655909Z         %163 = arith.cmpi slt, %158, %c512_i64 : i64
2026-02-21T09:47:02.0656057Z         %164 = tt.splat %163 : i1 -> tensor<1x256xi1, #blocked1>
2026-02-21T09:47:02.0656220Z         %165 = arith.andi %164, %36 : tensor<1x256xi1, #blocked1>
2026-02-21T09:47:02.0656413Z         %166 = tt.load %162, %165, %cst_6 : tensor<1x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:47:02.0656677Z         %167 = ttg.convert_layout %166 : tensor<1x256xi8, #blocked1> -> tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0656968Z         %168 = arith.shli %167, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0657208Z         %169 = arith.shrsi %168, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0657451Z         %170 = arith.shrsi %167, %cst_8 : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:02.0657750Z         %171 = tt.expand_dims %169 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:47:02.0658092Z         %172 = tt.expand_dims %170 {axis = 1 : i32} : tensor<1x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x256xi8, #blocked>
2026-02-21T09:47:02.0658389Z         %173 = tt.broadcast %171 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0658633Z         %174 = arith.select %13, %173, %cst_7 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0658881Z         %175 = tt.broadcast %172 : tensor<1x1x256xi8, #blocked> -> tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0659127Z         %176 = arith.select %15, %175, %174 : tensor<1x2x256xi1, #blocked>, tensor<1x2x256xi8, #blocked>
2026-02-21T09:47:02.0659368Z         %177 = tt.reshape %176 : tensor<1x2x256xi8, #blocked> -> tensor<2x256xi8, #blocked3>
2026-02-21T09:47:02.0659602Z         %178 = arith.sitofp %177 : tensor<2x256xi8, #blocked3> to tensor<2x256xf32, #blocked3>
2026-02-21T09:47:02.0659881Z         %179 = ttg.local_alloc %178 : (tensor<2x256xf32, #blocked3>) -> !ttg.memdesc<2x256xf32, #shared, #smem>
2026-02-21T09:47:02.0660212Z         %180 = ttg.local_load %179 : !ttg.memdesc<2x256xf32, #shared, #smem> -> tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:02.0660684Z         %181 = tt.dot %157, %180, %146, inputPrecision = tf32 : tensor<16x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x256xf32, #mma>
2026-02-21T09:47:02.0661046Z         scf.yield %181 : tensor<16x256xf32, #mma>
2026-02-21T09:47:02.0661176Z       }
2026-02-21T09:47:02.0661307Z       %38 = arith.truncf %37 : tensor<16x256xf32, #mma> to tensor<16x256xbf16, #mma>
2026-02-21T09:47:02.0661577Z       %39 = tt.expand_dims %23 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:47:02.0661813Z       %40 = arith.muli %39, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:47:02.0662047Z       %41 = tt.expand_dims %26 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:47:02.0662308Z       %42 = tt.broadcast %40 : tensor<16x1xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:47:02.0662509Z       %43 = tt.broadcast %41 : tensor<1x256xi32, #mma> -> tensor<16x256xi32, #mma>
2026-02-21T09:47:02.0662690Z       %44 = arith.addi %42, %43 : tensor<16x256xi32, #mma>
2026-02-21T09:47:02.0662877Z       %45 = tt.addptr %16, %44 : tensor<16x256x!tt.ptr<bf16>, #mma>, tensor<16x256xi32, #mma>
2026-02-21T09:47:02.0663072Z       tt.store %45, %38 : tensor<16x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:02.0663282Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:47:02.0663465Z     tt.return
2026-02-21T09:47:02.0663552Z   }
2026-02-21T09:47:02.0663632Z }
2026-02-21T09:47:02.0663682Z 
2026-02-21T09:47:02.0663715Z {-#
2026-02-21T09:47:02.0663798Z   external_resources: {
2026-02-21T09:47:02.0663921Z     mlir_reproducer: {
2026-02-21T09:47:02.0664926Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:47:02.0665942Z       disable_threading: false,
2026-02-21T09:47:02.0666055Z       verify_each: true
2026-02-21T09:47:02.0666151Z     }
2026-02-21T09:47:02.0666227Z   }
2026-02-21T09:47:02.0666303Z #-}
2026-02-21T09:47:02.0666590Z /tmp/torchinductor_root/zh/czhziawqzz4jltg55krmykgp4tqxs6x5wtip7pfr6hq44mrh2l3m.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:47:02.0667271Z /tmp/torchinductor_root/zh/czhziawqzz4jltg55krmykgp4tqxs6x5wtip7pfr6hq44mrh2l3m.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:47:02.0667829Z [152s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:47:02.0668626Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 16, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, None], range_num_stages=[2, 0], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:47:02.0669352Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:47:02.0669525Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:47:08.6420627Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:47:08.6434031Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:47:08.6434871Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:47:08.6435526Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:47:08.6436138Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:47:08.6436695Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 32, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:47:08.6437178Z #smem = #ttg.shared_memory
2026-02-21T09:47:08.6437819Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:47:08.6438843Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:47:08.6439681Z     %cst = arith.constant dense<0.000000e+00> : tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6440017Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:47:08.6440251Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:47:08.6440481Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:47:08.6440711Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:47:08.6441118Z     %cst_0 = arith.constant dense<0> : tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6441418Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:47:08.6441647Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T09:47:08.6441870Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:47:08.6442228Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T09:47:08.6442473Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:47:08.6442810Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:47:08.6443067Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:47:08.6443299Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:47:08.6443720Z     %cst_1 = arith.constant dense<4153344> : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6444270Z     %cst_2 = arith.constant dense<4128768> : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6444657Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:47:08.6444878Z     %c7_i32 = arith.constant 7 : i32
2026-02-21T09:47:08.6445237Z     %cst_3 = arith.constant dense<10> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6445743Z     %cst_4 = arith.constant dense<8> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6446237Z     %cst_5 = arith.constant dense<6> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6446720Z     %cst_6 = arith.constant dense<4> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6447205Z     %cst_7 = arith.constant dense<2> : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6447568Z     %c504_i32 = arith.constant 504 : i32
2026-02-21T09:47:08.6447797Z     %c5_i32 = arith.constant 5 : i32
2026-02-21T09:47:08.6448015Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:47:08.6448303Z     %cst_8 = arith.constant dense<1024> : tensor<256x1xi32, #blocked1>
2026-02-21T09:47:08.6448737Z     %cst_9 = arith.constant dense<4> : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6449287Z     %cst_10 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:08.6449537Z     %cst_11 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:08.6449783Z     %cst_12 = arith.constant dense<8192> : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6450024Z     %cst_13 = arith.constant dense<0> : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6450292Z     %cst_14 = arith.constant dense<16384> : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6450528Z     %cst_15 = arith.constant dense<0> : tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6450767Z     %cst_16 = arith.constant dense<8192> : tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6450972Z     %0 = tt.get_program_id x : i32
2026-02-21T09:47:08.6451133Z     %1 = arith.muli %0, %c4_i32 : i32
2026-02-21T09:47:08.6451291Z     %2 = arith.addi %1, %c4_i32 : i32
2026-02-21T09:47:08.6451454Z     %3 = arith.minsi %2, %c16384_i32 : i32
2026-02-21T09:47:08.6451779Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:08.6452179Z     %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.6452623Z     %6 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:08.6453062Z     %7 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.6453434Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6453781Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6454122Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6454582Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:47:08.6455175Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:47:08.6455774Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:08.6456136Z     %14 = arith.cmpi eq, %13, %cst_10 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:08.6456419Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked>
2026-02-21T09:47:08.6456694Z     %16 = arith.cmpi eq, %13, %cst_11 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:08.6456962Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x32xi1, #blocked>
2026-02-21T09:47:08.6457260Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<256x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:08.6457648Z     %19 = arith.extsi %5 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.6458123Z     %20 = arith.extsi %7 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.6458445Z     %21 = arith.subi %3, %1 : i32
2026-02-21T09:47:08.6458621Z     %22 = arith.remsi %21, %c2_i32 : i32
2026-02-21T09:47:08.6458785Z     %23 = arith.subi %21, %22 : i32
2026-02-21T09:47:08.6458941Z     %24 = arith.addi %1, %23 : i32
2026-02-21T09:47:08.6459119Z     scf.for %arg3 = %1 to %24 step %c2_i32  : i32 {
2026-02-21T09:47:08.6459283Z       %25 = arith.divsi %arg3, %c16384_i32 : i32
2026-02-21T09:47:08.6459425Z       %26 = arith.muli %25, %c64_i32 : i32
2026-02-21T09:47:08.6459553Z       %27 = arith.subi %c64_i32, %26 : i32
2026-02-21T09:47:08.6459680Z       %28 = arith.minsi %27, %c64_i32 : i32
2026-02-21T09:47:08.6459813Z       %29 = arith.remsi %arg3, %c16384_i32 : i32
2026-02-21T09:47:08.6459950Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:47:08.6460102Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:47:08.6460226Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:47:08.6460354Z       %33 = arith.muli %31, %c256_i32 : i32
2026-02-21T09:47:08.6460539Z       %34 = tt.splat %33 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:08.6460806Z       %35 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:08.6460996Z       %36 = arith.muli %32, %c32_i32 : i32
2026-02-21T09:47:08.6461221Z       %37 = tt.splat %36 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:08.6461549Z       %38 = arith.addi %37, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:08.6461899Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1>
2026-02-21T09:47:08.6462182Z       %40 = arith.muli %39, %cst_8 : tensor<256x1xi32, #blocked1>
2026-02-21T09:47:08.6462398Z       %41 = tt.broadcast %40 : tensor<256x1xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6462791Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6463179Z       %43 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6463417Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6463648Z       %45 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6463948Z       %46 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6464268Z       %47 = tt.broadcast %46 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6464481Z       %48 = arith.addi %41, %47 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6464717Z       %49 = tt.addptr %9, %48 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6464942Z       %50 = tt.load %49 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6465153Z       %51 = arith.addi %8, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6465457Z       %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6465759Z       %53 = tt.broadcast %52 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6465966Z       %54 = arith.addi %41, %53 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6466184Z       %55 = tt.addptr %9, %54 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6466407Z       %56 = tt.load %55 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6466615Z       %57 = arith.addi %8, %cst_6 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6466925Z       %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6467225Z       %59 = tt.broadcast %58 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6467433Z       %60 = arith.addi %41, %59 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6467644Z       %61 = tt.addptr %9, %60 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6467865Z       %62 = tt.load %61 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6468178Z       %63 = ttg.memdesc_index %43[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6468598Z       ttg.local_store %50, %63 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6468996Z       %64 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6469394Z       ttg.local_store %56, %64 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6469800Z       %65 = ttg.memdesc_index %45[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6470154Z       ttg.local_store %62, %65 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6470421Z       %66 = arith.addi %8, %cst_5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6470699Z       %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6470967Z       %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6471155Z       %69 = arith.addi %41, %68 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6471348Z       %70 = tt.addptr %9, %69 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6471545Z       %71 = tt.load %70 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6471729Z       %72 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6471997Z       %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6472264Z       %74 = tt.broadcast %73 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6472449Z       %75 = arith.addi %41, %74 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6472655Z       %76 = tt.addptr %9, %75 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6472853Z       %77 = tt.load %76 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6473067Z       %78 = arith.addi %8, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6473340Z       %79 = tt.expand_dims %78 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6473606Z       %80 = tt.broadcast %79 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6473787Z       %81 = arith.addi %41, %80 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6473979Z       %82 = tt.addptr %9, %81 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6474174Z       %83 = tt.load %82 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6474455Z       %84 = ttg.memdesc_index %43[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6474807Z       ttg.local_store %71, %84 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6475162Z       %85 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6475515Z       ttg.local_store %77, %85 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6475863Z       %86 = ttg.memdesc_index %45[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6476213Z       ttg.local_store %83, %86 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6477208Z       %87:12 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c3_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %63, %arg8 = %84, %arg9 = %64, %arg10 = %85, %arg11 = %c1_i32, %arg12 = %c4_i32, %arg13 = %65, %arg14 = %86, %arg15 = %c2_i32, %arg16 = %c5_i32) -> (tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32)  : i32 {
2026-02-21T09:47:08.6478104Z         %415 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:47:08.6478229Z         %416 = arith.muli %415, %c2_i32 : i32
2026-02-21T09:47:08.6478401Z         %417 = tt.splat %416 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6478627Z         %418 = arith.addi %417, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6478908Z         %419 = tt.expand_dims %418 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6479186Z         %420 = tt.broadcast %419 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6479380Z         %421 = arith.addi %41, %420 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6479584Z         %422 = tt.addptr %9, %421 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6479793Z         %423 = tt.load %422 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6480097Z         %424 = ttg.local_load %arg7 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6480539Z         %425 = arith.extf %424 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6480842Z         %426 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:47:08.6481023Z         %427 = tt.splat %426 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6481268Z         %428 = arith.addi %427, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6481581Z         %429 = tt.addptr %10, %428 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6481891Z         %430 = tt.load %429 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6482122Z         %431 = arith.shli %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6482357Z         %432 = arith.shrsi %431, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6482661Z         %433 = arith.shrsi %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6482951Z         %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6483291Z         %435 = tt.expand_dims %433 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6483577Z         %436 = tt.broadcast %434 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6483814Z         %437 = arith.select %15, %436, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6484053Z         %438 = tt.broadcast %435 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6484283Z         %439 = arith.select %17, %438, %437 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6484515Z         %440 = tt.reshape %439 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6484760Z         %441 = arith.sitofp %440 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6485050Z         %442 = ttg.convert_layout %441 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6485524Z         %443 = tt.dot %425, %442, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6485891Z         %444 = arith.addi %arg4, %c7_i32 : i32
2026-02-21T09:47:08.6486043Z         %445 = arith.muli %444, %c2_i32 : i32
2026-02-21T09:47:08.6486242Z         %446 = tt.splat %445 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6486464Z         %447 = arith.addi %446, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6486743Z         %448 = tt.expand_dims %447 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6487023Z         %449 = tt.broadcast %448 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6487218Z         %450 = arith.addi %41, %449 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6487421Z         %451 = tt.addptr %9, %450 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6487625Z         %452 = tt.load %451 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6487930Z         %453 = ttg.local_load %arg9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6488364Z         %454 = arith.extf %453 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6488651Z         %455 = arith.muli %arg11, %c8192_i32 : i32
2026-02-21T09:47:08.6488851Z         %456 = tt.splat %455 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6489075Z         %457 = arith.addi %456, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6489403Z         %458 = tt.addptr %10, %457 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6489715Z         %459 = tt.load %458 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6489946Z         %460 = arith.shli %459, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6490182Z         %461 = arith.shrsi %460, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6490416Z         %462 = arith.shrsi %459, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6490707Z         %463 = tt.expand_dims %461 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6491042Z         %464 = tt.expand_dims %462 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6491322Z         %465 = tt.broadcast %463 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6491591Z         %466 = arith.select %15, %465, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6491823Z         %467 = tt.broadcast %464 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6492054Z         %468 = arith.select %17, %467, %466 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6492286Z         %469 = tt.reshape %468 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6492509Z         %470 = arith.sitofp %469 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6492827Z         %471 = ttg.convert_layout %470 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6493287Z         %472 = tt.dot %454, %471, %443, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6493675Z         %473 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:47:08.6493802Z         %474 = arith.muli %473, %c2_i32 : i32
2026-02-21T09:47:08.6493972Z         %475 = tt.splat %474 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6494197Z         %476 = arith.addi %475, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6494474Z         %477 = tt.expand_dims %476 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6494754Z         %478 = tt.broadcast %477 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6494951Z         %479 = arith.addi %41, %478 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6495156Z         %480 = tt.addptr %9, %479 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6495364Z         %481 = tt.load %480 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6495669Z         %482 = ttg.local_load %arg13 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6496109Z         %483 = arith.extf %482 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6496397Z         %484 = arith.muli %arg15, %c8192_i32 : i32
2026-02-21T09:47:08.6496579Z         %485 = tt.splat %484 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6496827Z         %486 = arith.addi %485, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6497134Z         %487 = tt.addptr %10, %486 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6497461Z         %488 = tt.load %487 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6497695Z         %489 = arith.shli %488, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6497927Z         %490 = arith.shrsi %489, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6498164Z         %491 = arith.shrsi %488, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6505913Z         %492 = tt.expand_dims %490 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6506331Z         %493 = tt.expand_dims %491 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6506669Z         %494 = tt.broadcast %492 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6506906Z         %495 = arith.select %15, %494, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6507149Z         %496 = tt.broadcast %493 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6507382Z         %497 = arith.select %17, %496, %495 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6507611Z         %498 = tt.reshape %497 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6507836Z         %499 = arith.sitofp %498 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6508133Z         %500 = ttg.convert_layout %499 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6508657Z         %501 = tt.dot %483, %500, %472, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6509026Z         %502 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:47:08.6509154Z         %503 = arith.cmpi slt, %502, %c2_i32 : i32
2026-02-21T09:47:08.6509289Z         %504 = arith.select %503, %502, %c0_i32 : i32
2026-02-21T09:47:08.6509557Z         %505 = ttg.memdesc_index %43[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6509919Z         ttg.local_store %423, %505 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6510284Z         %506 = ttg.memdesc_index %44[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6510636Z         ttg.local_store %452, %506 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6510995Z         %507 = ttg.memdesc_index %45[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6511351Z         ttg.local_store %481, %507 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6512157Z         scf.yield %501, %504, %arg8, %505, %arg10, %506, %arg12, %444, %arg14, %507, %arg16, %473 : tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32
2026-02-21T09:47:08.6512833Z       }
2026-02-21T09:47:08.6513075Z       %88 = ttg.local_load %87#2 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6513519Z       %89 = arith.extf %88 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6513855Z       %90 = arith.addi %42, %cst_2 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6514164Z       %91 = tt.addptr %10, %90 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6514462Z       %92 = tt.load %91 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6514689Z       %93 = arith.shli %92, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6514918Z       %94 = arith.shrsi %93, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6515152Z       %95 = arith.shrsi %92, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6515437Z       %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6515762Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6516035Z       %98 = tt.broadcast %96 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6516266Z       %99 = arith.select %15, %98, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6516500Z       %100 = tt.broadcast %97 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6521345Z       %101 = arith.select %17, %100, %99 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6521588Z       %102 = tt.reshape %101 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6521811Z       %103 = arith.sitofp %102 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6522131Z       %104 = ttg.convert_layout %103 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6522655Z       %105 = tt.dot %89, %104, %87#0, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6523160Z       %106 = ttg.local_load %87#4 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6523602Z       %107 = arith.extf %106 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6523894Z       %108 = arith.muli %87#6, %c8192_i32 : i32
2026-02-21T09:47:08.6524076Z       %109 = tt.splat %108 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6524304Z       %110 = arith.addi %109, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6524612Z       %111 = tt.addptr %10, %110 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6524918Z       %112 = tt.load %111 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6525151Z       %113 = arith.shli %112, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6525409Z       %114 = arith.shrsi %113, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6525650Z       %115 = arith.shrsi %112, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6525961Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6526295Z       %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6526579Z       %118 = tt.broadcast %116 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6526818Z       %119 = arith.select %15, %118, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6527054Z       %120 = tt.broadcast %117 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6527290Z       %121 = arith.select %17, %120, %119 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6527520Z       %122 = tt.reshape %121 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6527738Z       %123 = arith.sitofp %122 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6528032Z       %124 = ttg.convert_layout %123 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6528491Z       %125 = tt.dot %107, %124, %105, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6528986Z       %126 = ttg.local_load %87#8 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6529524Z       %127 = arith.extf %126 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6529833Z       %128 = arith.muli %87#10, %c8192_i32 : i32
2026-02-21T09:47:08.6530013Z       %129 = tt.splat %128 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6530240Z       %130 = arith.addi %129, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6530567Z       %131 = tt.addptr %10, %130 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6530878Z       %132 = tt.load %131 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6531107Z       %133 = arith.shli %132, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6531344Z       %134 = arith.shrsi %133, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6531580Z       %135 = arith.shrsi %132, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6531865Z       %136 = tt.expand_dims %134 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6532204Z       %137 = tt.expand_dims %135 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6532485Z       %138 = tt.broadcast %136 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6532722Z       %139 = arith.select %15, %138, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6532958Z       %140 = tt.broadcast %137 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6533191Z       %141 = arith.select %17, %140, %139 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6533419Z       %142 = tt.reshape %141 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6533656Z       %143 = arith.sitofp %142 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6533949Z       %144 = ttg.convert_layout %143 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6534422Z       %145 = tt.dot %127, %144, %125, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6534916Z       %146 = ttg.local_load %87#3 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6535351Z       %147 = arith.extf %146 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6535686Z       %148 = arith.addi %42, %cst_1 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6536002Z       %149 = tt.addptr %10, %148 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6536315Z       %150 = tt.load %149 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6536544Z       %151 = arith.shli %150, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6536805Z       %152 = arith.shrsi %151, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6537042Z       %153 = arith.shrsi %150, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6537330Z       %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6537666Z       %155 = tt.expand_dims %153 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6537963Z       %156 = tt.broadcast %154 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6538199Z       %157 = arith.select %15, %156, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6538435Z       %158 = tt.broadcast %155 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6538682Z       %159 = arith.select %17, %158, %157 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6538910Z       %160 = tt.reshape %159 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6539131Z       %161 = arith.sitofp %160 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6539423Z       %162 = ttg.convert_layout %161 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6539885Z       %163 = tt.dot %147, %162, %145, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6540382Z       %164 = ttg.local_load %87#5 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6540813Z       %165 = arith.extf %164 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6541098Z       %166 = arith.muli %87#7, %c8192_i32 : i32
2026-02-21T09:47:08.6541276Z       %167 = tt.splat %166 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6541504Z       %168 = arith.addi %167, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6541832Z       %169 = tt.addptr %10, %168 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6542141Z       %170 = tt.load %169 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6542392Z       %171 = arith.shli %170, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6542625Z       %172 = arith.shrsi %171, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6542862Z       %173 = arith.shrsi %170, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6543145Z       %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6543481Z       %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6543763Z       %176 = tt.broadcast %174 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6543999Z       %177 = arith.select %15, %176, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6544235Z       %178 = tt.broadcast %175 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6544464Z       %179 = arith.select %17, %178, %177 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6544695Z       %180 = tt.reshape %179 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6544916Z       %181 = arith.sitofp %180 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6545209Z       %182 = ttg.convert_layout %181 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6545674Z       %183 = tt.dot %165, %182, %163, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6546186Z       %184 = ttg.local_load %87#9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6546616Z       %185 = arith.extf %184 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6546918Z       %186 = arith.muli %87#11, %c8192_i32 : i32
2026-02-21T09:47:08.6547095Z       %187 = tt.splat %186 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6547322Z       %188 = arith.addi %187, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6547633Z       %189 = tt.addptr %10, %188 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6547940Z       %190 = tt.load %189 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6548170Z       %191 = arith.shli %190, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6548407Z       %192 = arith.shrsi %191, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6548644Z       %193 = arith.shrsi %190, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6548932Z       %194 = tt.expand_dims %192 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6549262Z       %195 = tt.expand_dims %193 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6549544Z       %196 = tt.broadcast %194 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6549796Z       %197 = arith.select %15, %196, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6550029Z       %198 = tt.broadcast %195 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6550263Z       %199 = arith.select %17, %198, %197 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6550506Z       %200 = tt.reshape %199 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6550728Z       %201 = arith.sitofp %200 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6551021Z       %202 = ttg.convert_layout %201 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6551478Z       %203 = tt.dot %185, %202, %183, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6551871Z       ttg.local_dealloc %45 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6552081Z       ttg.local_dealloc %44 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6552287Z       ttg.local_dealloc %43 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6552551Z       %204 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %203) -> (tensor<256x32xf32, #mma>)  : i32 {
2026-02-21T09:47:08.6552776Z         %415 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:47:08.6552953Z         %416 = tt.splat %415 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6553179Z         %417 = arith.addi %416, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6553460Z         %418 = tt.expand_dims %417 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6553741Z         %419 = tt.broadcast %418 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6553959Z         %420 = arith.addi %41, %419 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6554165Z         %421 = tt.addptr %9, %420 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6554374Z         %422 = tt.load %421 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6554662Z         %423 = ttg.convert_layout %422 : tensor<256x2xbf16, #blocked1> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6555069Z         %424 = arith.extf %423 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6555359Z         %425 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:47:08.6555539Z         %426 = tt.splat %425 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6555766Z         %427 = arith.addi %426, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6556079Z         %428 = tt.addptr %10, %427 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6556391Z         %429 = tt.load %428 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6556624Z         %430 = arith.shli %429, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6556861Z         %431 = arith.shrsi %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6557097Z         %432 = arith.shrsi %429, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6557389Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6557742Z         %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6558025Z         %435 = tt.broadcast %433 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6558285Z         %436 = arith.select %15, %435, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6558524Z         %437 = tt.broadcast %434 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6558761Z         %438 = arith.select %17, %437, %436 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6558991Z         %439 = tt.reshape %438 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6559212Z         %440 = arith.sitofp %439 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6559506Z         %441 = ttg.convert_layout %440 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6559976Z         %442 = tt.dot %424, %441, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6560329Z         scf.yield %442 : tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6560458Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:47:08.6560619Z       %205 = arith.truncf %204 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma>
2026-02-21T09:47:08.6560821Z       %206 = arith.extsi %33 : i32 to i64
2026-02-21T09:47:08.6560939Z       %207 = arith.extsi %36 : i32 to i64
2026-02-21T09:47:08.6561107Z       %208 = tt.splat %206 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.6561338Z       %209 = arith.addi %208, %19 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.6561606Z       %210 = tt.expand_dims %209 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6561870Z       %211 = arith.muli %210, %cst_12 : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6562056Z       %212 = tt.broadcast %211 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6562265Z       %213 = tt.splat %207 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.6562487Z       %214 = arith.addi %213, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.6562799Z       %215 = tt.expand_dims %214 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6563058Z       %216 = tt.broadcast %215 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6563240Z       %217 = arith.addi %212, %216 : tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6563431Z       %218 = tt.addptr %18, %217 : tensor<256x32x!tt.ptr<bf16>, #mma>, tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6563631Z       %219 = arith.cmpi sge, %210, %cst_13 : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6563800Z       %220 = arith.cmpi slt, %210, %cst_14 : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6563958Z       %221 = arith.andi %219, %220 : tensor<256x1xi1, #mma>
2026-02-21T09:47:08.6564134Z       %222 = tt.broadcast %221 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:47:08.6564320Z       %223 = arith.cmpi sge, %215, %cst_15 : tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6564485Z       %224 = arith.cmpi slt, %215, %cst_16 : tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6564642Z       %225 = arith.andi %223, %224 : tensor<1x32xi1, #mma>
2026-02-21T09:47:08.6564811Z       %226 = tt.broadcast %225 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:47:08.6564988Z       %227 = arith.andi %222, %226 : tensor<256x32xi1, #mma>
2026-02-21T09:47:08.6565146Z       tt.store %218, %205, %227 : tensor<256x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:08.6565296Z       %228 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:47:08.6565421Z       %229 = arith.divsi %228, %c16384_i32 : i32
2026-02-21T09:47:08.6565567Z       %230 = arith.muli %229, %c64_i32 : i32
2026-02-21T09:47:08.6565684Z       %231 = arith.subi %c64_i32, %230 : i32
2026-02-21T09:47:08.6565820Z       %232 = arith.minsi %231, %c64_i32 : i32
2026-02-21T09:47:08.6565943Z       %233 = arith.remsi %228, %c16384_i32 : i32
2026-02-21T09:47:08.6566062Z       %234 = arith.remsi %233, %232 : i32
2026-02-21T09:47:08.6566178Z       %235 = arith.addi %230, %234 : i32
2026-02-21T09:47:08.6566290Z       %236 = arith.divsi %233, %232 : i32
2026-02-21T09:47:08.6566407Z       %237 = arith.muli %235, %c256_i32 : i32
2026-02-21T09:47:08.6566579Z       %238 = tt.splat %237 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:08.6566802Z       %239 = arith.addi %238, %4 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:08.6566976Z       %240 = arith.muli %236, %c32_i32 : i32
2026-02-21T09:47:08.6567180Z       %241 = tt.splat %240 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:08.6567483Z       %242 = arith.addi %241, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:08.6567806Z       %243 = tt.expand_dims %239 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1>
2026-02-21T09:47:08.6568061Z       %244 = arith.muli %243, %cst_8 : tensor<256x1xi32, #blocked1>
2026-02-21T09:47:08.6568258Z       %245 = tt.broadcast %244 : tensor<256x1xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6568638Z       %246 = tt.expand_dims %242 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6568988Z       %247 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6569202Z       %248 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6569428Z       %249 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6569615Z       %250 = arith.addi %245, %47 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6569815Z       %251 = tt.addptr %9, %250 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6570049Z       %252 = tt.load %251 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6570211Z       %253 = arith.addi %245, %53 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6570405Z       %254 = tt.addptr %9, %253 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6570610Z       %255 = tt.load %254 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6570766Z       %256 = arith.addi %245, %59 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6570962Z       %257 = tt.addptr %9, %256 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6571164Z       %258 = tt.load %257 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6571450Z       %259 = ttg.memdesc_index %247[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6571817Z       ttg.local_store %252, %259 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6572181Z       %260 = ttg.memdesc_index %248[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6572567Z       ttg.local_store %255, %260 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6572928Z       %261 = ttg.memdesc_index %249[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6573314Z       ttg.local_store %258, %261 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6573553Z       %262 = arith.addi %245, %68 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6573788Z       %263 = tt.addptr %9, %262 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6573992Z       %264 = tt.load %263 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6574151Z       %265 = arith.addi %245, %74 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6574344Z       %266 = tt.addptr %9, %265 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6574545Z       %267 = tt.load %266 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6574700Z       %268 = arith.addi %245, %80 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6574895Z       %269 = tt.addptr %9, %268 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6575096Z       %270 = tt.load %269 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6575373Z       %271 = ttg.memdesc_index %247[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6575733Z       ttg.local_store %264, %271 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6576090Z       %272 = ttg.memdesc_index %248[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6576446Z       ttg.local_store %267, %272 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6576803Z       %273 = ttg.memdesc_index %249[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6577176Z       ttg.local_store %270, %273 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6578169Z       %274:12 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c3_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %259, %arg8 = %271, %arg9 = %260, %arg10 = %272, %arg11 = %c1_i32, %arg12 = %c4_i32, %arg13 = %261, %arg14 = %273, %arg15 = %c2_i32, %arg16 = %c5_i32) -> (tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32)  : i32 {
2026-02-21T09:47:08.6579071Z         %415 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:47:08.6579195Z         %416 = arith.muli %415, %c2_i32 : i32
2026-02-21T09:47:08.6579368Z         %417 = tt.splat %416 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6579592Z         %418 = arith.addi %417, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6579870Z         %419 = tt.expand_dims %418 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6580171Z         %420 = tt.broadcast %419 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6580365Z         %421 = arith.addi %245, %420 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6580567Z         %422 = tt.addptr %9, %421 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6580774Z         %423 = tt.load %422 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6581077Z         %424 = ttg.local_load %arg7 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6581538Z         %425 = arith.extf %424 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6581839Z         %426 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:47:08.6582020Z         %427 = tt.splat %426 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6582252Z         %428 = arith.addi %427, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6582561Z         %429 = tt.addptr %10, %428 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6582866Z         %430 = tt.load %429 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6583096Z         %431 = arith.shli %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6583403Z         %432 = arith.shrsi %431, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6583642Z         %433 = arith.shrsi %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6583936Z         %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6584272Z         %435 = tt.expand_dims %433 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6584557Z         %436 = tt.broadcast %434 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6584797Z         %437 = arith.select %15, %436, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6585038Z         %438 = tt.broadcast %435 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6585269Z         %439 = arith.select %17, %438, %437 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6585516Z         %440 = tt.reshape %439 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6585736Z         %441 = arith.sitofp %440 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6586029Z         %442 = ttg.convert_layout %441 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6586509Z         %443 = tt.dot %425, %442, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6586857Z         %444 = arith.addi %arg4, %c7_i32 : i32
2026-02-21T09:47:08.6586979Z         %445 = arith.muli %444, %c2_i32 : i32
2026-02-21T09:47:08.6587149Z         %446 = tt.splat %445 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6587375Z         %447 = arith.addi %446, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6587651Z         %448 = tt.expand_dims %447 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6587933Z         %449 = tt.broadcast %448 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6588130Z         %450 = arith.addi %245, %449 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6588331Z         %451 = tt.addptr %9, %450 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6588535Z         %452 = tt.load %451 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6588837Z         %453 = ttg.local_load %arg9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6589285Z         %454 = arith.extf %453 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6589569Z         %455 = arith.muli %arg11, %c8192_i32 : i32
2026-02-21T09:47:08.6589749Z         %456 = tt.splat %455 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6589996Z         %457 = arith.addi %456, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6590309Z         %458 = tt.addptr %10, %457 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6590616Z         %459 = tt.load %458 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6590845Z         %460 = arith.shli %459, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6591080Z         %461 = arith.shrsi %460, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6591317Z         %462 = arith.shrsi %459, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6591602Z         %463 = tt.expand_dims %461 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6591935Z         %464 = tt.expand_dims %462 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6592214Z         %465 = tt.broadcast %463 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6592451Z         %466 = arith.select %15, %465, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6592685Z         %467 = tt.broadcast %464 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6592914Z         %468 = arith.select %17, %467, %466 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6593142Z         %469 = tt.reshape %468 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6593375Z         %470 = arith.sitofp %469 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6593670Z         %471 = ttg.convert_layout %470 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6594142Z         %472 = tt.dot %454, %471, %443, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6594484Z         %473 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:47:08.6594606Z         %474 = arith.muli %473, %c2_i32 : i32
2026-02-21T09:47:08.6594774Z         %475 = tt.splat %474 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6594995Z         %476 = arith.addi %475, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6595268Z         %477 = tt.expand_dims %476 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6595542Z         %478 = tt.broadcast %477 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6595739Z         %479 = arith.addi %245, %478 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6595940Z         %480 = tt.addptr %9, %479 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6596146Z         %481 = tt.load %480 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6596452Z         %482 = ttg.local_load %arg13 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6596889Z         %483 = arith.extf %482 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6597196Z         %484 = arith.muli %arg15, %c8192_i32 : i32
2026-02-21T09:47:08.6597375Z         %485 = tt.splat %484 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6597620Z         %486 = arith.addi %485, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6597927Z         %487 = tt.addptr %10, %486 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6598238Z         %488 = tt.load %487 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6598470Z         %489 = arith.shli %488, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6598704Z         %490 = arith.shrsi %489, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6598940Z         %491 = arith.shrsi %488, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6599230Z         %492 = tt.expand_dims %490 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6599559Z         %493 = tt.expand_dims %491 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6599841Z         %494 = tt.broadcast %492 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6600077Z         %495 = arith.select %15, %494, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6600308Z         %496 = tt.broadcast %493 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6600540Z         %497 = arith.select %17, %496, %495 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6600767Z         %498 = tt.reshape %497 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6600988Z         %499 = arith.sitofp %498 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6601305Z         %500 = ttg.convert_layout %499 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6601765Z         %501 = tt.dot %483, %500, %472, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6602121Z         %502 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:47:08.6602246Z         %503 = arith.cmpi slt, %502, %c2_i32 : i32
2026-02-21T09:47:08.6602378Z         %504 = arith.select %503, %502, %c0_i32 : i32
2026-02-21T09:47:08.6602704Z         %505 = ttg.memdesc_index %247[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6603068Z         ttg.local_store %423, %505 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6603427Z         %506 = ttg.memdesc_index %248[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6603787Z         ttg.local_store %452, %506 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6604142Z         %507 = ttg.memdesc_index %249[%504] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6604497Z         ttg.local_store %481, %507 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6605303Z         scf.yield %501, %504, %arg8, %505, %arg10, %506, %arg12, %444, %arg14, %507, %arg16, %473 : tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32
2026-02-21T09:47:08.6605993Z       }
2026-02-21T09:47:08.6606234Z       %275 = ttg.local_load %274#2 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6606664Z       %276 = arith.extf %275 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6606996Z       %277 = arith.addi %246, %cst_2 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6607305Z       %278 = tt.addptr %10, %277 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6607610Z       %279 = tt.load %278 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6607841Z       %280 = arith.shli %279, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6608074Z       %281 = arith.shrsi %280, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6608307Z       %282 = arith.shrsi %279, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6608591Z       %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6608920Z       %284 = tt.expand_dims %282 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6609201Z       %285 = tt.broadcast %283 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6609453Z       %286 = arith.select %15, %285, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6609687Z       %287 = tt.broadcast %284 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6609916Z       %288 = arith.select %17, %287, %286 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6610158Z       %289 = tt.reshape %288 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6610379Z       %290 = arith.sitofp %289 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6610669Z       %291 = ttg.convert_layout %290 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6611127Z       %292 = tt.dot %276, %291, %274#0, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6611627Z       %293 = ttg.local_load %274#4 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6612055Z       %294 = arith.extf %293 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6612339Z       %295 = arith.muli %274#6, %c8192_i32 : i32
2026-02-21T09:47:08.6612514Z       %296 = tt.splat %295 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6612740Z       %297 = arith.addi %296, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6613049Z       %298 = tt.addptr %10, %297 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6613352Z       %299 = tt.load %298 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6613597Z       %300 = arith.shli %299, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6613832Z       %301 = arith.shrsi %300, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6614083Z       %302 = arith.shrsi %299, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6614372Z       %303 = tt.expand_dims %301 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6614705Z       %304 = tt.expand_dims %302 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6614988Z       %305 = tt.broadcast %303 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6615229Z       %306 = arith.select %15, %305, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6615465Z       %307 = tt.broadcast %304 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6615696Z       %308 = arith.select %17, %307, %306 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6615921Z       %309 = tt.reshape %308 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6616141Z       %310 = arith.sitofp %309 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6616432Z       %311 = ttg.convert_layout %310 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6616886Z       %312 = tt.dot %294, %311, %292, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6617385Z       %313 = ttg.local_load %274#8 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6617833Z       %314 = arith.extf %313 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6618116Z       %315 = arith.muli %274#10, %c8192_i32 : i32
2026-02-21T09:47:08.6618308Z       %316 = tt.splat %315 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6618537Z       %317 = arith.addi %316, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6618847Z       %318 = tt.addptr %10, %317 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6619153Z       %319 = tt.load %318 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6619379Z       %320 = arith.shli %319, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6619616Z       %321 = arith.shrsi %320, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6619847Z       %322 = arith.shrsi %319, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6620132Z       %323 = tt.expand_dims %321 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6620467Z       %324 = tt.expand_dims %322 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6620744Z       %325 = tt.broadcast %323 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6620979Z       %326 = arith.select %15, %325, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6621210Z       %327 = tt.broadcast %324 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6621457Z       %328 = arith.select %17, %327, %326 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6621684Z       %329 = tt.reshape %328 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6621925Z       %330 = arith.sitofp %329 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6622215Z       %331 = ttg.convert_layout %330 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6622670Z       %332 = tt.dot %314, %331, %312, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6623168Z       %333 = ttg.local_load %274#3 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6623604Z       %334 = arith.extf %333 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6623939Z       %335 = arith.addi %246, %cst_1 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6624254Z       %336 = tt.addptr %10, %335 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6624564Z       %337 = tt.load %336 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6624791Z       %338 = arith.shli %337, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6625026Z       %339 = arith.shrsi %338, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6625259Z       %340 = arith.shrsi %337, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6625566Z       %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6625900Z       %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6626182Z       %343 = tt.broadcast %341 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6626438Z       %344 = arith.select %15, %343, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6626679Z       %345 = tt.broadcast %342 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6626911Z       %346 = arith.select %17, %345, %344 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6627144Z       %347 = tt.reshape %346 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6627364Z       %348 = arith.sitofp %347 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6627662Z       %349 = ttg.convert_layout %348 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6628125Z       %350 = tt.dot %334, %349, %332, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6628618Z       %351 = ttg.local_load %274#5 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6629056Z       %352 = arith.extf %351 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6629346Z       %353 = arith.muli %274#7, %c8192_i32 : i32
2026-02-21T09:47:08.6629525Z       %354 = tt.splat %353 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6629776Z       %355 = arith.addi %354, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6630087Z       %356 = tt.addptr %10, %355 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6630424Z       %357 = tt.load %356 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6630661Z       %358 = arith.shli %357, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6630898Z       %359 = arith.shrsi %358, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6631141Z       %360 = arith.shrsi %357, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6631434Z       %361 = tt.expand_dims %359 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6631777Z       %362 = tt.expand_dims %360 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6632065Z       %363 = tt.broadcast %361 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6632309Z       %364 = arith.select %15, %363, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6632552Z       %365 = tt.broadcast %362 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6632785Z       %366 = arith.select %17, %365, %364 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6633020Z       %367 = tt.reshape %366 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6633249Z       %368 = arith.sitofp %367 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6633543Z       %369 = ttg.convert_layout %368 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6634026Z       %370 = tt.dot %352, %369, %350, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6634521Z       %371 = ttg.local_load %274#9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6634976Z       %372 = arith.extf %371 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6635272Z       %373 = arith.muli %274#11, %c8192_i32 : i32
2026-02-21T09:47:08.6635453Z       %374 = tt.splat %373 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6635692Z       %375 = arith.addi %374, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6636010Z       %376 = tt.addptr %10, %375 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6636322Z       %377 = tt.load %376 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6636560Z       %378 = arith.shli %377, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6636801Z       %379 = arith.shrsi %378, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6637043Z       %380 = arith.shrsi %377, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6637339Z       %381 = tt.expand_dims %379 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6637677Z       %382 = tt.expand_dims %380 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6637978Z       %383 = tt.broadcast %381 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6638218Z       %384 = arith.select %15, %383, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6638476Z       %385 = tt.broadcast %382 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6638712Z       %386 = arith.select %17, %385, %384 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6638942Z       %387 = tt.reshape %386 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6639166Z       %388 = arith.sitofp %387 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6639457Z       %389 = ttg.convert_layout %388 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6639927Z       %390 = tt.dot %372, %389, %370, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6640316Z       ttg.local_dealloc %249 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6640533Z       ttg.local_dealloc %248 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6640749Z       ttg.local_dealloc %247 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6641010Z       %391 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %390) -> (tensor<256x32xf32, #mma>)  : i32 {
2026-02-21T09:47:08.6641239Z         %415 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:47:08.6641423Z         %416 = tt.splat %415 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6641653Z         %417 = arith.addi %416, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6641954Z         %418 = tt.expand_dims %417 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6642237Z         %419 = tt.broadcast %418 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6642446Z         %420 = arith.addi %245, %419 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6642746Z         %421 = tt.addptr %9, %420 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6642960Z         %422 = tt.load %421 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6643240Z         %423 = ttg.convert_layout %422 : tensor<256x2xbf16, #blocked1> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6643652Z         %424 = arith.extf %423 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6643946Z         %425 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:47:08.6644138Z         %426 = tt.splat %425 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6644372Z         %427 = arith.addi %426, %246 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6644695Z         %428 = tt.addptr %10, %427 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6645012Z         %429 = tt.load %428 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6645247Z         %430 = arith.shli %429, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6645489Z         %431 = arith.shrsi %430, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6645730Z         %432 = arith.shrsi %429, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6646049Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6646389Z         %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6646694Z         %435 = tt.broadcast %433 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6646938Z         %436 = arith.select %15, %435, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6647177Z         %437 = tt.broadcast %434 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6647416Z         %438 = arith.select %17, %437, %436 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6647652Z         %439 = tt.reshape %438 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6647879Z         %440 = arith.sitofp %439 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6648180Z         %441 = ttg.convert_layout %440 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6648649Z         %442 = tt.dot %424, %441, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6649008Z         scf.yield %442 : tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6649145Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:47:08.6649311Z       %392 = arith.truncf %391 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma>
2026-02-21T09:47:08.6649493Z       %393 = arith.extsi %237 : i32 to i64
2026-02-21T09:47:08.6649616Z       %394 = arith.extsi %240 : i32 to i64
2026-02-21T09:47:08.6649788Z       %395 = tt.splat %393 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.6650011Z       %396 = arith.addi %395, %19 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.6650303Z       %397 = tt.expand_dims %396 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6650556Z       %398 = arith.muli %397, %cst_12 : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6650760Z       %399 = tt.broadcast %398 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6650975Z       %400 = tt.splat %394 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.6651185Z       %401 = arith.addi %400, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.6651453Z       %402 = tt.expand_dims %401 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6651718Z       %403 = tt.broadcast %402 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6651903Z       %404 = arith.addi %399, %403 : tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6652101Z       %405 = tt.addptr %18, %404 : tensor<256x32x!tt.ptr<bf16>, #mma>, tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6652306Z       %406 = arith.cmpi sge, %397, %cst_13 : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6652482Z       %407 = arith.cmpi slt, %397, %cst_14 : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6652648Z       %408 = arith.andi %406, %407 : tensor<256x1xi1, #mma>
2026-02-21T09:47:08.6652826Z       %409 = tt.broadcast %408 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:47:08.6653018Z       %410 = arith.cmpi sge, %402, %cst_15 : tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6653185Z       %411 = arith.cmpi slt, %402, %cst_16 : tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6653349Z       %412 = arith.andi %410, %411 : tensor<1x32xi1, #mma>
2026-02-21T09:47:08.6653524Z       %413 = tt.broadcast %412 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:47:08.6653709Z       %414 = arith.andi %409, %413 : tensor<256x32xi1, #mma>
2026-02-21T09:47:08.6653892Z       tt.store %405, %392, %414 : tensor<256x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:08.6654041Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:47:08.6654169Z     scf.for %arg3 = %24 to %3 step %c1_i32  : i32 {
2026-02-21T09:47:08.6654326Z       %25 = arith.divsi %arg3, %c16384_i32 : i32
2026-02-21T09:47:08.6654462Z       %26 = arith.muli %25, %c64_i32 : i32
2026-02-21T09:47:08.6654585Z       %27 = arith.subi %c64_i32, %26 : i32
2026-02-21T09:47:08.6654710Z       %28 = arith.minsi %27, %c64_i32 : i32
2026-02-21T09:47:08.6654838Z       %29 = arith.remsi %arg3, %c16384_i32 : i32
2026-02-21T09:47:08.6654963Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:47:08.6655083Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:47:08.6655199Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:47:08.6655319Z       %33 = arith.muli %31, %c256_i32 : i32
2026-02-21T09:47:08.6655491Z       %34 = tt.splat %33 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:08.6655725Z       %35 = arith.addi %34, %4 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:08.6655902Z       %36 = arith.muli %32, %c32_i32 : i32
2026-02-21T09:47:08.6656113Z       %37 = tt.splat %36 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:08.6656415Z       %38 = arith.addi %37, %6 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:08.6656732Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1>
2026-02-21T09:47:08.6656989Z       %40 = arith.muli %39, %cst_8 : tensor<256x1xi32, #blocked1>
2026-02-21T09:47:08.6657189Z       %41 = tt.broadcast %40 : tensor<256x1xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6657545Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6657910Z       %43 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6658127Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6658338Z       %45 = ttg.local_alloc : () -> !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6658618Z       %46 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6658888Z       %47 = tt.broadcast %46 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6659084Z       %48 = arith.addi %41, %47 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6659282Z       %49 = tt.addptr %9, %48 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6659488Z       %50 = tt.load %49 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6659682Z       %51 = arith.addi %8, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6659955Z       %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6660229Z       %53 = tt.broadcast %52 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6660419Z       %54 = arith.addi %41, %53 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6660615Z       %55 = tt.addptr %9, %54 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6660816Z       %56 = tt.load %55 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6661005Z       %57 = arith.addi %8, %cst_6 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6661279Z       %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6661568Z       %59 = tt.broadcast %58 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6661762Z       %60 = arith.addi %41, %59 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6661973Z       %61 = tt.addptr %9, %60 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6662175Z       %62 = tt.load %61 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6662459Z       %63 = ttg.memdesc_index %43[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6662816Z       ttg.local_store %50, %63 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6663173Z       %64 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6663529Z       ttg.local_store %56, %64 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6663881Z       %65 = ttg.memdesc_index %45[%c0_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6664236Z       ttg.local_store %62, %65 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6664502Z       %66 = arith.addi %8, %cst_5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6664774Z       %67 = tt.expand_dims %66 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6665043Z       %68 = tt.broadcast %67 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6665228Z       %69 = arith.addi %41, %68 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6665425Z       %70 = tt.addptr %9, %69 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6665638Z       %71 = tt.load %70 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6665825Z       %72 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6666099Z       %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6666379Z       %74 = tt.broadcast %73 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6666566Z       %75 = arith.addi %41, %74 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6666756Z       %76 = tt.addptr %9, %75 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6666956Z       %77 = tt.load %76 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6667143Z       %78 = arith.addi %8, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6667415Z       %79 = tt.expand_dims %78 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6667683Z       %80 = tt.broadcast %79 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6667873Z       %81 = arith.addi %41, %80 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6668070Z       %82 = tt.addptr %9, %81 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6668269Z       %83 = tt.load %82 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6668547Z       %84 = ttg.memdesc_index %43[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6668902Z       ttg.local_store %71, %84 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6669270Z       %85 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6669623Z       ttg.local_store %77, %85 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6669989Z       %86 = ttg.memdesc_index %45[%c1_i32] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6670339Z       ttg.local_store %83, %86 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6671320Z       %87:12 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c3_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %63, %arg8 = %84, %arg9 = %64, %arg10 = %85, %arg11 = %c1_i32, %arg12 = %c4_i32, %arg13 = %65, %arg14 = %86, %arg15 = %c2_i32, %arg16 = %c5_i32) -> (tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32)  : i32 {
2026-02-21T09:47:08.6672195Z         %228 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:47:08.6672322Z         %229 = arith.muli %228, %c2_i32 : i32
2026-02-21T09:47:08.6672498Z         %230 = tt.splat %229 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6672724Z         %231 = arith.addi %230, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6673001Z         %232 = tt.expand_dims %231 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6673281Z         %233 = tt.broadcast %232 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6673477Z         %234 = arith.addi %41, %233 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6673702Z         %235 = tt.addptr %9, %234 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6673910Z         %236 = tt.load %235 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6674213Z         %237 = ttg.local_load %arg7 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6674665Z         %238 = arith.extf %237 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6674949Z         %239 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:47:08.6675131Z         %240 = tt.splat %239 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6675363Z         %241 = arith.addi %240, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6675674Z         %242 = tt.addptr %10, %241 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6675983Z         %243 = tt.load %242 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6676215Z         %244 = arith.shli %243, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6676451Z         %245 = arith.shrsi %244, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6676689Z         %246 = arith.shrsi %243, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6676976Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6677333Z         %248 = tt.expand_dims %246 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6677647Z         %249 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6677885Z         %250 = arith.select %15, %249, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6678140Z         %251 = tt.broadcast %248 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6678373Z         %252 = arith.select %17, %251, %250 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6678602Z         %253 = tt.reshape %252 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6678824Z         %254 = arith.sitofp %253 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6679116Z         %255 = ttg.convert_layout %254 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6679587Z         %256 = tt.dot %238, %255, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6679934Z         %257 = arith.addi %arg4, %c7_i32 : i32
2026-02-21T09:47:08.6680062Z         %258 = arith.muli %257, %c2_i32 : i32
2026-02-21T09:47:08.6680236Z         %259 = tt.splat %258 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6680458Z         %260 = arith.addi %259, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6680733Z         %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6681008Z         %262 = tt.broadcast %261 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6681207Z         %263 = arith.addi %41, %262 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6681411Z         %264 = tt.addptr %9, %263 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6681633Z         %265 = tt.load %264 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6681937Z         %266 = ttg.local_load %arg9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6682393Z         %267 = arith.extf %266 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6682897Z         %268 = arith.muli %arg11, %c8192_i32 : i32
2026-02-21T09:47:08.6683089Z         %269 = tt.splat %268 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6683321Z         %270 = arith.addi %269, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6683639Z         %271 = tt.addptr %10, %270 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6683951Z         %272 = tt.load %271 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6684186Z         %273 = arith.shli %272, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6684422Z         %274 = arith.shrsi %273, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6684659Z         %275 = arith.shrsi %272, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6684948Z         %276 = tt.expand_dims %274 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6685280Z         %277 = tt.expand_dims %275 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6685563Z         %278 = tt.broadcast %276 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6685877Z         %279 = arith.select %15, %278, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6686110Z         %280 = tt.broadcast %277 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6686369Z         %281 = arith.select %17, %280, %279 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6686597Z         %282 = tt.reshape %281 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6686819Z         %283 = arith.sitofp %282 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6687114Z         %284 = ttg.convert_layout %283 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6687581Z         %285 = tt.dot %267, %284, %256, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6687930Z         %286 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:47:08.6688056Z         %287 = arith.muli %286, %c2_i32 : i32
2026-02-21T09:47:08.6688229Z         %288 = tt.splat %287 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6688455Z         %289 = arith.addi %288, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6688732Z         %290 = tt.expand_dims %289 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6689014Z         %291 = tt.broadcast %290 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6689214Z         %292 = arith.addi %41, %291 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6689415Z         %293 = tt.addptr %9, %292 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6689624Z         %294 = tt.load %293 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6689963Z         %295 = ttg.local_load %arg13 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6690403Z         %296 = arith.extf %295 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6690712Z         %297 = arith.muli %arg15, %c8192_i32 : i32
2026-02-21T09:47:08.6690890Z         %298 = tt.splat %297 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6691116Z         %299 = arith.addi %298, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6691424Z         %300 = tt.addptr %10, %299 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6691733Z         %301 = tt.load %300 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6691968Z         %302 = arith.shli %301, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6692202Z         %303 = arith.shrsi %302, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6692438Z         %304 = arith.shrsi %301, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6692724Z         %305 = tt.expand_dims %303 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6693060Z         %306 = tt.expand_dims %304 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6693342Z         %307 = tt.broadcast %305 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6693592Z         %308 = arith.select %15, %307, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6693830Z         %309 = tt.broadcast %306 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6694061Z         %310 = arith.select %17, %309, %308 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6694309Z         %311 = tt.reshape %310 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6694532Z         %312 = arith.sitofp %311 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6694826Z         %313 = ttg.convert_layout %312 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6695291Z         %314 = tt.dot %296, %313, %285, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6695638Z         %315 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:47:08.6695768Z         %316 = arith.cmpi slt, %315, %c2_i32 : i32
2026-02-21T09:47:08.6695902Z         %317 = arith.select %316, %315, %c0_i32 : i32
2026-02-21T09:47:08.6696172Z         %318 = ttg.memdesc_index %43[%317] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6696538Z         ttg.local_store %236, %318 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6696895Z         %319 = ttg.memdesc_index %44[%317] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6697248Z         ttg.local_store %265, %319 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6697604Z         %320 = ttg.memdesc_index %45[%317] : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6697977Z         ttg.local_store %294, %320 : tensor<256x2xbf16, #blocked1> -> !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>
2026-02-21T09:47:08.6698760Z         scf.yield %314, %317, %arg8, %318, %arg10, %319, %arg12, %257, %arg14, %320, %arg16, %286 : tensor<256x32xf32, #mma>, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2>, i32, i32
2026-02-21T09:47:08.6699445Z       }
2026-02-21T09:47:08.6699684Z       %88 = ttg.local_load %87#2 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6700116Z       %89 = arith.extf %88 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6700450Z       %90 = arith.addi %42, %cst_2 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6700756Z       %91 = tt.addptr %10, %90 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6701060Z       %92 = tt.load %91 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6701285Z       %93 = arith.shli %92, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6701513Z       %94 = arith.shrsi %93, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6701743Z       %95 = arith.shrsi %92, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6702037Z       %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6702364Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6702655Z       %98 = tt.broadcast %96 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6702885Z       %99 = arith.select %15, %98, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6703117Z       %100 = tt.broadcast %97 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6703345Z       %101 = arith.select %17, %100, %99 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6703574Z       %102 = tt.reshape %101 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6703797Z       %103 = arith.sitofp %102 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6704088Z       %104 = ttg.convert_layout %103 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6704545Z       %105 = tt.dot %89, %104, %87#0, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6705039Z       %106 = ttg.local_load %87#4 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6705466Z       %107 = arith.extf %106 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6705750Z       %108 = arith.muli %87#6, %c8192_i32 : i32
2026-02-21T09:47:08.6705927Z       %109 = tt.splat %108 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6706171Z       %110 = arith.addi %109, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6706478Z       %111 = tt.addptr %10, %110 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6706788Z       %112 = tt.load %111 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6707035Z       %113 = arith.shli %112, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6707269Z       %114 = arith.shrsi %113, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6707509Z       %115 = arith.shrsi %112, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6707799Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6708136Z       %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6708420Z       %118 = tt.broadcast %116 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6708660Z       %119 = arith.select %15, %118, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6708898Z       %120 = tt.broadcast %117 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6709133Z       %121 = arith.select %17, %120, %119 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6709360Z       %122 = tt.reshape %121 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6709582Z       %123 = arith.sitofp %122 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6709871Z       %124 = ttg.convert_layout %123 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6710346Z       %125 = tt.dot %107, %124, %105, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6710855Z       %126 = ttg.local_load %87#8 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6711289Z       %127 = arith.extf %126 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6711578Z       %128 = arith.muli %87#10, %c8192_i32 : i32
2026-02-21T09:47:08.6711758Z       %129 = tt.splat %128 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6711983Z       %130 = arith.addi %129, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6712292Z       %131 = tt.addptr %10, %130 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6712601Z       %132 = tt.load %131 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6712833Z       %133 = arith.shli %132, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6713071Z       %134 = arith.shrsi %133, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6713306Z       %135 = arith.shrsi %132, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6713593Z       %136 = tt.expand_dims %134 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6713928Z       %137 = tt.expand_dims %135 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6714233Z       %138 = tt.broadcast %136 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6714470Z       %139 = arith.select %15, %138, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6714704Z       %140 = tt.broadcast %137 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6714952Z       %141 = arith.select %17, %140, %139 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6715179Z       %142 = tt.reshape %141 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6715397Z       %143 = arith.sitofp %142 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6715691Z       %144 = ttg.convert_layout %143 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6716148Z       %145 = tt.dot %127, %144, %125, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6716639Z       %146 = ttg.local_load %87#3 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6717072Z       %147 = arith.extf %146 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6717403Z       %148 = arith.addi %42, %cst_1 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6717716Z       %149 = tt.addptr %10, %148 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6718022Z       %150 = tt.load %149 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6718267Z       %151 = arith.shli %150, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6718501Z       %152 = arith.shrsi %151, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6718748Z       %153 = arith.shrsi %150, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6719034Z       %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6719366Z       %155 = tt.expand_dims %153 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6719642Z       %156 = tt.broadcast %154 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6719880Z       %157 = arith.select %15, %156, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6720113Z       %158 = tt.broadcast %155 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6720345Z       %159 = arith.select %17, %158, %157 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6720573Z       %160 = tt.reshape %159 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6720789Z       %161 = arith.sitofp %160 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6721082Z       %162 = ttg.convert_layout %161 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6721544Z       %163 = tt.dot %147, %162, %145, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6722036Z       %164 = ttg.local_load %87#5 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6722490Z       %165 = arith.extf %164 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6722842Z       %166 = arith.muli %87#7, %c8192_i32 : i32
2026-02-21T09:47:08.6723020Z       %167 = tt.splat %166 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6723269Z       %168 = arith.addi %167, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6723573Z       %169 = tt.addptr %10, %168 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6723879Z       %170 = tt.load %169 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6729094Z       %171 = arith.shli %170, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6729338Z       %172 = arith.shrsi %171, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6729575Z       %173 = arith.shrsi %170, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6729865Z       %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6730202Z       %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6730486Z       %176 = tt.broadcast %174 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6730723Z       %177 = arith.select %15, %176, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6730962Z       %178 = tt.broadcast %175 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6731244Z       %179 = arith.select %17, %178, %177 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6731469Z       %180 = tt.reshape %179 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6731688Z       %181 = arith.sitofp %180 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6732004Z       %182 = ttg.convert_layout %181 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6732469Z       %183 = tt.dot %165, %182, %163, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6732967Z       %184 = ttg.local_load %87#9 : !ttg.memdesc<256x2xbf16, #shared, #smem, mutable, 2x256x2> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6733397Z       %185 = arith.extf %184 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6733684Z       %186 = arith.muli %87#11, %c8192_i32 : i32
2026-02-21T09:47:08.6733864Z       %187 = tt.splat %186 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6734091Z       %188 = arith.addi %187, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6734398Z       %189 = tt.addptr %10, %188 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6734700Z       %190 = tt.load %189 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6734931Z       %191 = arith.shli %190, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6735164Z       %192 = arith.shrsi %191, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6735417Z       %193 = arith.shrsi %190, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6735702Z       %194 = tt.expand_dims %192 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6736034Z       %195 = tt.expand_dims %193 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6736328Z       %196 = tt.broadcast %194 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6736564Z       %197 = arith.select %15, %196, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6736794Z       %198 = tt.broadcast %195 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6737022Z       %199 = arith.select %17, %198, %197 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6737246Z       %200 = tt.reshape %199 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6737466Z       %201 = arith.sitofp %200 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6737756Z       %202 = ttg.convert_layout %201 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6738211Z       %203 = tt.dot %185, %202, %183, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6738593Z       ttg.local_dealloc %45 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6738803Z       ttg.local_dealloc %44 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6739003Z       ttg.local_dealloc %43 : !ttg.memdesc<2x256x2xbf16, #shared, #smem, mutable>
2026-02-21T09:47:08.6739277Z       %204 = scf.for %arg4 = %c510_i32 to %c512_i32 step %c1_i32 iter_args(%arg5 = %203) -> (tensor<256x32xf32, #mma>)  : i32 {
2026-02-21T09:47:08.6739326Z         %228 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:47:08.6739422Z         %229 = tt.splat %228 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6739532Z         %230 = arith.addi %229, %8 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.6739679Z         %231 = tt.expand_dims %230 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.6739772Z         %232 = tt.broadcast %231 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6739839Z         %233 = arith.addi %41, %232 : tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6739945Z         %234 = tt.addptr %9, %233 : tensor<256x2x!tt.ptr<bf16>, #blocked1>, tensor<256x2xi32, #blocked1>
2026-02-21T09:47:08.6740010Z         %235 = tt.load %234 : tensor<256x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.6740180Z         %236 = ttg.convert_layout %235 : tensor<256x2xbf16, #blocked1> -> tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6740378Z         %237 = arith.extf %236 : tensor<256x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6740427Z         %238 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:47:08.6740524Z         %239 = tt.splat %238 : i32 -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6740618Z         %240 = arith.addi %239, %42 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6740795Z         %241 = tt.addptr %10, %240 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6740890Z         %242 = tt.load %241 : tensor<1x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6741004Z         %243 = arith.shli %242, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6741101Z         %244 = arith.shrsi %243, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6741200Z         %245 = arith.shrsi %242, %cst_9 : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.6741362Z         %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6741507Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<1x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x32xi8, #blocked>
2026-02-21T09:47:08.6741602Z         %248 = tt.broadcast %246 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6741703Z         %249 = arith.select %15, %248, %cst_0 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6741796Z         %250 = tt.broadcast %247 : tensor<1x1x32xi8, #blocked> -> tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6741897Z         %251 = arith.select %17, %250, %249 : tensor<1x2x32xi1, #blocked>, tensor<1x2x32xi8, #blocked>
2026-02-21T09:47:08.6741985Z         %252 = tt.reshape %251 : tensor<1x2x32xi8, #blocked> -> tensor<2x32xi8, #blocked2>
2026-02-21T09:47:08.6742076Z         %253 = arith.sitofp %252 : tensor<2x32xi8, #blocked2> to tensor<2x32xf32, #blocked2>
2026-02-21T09:47:08.6742239Z         %254 = ttg.convert_layout %253 : tensor<2x32xf32, #blocked2> -> tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.6742503Z         %255 = tt.dot %237, %254, %arg5, inputPrecision = tf32 : tensor<256x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6742553Z         scf.yield %255 : tensor<256x32xf32, #mma>
2026-02-21T09:47:08.6742599Z       } {tt.num_stages = 1 : i32}
2026-02-21T09:47:08.6742699Z       %205 = arith.truncf %204 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma>
2026-02-21T09:47:08.6742742Z       %206 = arith.extsi %33 : i32 to i64
2026-02-21T09:47:08.6742806Z       %207 = arith.extsi %36 : i32 to i64
2026-02-21T09:47:08.6742897Z       %208 = tt.splat %206 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.6742982Z       %209 = arith.addi %208, %19 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.6743126Z       %210 = tt.expand_dims %209 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6743191Z       %211 = arith.muli %210, %cst_12 : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6743274Z       %212 = tt.broadcast %211 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6743357Z       %213 = tt.splat %207 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.6743442Z       %214 = arith.addi %213, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.6743579Z       %215 = tt.expand_dims %214 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6743662Z       %216 = tt.broadcast %215 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6743724Z       %217 = arith.addi %212, %216 : tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6743819Z       %218 = tt.addptr %18, %217 : tensor<256x32x!tt.ptr<bf16>, #mma>, tensor<256x32xi64, #mma>
2026-02-21T09:47:08.6743887Z       %219 = arith.cmpi sge, %210, %cst_13 : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6743954Z       %220 = arith.cmpi slt, %210, %cst_14 : tensor<256x1xi64, #mma>
2026-02-21T09:47:08.6744014Z       %221 = arith.andi %219, %220 : tensor<256x1xi1, #mma>
2026-02-21T09:47:08.6744094Z       %222 = tt.broadcast %221 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:47:08.6744161Z       %223 = arith.cmpi sge, %215, %cst_15 : tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6744245Z       %224 = arith.cmpi slt, %215, %cst_16 : tensor<1x32xi64, #mma>
2026-02-21T09:47:08.6744300Z       %225 = arith.andi %223, %224 : tensor<1x32xi1, #mma>
2026-02-21T09:47:08.6744379Z       %226 = tt.broadcast %225 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:47:08.6744438Z       %227 = arith.andi %222, %226 : tensor<256x32xi1, #mma>
2026-02-21T09:47:08.6744517Z       tt.store %218, %205, %227 : tensor<256x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:08.6744558Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:47:08.6744595Z     tt.return
2026-02-21T09:47:08.6744626Z   }
2026-02-21T09:47:08.6744660Z }
2026-02-21T09:47:08.6744665Z 
2026-02-21T09:47:08.6744697Z {-#
2026-02-21T09:47:08.6744737Z   external_resources: {
2026-02-21T09:47:08.6744775Z     mlir_reproducer: {
2026-02-21T09:47:08.6745712Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:47:08.6745754Z       disable_threading: false,
2026-02-21T09:47:08.6745792Z       verify_each: true
2026-02-21T09:47:08.6745824Z     }
2026-02-21T09:47:08.6745854Z   }
2026-02-21T09:47:08.6745884Z #-}
2026-02-21T09:47:08.6746120Z /tmp/torchinductor_root/gt/cgtsosf4i4p64k42w2lgukhe2w75x3xfgm3sybvafk5ra2aegofb.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:47:08.6746560Z /tmp/torchinductor_root/gt/cgtsosf4i4p64k42w2lgukhe2w75x3xfgm3sybvafk5ra2aegofb.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:47:08.6746672Z [159s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:47:08.6747312Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 256, 32], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:47:08.6747367Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:47:08.6747447Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:47:08.8419752Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:47:08.8422669Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:47:08.8423587Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:47:08.8424410Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:47:08.8425174Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:47:08.8426036Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:47:08.8427501Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:47:08.8428576Z     %cst = arith.constant dense<0.000000e+00> : tensor<32x16xf32, #mma>
2026-02-21T09:47:08.8429023Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:47:08.8429402Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:47:08.8429734Z     %c262144_i32 = arith.constant 262144 : i32
2026-02-21T09:47:08.8430064Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:47:08.8430464Z     %cst_0 = arith.constant dense<0> : tensor<1x2x16xi8, #blocked>
2026-02-21T09:47:08.8430871Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:47:08.8431018Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:47:08.8431132Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:47:08.8431246Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:47:08.8431358Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:47:08.8431505Z     %cst_1 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1>
2026-02-21T09:47:08.8431723Z     %cst_2 = arith.constant dense<4> : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8431937Z     %cst_3 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:08.8432103Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:08.8432271Z     %cst_5 = arith.constant dense<8192> : tensor<32x1xi32, #mma>
2026-02-21T09:47:08.8432410Z     %0 = tt.get_program_id x : i32
2026-02-21T09:47:08.8432652Z     %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:08.8432957Z     %2 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.8433225Z     %3 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:08.8433510Z     %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.8433765Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.8434021Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.8434256Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8434556Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:47:08.8434970Z     %9 = tt.expand_dims %8 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:47:08.8435389Z     %10 = tt.expand_dims %9 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:08.8435637Z     %11 = arith.cmpi eq, %10, %cst_3 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:08.8435832Z     %12 = tt.broadcast %11 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x16xi1, #blocked>
2026-02-21T09:47:08.8436025Z     %13 = arith.cmpi eq, %10, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:08.8436209Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x16xi1, #blocked>
2026-02-21T09:47:08.8436412Z     %15 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:08.8436591Z     scf.for %arg3 = %0 to %c262144_i32 step %c2432_i32  : i32 {
2026-02-21T09:47:08.8436740Z       %16 = arith.divsi %arg3, %c8192_i32 : i32
2026-02-21T09:47:08.8436861Z       %17 = arith.muli %16, %c16_i32 : i32
2026-02-21T09:47:08.8436982Z       %18 = arith.subi %c512_i32, %17 : i32
2026-02-21T09:47:08.8437095Z       %19 = arith.minsi %18, %c16_i32 : i32
2026-02-21T09:47:08.8437213Z       %20 = arith.remsi %arg3, %c8192_i32 : i32
2026-02-21T09:47:08.8437353Z       %21 = arith.remsi %20, %19 : i32
2026-02-21T09:47:08.8437463Z       %22 = arith.addi %17, %21 : i32
2026-02-21T09:47:08.8437573Z       %23 = arith.divsi %20, %19 : i32
2026-02-21T09:47:08.8437682Z       %24 = arith.muli %22, %c16_i32 : i32
2026-02-21T09:47:08.8437887Z       %25 = tt.splat %24 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:08.8438150Z       %26 = tt.splat %24 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.8438397Z       %27 = arith.addi %25, %1 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:08.8438645Z       %28 = arith.addi %26, %2 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:08.8438804Z       %29 = arith.muli %23, %c32_i32 : i32
2026-02-21T09:47:08.8438968Z       %30 = tt.splat %29 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:08.8439176Z       %31 = tt.splat %29 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.8439382Z       %32 = arith.addi %30, %3 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:08.8439586Z       %33 = arith.addi %31, %4 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:08.8439847Z       %34 = tt.expand_dims %32 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:47:08.8440091Z       %35 = arith.muli %34, %cst_1 : tensor<32x1xi32, #blocked1>
2026-02-21T09:47:08.8440276Z       %36 = tt.broadcast %35 : tensor<32x1xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:47:08.8440622Z       %37 = tt.expand_dims %27 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8441021Z       %38 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x16xf32, #mma>)  : i32 {
2026-02-21T09:47:08.8441231Z         %47 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:47:08.8441398Z         %48 = tt.splat %47 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.8441630Z         %49 = arith.addi %48, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.8441900Z         %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.8442173Z         %51 = tt.broadcast %50 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:47:08.8442359Z         %52 = arith.addi %36, %51 : tensor<32x2xi32, #blocked1>
2026-02-21T09:47:08.8442553Z         %53 = tt.addptr %6, %52 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:47:08.8442794Z         %54 = tt.load %53 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.8443053Z         %55 = ttg.convert_layout %54 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.8443454Z         %56 = arith.extf %55 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.8443734Z         %57 = arith.muli %arg4, %c8192_i32 : i32
2026-02-21T09:47:08.8443906Z         %58 = tt.splat %57 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8444124Z         %59 = arith.addi %58, %37 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8444424Z         %60 = tt.addptr %7, %59 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8444723Z         %61 = tt.load %60 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8444968Z         %62 = arith.shli %61, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8445196Z         %63 = arith.shrsi %62, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8445427Z         %64 = arith.shrsi %61, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8445722Z         %65 = tt.expand_dims %63 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T09:47:08.8446044Z         %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T09:47:08.8446314Z         %67 = tt.broadcast %65 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T09:47:08.8446540Z         %68 = arith.select %12, %67, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T09:47:08.8446768Z         %69 = tt.broadcast %66 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T09:47:08.8446989Z         %70 = arith.select %14, %69, %68 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T09:47:08.8447208Z         %71 = tt.reshape %70 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T09:47:08.8447420Z         %72 = arith.sitofp %71 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T09:47:08.8447702Z         %73 = ttg.convert_layout %72 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.8448156Z         %74 = tt.dot %56, %73, %arg5, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x16xf32, #mma>
2026-02-21T09:47:08.8448495Z         %75 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:47:08.8448615Z         %76 = arith.muli %75, %c2_i32 : i32
2026-02-21T09:47:08.8448815Z         %77 = tt.splat %76 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.8449027Z         %78 = arith.addi %77, %5 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:08.8449313Z         %79 = tt.expand_dims %78 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:47:08.8449579Z         %80 = tt.broadcast %79 : tensor<1x2xi32, #blocked1> -> tensor<32x2xi32, #blocked1>
2026-02-21T09:47:08.8449763Z         %81 = arith.addi %36, %80 : tensor<32x2xi32, #blocked1>
2026-02-21T09:47:08.8449954Z         %82 = tt.addptr %6, %81 : tensor<32x2x!tt.ptr<bf16>, #blocked1>, tensor<32x2xi32, #blocked1>
2026-02-21T09:47:08.8450152Z         %83 = tt.load %82 : tensor<32x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:08.8450403Z         %84 = ttg.convert_layout %83 : tensor<32x2xbf16, #blocked1> -> tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.8450795Z         %85 = arith.extf %84 : tensor<32x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.8451068Z         %86 = arith.muli %75, %c8192_i32 : i32
2026-02-21T09:47:08.8451239Z         %87 = tt.splat %86 : i32 -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8451457Z         %88 = arith.addi %87, %37 : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8451753Z         %89 = tt.addptr %7, %88 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8452049Z         %90 = tt.load %89 : tensor<1x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8452270Z         %91 = arith.shli %90, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8452498Z         %92 = arith.shrsi %91, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8452743Z         %93 = arith.shrsi %90, %cst_2 : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:08.8453021Z         %94 = tt.expand_dims %92 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T09:47:08.8453362Z         %95 = tt.expand_dims %93 {axis = 1 : i32} : tensor<1x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x16xi8, #blocked>
2026-02-21T09:47:08.8453636Z         %96 = tt.broadcast %94 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T09:47:08.8453863Z         %97 = arith.select %12, %96, %cst_0 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T09:47:08.8454088Z         %98 = tt.broadcast %95 : tensor<1x1x16xi8, #blocked> -> tensor<1x2x16xi8, #blocked>
2026-02-21T09:47:08.8454307Z         %99 = arith.select %14, %98, %97 : tensor<1x2x16xi1, #blocked>, tensor<1x2x16xi8, #blocked>
2026-02-21T09:47:08.8454528Z         %100 = tt.reshape %99 : tensor<1x2x16xi8, #blocked> -> tensor<2x16xi8, #blocked2>
2026-02-21T09:47:08.8454745Z         %101 = arith.sitofp %100 : tensor<2x16xi8, #blocked2> to tensor<2x16xf32, #blocked2>
2026-02-21T09:47:08.8455034Z         %102 = ttg.convert_layout %101 : tensor<2x16xf32, #blocked2> -> tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:08.8455486Z         %103 = tt.dot %85, %102, %74, inputPrecision = tf32 : tensor<32x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x16xf32, #mma>
2026-02-21T09:47:08.8455822Z         scf.yield %103 : tensor<32x16xf32, #mma>
2026-02-21T09:47:08.8455983Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:47:08.8456178Z       %39 = arith.truncf %38 : tensor<32x16xf32, #mma> to tensor<32x16xbf16, #mma>
2026-02-21T09:47:08.8456445Z       %40 = tt.expand_dims %33 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma>
2026-02-21T09:47:08.8456673Z       %41 = arith.muli %40, %cst_5 : tensor<32x1xi32, #mma>
2026-02-21T09:47:08.8456894Z       %42 = tt.expand_dims %28 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma>
2026-02-21T09:47:08.8457155Z       %43 = tt.broadcast %41 : tensor<32x1xi32, #mma> -> tensor<32x16xi32, #mma>
2026-02-21T09:47:08.8457344Z       %44 = tt.broadcast %42 : tensor<1x16xi32, #mma> -> tensor<32x16xi32, #mma>
2026-02-21T09:47:08.8457512Z       %45 = arith.addi %43, %44 : tensor<32x16xi32, #mma>
2026-02-21T09:47:08.8457685Z       %46 = tt.addptr %15, %45 : tensor<32x16x!tt.ptr<bf16>, #mma>, tensor<32x16xi32, #mma>
2026-02-21T09:47:08.8457867Z       tt.store %46, %39 : tensor<32x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:08.8457988Z     }
2026-02-21T09:47:08.8458061Z     tt.return
2026-02-21T09:47:08.8458138Z   }
2026-02-21T09:47:08.8458206Z }
2026-02-21T09:47:08.8458248Z 
2026-02-21T09:47:08.8458279Z {-#
2026-02-21T09:47:08.8458357Z   external_resources: {
2026-02-21T09:47:08.8458456Z     mlir_reproducer: {
2026-02-21T09:47:08.8459451Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:47:08.8460431Z       disable_threading: false,
2026-02-21T09:47:08.8460534Z       verify_each: true
2026-02-21T09:47:08.8460619Z     }
2026-02-21T09:47:08.8460687Z   }
2026-02-21T09:47:08.8460753Z #-}
2026-02-21T09:47:08.8461043Z /tmp/torchinductor_root/cl/cclt7caafp2qbph4tyfhx3ejkoceexr4wcdrgb56w7si5kzmws65.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:47:08.8461713Z /tmp/torchinductor_root/cl/cclt7caafp2qbph4tyfhx3ejkoceexr4wcdrgb56w7si5kzmws65.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:47:08.8462277Z [159s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:47:08.8463049Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:47:08.8463750Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:47:08.8463913Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:47:16.3701396Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:47:16.3703833Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:47:16.3704448Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:47:16.3704959Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:47:16.3706005Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:47:16.3706441Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:47:16.3706837Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:47:16.3707246Z #smem = #ttg.shared_memory
2026-02-21T09:47:16.3707647Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:47:16.3708407Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:47:16.3709075Z     %cst = arith.constant dense<0.000000e+00> : tensor<32x64xf32, #mma>
2026-02-21T09:47:16.3709335Z     %c65536_i32 = arith.constant 65536 : i32
2026-02-21T09:47:16.3709509Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:47:16.3709689Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:47:16.3709848Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:47:16.3710060Z     %cst_0 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:47:16.3710270Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:47:16.3710449Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:47:16.3710674Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:47:16.3710842Z     %c1216_i32 = arith.constant 1216 : i32
2026-02-21T09:47:16.3711063Z     %cst_1 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1>
2026-02-21T09:47:16.3711382Z     %cst_2 = arith.constant dense<8192> : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3711750Z     %cst_3 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3712058Z     %cst_4 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:16.3712308Z     %cst_5 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:16.3712675Z     %cst_6 = arith.constant dense<8192> : tensor<32x1xi32, #mma>
2026-02-21T09:47:16.3712878Z     %0 = tt.get_program_id x : i32
2026-02-21T09:47:16.3713170Z     %1 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:16.3713560Z     %2 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:16.3714099Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:16.3714533Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:16.3714968Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:16.3715407Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:16.3715764Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:16.3716102Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3716540Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:47:16.3717130Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:47:16.3717697Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:16.3718059Z     %12 = arith.cmpi eq, %11, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:16.3718366Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:47:16.3718655Z     %14 = arith.cmpi eq, %11, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:16.3718923Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:47:16.3719260Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:16.3719469Z     scf.for %arg3 = %0 to %c65536_i32 step %c1216_i32  : i32 {
2026-02-21T09:47:16.3719634Z       %17 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:47:16.3719779Z       %18 = arith.muli %17, %c2_i32 : i32
2026-02-21T09:47:16.3719912Z       %19 = arith.subi %c512_i32, %18 : i32
2026-02-21T09:47:16.3720055Z       %20 = arith.minsi %19, %c2_i32 : i32
2026-02-21T09:47:16.3720195Z       %21 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:47:16.3720335Z       %22 = arith.remsi %21, %20 : i32
2026-02-21T09:47:16.3720466Z       %23 = arith.addi %18, %22 : i32
2026-02-21T09:47:16.3720595Z       %24 = arith.divsi %21, %20 : i32
2026-02-21T09:47:16.3720726Z       %25 = arith.muli %23, %c32_i32 : i32
2026-02-21T09:47:16.3720916Z       %26 = tt.splat %25 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:16.3721163Z       %27 = tt.splat %25 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:16.3721399Z       %28 = arith.addi %26, %1 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:16.3721639Z       %29 = arith.addi %27, %2 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:16.3721829Z       %30 = arith.muli %24, %c64_i32 : i32
2026-02-21T09:47:16.3722056Z       %31 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:16.3722340Z       %32 = tt.splat %30 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:16.3722700Z       %33 = arith.addi %31, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:16.3723010Z       %34 = arith.addi %32, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:16.3723308Z       %35 = tt.expand_dims %28 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:47:16.3723588Z       %36 = arith.muli %35, %cst_1 : tensor<32x1xi32, #blocked1>
2026-02-21T09:47:16.3723824Z       %37 = tt.broadcast %36 : tensor<32x1xi32, #blocked1> -> tensor<32x4xi32, #blocked1>
2026-02-21T09:47:16.3724207Z       %38 = tt.expand_dims %33 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3724689Z       %39 = tt.broadcast %38 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3725073Z       %40 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<32x64xf32, #mma>)  : i32 {
2026-02-21T09:47:16.3725418Z         %49 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:16.3725754Z         %50 = arith.addi %49, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:16.3725998Z         %51 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:47:16.3726190Z         %52 = tt.splat %51 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:16.3726441Z         %53 = arith.addi %52, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:16.3726749Z         %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:47:16.3727064Z         %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<32x4xi32, #blocked1>
2026-02-21T09:47:16.3727301Z         %56 = arith.addi %37, %55 : tensor<32x4xi32, #blocked1>
2026-02-21T09:47:16.3727524Z         %57 = tt.addptr %7, %56 : tensor<32x4x!tt.ptr<bf16>, #blocked1>, tensor<32x4xi32, #blocked1>
2026-02-21T09:47:16.3727782Z         %58 = tt.load %57 : tensor<32x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:16.3728030Z         %59 = ttg.local_alloc %58 : (tensor<32x4xbf16, #blocked1>) -> !ttg.memdesc<32x4xbf16, #shared, #smem>
2026-02-21T09:47:16.3728404Z         %60 = ttg.local_load %59 : !ttg.memdesc<32x4xbf16, #shared, #smem> -> tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:16.3728866Z         %61 = arith.extf %60 : tensor<32x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:16.3729384Z         %62 = tt.expand_dims %50 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3729757Z         %63 = arith.muli %62, %cst_2 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3730064Z         %64 = tt.broadcast %63 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3730370Z         %65 = arith.addi %64, %39 : tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3730679Z         %66 = tt.addptr %8, %65 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3730986Z         %67 = tt.load %66 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3731221Z         %68 = arith.shli %67, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3731459Z         %69 = arith.shrsi %68, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3731713Z         %70 = arith.shrsi %67, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:16.3732001Z         %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:47:16.3732330Z         %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:47:16.3732628Z         %73 = tt.broadcast %71 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:47:16.3732894Z         %74 = arith.select %13, %73, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:47:16.3733133Z         %75 = tt.broadcast %72 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:47:16.3733365Z         %76 = arith.select %15, %75, %74 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:47:16.3733591Z         %77 = tt.reshape %76 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:47:16.3733818Z         %78 = arith.sitofp %77 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:47:16.3734065Z         %79 = ttg.local_alloc %78 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:47:16.3734388Z         %80 = ttg.local_load %79 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:16.3734857Z         %81 = tt.dot %61, %80, %arg5, inputPrecision = tf32 : tensor<32x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x64xf32, #mma>
2026-02-21T09:47:16.3735208Z         scf.yield %81 : tensor<32x64xf32, #mma>
2026-02-21T09:47:16.3735344Z       } {tt.loop_unroll_factor = 1 : i32}
2026-02-21T09:47:16.3735508Z       %41 = arith.truncf %40 : tensor<32x64xf32, #mma> to tensor<32x64xbf16, #mma>
2026-02-21T09:47:16.3735787Z       %42 = tt.expand_dims %29 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma>
2026-02-21T09:47:16.3736026Z       %43 = arith.muli %42, %cst_6 : tensor<32x1xi32, #mma>
2026-02-21T09:47:16.3736272Z       %44 = tt.expand_dims %34 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:47:16.3736531Z       %45 = tt.broadcast %43 : tensor<32x1xi32, #mma> -> tensor<32x64xi32, #mma>
2026-02-21T09:47:16.3736726Z       %46 = tt.broadcast %44 : tensor<1x64xi32, #mma> -> tensor<32x64xi32, #mma>
2026-02-21T09:47:16.3736906Z       %47 = arith.addi %45, %46 : tensor<32x64xi32, #mma>
2026-02-21T09:47:16.3737090Z       %48 = tt.addptr %16, %47 : tensor<32x64x!tt.ptr<bf16>, #mma>, tensor<32x64xi32, #mma>
2026-02-21T09:47:16.3737278Z       tt.store %48, %41 : tensor<32x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:16.3737449Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:47:16.3737589Z     tt.return
2026-02-21T09:47:16.3737681Z   }
2026-02-21T09:47:16.3737763Z }
2026-02-21T09:47:16.3737813Z 
2026-02-21T09:47:16.3737846Z {-#
2026-02-21T09:47:16.3737930Z   external_resources: {
2026-02-21T09:47:16.3738039Z     mlir_reproducer: {
2026-02-21T09:47:16.3739038Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:47:16.3740031Z       disable_threading: false,
2026-02-21T09:47:16.3740143Z       verify_each: true
2026-02-21T09:47:16.3740241Z     }
2026-02-21T09:47:16.3740320Z   }
2026-02-21T09:47:16.3740440Z #-}
2026-02-21T09:47:16.3740724Z /tmp/torchinductor_root/bh/cbhtwx7xaagwkpticpyxiicu3kro4ylitkdddkylogh3car5inkw.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:47:16.3741451Z /tmp/torchinductor_root/bh/cbhtwx7xaagwkpticpyxiicu3kro4ylitkdddkylogh3car5inkw.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:47:16.3742021Z [167s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:47:16.3742800Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 32, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[2, 0], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:47:16.3743503Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:47:16.3743675Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:47:37.4558991Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:47:37.4563272Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:47:37.4567840Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:47:37.4569244Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:47:37.4570007Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [16, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:47:37.4570658Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:47:37.4571382Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:47:37.4571837Z #smem = #ttg.shared_memory
2026-02-21T09:47:37.4572428Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:47:37.4573572Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:47:37.4574533Z     %cst = arith.constant dense<0.000000e+00> : tensor<4096x16xf32, #mma>
2026-02-21T09:47:37.4574931Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:47:37.4575212Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:47:37.4575492Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:47:37.4575863Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:47:37.4576221Z     %cst_0 = arith.constant dense<0> : tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4576577Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:47:37.4576848Z     %c12_i32 = arith.constant 12 : i32
2026-02-21T09:47:37.4577084Z     %c504_i32 = arith.constant 504 : i32
2026-02-21T09:47:37.4577361Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:47:37.4577550Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:47:37.4577741Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:47:37.4578041Z     %cst_1 = arith.constant dense<0> : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4578478Z     %cst_2 = arith.constant dense<8192> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4579015Z     %cst_3 = arith.constant dense<0> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4579388Z     %cst_4 = arith.constant dense<1024> : tensor<4096x1xi32, #blocked1>
2026-02-21T09:47:37.4579747Z     %cst_5 = arith.constant dense<4> : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4580092Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:37.4580467Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:37.4580808Z     %cst_8 = arith.constant dense<8192> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4581228Z     %cst_9 = arith.constant dense<0> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4581644Z     %cst_10 = arith.constant dense<512> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4581992Z     %cst_11 = arith.constant dense<0> : tensor<1x16xi64, #mma>
2026-02-21T09:47:37.4582277Z     %cst_12 = arith.constant dense<8192> : tensor<1x16xi64, #mma>
2026-02-21T09:47:37.4582568Z     %cst_13 = arith.constant dense<8192> : tensor<4096x1xi64, #mma>
2026-02-21T09:47:37.4582847Z     %cst_14 = arith.constant dense<0> : tensor<4096x1xi64, #mma>
2026-02-21T09:47:37.4583132Z     %cst_15 = arith.constant dense<16384> : tensor<4096x1xi64, #mma>
2026-02-21T09:47:37.4583385Z     %0 = tt.get_program_id x : i32
2026-02-21T09:47:37.4583570Z     %1 = arith.divsi %0, %c2048_i32 : i32
2026-02-21T09:47:37.4583767Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T09:47:37.4583957Z     %3 = arith.subi %c4_i32, %2 : i32
2026-02-21T09:47:37.4584135Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T09:47:37.4584327Z     %5 = arith.remsi %0, %c2048_i32 : i32
2026-02-21T09:47:37.4584517Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:47:37.4584695Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:47:37.4584875Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:47:37.4585051Z     %9 = arith.muli %7, %c4096_i32 : i32
2026-02-21T09:47:37.4585437Z     %10 = tt.make_range {end = 4096 : i32, start = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:37.4585915Z     %11 = tt.make_range {end = 4096 : i32, start = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:37.4586369Z     %12 = tt.splat %9 : i32 -> tensor<4096xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:37.4586747Z     %13 = arith.addi %12, %10 : tensor<4096xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:37.4587044Z     %14 = arith.muli %8, %c16_i32 : i32
2026-02-21T09:47:37.4587366Z     %15 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:37.4587855Z     %16 = tt.expand_dims %13 {axis = 1 : i32} : tensor<4096xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4096x1xi32, #blocked1>
2026-02-21T09:47:37.4588197Z     %17 = arith.muli %16, %cst_4 : tensor<4096x1xi32, #blocked1>
2026-02-21T09:47:37.4588445Z     %18 = tt.broadcast %17 : tensor<4096x1xi32, #blocked1> -> tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4588727Z     %19 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<4096x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:37.4588934Z     %20 = arith.extsi %14 : i32 to i64
2026-02-21T09:47:37.4589170Z     %21 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4589555Z     %22 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4590088Z     %23 = arith.extsi %22 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4590591Z     %24 = tt.splat %20 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4591018Z     %25 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4591553Z     %26 = arith.extsi %25 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4592069Z     %27 = arith.addi %24, %26 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4592549Z     %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4593076Z     %29 = tt.broadcast %28 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4593463Z     %30 = arith.cmpi sge, %28, %cst_3 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4593764Z     %31 = arith.cmpi slt, %28, %cst_2 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4594047Z     %32 = arith.andi %30, %31 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4594419Z     %33 = tt.broadcast %32 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4594862Z     %34 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:47:37.4595374Z     %35 = tt.expand_dims %34 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:47:37.4595878Z     %36 = tt.expand_dims %35 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:37.4596211Z     %37 = arith.cmpi eq, %36, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:37.4596452Z     %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked>
2026-02-21T09:47:37.4596712Z     %39 = arith.cmpi eq, %36, %cst_7 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:37.4596943Z     %40 = tt.broadcast %39 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked>
2026-02-21T09:47:37.4597283Z     %41 = scf.for %arg3 = %c0_i32 to %c504_i32 step %c12_i32 iter_args(%arg4 = %cst) -> (tensor<4096x16xf32, #mma>)  : i32 {
2026-02-21T09:47:37.4597552Z       %69 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:47:37.4597751Z       %70 = tt.splat %69 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:37.4597970Z       %71 = arith.addi %70, %15 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:37.4598248Z       %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:47:37.4598524Z       %73 = tt.broadcast %72 : tensor<1x8xi32, #blocked1> -> tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4598717Z       %74 = arith.addi %18, %73 : tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4598924Z       %75 = tt.addptr %19, %74 : tensor<4096x8x!tt.ptr<bf16>, #blocked1>, tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4599136Z       %76 = tt.load %75 : tensor<4096x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:37.4599361Z       %77 = ttg.local_alloc %76 : (tensor<4096x8xbf16, #blocked1>) -> !ttg.memdesc<4096x8xbf16, #shared, #smem>
2026-02-21T09:47:37.4599696Z       %78 = ttg.local_load %77 : !ttg.memdesc<4096x8xbf16, #shared, #smem> -> tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4600110Z       %79 = arith.extf %78 : tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4600401Z       %80 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:47:37.4600629Z       %81 = tt.splat %80 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4600925Z       %82 = arith.addi %81, %23 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4601328Z       %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4601678Z       %84 = arith.muli %83, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4601980Z       %85 = tt.broadcast %84 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4602276Z       %86 = arith.addi %85, %29 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4602654Z       %87 = tt.addptr %21, %86 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4602971Z       %88 = arith.cmpi sge, %83, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4603210Z       %89 = arith.cmpi slt, %83, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4603440Z       %90 = arith.andi %88, %89 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4603733Z       %91 = tt.broadcast %90 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4604025Z       %92 = arith.andi %91, %33 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4604263Z       %93 = tt.load %87, %92, %cst_1 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4604523Z       %94 = arith.shli %93, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4604758Z       %95 = arith.shrsi %94, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4605014Z       %96 = arith.shrsi %93, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4605296Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:37.4605628Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:37.4605904Z       %99 = tt.broadcast %97 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4606141Z       %100 = arith.select %38, %99, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4606380Z       %101 = tt.broadcast %98 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4606610Z       %102 = arith.select %40, %101, %100 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4606843Z       %103 = tt.reshape %102 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2>
2026-02-21T09:47:37.4607066Z       %104 = arith.sitofp %103 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2>
2026-02-21T09:47:37.4607318Z       %105 = ttg.local_alloc %104 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem>
2026-02-21T09:47:37.4607646Z       %106 = ttg.local_load %105 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4608131Z       %107 = tt.dot %79, %106, %arg4, inputPrecision = tf32 : tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<4096x16xf32, #mma>
2026-02-21T09:47:37.4608483Z       %108 = arith.addi %arg3, %c4_i32 : i32
2026-02-21T09:47:37.4608638Z       %109 = arith.muli %108, %c2_i32 : i32
2026-02-21T09:47:37.4608807Z       %110 = tt.splat %109 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:37.4609036Z       %111 = arith.addi %110, %15 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:37.4609324Z       %112 = tt.expand_dims %111 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:47:37.4609601Z       %113 = tt.broadcast %112 : tensor<1x8xi32, #blocked1> -> tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4609795Z       %114 = arith.addi %18, %113 : tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4610001Z       %115 = tt.addptr %19, %114 : tensor<4096x8x!tt.ptr<bf16>, #blocked1>, tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4610214Z       %116 = tt.load %115 : tensor<4096x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:37.4610437Z       %117 = ttg.local_alloc %116 : (tensor<4096x8xbf16, #blocked1>) -> !ttg.memdesc<4096x8xbf16, #shared, #smem>
2026-02-21T09:47:37.4610772Z       %118 = ttg.local_load %117 : !ttg.memdesc<4096x8xbf16, #shared, #smem> -> tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4611178Z       %119 = arith.extf %118 : tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4611462Z       %120 = arith.extsi %108 : i32 to i64
2026-02-21T09:47:37.4611670Z       %121 = tt.splat %120 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4611962Z       %122 = arith.addi %121, %23 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4612362Z       %123 = tt.expand_dims %122 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4612719Z       %124 = arith.muli %123, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4613023Z       %125 = tt.broadcast %124 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4613339Z       %126 = arith.addi %125, %29 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4613646Z       %127 = tt.addptr %21, %126 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4613977Z       %128 = arith.cmpi sge, %123, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4614223Z       %129 = arith.cmpi slt, %123, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4614475Z       %130 = arith.andi %128, %129 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4614772Z       %131 = tt.broadcast %130 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4615092Z       %132 = arith.andi %131, %33 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4615340Z       %133 = tt.load %127, %132, %cst_1 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4615585Z       %134 = arith.shli %133, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4615820Z       %135 = arith.shrsi %134, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4616062Z       %136 = arith.shrsi %133, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4616346Z       %137 = tt.expand_dims %135 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:37.4616706Z       %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:37.4616987Z       %139 = tt.broadcast %137 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4617228Z       %140 = arith.select %38, %139, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4617481Z       %141 = tt.broadcast %138 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4617710Z       %142 = arith.select %40, %141, %140 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4617943Z       %143 = tt.reshape %142 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2>
2026-02-21T09:47:37.4618163Z       %144 = arith.sitofp %143 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2>
2026-02-21T09:47:37.4618421Z       %145 = ttg.local_alloc %144 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem>
2026-02-21T09:47:37.4618751Z       %146 = ttg.local_load %145 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4619221Z       %147 = tt.dot %119, %146, %107, inputPrecision = tf32 : tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<4096x16xf32, #mma>
2026-02-21T09:47:37.4619596Z       %148 = arith.addi %arg3, %c8_i32 : i32
2026-02-21T09:47:37.4619732Z       %149 = arith.muli %148, %c2_i32 : i32
2026-02-21T09:47:37.4619924Z       %150 = tt.splat %149 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:37.4620152Z       %151 = arith.addi %150, %15 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:37.4620441Z       %152 = tt.expand_dims %151 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:47:37.4620750Z       %153 = tt.broadcast %152 : tensor<1x8xi32, #blocked1> -> tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4620965Z       %154 = arith.addi %18, %153 : tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4621191Z       %155 = tt.addptr %19, %154 : tensor<4096x8x!tt.ptr<bf16>, #blocked1>, tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4621414Z       %156 = tt.load %155 : tensor<4096x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:37.4621649Z       %157 = ttg.local_alloc %156 : (tensor<4096x8xbf16, #blocked1>) -> !ttg.memdesc<4096x8xbf16, #shared, #smem>
2026-02-21T09:47:37.4621987Z       %158 = ttg.local_load %157 : !ttg.memdesc<4096x8xbf16, #shared, #smem> -> tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4622402Z       %159 = arith.extf %158 : tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4622713Z       %160 = arith.extsi %148 : i32 to i64
2026-02-21T09:47:37.4622940Z       %161 = tt.splat %160 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4623242Z       %162 = arith.addi %161, %23 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4623635Z       %163 = tt.expand_dims %162 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4623991Z       %164 = arith.muli %163, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4624300Z       %165 = tt.broadcast %164 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4624608Z       %166 = arith.addi %165, %29 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4624934Z       %167 = tt.addptr %21, %166 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4625254Z       %168 = arith.cmpi sge, %163, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4625504Z       %169 = arith.cmpi slt, %163, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4625753Z       %170 = arith.andi %168, %169 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4626054Z       %171 = tt.broadcast %170 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4626347Z       %172 = arith.andi %171, %33 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4626593Z       %173 = tt.load %167, %172, %cst_1 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4626842Z       %174 = arith.shli %173, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4627073Z       %175 = arith.shrsi %174, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4627312Z       %176 = arith.shrsi %173, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4627598Z       %177 = tt.expand_dims %175 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:37.4627937Z       %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:37.4628218Z       %179 = tt.broadcast %177 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4628453Z       %180 = arith.select %38, %179, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4628706Z       %181 = tt.broadcast %178 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4628937Z       %182 = arith.select %40, %181, %180 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4629169Z       %183 = tt.reshape %182 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2>
2026-02-21T09:47:37.4629413Z       %184 = arith.sitofp %183 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2>
2026-02-21T09:47:37.4629663Z       %185 = ttg.local_alloc %184 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem>
2026-02-21T09:47:37.4629987Z       %186 = ttg.local_load %185 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4630459Z       %187 = tt.dot %159, %186, %147, inputPrecision = tf32 : tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<4096x16xf32, #mma>
2026-02-21T09:47:37.4630810Z       scf.yield %187 : tensor<4096x16xf32, #mma>
2026-02-21T09:47:37.4630949Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:47:37.4631161Z     %42 = scf.for %arg3 = %c504_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %41) -> (tensor<4096x16xf32, #mma>)  : i32 {
2026-02-21T09:47:37.4631380Z       %69 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:47:37.4631551Z       %70 = tt.splat %69 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:37.4631770Z       %71 = arith.addi %70, %15 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:37.4632046Z       %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:47:37.4632317Z       %73 = tt.broadcast %72 : tensor<1x8xi32, #blocked1> -> tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4632516Z       %74 = arith.addi %18, %73 : tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4632726Z       %75 = tt.addptr %19, %74 : tensor<4096x8x!tt.ptr<bf16>, #blocked1>, tensor<4096x8xi32, #blocked1>
2026-02-21T09:47:37.4632960Z       %76 = tt.load %75 : tensor<4096x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:37.4633184Z       %77 = ttg.local_alloc %76 : (tensor<4096x8xbf16, #blocked1>) -> !ttg.memdesc<4096x8xbf16, #shared, #smem>
2026-02-21T09:47:37.4633516Z       %78 = ttg.local_load %77 : !ttg.memdesc<4096x8xbf16, #shared, #smem> -> tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4633941Z       %79 = arith.extf %78 : tensor<4096x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4634227Z       %80 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:47:37.4634433Z       %81 = tt.splat %80 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4634728Z       %82 = arith.addi %81, %23 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:37.4635112Z       %83 = tt.expand_dims %82 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4635464Z       %84 = arith.muli %83, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4635766Z       %85 = tt.broadcast %84 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4636061Z       %86 = arith.addi %85, %29 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4636371Z       %87 = tt.addptr %21, %86 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4636681Z       %88 = arith.cmpi sge, %83, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4636941Z       %89 = arith.cmpi slt, %83, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4637170Z       %90 = arith.andi %88, %89 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4637479Z       %91 = tt.broadcast %90 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4637775Z       %92 = arith.andi %91, %33 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4638012Z       %93 = tt.load %87, %92, %cst_1 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4638254Z       %94 = arith.shli %93, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4638485Z       %95 = arith.shrsi %94, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4638713Z       %96 = arith.shrsi %93, %cst_5 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:37.4638996Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:37.4639328Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:37.4639603Z       %99 = tt.broadcast %97 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4639840Z       %100 = arith.select %38, %99, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4640076Z       %101 = tt.broadcast %98 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4640309Z       %102 = arith.select %40, %101, %100 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:37.4640541Z       %103 = tt.reshape %102 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2>
2026-02-21T09:47:37.4640762Z       %104 = arith.sitofp %103 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2>
2026-02-21T09:47:37.4641030Z       %105 = ttg.local_alloc %104 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem>
2026-02-21T09:47:37.4641355Z       %106 = ttg.local_load %105 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:37.4641852Z       %107 = tt.dot %79, %106, %arg4, inputPrecision = tf32 : tensor<4096x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<4096x16xf32, #mma>
2026-02-21T09:47:37.4642210Z       scf.yield %107 : tensor<4096x16xf32, #mma>
2026-02-21T09:47:37.4642343Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:47:37.4642520Z     %43 = arith.truncf %42 : tensor<4096x16xf32, #mma> to tensor<4096x16xbf16, #mma>
2026-02-21T09:47:37.4642733Z     %44 = arith.extsi %9 : i32 to i64
2026-02-21T09:47:37.4642897Z     %45 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<4096x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:37.4643107Z     %46 = tt.splat %44 : i64 -> tensor<4096xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:37.4643394Z     %47 = arith.extsi %11 : tensor<4096xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<4096xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:37.4643678Z     %48 = arith.addi %46, %47 : tensor<4096xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:37.4643944Z     %49 = tt.expand_dims %48 {axis = 1 : i32} : tensor<4096xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<4096x1xi64, #mma>
2026-02-21T09:47:37.4644190Z     %50 = arith.muli %49, %cst_13 : tensor<4096x1xi64, #mma>
2026-02-21T09:47:37.4644371Z     %51 = tt.broadcast %50 : tensor<4096x1xi64, #mma> -> tensor<4096x16xi64, #mma>
2026-02-21T09:47:37.4644582Z     %52 = tt.splat %20 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:37.4644839Z     %53 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:37.4645138Z     %54 = arith.extsi %53 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:37.4645427Z     %55 = arith.addi %52, %54 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:37.4645680Z     %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi64, #mma>
2026-02-21T09:47:37.4645939Z     %57 = tt.broadcast %56 : tensor<1x16xi64, #mma> -> tensor<4096x16xi64, #mma>
2026-02-21T09:47:37.4646121Z     %58 = arith.addi %51, %57 : tensor<4096x16xi64, #mma>
2026-02-21T09:47:37.4646308Z     %59 = tt.addptr %45, %58 : tensor<4096x16x!tt.ptr<bf16>, #mma>, tensor<4096x16xi64, #mma>
2026-02-21T09:47:37.4646513Z     %60 = arith.cmpi sge, %49, %cst_14 : tensor<4096x1xi64, #mma>
2026-02-21T09:47:37.4646679Z     %61 = arith.cmpi slt, %49, %cst_15 : tensor<4096x1xi64, #mma>
2026-02-21T09:47:37.4646843Z     %62 = arith.andi %60, %61 : tensor<4096x1xi1, #mma>
2026-02-21T09:47:37.4647018Z     %63 = tt.broadcast %62 : tensor<4096x1xi1, #mma> -> tensor<4096x16xi1, #mma>
2026-02-21T09:47:37.4647201Z     %64 = arith.cmpi sge, %56, %cst_11 : tensor<1x16xi64, #mma>
2026-02-21T09:47:37.4647366Z     %65 = arith.cmpi slt, %56, %cst_12 : tensor<1x16xi64, #mma>
2026-02-21T09:47:37.4647518Z     %66 = arith.andi %64, %65 : tensor<1x16xi1, #mma>
2026-02-21T09:47:37.4647687Z     %67 = tt.broadcast %66 : tensor<1x16xi1, #mma> -> tensor<4096x16xi1, #mma>
2026-02-21T09:47:37.4647859Z     %68 = arith.andi %63, %67 : tensor<4096x16xi1, #mma>
2026-02-21T09:47:37.4648017Z     tt.store %59, %43, %68 : tensor<4096x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:37.4648159Z     tt.return
2026-02-21T09:47:37.4648241Z   }
2026-02-21T09:47:37.4648326Z }
2026-02-21T09:47:37.4648371Z 
2026-02-21T09:47:37.4648407Z {-#
2026-02-21T09:47:37.4648497Z   external_resources: {
2026-02-21T09:47:37.4648598Z     mlir_reproducer: {
2026-02-21T09:47:37.4649626Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:47:37.4650640Z       disable_threading: false,
2026-02-21T09:47:37.4650749Z       verify_each: true
2026-02-21T09:47:37.4650844Z     }
2026-02-21T09:47:37.4650918Z   }
2026-02-21T09:47:37.4650993Z #-}
2026-02-21T09:47:37.4651277Z /tmp/torchinductor_root/vr/cvrwrnmhifgmlzk2dlcw2x5idbs7jh2w5orzxxh7f53mslhtvpcn.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:47:37.4651996Z /tmp/torchinductor_root/vr/cvrwrnmhifgmlzk2dlcw2x5idbs7jh2w5orzxxh7f53mslhtvpcn.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:47:37.4652565Z [188s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:47:37.4653291Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 4096, 16], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:47:37.4653948Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:47:37.4654134Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:47:55.3321338Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:47:55.3323652Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:47:55.3324574Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:47:55.3325319Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:47:55.3325982Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:47:55.3326576Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:47:55.3327148Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:47:55.3327573Z #smem = #ttg.shared_memory
2026-02-21T09:47:55.3328117Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:47:55.3329236Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:47:55.3330177Z     %cst = arith.constant dense<0.000000e+00> : tensor<1024x32xf32, #mma>
2026-02-21T09:47:55.3330564Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:47:55.3330838Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:47:55.3331113Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:47:55.3331377Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:47:55.3331644Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:47:55.3332156Z     %cst_0 = arith.constant dense<0> : tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3332494Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:47:55.3332758Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:47:55.3333172Z     %cst_1 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:55.3333973Z     %cst_2 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3334666Z     %cst_3 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3335024Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:47:55.3335225Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:47:55.3335470Z     %cst_4 = arith.constant dense<1024> : tensor<1024x1xi32, #blocked1>
2026-02-21T09:47:55.3335828Z     %cst_5 = arith.constant dense<8192> : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3336240Z     %cst_6 = arith.constant dense<4> : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3336590Z     %cst_7 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:55.3336862Z     %cst_8 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:55.3337141Z     %cst_9 = arith.constant dense<8192> : tensor<1024x1xi32, #mma>
2026-02-21T09:47:55.3337372Z     %0 = tt.get_program_id x : i32
2026-02-21T09:47:55.3337601Z     %1 = arith.divsi %0, %c64_i32 : i32
2026-02-21T09:47:55.3337781Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T09:47:55.3337965Z     %3 = arith.subi %c256_i32, %2 : i32
2026-02-21T09:47:55.3338138Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T09:47:55.3338322Z     %5 = arith.remsi %0, %c64_i32 : i32
2026-02-21T09:47:55.3338540Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:47:55.3338703Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:47:55.3338881Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:47:55.3339137Z     %9 = arith.muli %7, %c32_i32 : i32
2026-02-21T09:47:55.3339514Z     %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3340037Z     %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:55.3340498Z     %12 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3340899Z     %13 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:55.3341302Z     %14 = arith.addi %12, %10 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3341699Z     %15 = arith.addi %13, %11 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:55.3341961Z     %16 = arith.muli %8, %c1024_i32 : i32
2026-02-21T09:47:55.3342285Z     %17 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:55.3342739Z     %18 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:55.3343134Z     %19 = tt.splat %16 : i32 -> tensor<1024xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:55.3352114Z     %20 = tt.splat %16 : i32 -> tensor<1024xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:55.3352424Z     %21 = arith.addi %19, %17 : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:55.3352693Z     %22 = arith.addi %20, %18 : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:55.3353038Z     %23 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3353419Z     %24 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:55.3353866Z     %25 = tt.expand_dims %21 {axis = 1 : i32} : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<1024x1xi32, #blocked1>
2026-02-21T09:47:55.3354182Z     %26 = arith.muli %25, %cst_4 : tensor<1024x1xi32, #blocked1>
2026-02-21T09:47:55.3354426Z     %27 = tt.broadcast %26 : tensor<1024x1xi32, #blocked1> -> tensor<1024x4xi32, #blocked1>
2026-02-21T09:47:55.3354683Z     %28 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1024x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:55.3355030Z     %29 = tt.expand_dims %14 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3355457Z     %30 = tt.broadcast %29 : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3355785Z     %31 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3356098Z     %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:47:55.3356510Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:47:55.3356909Z     %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:55.3357166Z     %35 = arith.cmpi eq, %34, %cst_7 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:55.3357366Z     %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x32xi1, #blocked>
2026-02-21T09:47:55.3357561Z     %37 = arith.cmpi eq, %34, %cst_8 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:55.3357760Z     %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x32xi1, #blocked>
2026-02-21T09:47:55.3357994Z     %39 = ttg.local_alloc : () -> !ttg.memdesc<2x1024x4xbf16, #shared, #smem, mutable>
2026-02-21T09:47:55.3358267Z     %40 = tt.expand_dims %24 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:47:55.3358554Z     %41 = tt.broadcast %40 : tensor<1x4xi32, #blocked1> -> tensor<1024x4xi32, #blocked1>
2026-02-21T09:47:55.3358753Z     %42 = arith.addi %27, %41 : tensor<1024x4xi32, #blocked1>
2026-02-21T09:47:55.3358960Z     %43 = tt.addptr %28, %42 : tensor<1024x4x!tt.ptr<bf16>, #blocked1>, tensor<1024x4xi32, #blocked1>
2026-02-21T09:47:55.3359169Z     %44 = tt.load %43 : tensor<1024x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:55.3359463Z     %45 = ttg.memdesc_index %39[%c0_i32] : !ttg.memdesc<2x1024x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>
2026-02-21T09:47:55.3359831Z     ttg.local_store %44, %45 : tensor<1024x4xbf16, #blocked1> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>
2026-02-21T09:47:55.3360119Z     %46 = arith.addi %24, %cst_1 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:55.3360399Z     %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:47:55.3360673Z     %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked1> -> tensor<1024x4xi32, #blocked1>
2026-02-21T09:47:55.3360870Z     %49 = arith.addi %27, %48 : tensor<1024x4xi32, #blocked1>
2026-02-21T09:47:55.3361071Z     %50 = tt.addptr %28, %49 : tensor<1024x4x!tt.ptr<bf16>, #blocked1>, tensor<1024x4xi32, #blocked1>
2026-02-21T09:47:55.3361282Z     %51 = tt.load %50 : tensor<1024x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:55.3361571Z     %52 = ttg.memdesc_index %39[%c1_i32] : !ttg.memdesc<2x1024x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>
2026-02-21T09:47:55.3361958Z     ttg.local_store %51, %52 : tensor<1024x4xbf16, #blocked1> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>
2026-02-21T09:47:55.3362498Z     %53:4 = scf.for %arg3 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg4 = %cst, %arg5 = %c1_i32, %arg6 = %45, %arg7 = %52) -> (tensor<1024x32xf32, #mma>, i32, !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>, !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>)  : i32 {
2026-02-21T09:47:55.3363099Z       %109 = tt.splat %arg3 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3363401Z       %110 = arith.addi %109, %23 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3363626Z       %111 = arith.addi %arg3, %c4_i32 : i32
2026-02-21T09:47:55.3363752Z       %112 = arith.muli %111, %c2_i32 : i32
2026-02-21T09:47:55.3363929Z       %113 = tt.splat %112 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:55.3364163Z       %114 = arith.addi %113, %24 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:55.3364442Z       %115 = tt.expand_dims %114 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:47:55.3364730Z       %116 = tt.broadcast %115 : tensor<1x4xi32, #blocked1> -> tensor<1024x4xi32, #blocked1>
2026-02-21T09:47:55.3364930Z       %117 = arith.addi %27, %116 : tensor<1024x4xi32, #blocked1>
2026-02-21T09:47:55.3365144Z       %118 = tt.addptr %28, %117 : tensor<1024x4x!tt.ptr<bf16>, #blocked1>, tensor<1024x4xi32, #blocked1>
2026-02-21T09:47:55.3365365Z       %119 = tt.load %118 : tensor<1024x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:55.3365680Z       %120 = ttg.local_load %arg6 : !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> -> tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3366156Z       %121 = arith.extf %120 : tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3366620Z       %122 = tt.expand_dims %110 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3367000Z       %123 = arith.muli %122, %cst_5 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3367314Z       %124 = tt.broadcast %123 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3367621Z       %125 = arith.addi %124, %30 : tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3367936Z       %126 = tt.addptr %31, %125 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3368251Z       %127 = tt.load %126 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3368485Z       %128 = arith.shli %127, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3368727Z       %129 = arith.shrsi %128, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3368966Z       %130 = arith.shrsi %127, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3369266Z       %131 = tt.expand_dims %129 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:47:55.3369606Z       %132 = tt.expand_dims %130 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:47:55.3369894Z       %133 = tt.broadcast %131 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3370139Z       %134 = arith.select %36, %133, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3370403Z       %135 = tt.broadcast %132 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3370641Z       %136 = arith.select %38, %135, %134 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3370877Z       %137 = tt.reshape %136 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:47:55.3371120Z       %138 = arith.sitofp %137 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:47:55.3371375Z       %139 = ttg.local_alloc %138 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:47:55.3371698Z       %140 = ttg.local_load %139 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3372183Z       %141 = tt.dot %121, %140, %arg4, inputPrecision = tf32 : tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<1024x32xf32, #mma>
2026-02-21T09:47:55.3372545Z       %142 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:47:55.3372675Z       %143 = arith.cmpi slt, %142, %c2_i32 : i32
2026-02-21T09:47:55.3372813Z       %144 = arith.select %143, %142, %c0_i32 : i32
2026-02-21T09:47:55.3373088Z       %145 = ttg.memdesc_index %39[%144] : !ttg.memdesc<2x1024x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>
2026-02-21T09:47:55.3373462Z       ttg.local_store %119, %145 : tensor<1024x4xbf16, #blocked1> -> !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>
2026-02-21T09:47:55.3373880Z       scf.yield %141, %144, %arg7, %145 : tensor<1024x32xf32, #mma>, i32, !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>, !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4>
2026-02-21T09:47:55.3374216Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:47:55.3374477Z     %54 = arith.addi %23, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3374855Z     %55 = ttg.local_load %53#2 : !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> -> tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3375307Z     %56 = arith.extf %55 : tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3375766Z     %57 = tt.expand_dims %54 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3376117Z     %58 = arith.muli %57, %cst_5 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3376417Z     %59 = tt.broadcast %58 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3376718Z     %60 = arith.addi %59, %30 : tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3377018Z     %61 = tt.addptr %31, %60 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3377327Z     %62 = tt.load %61 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3377556Z     %63 = arith.shli %62, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3377786Z     %64 = arith.shrsi %63, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3378019Z     %65 = arith.shrsi %62, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3378299Z     %66 = tt.expand_dims %64 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:47:55.3378634Z     %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:47:55.3378929Z     %68 = tt.broadcast %66 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3379162Z     %69 = arith.select %36, %68, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3379392Z     %70 = tt.broadcast %67 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3379638Z     %71 = arith.select %38, %70, %69 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3379864Z     %72 = tt.reshape %71 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:47:55.3380080Z     %73 = arith.sitofp %72 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:47:55.3380321Z     %74 = ttg.local_alloc %73 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:47:55.3380642Z     %75 = ttg.local_load %74 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3381107Z     %76 = tt.dot %56, %75, %53#0, inputPrecision = tf32 : tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<1024x32xf32, #mma>
2026-02-21T09:47:55.3381540Z     %77 = arith.addi %23, %cst_2 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3381936Z     %78 = ttg.local_load %53#3 : !ttg.memdesc<1024x4xbf16, #shared, #smem, mutable, 2x1024x4> -> tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3382375Z     %79 = arith.extf %78 : tensor<1024x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3382851Z     %80 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3383206Z     %81 = arith.muli %80, %cst_5 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3383521Z     %82 = tt.broadcast %81 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3383819Z     %83 = arith.addi %82, %30 : tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3384122Z     %84 = tt.addptr %31, %83 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3384423Z     %85 = tt.load %84 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3384651Z     %86 = arith.shli %85, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3384884Z     %87 = arith.shrsi %86, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3385112Z     %88 = arith.shrsi %85, %cst_6 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3385392Z     %89 = tt.expand_dims %87 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:47:55.3385718Z     %90 = tt.expand_dims %88 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:47:55.3385997Z     %91 = tt.broadcast %89 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3386233Z     %92 = arith.select %36, %91, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3386461Z     %93 = tt.broadcast %90 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3386688Z     %94 = arith.select %38, %93, %92 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:47:55.3386909Z     %95 = tt.reshape %94 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:47:55.3387145Z     %96 = arith.sitofp %95 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:47:55.3387392Z     %97 = ttg.local_alloc %96 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:47:55.3387707Z     %98 = ttg.local_load %97 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3388185Z     %99 = tt.dot %79, %98, %76, inputPrecision = tf32 : tensor<1024x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<1024x32xf32, #mma>
2026-02-21T09:47:55.3388568Z     ttg.local_dealloc %39 : !ttg.memdesc<2x1024x4xbf16, #shared, #smem, mutable>
2026-02-21T09:47:55.3388788Z     %100 = arith.truncf %99 : tensor<1024x32xf32, #mma> to tensor<1024x32xbf16, #mma>
2026-02-21T09:47:55.3389067Z     %101 = tt.expand_dims %22 {axis = 1 : i32} : tensor<1024xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<1024x1xi32, #mma>
2026-02-21T09:47:55.3389310Z     %102 = arith.muli %101, %cst_9 : tensor<1024x1xi32, #mma>
2026-02-21T09:47:55.3389546Z     %103 = tt.expand_dims %15 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma>
2026-02-21T09:47:55.3389805Z     %104 = tt.broadcast %102 : tensor<1024x1xi32, #mma> -> tensor<1024x32xi32, #mma>
2026-02-21T09:47:55.3390018Z     %105 = tt.broadcast %103 : tensor<1x32xi32, #mma> -> tensor<1024x32xi32, #mma>
2026-02-21T09:47:55.3390205Z     %106 = arith.addi %104, %105 : tensor<1024x32xi32, #mma>
2026-02-21T09:47:55.3390383Z     %107 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1024x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:55.3390609Z     %108 = tt.addptr %107, %106 : tensor<1024x32x!tt.ptr<bf16>, #mma>, tensor<1024x32xi32, #mma>
2026-02-21T09:47:55.3390812Z     tt.store %108, %100 : tensor<1024x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:55.3390956Z     tt.return
2026-02-21T09:47:55.3391054Z   }
2026-02-21T09:47:55.3391143Z }
2026-02-21T09:47:55.3391188Z 
2026-02-21T09:47:55.3391227Z {-#
2026-02-21T09:47:55.3391310Z   external_resources: {
2026-02-21T09:47:55.3391434Z     mlir_reproducer: {
2026-02-21T09:47:55.3392434Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:47:55.3393426Z       disable_threading: false,
2026-02-21T09:47:55.3393540Z       verify_each: true
2026-02-21T09:47:55.3393637Z     }
2026-02-21T09:47:55.3393715Z   }
2026-02-21T09:47:55.3393788Z #-}
2026-02-21T09:47:55.3394071Z /tmp/torchinductor_root/3j/c3jhspwcswupy7forased75qyulwdzwqthgnq6ccgfjrupw4dxy6.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:47:55.3394761Z /tmp/torchinductor_root/3j/c3jhspwcswupy7forased75qyulwdzwqthgnq6ccgfjrupw4dxy6.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:47:55.3395312Z [206s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:47:55.3396068Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 1024, 32], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:47:55.3396727Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:47:55.3396899Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:47:55.3945148Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:47:55.3948826Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [16, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:47:55.3949168Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:47:55.3949495Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:47:55.3949797Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [8, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:47:55.3950065Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:47:55.3950306Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:47:55.3950491Z #smem = #ttg.shared_memory
2026-02-21T09:47:55.3950728Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:47:55.3951203Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:47:55.3951588Z     %cst = arith.constant dense<0.000000e+00> : tensor<32x16xf32, #mma>
2026-02-21T09:47:55.3951756Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:47:55.3951917Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:47:55.3952042Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:47:55.3952157Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:47:55.3952301Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:47:55.3952425Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:47:55.3952550Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:47:55.3952706Z     %cst_0 = arith.constant dense<0> : tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:55.3952854Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:47:55.3953042Z     %cst_1 = arith.constant dense<0> : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3953300Z     %cst_2 = arith.constant dense<8192> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3953558Z     %cst_3 = arith.constant dense<0> : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3953816Z     %cst_4 = arith.constant dense<512> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3954069Z     %cst_5 = arith.constant dense<0> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3954322Z     %cst_6 = arith.constant dense<8192> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3954544Z     %cst_7 = arith.constant dense<1024> : tensor<32x1xi32, #blocked1>
2026-02-21T09:47:55.3954762Z     %cst_8 = arith.constant dense<4> : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3954978Z     %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:55.3955154Z     %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:55.3955338Z     %cst_11 = arith.constant dense<8192> : tensor<32x1xi32, #mma>
2026-02-21T09:47:55.3955485Z     %0 = tt.get_program_id x : i32
2026-02-21T09:47:55.3955605Z     %1 = arith.divsi %0, %c2048_i32 : i32
2026-02-21T09:47:55.3955726Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T09:47:55.3955877Z     %3 = arith.subi %c512_i32, %2 : i32
2026-02-21T09:47:55.3956020Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T09:47:55.3956148Z     %5 = arith.remsi %0, %c2048_i32 : i32
2026-02-21T09:47:55.3956271Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:47:55.3956386Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:47:55.3956519Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:47:55.3956631Z     %9 = arith.muli %7, %c32_i32 : i32
2026-02-21T09:47:55.3956841Z     %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:55.3957117Z     %11 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:55.3957366Z     %12 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:55.3957580Z     %13 = tt.splat %9 : i32 -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:55.3957800Z     %14 = arith.addi %12, %10 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:47:55.3958016Z     %15 = arith.addi %13, %11 : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:47:55.3958184Z     %16 = arith.muli %8, %c16_i32 : i32
2026-02-21T09:47:55.3958425Z     %17 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3958735Z     %18 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:55.3958975Z     %19 = tt.splat %16 : i32 -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:55.3959187Z     %20 = arith.addi %19, %18 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:47:55.3959428Z     %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:55.3959763Z     %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1>
2026-02-21T09:47:55.3960013Z     %23 = arith.muli %22, %cst_7 : tensor<32x1xi32, #blocked1>
2026-02-21T09:47:55.3960233Z     %24 = tt.broadcast %23 : tensor<32x1xi32, #blocked1> -> tensor<32x8xi32, #blocked1>
2026-02-21T09:47:55.3960451Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<32x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:55.3960617Z     %26 = arith.extsi %16 : i32 to i64
2026-02-21T09:47:55.3960818Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3961125Z     %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3961561Z     %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3961967Z     %30 = tt.splat %26 : i64 -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3962367Z     %31 = arith.extsi %17 : tensor<16xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3962838Z     %32 = arith.addi %30, %31 : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3963229Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3963651Z     %34 = tt.broadcast %33 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3963969Z     %35 = arith.cmpi sge, %33, %cst_3 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3964241Z     %36 = arith.cmpi slt, %33, %cst_2 : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3964473Z     %37 = arith.andi %35, %36 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3964773Z     %38 = tt.broadcast %37 : tensor<1x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3965166Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:47:55.3965580Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:47:55.3965986Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:55.3966239Z     %42 = arith.cmpi eq, %41, %cst_9 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:55.3966437Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked>
2026-02-21T09:47:55.3966636Z     %44 = arith.cmpi eq, %41, %cst_10 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:47:55.3966830Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x16xi1, #blocked>
2026-02-21T09:47:55.3967093Z     %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg4 = %cst) -> (tensor<32x16xf32, #mma>)  : i32 {
2026-02-21T09:47:55.3967307Z       %56 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:47:55.3967480Z       %57 = tt.splat %56 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:55.3967698Z       %58 = arith.addi %57, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:55.3967991Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:47:55.3968274Z       %60 = tt.broadcast %59 : tensor<1x8xi32, #blocked1> -> tensor<32x8xi32, #blocked1>
2026-02-21T09:47:55.3968482Z       %61 = arith.addi %24, %60 : tensor<32x8xi32, #blocked1>
2026-02-21T09:47:55.3968684Z       %62 = tt.addptr %25, %61 : tensor<32x8x!tt.ptr<bf16>, #blocked1>, tensor<32x8xi32, #blocked1>
2026-02-21T09:47:55.3968890Z       %63 = tt.load %62 : tensor<32x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:55.3969112Z       %64 = ttg.local_alloc %63 : (tensor<32x8xbf16, #blocked1>) -> !ttg.memdesc<32x8xbf16, #shared, #smem>
2026-02-21T09:47:55.3969447Z       %65 = ttg.local_load %64 : !ttg.memdesc<32x8xbf16, #shared, #smem> -> tensor<32x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3969854Z       %66 = arith.extf %65 : tensor<32x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3970142Z       %67 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:47:55.3970351Z       %68 = tt.splat %67 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3970650Z       %69 = arith.addi %68, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3971037Z       %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3971388Z       %71 = arith.muli %70, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3971695Z       %72 = tt.broadcast %71 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3971997Z       %73 = arith.addi %72, %34 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3972325Z       %74 = tt.addptr %27, %73 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3972642Z       %75 = arith.cmpi sge, %70, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3972884Z       %76 = arith.cmpi slt, %70, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3973132Z       %77 = arith.andi %75, %76 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3973454Z       %78 = tt.broadcast %77 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3973751Z       %79 = arith.andi %78, %38 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3973998Z       %80 = tt.load %74, %79, %cst_1 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3974248Z       %81 = arith.shli %80, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3974479Z       %82 = arith.shrsi %81, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3974719Z       %83 = arith.shrsi %80, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3975002Z       %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:55.3975337Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:55.3975621Z       %86 = tt.broadcast %84 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:55.3975854Z       %87 = arith.select %43, %86, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:55.3976087Z       %88 = tt.broadcast %85 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:55.3976330Z       %89 = arith.select %45, %88, %87 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:55.3976556Z       %90 = tt.reshape %89 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2>
2026-02-21T09:47:55.3976791Z       %91 = arith.sitofp %90 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2>
2026-02-21T09:47:55.3977035Z       %92 = ttg.local_alloc %91 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem>
2026-02-21T09:47:55.3977359Z       %93 = ttg.local_load %92 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3977833Z       %94 = tt.dot %66, %93, %arg4, inputPrecision = tf32 : tensor<32x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x16xf32, #mma>
2026-02-21T09:47:55.3978178Z       %95 = arith.addi %arg3, %c4_i32 : i32
2026-02-21T09:47:55.3978305Z       %96 = arith.muli %95, %c2_i32 : i32
2026-02-21T09:47:55.3978474Z       %97 = tt.splat %96 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:55.3978701Z       %98 = arith.addi %97, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:47:55.3978977Z       %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:47:55.3979258Z       %100 = tt.broadcast %99 : tensor<1x8xi32, #blocked1> -> tensor<32x8xi32, #blocked1>
2026-02-21T09:47:55.3979462Z       %101 = arith.addi %24, %100 : tensor<32x8xi32, #blocked1>
2026-02-21T09:47:55.3979666Z       %102 = tt.addptr %25, %101 : tensor<32x8x!tt.ptr<bf16>, #blocked1>, tensor<32x8xi32, #blocked1>
2026-02-21T09:47:55.3979881Z       %103 = tt.load %102 : tensor<32x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:47:55.3980110Z       %104 = ttg.local_alloc %103 : (tensor<32x8xbf16, #blocked1>) -> !ttg.memdesc<32x8xbf16, #shared, #smem>
2026-02-21T09:47:55.3980465Z       %105 = ttg.local_load %104 : !ttg.memdesc<32x8xbf16, #shared, #smem> -> tensor<32x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3980878Z       %106 = arith.extf %105 : tensor<32x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<32x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3981169Z       %107 = arith.extsi %95 : i32 to i64
2026-02-21T09:47:55.3981386Z       %108 = tt.splat %107 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3981691Z       %109 = arith.addi %108, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:47:55.3982090Z       %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3982456Z       %111 = arith.muli %110, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3982764Z       %112 = tt.broadcast %111 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3983079Z       %113 = arith.addi %112, %34 : tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3983394Z       %114 = tt.addptr %27, %113 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x16xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3983715Z       %115 = arith.cmpi sge, %110, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3983961Z       %116 = arith.cmpi slt, %110, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3984200Z       %117 = arith.andi %115, %116 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3984520Z       %118 = tt.broadcast %117 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3984823Z       %119 = arith.andi %118, %38 : tensor<4x16xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3985082Z       %120 = tt.load %114, %119, %cst_1 : tensor<4x16x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3985338Z       %121 = arith.shli %120, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3985580Z       %122 = arith.shrsi %121, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3985819Z       %123 = arith.shrsi %120, %cst_8 : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:47:55.3986114Z       %124 = tt.expand_dims %122 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:55.3986453Z       %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<4x16xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x16xi8, #blocked>
2026-02-21T09:47:55.3986745Z       %126 = tt.broadcast %124 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:55.3986992Z       %127 = arith.select %43, %126, %cst_0 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:55.3987230Z       %128 = tt.broadcast %125 : tensor<4x1x16xi8, #blocked> -> tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:55.3987468Z       %129 = arith.select %45, %128, %127 : tensor<4x2x16xi1, #blocked>, tensor<4x2x16xi8, #blocked>
2026-02-21T09:47:55.3987701Z       %130 = tt.reshape %129 : tensor<4x2x16xi8, #blocked> -> tensor<8x16xi8, #blocked2>
2026-02-21T09:47:55.3987927Z       %131 = arith.sitofp %130 : tensor<8x16xi8, #blocked2> to tensor<8x16xf32, #blocked2>
2026-02-21T09:47:55.3988188Z       %132 = ttg.local_alloc %131 : (tensor<8x16xf32, #blocked2>) -> !ttg.memdesc<8x16xf32, #shared1, #smem>
2026-02-21T09:47:55.3988532Z       %133 = ttg.local_load %132 : !ttg.memdesc<8x16xf32, #shared1, #smem> -> tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:47:55.3989002Z       %134 = tt.dot %106, %133, %94, inputPrecision = tf32 : tensor<32x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x16xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<32x16xf32, #mma>
2026-02-21T09:47:55.3989364Z       scf.yield %134 : tensor<32x16xf32, #mma>
2026-02-21T09:47:55.3989496Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:47:55.3989653Z     %47 = arith.truncf %46 : tensor<32x16xf32, #mma> to tensor<32x16xbf16, #mma>
2026-02-21T09:47:55.3989912Z     %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<32x1xi32, #mma>
2026-02-21T09:47:55.3990155Z     %49 = arith.muli %48, %cst_11 : tensor<32x1xi32, #mma>
2026-02-21T09:47:55.3990380Z     %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x16xi32, #mma>
2026-02-21T09:47:55.3990637Z     %51 = tt.broadcast %49 : tensor<32x1xi32, #mma> -> tensor<32x16xi32, #mma>
2026-02-21T09:47:55.3990841Z     %52 = tt.broadcast %50 : tensor<1x16xi32, #mma> -> tensor<32x16xi32, #mma>
2026-02-21T09:47:55.3991016Z     %53 = arith.addi %51, %52 : tensor<32x16xi32, #mma>
2026-02-21T09:47:55.3991193Z     %54 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<32x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:55.3991402Z     %55 = tt.addptr %54, %53 : tensor<32x16x!tt.ptr<bf16>, #mma>, tensor<32x16xi32, #mma>
2026-02-21T09:47:55.3991594Z     tt.store %55, %47 : tensor<32x16x!tt.ptr<bf16>, #mma>
2026-02-21T09:47:55.3991725Z     tt.return
2026-02-21T09:47:55.3991810Z   }
2026-02-21T09:47:55.3991888Z }
2026-02-21T09:47:55.3991932Z 
2026-02-21T09:47:55.3991963Z {-#
2026-02-21T09:47:55.3992050Z   external_resources: {
2026-02-21T09:47:55.3992149Z     mlir_reproducer: {
2026-02-21T09:47:55.3993182Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:47:55.3994201Z       disable_threading: false,
2026-02-21T09:47:55.3994308Z       verify_each: true
2026-02-21T09:47:55.3994403Z     }
2026-02-21T09:47:55.3994477Z   }
2026-02-21T09:47:55.3994553Z #-}
2026-02-21T09:47:55.3994838Z /tmp/torchinductor_root/lm/clmld3y3zmlww2g53qdkhjga4fhrgryynl2vnxsuowkkhbf4naug.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:47:55.3995549Z /tmp/torchinductor_root/lm/clmld3y3zmlww2g53qdkhjga4fhrgryynl2vnxsuowkkhbf4naug.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:47:55.3996111Z [206s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:47:55.3996843Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 32, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:47:55.3997504Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:47:55.3997679Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:48:03.6817391Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:48:03.6818930Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:48:03.6819980Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:48:03.6820820Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:48:03.6821591Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:48:03.6822290Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 32, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:48:03.6822932Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:48:03.6823437Z #smem = #ttg.shared_memory
2026-02-21T09:48:03.6823712Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:48:03.6824188Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:48:03.6824610Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:48:03.6824781Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:48:03.6824902Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:48:03.6825026Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:48:03.6825144Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:48:03.6825298Z     %cst_0 = arith.constant dense<0> : tensor<1x2x128xi8, #blocked>
2026-02-21T09:48:03.6825452Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:48:03.6825663Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:48:03.6825784Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:48:03.6825904Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:48:03.6826163Z     %cst_1 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:48:03.6826382Z     %cst_2 = arith.constant dense<4> : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6826608Z     %cst_3 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:03.6826783Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:03.6826988Z     %cst_5 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:48:03.6827161Z     %cst_6 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:48:03.6827328Z     %cst_7 = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:48:03.6827504Z     %cst_8 = arith.constant dense<0> : tensor<1x128xi64, #mma>
2026-02-21T09:48:03.6827680Z     %cst_9 = arith.constant dense<8192> : tensor<1x128xi64, #mma>
2026-02-21T09:48:03.6827830Z     %0 = tt.get_program_id x : i32
2026-02-21T09:48:03.6827949Z     %1 = arith.divsi %0, %c256_i32 : i32
2026-02-21T09:48:03.6828070Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:48:03.6828191Z     %3 = arith.subi %c64_i32, %2 : i32
2026-02-21T09:48:03.6828308Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:48:03.6828426Z     %5 = arith.remsi %0, %c256_i32 : i32
2026-02-21T09:48:03.6828541Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:48:03.6828656Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:48:03.6828769Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:48:03.6828879Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:48:03.6829089Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:03.6829374Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:03.6829723Z     %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:03.6830041Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:03.6830344Z     %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:03.6830646Z     %15 = arith.addi %14, %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:03.6830857Z     %16 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:48:03.6831031Z     %17 = tt.splat %16 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:03.6831259Z     %18 = arith.addi %17, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:03.6831507Z     %19 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:03.6831820Z     %20 = tt.expand_dims %18 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:48:03.6832074Z     %21 = arith.muli %20, %cst_1 : tensor<128x1xi32, #blocked1>
2026-02-21T09:48:03.6832316Z     %22 = tt.broadcast %21 : tensor<128x1xi32, #blocked1> -> tensor<128x2xi32, #blocked1>
2026-02-21T09:48:03.6832541Z     %23 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:03.6832886Z     %24 = tt.expand_dims %15 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6833264Z     %25 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6833596Z     %26 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:48:03.6834005Z     %27 = tt.expand_dims %26 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:48:03.6834422Z     %28 = tt.expand_dims %27 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:03.6834676Z     %29 = arith.cmpi eq, %28, %cst_3 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:03.6834873Z     %30 = tt.broadcast %29 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked>
2026-02-21T09:48:03.6835073Z     %31 = arith.cmpi eq, %28, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:03.6835262Z     %32 = tt.broadcast %31 : tensor<1x2x1xi1, #blocked> -> tensor<1x2x128xi1, #blocked>
2026-02-21T09:48:03.6835529Z     %33 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:48:03.6835749Z       %60 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:48:03.6835917Z       %61 = tt.splat %60 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:03.6836136Z       %62 = arith.addi %61, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:03.6836405Z       %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:48:03.6836676Z       %64 = tt.broadcast %63 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1>
2026-02-21T09:48:03.6836869Z       %65 = arith.addi %22, %64 : tensor<128x2xi32, #blocked1>
2026-02-21T09:48:03.6837063Z       %66 = tt.addptr %23, %65 : tensor<128x2x!tt.ptr<bf16>, #blocked1>, tensor<128x2xi32, #blocked1>
2026-02-21T09:48:03.6837265Z       %67 = tt.load %66 : tensor<128x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:03.6837484Z       %68 = ttg.local_alloc %67 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem>
2026-02-21T09:48:03.6837834Z       %69 = ttg.local_load %68 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:03.6838242Z       %70 = arith.extf %69 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:03.6838541Z       %71 = arith.muli %arg3, %c8192_i32 : i32
2026-02-21T09:48:03.6838719Z       %72 = tt.splat %71 : i32 -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6838942Z       %73 = arith.addi %72, %24 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6839248Z       %74 = tt.addptr %25, %73 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6839556Z       %75 = tt.load %74 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6839780Z       %76 = arith.shli %75, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6840010Z       %77 = arith.shrsi %76, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6840239Z       %78 = arith.shrsi %75, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6840525Z       %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:48:03.6840859Z       %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:48:03.6841138Z       %81 = tt.broadcast %79 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:48:03.6841377Z       %82 = arith.select %30, %81, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:48:03.6841626Z       %83 = tt.broadcast %80 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:48:03.6841853Z       %84 = arith.select %32, %83, %82 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:48:03.6842098Z       %85 = tt.reshape %84 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2>
2026-02-21T09:48:03.6842319Z       %86 = arith.sitofp %85 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2>
2026-02-21T09:48:03.6842647Z       %87 = ttg.local_alloc %86 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared1, #smem>
2026-02-21T09:48:03.6842974Z       %88 = ttg.local_load %87 : !ttg.memdesc<2x128xf32, #shared1, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:03.6843447Z       %89 = tt.dot %70, %88, %arg4, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:48:03.6843798Z       %90 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:48:03.6843919Z       %91 = arith.muli %90, %c2_i32 : i32
2026-02-21T09:48:03.6844091Z       %92 = tt.splat %91 : i32 -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:03.6844302Z       %93 = arith.addi %92, %19 : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:03.6844577Z       %94 = tt.expand_dims %93 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x2xi32, #blocked1>
2026-02-21T09:48:03.6844848Z       %95 = tt.broadcast %94 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1>
2026-02-21T09:48:03.6845042Z       %96 = arith.addi %22, %95 : tensor<128x2xi32, #blocked1>
2026-02-21T09:48:03.6845244Z       %97 = tt.addptr %23, %96 : tensor<128x2x!tt.ptr<bf16>, #blocked1>, tensor<128x2xi32, #blocked1>
2026-02-21T09:48:03.6845445Z       %98 = tt.load %97 : tensor<128x2x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:03.6845684Z       %99 = ttg.local_alloc %98 : (tensor<128x2xbf16, #blocked1>) -> !ttg.memdesc<128x2xbf16, #shared, #smem>
2026-02-21T09:48:03.6846011Z       %100 = ttg.local_load %99 : !ttg.memdesc<128x2xbf16, #shared, #smem> -> tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:03.6846417Z       %101 = arith.extf %100 : tensor<128x2xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:03.6846726Z       %102 = arith.muli %90, %c8192_i32 : i32
2026-02-21T09:48:03.6846901Z       %103 = tt.splat %102 : i32 -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6847134Z       %104 = arith.addi %103, %24 : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6847449Z       %105 = tt.addptr %25, %104 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6847765Z       %106 = tt.load %105 : tensor<1x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6848002Z       %107 = arith.shli %106, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6848240Z       %108 = arith.shrsi %107, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6848480Z       %109 = arith.shrsi %106, %cst_2 : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:03.6848772Z       %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:48:03.6849110Z       %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<1x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1x128xi8, #blocked>
2026-02-21T09:48:03.6849405Z       %112 = tt.broadcast %110 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:48:03.6849698Z       %113 = arith.select %30, %112, %cst_0 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:48:03.6849935Z       %114 = tt.broadcast %111 : tensor<1x1x128xi8, #blocked> -> tensor<1x2x128xi8, #blocked>
2026-02-21T09:48:03.6850191Z       %115 = arith.select %32, %114, %113 : tensor<1x2x128xi1, #blocked>, tensor<1x2x128xi8, #blocked>
2026-02-21T09:48:03.6850421Z       %116 = tt.reshape %115 : tensor<1x2x128xi8, #blocked> -> tensor<2x128xi8, #blocked2>
2026-02-21T09:48:03.6850647Z       %117 = arith.sitofp %116 : tensor<2x128xi8, #blocked2> to tensor<2x128xf32, #blocked2>
2026-02-21T09:48:03.6850903Z       %118 = ttg.local_alloc %117 : (tensor<2x128xf32, #blocked2>) -> !ttg.memdesc<2x128xf32, #shared1, #smem>
2026-02-21T09:48:03.6851229Z       %119 = ttg.local_load %118 : !ttg.memdesc<2x128xf32, #shared1, #smem> -> tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:03.6851702Z       %120 = tt.dot %101, %119, %89, inputPrecision = tf32 : tensor<128x2xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<2x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:48:03.6852051Z       scf.yield %120 : tensor<128x128xf32, #mma>
2026-02-21T09:48:03.6852175Z     } {tt.flatten}
2026-02-21T09:48:03.6852316Z     %34 = arith.truncf %33 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:48:03.6852487Z     %35 = arith.extsi %16 : i32 to i64
2026-02-21T09:48:03.6852604Z     %36 = arith.extsi %9 : i32 to i64
2026-02-21T09:48:03.6852758Z     %37 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:03.6852962Z     %38 = tt.splat %35 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:03.6853238Z     %39 = arith.extsi %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:03.6853574Z     %40 = arith.extsi %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:03.6853863Z     %41 = arith.addi %38, %39 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:03.6854120Z     %42 = tt.expand_dims %41 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:48:03.6854376Z     %43 = arith.muli %42, %cst_5 : tensor<128x1xi64, #mma>
2026-02-21T09:48:03.6854548Z     %44 = tt.broadcast %43 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:48:03.6854750Z     %45 = tt.splat %36 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:03.6854954Z     %46 = arith.addi %45, %40 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:03.6855209Z     %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:48:03.6855466Z     %48 = tt.broadcast %47 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:48:03.6855642Z     %49 = arith.addi %44, %48 : tensor<128x128xi64, #mma>
2026-02-21T09:48:03.6855830Z     %50 = tt.addptr %37, %49 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi64, #mma>
2026-02-21T09:48:03.6856028Z     %51 = arith.cmpi sge, %42, %cst_6 : tensor<128x1xi64, #mma>
2026-02-21T09:48:03.6856187Z     %52 = arith.cmpi slt, %42, %cst_7 : tensor<128x1xi64, #mma>
2026-02-21T09:48:03.6856343Z     %53 = arith.andi %51, %52 : tensor<128x1xi1, #mma>
2026-02-21T09:48:03.6856510Z     %54 = tt.broadcast %53 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:48:03.6856693Z     %55 = arith.cmpi sge, %47, %cst_8 : tensor<1x128xi64, #mma>
2026-02-21T09:48:03.6856849Z     %56 = arith.cmpi slt, %47, %cst_9 : tensor<1x128xi64, #mma>
2026-02-21T09:48:03.6857000Z     %57 = arith.andi %55, %56 : tensor<1x128xi1, #mma>
2026-02-21T09:48:03.6857167Z     %58 = tt.broadcast %57 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:48:03.6857355Z     %59 = arith.andi %54, %58 : tensor<128x128xi1, #mma>
2026-02-21T09:48:03.6857511Z     tt.store %50, %34, %59 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:03.6857644Z     tt.return
2026-02-21T09:48:03.6857744Z   }
2026-02-21T09:48:03.6857823Z }
2026-02-21T09:48:03.6857869Z 
2026-02-21T09:48:03.6857900Z {-#
2026-02-21T09:48:03.6857981Z   external_resources: {
2026-02-21T09:48:03.6858084Z     mlir_reproducer: {
2026-02-21T09:48:03.6859080Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:48:03.6860069Z       disable_threading: false,
2026-02-21T09:48:03.6860173Z       verify_each: true
2026-02-21T09:48:03.6860263Z     }
2026-02-21T09:48:03.6860339Z   }
2026-02-21T09:48:03.6860410Z #-}
2026-02-21T09:48:03.6860686Z /tmp/torchinductor_root/o3/co3bzd4cuo6myrozjopy26yiptt777ypmetejcg2rol46jmdbobw.py:12:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:48:03.6861376Z /tmp/torchinductor_root/o3/co3bzd4cuo6myrozjopy26yiptt777ypmetejcg2rol46jmdbobw.py:12:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:48:03.6861939Z [214s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:48:03.6862680Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T09:48:03.6863333Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:48:03.6863519Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:48:13.4421889Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:48:13.4424191Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:48:13.4425213Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:48:13.4426133Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:48:13.4426958Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:48:13.4427754Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:48:13.4428481Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:48:13.4428985Z #smem = #ttg.shared_memory
2026-02-21T09:48:13.4429627Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:48:13.4430936Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:48:13.4432593Z     %cst = arith.constant dense<8192> : tensor<16x1xi32, #mma>
2026-02-21T09:48:13.4433088Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:13.4433634Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:13.4433924Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<16x32xf32, #mma>
2026-02-21T09:48:13.4434155Z     %cst_3 = arith.constant dense<8192> : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4434359Z     %cst_4 = arith.constant dense<0> : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4434565Z     %cst_5 = arith.constant dense<512> : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4434763Z     %cst_6 = arith.constant dense<0> : tensor<1x32xi64, #blocked1>
2026-02-21T09:48:13.4434967Z     %cst_7 = arith.constant dense<8192> : tensor<1x32xi64, #blocked1>
2026-02-21T09:48:13.4435170Z     %cst_8 = arith.constant dense<1024> : tensor<16x1xi32, #blocked2>
2026-02-21T09:48:13.4435351Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:48:13.4435490Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:48:13.4435622Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:48:13.4435760Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:48:13.4435902Z     %c504_i32 = arith.constant 504 : i32
2026-02-21T09:48:13.4436038Z     %c12_i32 = arith.constant 12 : i32
2026-02-21T09:48:13.4436203Z     %cst_9 = arith.constant dense<0> : tensor<4x32xi8, #blocked1>
2026-02-21T09:48:13.4436373Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:48:13.4436540Z     %cst_10 = arith.constant dense<0> : tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4436707Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:48:13.4436844Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:48:13.4436975Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:48:13.4437190Z     %cst_11 = arith.constant dense<4> : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4437415Z     %0 = tt.get_program_id x : i32
2026-02-21T09:48:13.4437549Z     %1 = arith.divsi %0, %c512_i32 : i32
2026-02-21T09:48:13.4437779Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:48:13.4437913Z     %3 = arith.subi %c1024_i32, %2 : i32
2026-02-21T09:48:13.4438076Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:48:13.4438217Z     %5 = arith.remsi %0, %c512_i32 : i32
2026-02-21T09:48:13.4438470Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:48:13.4438601Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:48:13.4438733Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:48:13.4438865Z     %9 = arith.muli %7, %c16_i32 : i32
2026-02-21T09:48:13.4439106Z     %10 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:13.4439439Z     %11 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:13.4439733Z     %12 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:13.4439993Z     %13 = tt.splat %9 : i32 -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:13.4440248Z     %14 = arith.addi %12, %10 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:13.4440501Z     %15 = arith.addi %13, %11 : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:13.4440700Z     %16 = arith.muli %8, %c32_i32 : i32
2026-02-21T09:48:13.4440938Z     %17 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:13.4441261Z     %18 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:13.4441541Z     %19 = tt.splat %16 : i32 -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:13.4441777Z     %20 = arith.addi %19, %18 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:13.4442061Z     %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:13.4442444Z     %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<16x1xi32, #blocked2>
2026-02-21T09:48:13.4442843Z     %23 = arith.muli %22, %cst_8 : tensor<16x1xi32, #blocked2>
2026-02-21T09:48:13.4443069Z     %24 = tt.broadcast %23 : tensor<16x1xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4443326Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:13.4443503Z     %26 = arith.extsi %16 : i32 to i64
2026-02-21T09:48:13.4443662Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x32x!tt.ptr<i8>, #blocked1>
2026-02-21T09:48:13.4443905Z     %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:13.4444227Z     %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:13.4444527Z     %30 = tt.splat %26 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:13.4444830Z     %31 = arith.extsi %17 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:13.4445126Z     %32 = arith.addi %30, %31 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:13.4445407Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x32xi64, #blocked1>
2026-02-21T09:48:13.4445682Z     %34 = tt.broadcast %33 : tensor<1x32xi64, #blocked1> -> tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4445885Z     %35 = arith.cmpi sge, %33, %cst_6 : tensor<1x32xi64, #blocked1>
2026-02-21T09:48:13.4446059Z     %36 = arith.cmpi slt, %33, %cst_7 : tensor<1x32xi64, #blocked1>
2026-02-21T09:48:13.4446225Z     %37 = arith.andi %35, %36 : tensor<1x32xi1, #blocked1>
2026-02-21T09:48:13.4446411Z     %38 = tt.broadcast %37 : tensor<1x32xi1, #blocked1> -> tensor<4x32xi1, #blocked1>
2026-02-21T09:48:13.4446718Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:48:13.4447136Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:48:13.4447560Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:13.4447810Z     %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:13.4448010Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked>
2026-02-21T09:48:13.4448203Z     %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:13.4448395Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked>
2026-02-21T09:48:13.4448670Z     %46 = scf.for %arg3 = %c0_i32 to %c504_i32 step %c12_i32 iter_args(%arg4 = %cst_2) -> (tensor<16x32xf32, #mma>)  : i32 {
2026-02-21T09:48:13.4448888Z       %57 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:48:13.4449064Z       %58 = tt.splat %57 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:13.4449289Z       %59 = arith.addi %58, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:13.4449568Z       %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:48:13.4449847Z       %61 = tt.broadcast %60 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4450038Z       %62 = arith.addi %24, %61 : tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4450239Z       %63 = tt.addptr %25, %62 : tensor<16x8x!tt.ptr<bf16>, #blocked2>, tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4450460Z       %64 = tt.load %63 : tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:13.4450733Z       %65 = ttg.convert_layout %64 : tensor<16x8xbf16, #blocked2> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4451154Z       %66 = arith.extf %65 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4451439Z       %67 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:48:13.4451614Z       %68 = tt.splat %67 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:13.4451832Z       %69 = arith.addi %68, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:13.4452110Z       %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4452365Z       %71 = arith.muli %70, %cst_3 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4452557Z       %72 = tt.broadcast %71 : tensor<4x1xi64, #blocked1> -> tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4452751Z       %73 = arith.addi %72, %34 : tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4452947Z       %74 = tt.addptr %27, %73 : tensor<4x32x!tt.ptr<i8>, #blocked1>, tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4453157Z       %75 = arith.cmpi sge, %70, %cst_4 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4453333Z       %76 = arith.cmpi slt, %70, %cst_5 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4453494Z       %77 = arith.andi %75, %76 : tensor<4x1xi1, #blocked1>
2026-02-21T09:48:13.4453680Z       %78 = tt.broadcast %77 : tensor<4x1xi1, #blocked1> -> tensor<4x32xi1, #blocked1>
2026-02-21T09:48:13.4453867Z       %79 = arith.andi %78, %38 : tensor<4x32xi1, #blocked1>
2026-02-21T09:48:13.4454036Z       %80 = tt.load %74, %79, %cst_9 : tensor<4x32x!tt.ptr<i8>, #blocked1>
2026-02-21T09:48:13.4454292Z       %81 = ttg.convert_layout %80 : tensor<4x32xi8, #blocked1> -> tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4454639Z       %82 = arith.shli %81, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4454877Z       %83 = arith.shrsi %82, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4455116Z       %84 = arith.shrsi %81, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4455427Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:13.4455757Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:13.4456039Z       %87 = tt.broadcast %85 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4456279Z       %88 = arith.select %43, %87, %cst_10 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4456515Z       %89 = tt.broadcast %86 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4456744Z       %90 = arith.select %45, %89, %88 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4456965Z       %91 = tt.reshape %90 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked3>
2026-02-21T09:48:13.4457186Z       %92 = arith.sitofp %91 : tensor<8x32xi8, #blocked3> to tensor<8x32xf32, #blocked3>
2026-02-21T09:48:13.4457434Z       %93 = ttg.local_alloc %92 : (tensor<8x32xf32, #blocked3>) -> !ttg.memdesc<8x32xf32, #shared, #smem>
2026-02-21T09:48:13.4457750Z       %94 = ttg.local_load %93 : !ttg.memdesc<8x32xf32, #shared, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4458229Z       %95 = tt.dot %66, %94, %arg4, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma>
2026-02-21T09:48:13.4458595Z       %96 = arith.addi %arg3, %c4_i32 : i32
2026-02-21T09:48:13.4458722Z       %97 = arith.muli %96, %c2_i32 : i32
2026-02-21T09:48:13.4458897Z       %98 = tt.splat %97 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:13.4459130Z       %99 = arith.addi %98, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:13.4459411Z       %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:48:13.4459699Z       %101 = tt.broadcast %100 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4459893Z       %102 = arith.addi %24, %101 : tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4460101Z       %103 = tt.addptr %25, %102 : tensor<16x8x!tt.ptr<bf16>, #blocked2>, tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4460308Z       %104 = tt.load %103 : tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:13.4460579Z       %105 = ttg.convert_layout %104 : tensor<16x8xbf16, #blocked2> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4460991Z       %106 = arith.extf %105 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4461275Z       %107 = arith.extsi %96 : i32 to i64
2026-02-21T09:48:13.4461452Z       %108 = tt.splat %107 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:13.4461676Z       %109 = arith.addi %108, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:13.4461958Z       %110 = tt.expand_dims %109 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4462217Z       %111 = arith.muli %110, %cst_3 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4462410Z       %112 = tt.broadcast %111 : tensor<4x1xi64, #blocked1> -> tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4462627Z       %113 = arith.addi %112, %34 : tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4462825Z       %114 = tt.addptr %27, %113 : tensor<4x32x!tt.ptr<i8>, #blocked1>, tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4463046Z       %115 = arith.cmpi sge, %110, %cst_4 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4463237Z       %116 = arith.cmpi slt, %110, %cst_5 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4463411Z       %117 = arith.andi %115, %116 : tensor<4x1xi1, #blocked1>
2026-02-21T09:48:13.4463603Z       %118 = tt.broadcast %117 : tensor<4x1xi1, #blocked1> -> tensor<4x32xi1, #blocked1>
2026-02-21T09:48:13.4463794Z       %119 = arith.andi %118, %38 : tensor<4x32xi1, #blocked1>
2026-02-21T09:48:13.4463968Z       %120 = tt.load %114, %119, %cst_9 : tensor<4x32x!tt.ptr<i8>, #blocked1>
2026-02-21T09:48:13.4464230Z       %121 = ttg.convert_layout %120 : tensor<4x32xi8, #blocked1> -> tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4464521Z       %122 = arith.shli %121, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4464766Z       %123 = arith.shrsi %122, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4465006Z       %124 = arith.shrsi %121, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4465301Z       %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:13.4465641Z       %126 = tt.expand_dims %124 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:13.4465928Z       %127 = tt.broadcast %125 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4466175Z       %128 = arith.select %43, %127, %cst_10 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4466428Z       %129 = tt.broadcast %126 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4466670Z       %130 = arith.select %45, %129, %128 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4466918Z       %131 = tt.reshape %130 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked3>
2026-02-21T09:48:13.4467144Z       %132 = arith.sitofp %131 : tensor<8x32xi8, #blocked3> to tensor<8x32xf32, #blocked3>
2026-02-21T09:48:13.4467400Z       %133 = ttg.local_alloc %132 : (tensor<8x32xf32, #blocked3>) -> !ttg.memdesc<8x32xf32, #shared, #smem>
2026-02-21T09:48:13.4467725Z       %134 = ttg.local_load %133 : !ttg.memdesc<8x32xf32, #shared, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4468196Z       %135 = tt.dot %106, %134, %95, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma>
2026-02-21T09:48:13.4468543Z       %136 = arith.addi %arg3, %c8_i32 : i32
2026-02-21T09:48:13.4468674Z       %137 = arith.muli %136, %c2_i32 : i32
2026-02-21T09:48:13.4468851Z       %138 = tt.splat %137 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:13.4469077Z       %139 = arith.addi %138, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:13.4469361Z       %140 = tt.expand_dims %139 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:48:13.4469640Z       %141 = tt.broadcast %140 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4469841Z       %142 = arith.addi %24, %141 : tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4470049Z       %143 = tt.addptr %25, %142 : tensor<16x8x!tt.ptr<bf16>, #blocked2>, tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4470261Z       %144 = tt.load %143 : tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:13.4470549Z       %145 = ttg.convert_layout %144 : tensor<16x8xbf16, #blocked2> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4470952Z       %146 = arith.extf %145 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4471240Z       %147 = arith.extsi %136 : i32 to i64
2026-02-21T09:48:13.4471572Z       %148 = tt.splat %147 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:13.4471799Z       %149 = arith.addi %148, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:13.4472085Z       %150 = tt.expand_dims %149 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4472373Z       %151 = arith.muli %150, %cst_3 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4472565Z       %152 = tt.broadcast %151 : tensor<4x1xi64, #blocked1> -> tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4472766Z       %153 = arith.addi %152, %34 : tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4472966Z       %154 = tt.addptr %27, %153 : tensor<4x32x!tt.ptr<i8>, #blocked1>, tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4473184Z       %155 = arith.cmpi sge, %150, %cst_4 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4473362Z       %156 = arith.cmpi slt, %150, %cst_5 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4473530Z       %157 = arith.andi %155, %156 : tensor<4x1xi1, #blocked1>
2026-02-21T09:48:13.4473724Z       %158 = tt.broadcast %157 : tensor<4x1xi1, #blocked1> -> tensor<4x32xi1, #blocked1>
2026-02-21T09:48:13.4473916Z       %159 = arith.andi %158, %38 : tensor<4x32xi1, #blocked1>
2026-02-21T09:48:13.4474091Z       %160 = tt.load %154, %159, %cst_9 : tensor<4x32x!tt.ptr<i8>, #blocked1>
2026-02-21T09:48:13.4474354Z       %161 = ttg.convert_layout %160 : tensor<4x32xi8, #blocked1> -> tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4474654Z       %162 = arith.shli %161, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4474899Z       %163 = arith.shrsi %162, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4475155Z       %164 = arith.shrsi %161, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4475451Z       %165 = tt.expand_dims %163 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:13.4475793Z       %166 = tt.expand_dims %164 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:13.4476078Z       %167 = tt.broadcast %165 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4476322Z       %168 = arith.select %43, %167, %cst_10 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4476561Z       %169 = tt.broadcast %166 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4476806Z       %170 = arith.select %45, %169, %168 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4477041Z       %171 = tt.reshape %170 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked3>
2026-02-21T09:48:13.4477267Z       %172 = arith.sitofp %171 : tensor<8x32xi8, #blocked3> to tensor<8x32xf32, #blocked3>
2026-02-21T09:48:13.4477526Z       %173 = ttg.local_alloc %172 : (tensor<8x32xf32, #blocked3>) -> !ttg.memdesc<8x32xf32, #shared, #smem>
2026-02-21T09:48:13.4477853Z       %174 = ttg.local_load %173 : !ttg.memdesc<8x32xf32, #shared, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4478325Z       %175 = tt.dot %146, %174, %135, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma>
2026-02-21T09:48:13.4478681Z       scf.yield %175 : tensor<16x32xf32, #mma>
2026-02-21T09:48:13.4478833Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:48:13.4479041Z     %47 = scf.for %arg3 = %c504_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %46) -> (tensor<16x32xf32, #mma>)  : i32 {
2026-02-21T09:48:13.4479258Z       %57 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:48:13.4479434Z       %58 = tt.splat %57 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:13.4479670Z       %59 = arith.addi %58, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:13.4479944Z       %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:48:13.4480221Z       %61 = tt.broadcast %60 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4480415Z       %62 = arith.addi %24, %61 : tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4480616Z       %63 = tt.addptr %25, %62 : tensor<16x8x!tt.ptr<bf16>, #blocked2>, tensor<16x8xi32, #blocked2>
2026-02-21T09:48:13.4480826Z       %64 = tt.load %63 : tensor<16x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:13.4481096Z       %65 = ttg.convert_layout %64 : tensor<16x8xbf16, #blocked2> -> tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4481503Z       %66 = arith.extf %65 : tensor<16x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4481784Z       %67 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:48:13.4481958Z       %68 = tt.splat %67 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:13.4482183Z       %69 = arith.addi %68, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:13.4482464Z       %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4482791Z       %71 = arith.muli %70, %cst_3 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4482981Z       %72 = tt.broadcast %71 : tensor<4x1xi64, #blocked1> -> tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4483176Z       %73 = arith.addi %72, %34 : tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4483403Z       %74 = tt.addptr %27, %73 : tensor<4x32x!tt.ptr<i8>, #blocked1>, tensor<4x32xi64, #blocked1>
2026-02-21T09:48:13.4483612Z       %75 = arith.cmpi sge, %70, %cst_4 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4483786Z       %76 = arith.cmpi slt, %70, %cst_5 : tensor<4x1xi64, #blocked1>
2026-02-21T09:48:13.4483948Z       %77 = arith.andi %75, %76 : tensor<4x1xi1, #blocked1>
2026-02-21T09:48:13.4484137Z       %78 = tt.broadcast %77 : tensor<4x1xi1, #blocked1> -> tensor<4x32xi1, #blocked1>
2026-02-21T09:48:13.4484327Z       %79 = arith.andi %78, %38 : tensor<4x32xi1, #blocked1>
2026-02-21T09:48:13.4484495Z       %80 = tt.load %74, %79, %cst_9 : tensor<4x32x!tt.ptr<i8>, #blocked1>
2026-02-21T09:48:13.4484755Z       %81 = ttg.convert_layout %80 : tensor<4x32xi8, #blocked1> -> tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4485031Z       %82 = arith.shli %81, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4485275Z       %83 = arith.shrsi %82, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4485510Z       %84 = arith.shrsi %81, %cst_11 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:13.4485799Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:13.4486137Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:13.4486413Z       %87 = tt.broadcast %85 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4486651Z       %88 = arith.select %43, %87, %cst_10 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4486904Z       %89 = tt.broadcast %86 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4487136Z       %90 = arith.select %45, %89, %88 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:13.4487368Z       %91 = tt.reshape %90 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked3>
2026-02-21T09:48:13.4487600Z       %92 = arith.sitofp %91 : tensor<8x32xi8, #blocked3> to tensor<8x32xf32, #blocked3>
2026-02-21T09:48:13.4495447Z       %93 = ttg.local_alloc %92 : (tensor<8x32xf32, #blocked3>) -> !ttg.memdesc<8x32xf32, #shared, #smem>
2026-02-21T09:48:13.4495804Z       %94 = ttg.local_load %93 : !ttg.memdesc<8x32xf32, #shared, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:13.4496277Z       %95 = tt.dot %66, %94, %arg4, inputPrecision = tf32 : tensor<16x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<16x32xf32, #mma>
2026-02-21T09:48:13.4496635Z       scf.yield %95 : tensor<16x32xf32, #mma>
2026-02-21T09:48:13.4496768Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:48:13.4496926Z     %48 = arith.truncf %47 : tensor<16x32xf32, #mma> to tensor<16x32xbf16, #mma>
2026-02-21T09:48:13.4497189Z     %49 = tt.expand_dims %15 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<16x1xi32, #mma>
2026-02-21T09:48:13.4497427Z     %50 = arith.muli %49, %cst : tensor<16x1xi32, #mma>
2026-02-21T09:48:13.4497657Z     %51 = tt.expand_dims %20 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi32, #mma>
2026-02-21T09:48:13.4497912Z     %52 = tt.broadcast %50 : tensor<16x1xi32, #mma> -> tensor<16x32xi32, #mma>
2026-02-21T09:48:13.4498108Z     %53 = tt.broadcast %51 : tensor<1x32xi32, #mma> -> tensor<16x32xi32, #mma>
2026-02-21T09:48:13.4498286Z     %54 = arith.addi %52, %53 : tensor<16x32xi32, #mma>
2026-02-21T09:48:13.4498502Z     %55 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<16x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:13.4498717Z     %56 = tt.addptr %55, %54 : tensor<16x32x!tt.ptr<bf16>, #mma>, tensor<16x32xi32, #mma>
2026-02-21T09:48:13.4498925Z     tt.store %56, %48 : tensor<16x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:13.4499061Z     tt.return
2026-02-21T09:48:13.4499152Z   }
2026-02-21T09:48:13.4499236Z }
2026-02-21T09:48:13.4499282Z 
2026-02-21T09:48:13.4499321Z {-#
2026-02-21T09:48:13.4499406Z   external_resources: {
2026-02-21T09:48:13.4499512Z     mlir_reproducer: {
2026-02-21T09:48:13.4500516Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:48:13.4501523Z       disable_threading: false,
2026-02-21T09:48:13.4501633Z       verify_each: true
2026-02-21T09:48:13.4501730Z     }
2026-02-21T09:48:13.4501809Z   }
2026-02-21T09:48:13.4501887Z #-}
2026-02-21T09:48:13.4502167Z /tmp/torchinductor_root/nb/cnboqpelzdadahownptqgzjviokjf56pgn6uoo537sfljnhv5aii.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:48:13.4502860Z /tmp/torchinductor_root/nb/cnboqpelzdadahownptqgzjviokjf56pgn6uoo537sfljnhv5aii.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:48:13.4503419Z [224s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:48:13.4504159Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:48:13.4509496Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:48:13.4509672Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:48:13.4510188Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.9 configs/s
2026-02-21T09:48:13.4510433Z [224s] Adaptive compile timeout: 30s (90% percentile=15.1s, bounds=[30.0s, 30s])
2026-02-21T09:48:13.4510635Z [224s] Initial random population of 100, 5 starting points: 
2026-02-21T09:48:13.4510778Z error=15
2026-02-21T09:48:13.4510866Z timeout=3
2026-02-21T09:48:13.4510950Z ok=82
2026-02-21T09:48:13.4511038Z min=1.5749
2026-02-21T09:48:13.4511119Z mid=115.6728
2026-02-21T09:48:13.4511205Z max=3930.6316
2026-02-21T09:48:13.4511304Z best={'block_sizes': [8, 128, 512],
2026-02-21T09:48:13.4511445Z  'indexing': ['pointer', 'block_ptr', 'block_ptr'],
2026-02-21T09:48:13.4511577Z  'l2_groupings': [1],
2026-02-21T09:48:13.4511690Z  'load_eviction_policies': ['', ''],
2026-02-21T09:48:13.4511807Z  'loop_orders': [[0, 1]],
2026-02-21T09:48:13.4511923Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:48:13.4512084Z  'num_sm_multiplier': 128,
2026-02-21T09:48:13.4512187Z  'num_stages': 4,
2026-02-21T09:48:13.4512281Z  'num_warps': 16,
2026-02-21T09:48:13.4512379Z  'pid_type': 'persistent_blocked',
2026-02-21T09:48:13.4512501Z  'range_flattens': [False, True],
2026-02-21T09:48:13.4512616Z  'range_multi_buffers': [True, True],
2026-02-21T09:48:13.4512738Z  'range_num_stages': [1, 1],
2026-02-21T09:48:13.4512845Z  'range_unroll_factors': [3, 3],
2026-02-21T09:48:13.4512993Z  'range_warp_specializes': [],
2026-02-21T09:48:13.4513102Z  'waves_per_eu': 1}
2026-02-21T09:48:13.4513215Z [224s] Fitting surrogate: 100 points, 100 targets
2026-02-21T09:48:14.6224240Z [225s] Generation 1 starting: 108 neighbors, 5 active search path(s)
2026-02-21T09:48:40.0404193Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 111/111 0.7 configs/s
2026-02-21T09:48:41.3123092Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:48:41.3136108Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:48:41.3139503Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}>
2026-02-21T09:48:41.3140234Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:48:41.3140781Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}>
2026-02-21T09:48:41.3141303Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:48:41.3141769Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:48:41.3142181Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:48:41.3142501Z #smem = #ttg.shared_memory
2026-02-21T09:48:41.3142911Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:48:41.3143754Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:48:41.3144768Z     %cst = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3145072Z     %cst_0 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3145377Z     %cst_1 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3145671Z     %cst_2 = arith.constant dense<8192> : tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3146054Z     %cst_3 = arith.constant dense<0> : tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3146353Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3146644Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3146940Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3147252Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:48:41.3147533Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:48:41.3147807Z     %cst_9 = arith.constant dense<0.000000e+00> : tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3148084Z     %cst_10 = arith.constant dense<0> : tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3148353Z     %cst_11 = arith.constant dense<8192> : tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3148622Z     %cst_12 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:48:41.3148936Z     %cst_13 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3149301Z     %cst_14 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3149576Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:48:41.3149785Z     %cst_15 = arith.constant dense<0> : tensor<2x512xi8, #blocked>
2026-02-21T09:48:41.3149990Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:48:41.3150157Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T09:48:41.3150314Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:48:41.3150485Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:48:41.3150720Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:48:41.3150886Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:48:41.3151056Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:48:41.3151342Z     %cst_16 = arith.constant dense<0> : tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3151553Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:48:41.3151715Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:48:41.3151970Z     %cst_17 = arith.constant dense<4> : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3152243Z     %0 = tt.get_program_id x : i32
2026-02-21T09:48:41.3152405Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:48:41.3152565Z     %2 = arith.minsi %1, %c2048_i32 : i32
2026-02-21T09:48:41.3152859Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:41.3153260Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:41.3153635Z     %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3153985Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3154261Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3154594Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3155035Z     %9 = arith.extsi %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3155485Z     %10 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:48:41.3155947Z     %11 = arith.extsi %10 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:48:41.3156497Z     %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:48:41.3157084Z     %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:48:41.3157635Z     %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:48:41.3157938Z     %15 = arith.cmpi eq, %14, %cst_8 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:48:41.3158167Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1>
2026-02-21T09:48:41.3158416Z     %17 = arith.cmpi eq, %14, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:48:41.3158651Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1>
2026-02-21T09:48:41.3158898Z     %19 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:41.3159189Z     %20 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:41.3159524Z     %21 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:41.3159853Z     %22 = arith.extsi %21 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:41.3160102Z     %23 = arith.subi %2, %0 : i32
2026-02-21T09:48:41.3160228Z     %24 = arith.remsi %23, %c3_i32 : i32
2026-02-21T09:48:41.3160351Z     %25 = arith.subi %23, %24 : i32
2026-02-21T09:48:41.3160470Z     %26 = arith.addi %0, %25 : i32
2026-02-21T09:48:41.3160653Z     %27 = arith.addi %5, %cst_13 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3160970Z     %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3161268Z     %29 = tt.broadcast %28 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3161531Z     %30 = arith.addi %9, %cst_14 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3161848Z     %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3162107Z     %32 = arith.muli %31, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3162309Z     %33 = tt.broadcast %32 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3162518Z     %34 = arith.cmpi sge, %31, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3162770Z     %35 = arith.cmpi slt, %31, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3162944Z     %36 = arith.andi %34, %35 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3163134Z     %37 = tt.broadcast %36 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3163331Z     scf.for %arg3 = %0 to %26 step %c3_i32  : i32 {
2026-02-21T09:48:41.3163483Z       %38 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:48:41.3163628Z       %39 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:48:41.3163764Z       %40 = arith.muli %38, %c128_i32 : i32
2026-02-21T09:48:41.3163952Z       %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:41.3164196Z       %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:41.3164385Z       %43 = arith.muli %39, %c512_i32 : i32
2026-02-21T09:48:41.3164627Z       %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:48:41.3164901Z       %45 = arith.muli %44, %cst_12 : tensor<128x1xi32, #blocked2>
2026-02-21T09:48:41.3165115Z       %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3165325Z       %47 = arith.extsi %43 : i32 to i64
2026-02-21T09:48:41.3165507Z       %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:48:41.3165740Z       %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:48:41.3166058Z       %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3166358Z       %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3166579Z       %52 = arith.cmpi sge, %50, %cst_10 : tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3166765Z       %53 = arith.cmpi slt, %50, %cst_11 : tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3166942Z       %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked>
2026-02-21T09:48:41.3167135Z       %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3167400Z       %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_9) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:48:41.3167639Z         %238 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:41.3167812Z         %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3168037Z         %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3168309Z         %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3168586Z         %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3168781Z         %243 = arith.addi %46, %242 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3168981Z         %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3169211Z         %245 = tt.load %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3169433Z         %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3169787Z         %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3170200Z         %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3170483Z         %249 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:41.3170657Z         %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3170872Z         %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3171148Z         %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3171391Z         %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3171583Z         %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3171778Z         %255 = arith.addi %254, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3171972Z         %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3172183Z         %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3172357Z         %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3172517Z         %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3172702Z         %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3172888Z         %261 = arith.andi %260, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3173102Z         %262 = tt.load %256, %261, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3173362Z         %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3173647Z         %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3173905Z         %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3174148Z         %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3174450Z         %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3174801Z         %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3175352Z         %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3175606Z         %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3175852Z         %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3176097Z         %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3176335Z         %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3176559Z         %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3176815Z         %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3177155Z         %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3177640Z         %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3178009Z         %278 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:48:41.3178132Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:48:41.3178304Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3178529Z         %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3178804Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3179083Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3179278Z         %284 = arith.addi %46, %283 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3179481Z         %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3179689Z         %286 = tt.load %285 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3179916Z         %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3180247Z         %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3180654Z         %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3180938Z         %290 = arith.extsi %278 : i32 to i64
2026-02-21T09:48:41.3181128Z         %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3181345Z         %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3181620Z         %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3181877Z         %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3182069Z         %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3182259Z         %296 = arith.addi %295, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3182454Z         %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3182663Z         %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3182831Z         %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3182996Z         %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3183179Z         %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3183370Z         %302 = arith.andi %301, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3183537Z         %303 = tt.load %297, %302, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3183797Z         %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3184086Z         %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3184327Z         %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3184572Z         %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3184886Z         %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3185229Z         %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3185538Z         %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3185787Z         %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3186035Z         %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3186279Z         %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3186513Z         %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3186744Z         %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3186999Z         %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3187335Z         %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3187812Z         %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3188156Z         %319 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:41.3188282Z         %320 = arith.muli %319, %c2_i32 : i32
2026-02-21T09:48:41.3188455Z         %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3188674Z         %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3188968Z         %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3189244Z         %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3189439Z         %325 = arith.addi %46, %324 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3189663Z         %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3189867Z         %327 = tt.load %326 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3190089Z         %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3190417Z         %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3190827Z         %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3191112Z         %331 = arith.extsi %319 : i32 to i64
2026-02-21T09:48:41.3191280Z         %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3191501Z         %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3191770Z         %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3192015Z         %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3192206Z         %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3192395Z         %337 = arith.addi %336, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3192608Z         %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3192813Z         %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3192982Z         %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3193159Z         %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3193345Z         %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3193532Z         %343 = arith.andi %342, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3193697Z         %344 = tt.load %338, %343, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3193957Z         %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3194247Z         %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3194490Z         %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3194736Z         %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3195031Z         %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3195377Z         %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3195668Z         %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3195915Z         %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3196164Z         %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3196407Z         %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3196660Z         %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3196889Z         %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3197141Z         %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3197487Z         %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3197953Z         %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3198304Z         scf.yield %359 : tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3198438Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:41.3198580Z       %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3198778Z       %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3198979Z       %59 = tt.load %58 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3199197Z       %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3199520Z       %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3199920Z       %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3200216Z       %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3200424Z       %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3200618Z       %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3200779Z       %66 = tt.load %64, %65, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3201050Z       %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3201330Z       %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3201562Z       %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3201799Z       %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3202091Z       %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3202426Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3202760Z       %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3203003Z       %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3203244Z       %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3203477Z       %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3203702Z       %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3203921Z       %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3204166Z       %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3204511Z       %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3204971Z       %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3205366Z       %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:48:41.3205538Z       %83 = arith.extsi %40 : i32 to i64
2026-02-21T09:48:41.3205698Z       %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:41.3205905Z       %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:41.3206166Z       %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3206399Z       %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3206576Z       %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3206777Z       %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:41.3206980Z       %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:41.3207238Z       %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3207493Z       %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3207670Z       %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3207856Z       %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3208054Z       %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3240815Z       %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3241095Z       %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma>
2026-02-21T09:48:41.3241267Z       %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3241526Z       %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3241691Z       %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3241847Z       %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma>
2026-02-21T09:48:41.3242019Z       %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3242199Z       %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3242357Z       tt.store %94, %82, %103 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:41.3242507Z       %104 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:48:41.3242674Z       %105 = arith.remsi %104, %c128_i32 : i32
2026-02-21T09:48:41.3242797Z       %106 = arith.divsi %104, %c128_i32 : i32
2026-02-21T09:48:41.3242918Z       %107 = arith.muli %105, %c128_i32 : i32
2026-02-21T09:48:41.3243091Z       %108 = tt.splat %107 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:41.3243323Z       %109 = arith.addi %108, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:41.3243499Z       %110 = arith.muli %106, %c512_i32 : i32
2026-02-21T09:48:41.3243728Z       %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:48:41.3243984Z       %112 = arith.muli %111, %cst_12 : tensor<128x1xi32, #blocked2>
2026-02-21T09:48:41.3244184Z       %113 = tt.broadcast %112 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3244363Z       %114 = arith.extsi %110 : i32 to i64
2026-02-21T09:48:41.3244530Z       %115 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:48:41.3244775Z       %116 = arith.addi %115, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:48:41.3245052Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3245334Z       %118 = tt.broadcast %117 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3245559Z       %119 = arith.cmpi sge, %117, %cst_10 : tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3245731Z       %120 = arith.cmpi slt, %117, %cst_11 : tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3245899Z       %121 = arith.andi %119, %120 : tensor<1x512xi1, #blocked>
2026-02-21T09:48:41.3246084Z       %122 = tt.broadcast %121 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3246350Z       %123 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_9) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:48:41.3246570Z         %238 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:41.3246743Z         %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3246968Z         %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3247244Z         %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3247524Z         %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3247720Z         %243 = arith.addi %113, %242 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3247924Z         %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3248138Z         %245 = tt.load %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3248362Z         %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3248723Z         %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3249152Z         %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3249433Z         %249 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:41.3249606Z         %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3249822Z         %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3250094Z         %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3250337Z         %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3250527Z         %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3250719Z         %255 = arith.addi %254, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3250915Z         %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3251122Z         %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3251290Z         %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3251454Z         %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3251638Z         %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3251825Z         %261 = arith.andi %260, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3251995Z         %262 = tt.load %256, %261, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3252254Z         %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3252559Z         %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3252805Z         %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3253064Z         %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3253362Z         %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3253705Z         %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3253997Z         %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3254250Z         %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3254497Z         %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3254739Z         %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3254972Z         %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3255199Z         %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3255457Z         %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3255785Z         %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3256287Z         %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3256645Z         %278 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:48:41.3256786Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:48:41.3256961Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3257185Z         %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3257464Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3257746Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3257944Z         %284 = arith.addi %113, %283 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3258152Z         %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3258361Z         %286 = tt.load %285 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3258589Z         %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3258927Z         %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3259334Z         %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3259622Z         %290 = arith.extsi %278 : i32 to i64
2026-02-21T09:48:41.3259794Z         %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3260021Z         %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3260316Z         %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3260562Z         %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3260759Z         %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3260965Z         %296 = arith.addi %295, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3261168Z         %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3261379Z         %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3261550Z         %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3261719Z         %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3261903Z         %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3262100Z         %302 = arith.andi %301, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3262269Z         %303 = tt.load %297, %302, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3262537Z         %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3262827Z         %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3263073Z         %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3263324Z         %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3263624Z         %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3263999Z         %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3264296Z         %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3264566Z         %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3264818Z         %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3265060Z         %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3265302Z         %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3265533Z         %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3265788Z         %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3266123Z         %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3266603Z         %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3266955Z         %319 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:41.3267085Z         %320 = arith.muli %319, %c2_i32 : i32
2026-02-21T09:48:41.3267258Z         %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3267484Z         %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3267762Z         %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3268038Z         %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3268254Z         %325 = arith.addi %113, %324 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3268458Z         %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3268669Z         %327 = tt.load %326 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3268912Z         %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3269241Z         %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3269653Z         %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3269934Z         %331 = arith.extsi %319 : i32 to i64
2026-02-21T09:48:41.3270112Z         %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3270338Z         %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3270616Z         %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3270866Z         %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3271062Z         %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3271263Z         %337 = arith.addi %336, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3271467Z         %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3271678Z         %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3271855Z         %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3272036Z         %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3272225Z         %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3272432Z         %343 = arith.andi %342, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3272606Z         %344 = tt.load %338, %343, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3272874Z         %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3273160Z         %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3273407Z         %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3273652Z         %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3273956Z         %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3274308Z         %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3274603Z         %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3274859Z         %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3275112Z         %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3275356Z         %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3275600Z         %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3275830Z         %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3276119Z         %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3276455Z         %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3276941Z         %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3277295Z         scf.yield %359 : tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3277430Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:41.3277581Z       %124 = arith.addi %113, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3277789Z       %125 = tt.addptr %6, %124 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3277998Z       %126 = tt.load %125 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3278227Z       %127 = ttg.local_alloc %126 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3278555Z       %128 = ttg.local_load %127 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3278964Z       %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3279268Z       %130 = arith.addi %33, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3279466Z       %131 = tt.addptr %7, %130 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3279670Z       %132 = arith.andi %37, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3279838Z       %133 = tt.load %131, %132, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3280116Z       %134 = ttg.convert_layout %133 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3280420Z       %135 = arith.shli %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3280662Z       %136 = arith.shrsi %135, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3280908Z       %137 = arith.shrsi %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3281203Z       %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3281550Z       %139 = tt.expand_dims %137 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3281845Z       %140 = tt.broadcast %138 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3282095Z       %141 = arith.select %16, %140, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3282344Z       %142 = tt.broadcast %139 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3282643Z       %143 = arith.select %18, %142, %141 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3282887Z       %144 = tt.reshape %143 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3283117Z       %145 = arith.sitofp %144 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3283372Z       %146 = ttg.local_alloc %145 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3283707Z       %147 = ttg.local_load %146 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3284204Z       %148 = tt.dot %129, %147, %123, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3284596Z       %149 = arith.truncf %148 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:48:41.3284798Z       %150 = arith.extsi %107 : i32 to i64
2026-02-21T09:48:41.3284965Z       %151 = tt.splat %150 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:41.3285186Z       %152 = arith.addi %151, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:41.3285462Z       %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3285703Z       %154 = arith.muli %153, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3285890Z       %155 = tt.broadcast %154 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3286105Z       %156 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:41.3286321Z       %157 = arith.addi %156, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:41.3286591Z       %158 = tt.expand_dims %157 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3286864Z       %159 = tt.broadcast %158 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3287056Z       %160 = arith.addi %155, %159 : tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3287248Z       %161 = tt.addptr %19, %160 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3287456Z       %162 = arith.cmpi sge, %153, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3287622Z       %163 = arith.cmpi slt, %153, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3287783Z       %164 = arith.andi %162, %163 : tensor<128x1xi1, #mma>
2026-02-21T09:48:41.3287983Z       %165 = tt.broadcast %164 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3288171Z       %166 = arith.cmpi sge, %158, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3288361Z       %167 = arith.cmpi slt, %158, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3288519Z       %168 = arith.andi %166, %167 : tensor<1x512xi1, #mma>
2026-02-21T09:48:41.3288699Z       %169 = tt.broadcast %168 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3288880Z       %170 = arith.andi %165, %169 : tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3289051Z       tt.store %161, %149, %170 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:41.3289208Z       %171 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:48:41.3289334Z       %172 = arith.remsi %171, %c128_i32 : i32
2026-02-21T09:48:41.3289464Z       %173 = arith.divsi %171, %c128_i32 : i32
2026-02-21T09:48:41.3289587Z       %174 = arith.muli %172, %c128_i32 : i32
2026-02-21T09:48:41.3289765Z       %175 = tt.splat %174 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:41.3289993Z       %176 = arith.addi %175, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:41.3290176Z       %177 = arith.muli %173, %c512_i32 : i32
2026-02-21T09:48:41.3290406Z       %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:48:41.3290663Z       %179 = arith.muli %178, %cst_12 : tensor<128x1xi32, #blocked2>
2026-02-21T09:48:41.3290865Z       %180 = tt.broadcast %179 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3291042Z       %181 = arith.extsi %177 : i32 to i64
2026-02-21T09:48:41.3291213Z       %182 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:48:41.3291439Z       %183 = arith.addi %182, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:48:41.3291744Z       %184 = tt.expand_dims %183 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3292028Z       %185 = tt.broadcast %184 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3292234Z       %186 = arith.cmpi sge, %184, %cst_10 : tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3292443Z       %187 = arith.cmpi slt, %184, %cst_11 : tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3292613Z       %188 = arith.andi %186, %187 : tensor<1x512xi1, #blocked>
2026-02-21T09:48:41.3292822Z       %189 = tt.broadcast %188 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3293092Z       %190 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_9) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:48:41.3293310Z         %238 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:41.3293483Z         %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3293704Z         %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3293980Z         %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3294259Z         %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3294454Z         %243 = arith.addi %180, %242 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3294658Z         %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3294864Z         %245 = tt.load %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3295087Z         %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3295433Z         %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3295837Z         %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3296141Z         %249 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:41.3296312Z         %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3296527Z         %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3296798Z         %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3297041Z         %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3297231Z         %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3297423Z         %255 = arith.addi %254, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3297618Z         %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3297824Z         %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3297996Z         %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3298160Z         %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3298343Z         %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3298532Z         %261 = arith.andi %260, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3298701Z         %262 = tt.load %256, %261, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3298957Z         %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3299244Z         %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3299500Z         %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3299745Z         %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3300040Z         %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3300399Z         %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3300689Z         %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3300935Z         %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3301184Z         %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3301426Z         %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3301660Z         %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3301887Z         %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3302139Z         %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3302466Z         %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3302942Z         %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3303302Z         %278 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:48:41.3303428Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:48:41.3303598Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3303836Z         %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3304113Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3304388Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3304584Z         %284 = arith.addi %180, %283 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3304786Z         %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3304992Z         %286 = tt.load %285 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3305217Z         %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3305542Z         %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3305950Z         %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3306232Z         %290 = arith.extsi %278 : i32 to i64
2026-02-21T09:48:41.3306399Z         %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3306617Z         %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3306885Z         %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3307132Z         %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3307443Z         %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3307635Z         %296 = arith.addi %295, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3307832Z         %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3308051Z         %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3308222Z         %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3308382Z         %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3308571Z         %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3308760Z         %302 = arith.andi %301, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3308928Z         %303 = tt.load %297, %302, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3309193Z         %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3309481Z         %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3309725Z         %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3309970Z         %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3310265Z         %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3310611Z         %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3310899Z         %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3311168Z         %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3311415Z         %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3311673Z         %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3311912Z         %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3312136Z         %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3312392Z         %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3312725Z         %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3313198Z         %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3313550Z         %319 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:41.3313676Z         %320 = arith.muli %319, %c2_i32 : i32
2026-02-21T09:48:41.3313844Z         %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3314068Z         %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3314341Z         %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3314616Z         %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3314813Z         %325 = arith.addi %180, %324 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3315032Z         %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3315239Z         %327 = tt.load %326 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3315460Z         %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3315801Z         %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3316206Z         %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3316484Z         %331 = arith.extsi %319 : i32 to i64
2026-02-21T09:48:41.3316652Z         %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3316868Z         %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3317140Z         %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3317387Z         %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3317578Z         %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3317771Z         %337 = arith.addi %336, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3317963Z         %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3318170Z         %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3318341Z         %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3318500Z         %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3318699Z         %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3318888Z         %343 = arith.andi %342, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3319056Z         %344 = tt.load %338, %343, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3319334Z         %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3319620Z         %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3319860Z         %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3320102Z         %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3320400Z         %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3320744Z         %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3321031Z         %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3321284Z         %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3321529Z         %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3321769Z         %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3322005Z         %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3322227Z         %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3322485Z         %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3322896Z         %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3323374Z         %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3323741Z         scf.yield %359 : tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3323873Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:41.3324017Z       %191 = arith.addi %180, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3324216Z       %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3324425Z       %193 = tt.load %192 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3324648Z       %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3324974Z       %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3325377Z       %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3325675Z       %197 = arith.addi %33, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3325871Z       %198 = tt.addptr %7, %197 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3326067Z       %199 = arith.andi %37, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3326232Z       %200 = tt.load %198, %199, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3326491Z       %201 = ttg.convert_layout %200 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3326791Z       %202 = arith.shli %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3327033Z       %203 = arith.shrsi %202, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3327293Z       %204 = arith.shrsi %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3327588Z       %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3327929Z       %206 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3328215Z       %207 = tt.broadcast %205 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3328464Z       %208 = arith.select %16, %207, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3328711Z       %209 = tt.broadcast %206 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3328949Z       %210 = arith.select %18, %209, %208 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3329188Z       %211 = tt.reshape %210 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3329414Z       %212 = arith.sitofp %211 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3329670Z       %213 = ttg.local_alloc %212 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3330000Z       %214 = ttg.local_load %213 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3330468Z       %215 = tt.dot %196, %214, %190, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3330873Z       %216 = arith.truncf %215 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:48:41.3331052Z       %217 = arith.extsi %174 : i32 to i64
2026-02-21T09:48:41.3331215Z       %218 = tt.splat %217 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:41.3331442Z       %219 = arith.addi %218, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:41.3331709Z       %220 = tt.expand_dims %219 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3331949Z       %221 = arith.muli %220, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3332132Z       %222 = tt.broadcast %221 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3332340Z       %223 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:41.3332550Z       %224 = arith.addi %223, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:41.3332815Z       %225 = tt.expand_dims %224 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3333080Z       %226 = tt.broadcast %225 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3333262Z       %227 = arith.addi %222, %226 : tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3333454Z       %228 = tt.addptr %19, %227 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3333658Z       %229 = arith.cmpi sge, %220, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3333819Z       %230 = arith.cmpi slt, %220, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3333976Z       %231 = arith.andi %229, %230 : tensor<128x1xi1, #mma>
2026-02-21T09:48:41.3334151Z       %232 = tt.broadcast %231 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3334354Z       %233 = arith.cmpi sge, %225, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3334518Z       %234 = arith.cmpi slt, %225, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3334673Z       %235 = arith.andi %233, %234 : tensor<1x512xi1, #mma>
2026-02-21T09:48:41.3334861Z       %236 = tt.broadcast %235 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3335040Z       %237 = arith.andi %232, %236 : tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3335202Z       tt.store %228, %216, %237 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:41.3335346Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:48:41.3335467Z     scf.for %arg3 = %26 to %2 step %c1_i32  : i32 {
2026-02-21T09:48:41.3335603Z       %38 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:48:41.3335727Z       %39 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:48:41.3335849Z       %40 = arith.muli %38, %c128_i32 : i32
2026-02-21T09:48:41.3336015Z       %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:41.3336238Z       %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:48:41.3336409Z       %43 = arith.muli %39, %c512_i32 : i32
2026-02-21T09:48:41.3336637Z       %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:48:41.3336890Z       %45 = arith.muli %44, %cst_12 : tensor<128x1xi32, #blocked2>
2026-02-21T09:48:41.3337079Z       %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3337254Z       %47 = arith.extsi %43 : i32 to i64
2026-02-21T09:48:41.3337415Z       %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:48:41.3337629Z       %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:48:41.3337897Z       %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3338187Z       %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3338384Z       %52 = arith.cmpi sge, %50, %cst_10 : tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3338557Z       %53 = arith.cmpi slt, %50, %cst_11 : tensor<1x512xi64, #blocked>
2026-02-21T09:48:41.3338733Z       %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked>
2026-02-21T09:48:41.3338910Z       %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3339176Z       %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_9) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:48:41.3339391Z         %104 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:41.3339562Z         %105 = tt.splat %104 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3339785Z         %106 = arith.addi %105, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3340061Z         %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3340345Z         %108 = tt.broadcast %107 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3340541Z         %109 = arith.addi %46, %108 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3340744Z         %110 = tt.addptr %6, %109 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3340953Z         %111 = tt.load %110 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3341175Z         %112 = ttg.local_alloc %111 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3341507Z         %113 = ttg.local_load %112 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3341930Z         %114 = arith.extf %113 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3350044Z         %115 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:41.3350309Z         %116 = tt.splat %115 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3350540Z         %117 = arith.addi %116, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3350828Z         %118 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3351077Z         %119 = arith.muli %118, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3351271Z         %120 = tt.broadcast %119 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3351467Z         %121 = arith.addi %120, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3351668Z         %122 = tt.addptr %7, %121 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3351884Z         %123 = arith.cmpi sge, %118, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3352053Z         %124 = arith.cmpi slt, %118, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3352222Z         %125 = arith.andi %123, %124 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3352408Z         %126 = tt.broadcast %125 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3352596Z         %127 = arith.andi %126, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3352766Z         %128 = tt.load %122, %127, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3353027Z         %129 = ttg.convert_layout %128 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3353314Z         %130 = arith.shli %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3353558Z         %131 = arith.shrsi %130, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3353827Z         %132 = arith.shrsi %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3354130Z         %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3354488Z         %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3354781Z         %135 = tt.broadcast %133 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3355031Z         %136 = arith.select %16, %135, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3355282Z         %137 = tt.broadcast %134 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3355527Z         %138 = arith.select %18, %137, %136 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3355762Z         %139 = tt.reshape %138 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3355994Z         %140 = arith.sitofp %139 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3356250Z         %141 = ttg.local_alloc %140 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3356581Z         %142 = ttg.local_load %141 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3357063Z         %143 = tt.dot %114, %142, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3357413Z         %144 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:48:41.3357540Z         %145 = arith.muli %144, %c2_i32 : i32
2026-02-21T09:48:41.3357730Z         %146 = tt.splat %145 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3357952Z         %147 = arith.addi %146, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3358247Z         %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3358523Z         %149 = tt.broadcast %148 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3358721Z         %150 = arith.addi %46, %149 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3358926Z         %151 = tt.addptr %6, %150 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3359133Z         %152 = tt.load %151 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3359361Z         %153 = ttg.local_alloc %152 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3359692Z         %154 = ttg.local_load %153 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3360099Z         %155 = arith.extf %154 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3360382Z         %156 = arith.extsi %144 : i32 to i64
2026-02-21T09:48:41.3360549Z         %157 = tt.splat %156 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3360768Z         %158 = arith.addi %157, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3361041Z         %159 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3361289Z         %160 = arith.muli %159, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3361482Z         %161 = tt.broadcast %160 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3361690Z         %162 = arith.addi %161, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3361888Z         %163 = tt.addptr %7, %162 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3362094Z         %164 = arith.cmpi sge, %159, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3362282Z         %165 = arith.cmpi slt, %159, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3362446Z         %166 = arith.andi %164, %165 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3362677Z         %167 = tt.broadcast %166 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3362867Z         %168 = arith.andi %167, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3363035Z         %169 = tt.load %163, %168, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3363299Z         %170 = ttg.convert_layout %169 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3363585Z         %171 = arith.shli %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3363832Z         %172 = arith.shrsi %171, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3364077Z         %173 = arith.shrsi %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3364377Z         %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3364726Z         %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3365021Z         %176 = tt.broadcast %174 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3365291Z         %177 = arith.select %16, %176, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3365541Z         %178 = tt.broadcast %175 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3365781Z         %179 = arith.select %18, %178, %177 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3366034Z         %180 = tt.reshape %179 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3366264Z         %181 = arith.sitofp %180 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3366517Z         %182 = ttg.local_alloc %181 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3366851Z         %183 = ttg.local_load %182 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3367326Z         %184 = tt.dot %155, %183, %143, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3367675Z         %185 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:41.3367805Z         %186 = arith.muli %185, %c2_i32 : i32
2026-02-21T09:48:41.3367975Z         %187 = tt.splat %186 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3368201Z         %188 = arith.addi %187, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:48:41.3368475Z         %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:48:41.3368754Z         %190 = tt.broadcast %189 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3368952Z         %191 = arith.addi %46, %190 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3369154Z         %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3369388Z         %193 = tt.load %192 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3369618Z         %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3369961Z         %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3370386Z         %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3370667Z         %197 = arith.extsi %185 : i32 to i64
2026-02-21T09:48:41.3370838Z         %198 = tt.splat %197 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3371056Z         %199 = arith.addi %198, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:41.3371332Z         %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3371576Z         %201 = arith.muli %200, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3371766Z         %202 = tt.broadcast %201 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3371957Z         %203 = arith.addi %202, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3372152Z         %204 = tt.addptr %7, %203 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3372361Z         %205 = arith.cmpi sge, %200, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3372530Z         %206 = arith.cmpi slt, %200, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:48:41.3372691Z         %207 = arith.andi %205, %206 : tensor<2x1xi1, #blocked>
2026-02-21T09:48:41.3372876Z         %208 = tt.broadcast %207 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3373061Z         %209 = arith.andi %208, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3373247Z         %210 = tt.load %204, %209, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3373506Z         %211 = ttg.convert_layout %210 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3373815Z         %212 = arith.shli %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3374058Z         %213 = arith.shrsi %212, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3374302Z         %214 = arith.shrsi %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3374600Z         %215 = tt.expand_dims %213 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3374944Z         %216 = tt.expand_dims %214 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3375235Z         %217 = tt.broadcast %215 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3375485Z         %218 = arith.select %16, %217, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3375734Z         %219 = tt.broadcast %216 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3375977Z         %220 = arith.select %18, %219, %218 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3376214Z         %221 = tt.reshape %220 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3376439Z         %222 = arith.sitofp %221 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3376695Z         %223 = ttg.local_alloc %222 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3377042Z         %224 = ttg.local_load %223 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3377513Z         %225 = tt.dot %196, %224, %184, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3377892Z         scf.yield %225 : tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3378027Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:41.3378170Z       %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3378365Z       %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:48:41.3378565Z       %59 = tt.load %58 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:48:41.3378783Z       %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:48:41.3379108Z       %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3379509Z       %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3379804Z       %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3379994Z       %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:48:41.3380187Z       %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:48:41.3380347Z       %66 = tt.load %64, %65, %cst_15 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:48:41.3380601Z       %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3380878Z       %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3381131Z       %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3381369Z       %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:41.3381675Z       %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3382013Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:48:41.3382294Z       %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3382537Z       %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3382775Z       %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3383008Z       %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:48:41.3383235Z       %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked3>
2026-02-21T09:48:41.3383456Z       %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked3> to tensor<4x512xf32, #blocked3>
2026-02-21T09:48:41.3383707Z       %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked3>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:48:41.3384027Z       %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:41.3384487Z       %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:48:41.3384867Z       %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:48:41.3385040Z       %83 = arith.extsi %40 : i32 to i64
2026-02-21T09:48:41.3385217Z       %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:41.3385428Z       %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:41.3385691Z       %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3385943Z       %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3386118Z       %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3386321Z       %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:41.3386524Z       %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:41.3386780Z       %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3387040Z       %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3387216Z       %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3387409Z       %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:48:41.3387608Z       %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3387768Z       %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:48:41.3387920Z       %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma>
2026-02-21T09:48:41.3388087Z       %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3388268Z       %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3388432Z       %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:48:41.3388589Z       %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma>
2026-02-21T09:48:41.3388782Z       %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3388960Z       %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma>
2026-02-21T09:48:41.3389137Z       tt.store %94, %82, %103 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:41.3389280Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:48:41.3389385Z     tt.return
2026-02-21T09:48:41.3389463Z   }
2026-02-21T09:48:41.3389539Z }
2026-02-21T09:48:41.3389583Z 
2026-02-21T09:48:41.3389615Z {-#
2026-02-21T09:48:41.3389694Z   external_resources: {
2026-02-21T09:48:41.3389792Z     mlir_reproducer: {
2026-02-21T09:48:41.3390793Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:48:41.3391786Z       disable_threading: false,
2026-02-21T09:48:41.3391893Z       verify_each: true
2026-02-21T09:48:41.3391982Z     }
2026-02-21T09:48:41.3392055Z   }
2026-02-21T09:48:41.3392123Z #-}
2026-02-21T09:48:41.3392399Z /tmp/torchinductor_root/rq/crqlyd4egjayq53eu2nago7umu7wvbg3q2ja7qxq3ncocq5xs6tb.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:48:41.3393113Z /tmp/torchinductor_root/rq/crqlyd4egjayq53eu2nago7umu7wvbg3q2ja7qxq3ncocq5xs6tb.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:48:41.3393666Z [252s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:48:41.3394464Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 512], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:48:41.3395186Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:48:41.3395353Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:48:42.3209861Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:48:42.3224933Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:48:42.3225534Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:48:42.3226077Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:48:42.3226571Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:48:42.3227018Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:48:42.3227463Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:48:42.3227784Z #smem = #ttg.shared_memory
2026-02-21T09:48:42.3228196Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:48:42.3229270Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:48:42.3230089Z     %cst = arith.constant dense<0.000000e+00> : tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3230376Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:48:42.3230589Z     %c608_i32 = arith.constant 608 : i32
2026-02-21T09:48:42.3230793Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:48:42.3230994Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:48:42.3231200Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:48:42.3231466Z     %cst_0 = arith.constant dense<0> : tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3231730Z     %c1824_i32 = arith.constant 1824 : i32
2026-02-21T09:48:42.3231933Z     %c1216_i32 = arith.constant 1216 : i32
2026-02-21T09:48:42.3232132Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:48:42.3232331Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:48:42.3232532Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:48:42.3232733Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T09:48:42.3232948Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:48:42.3233145Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:48:42.3233339Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:48:42.3233542Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:48:42.3233883Z     %cst_1 = arith.constant dense<0> : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3234343Z     %cst_2 = arith.constant dense<8192> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3234795Z     %cst_3 = arith.constant dense<0> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3235126Z     %c16991_i32 = arith.constant 16991 : i32
2026-02-21T09:48:42.3235397Z     %cst_4 = arith.constant dense<1024> : tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3235886Z     %cst_5 = arith.constant dense<4> : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3236267Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:42.3236570Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:42.3236941Z     %cst_8 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3237414Z     %cst_9 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3237736Z     %cst_10 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3238023Z     %cst_11 = arith.constant dense<0> : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3238245Z     %cst_12 = arith.constant dense<8192> : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3238468Z     %cst_13 = arith.constant dense<8192> : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3238696Z     %cst_14 = arith.constant dense<0> : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3238916Z     %cst_15 = arith.constant dense<16384> : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3239116Z     %0 = tt.get_program_id x : i32
2026-02-21T09:48:42.3239375Z     %1 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3239751Z     %2 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3240105Z     %3 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3240434Z     %4 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3240751Z     %5 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3241171Z     %6 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3241738Z     %7 = arith.extsi %6 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3242331Z     %8 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3242943Z     %9 = arith.extsi %8 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3243506Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:48:42.3244050Z     %11 = tt.expand_dims %10 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:48:42.3244582Z     %12 = tt.expand_dims %11 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:42.3244913Z     %13 = arith.cmpi eq, %12, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:42.3245176Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x32xi1, #blocked>
2026-02-21T09:48:42.3245433Z     %15 = arith.cmpi eq, %12, %cst_7 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:42.3245679Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x32xi1, #blocked>
2026-02-21T09:48:42.3245960Z     %17 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<256x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:42.3246321Z     %18 = arith.extsi %2 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3246723Z     %19 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3247151Z     %20 = arith.extsi %19 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3247396Z     %21 = arith.subi %c16991_i32, %0 : i32
2026-02-21T09:48:42.3247521Z     %22 = arith.divui %21, %c608_i32 : i32
2026-02-21T09:48:42.3247663Z     %23 = arith.remsi %22, %c4_i32 : i32
2026-02-21T09:48:42.3247784Z     %24 = arith.subi %22, %23 : i32
2026-02-21T09:48:42.3247908Z     %25 = arith.muli %24, %c608_i32 : i32
2026-02-21T09:48:42.3248029Z     %26 = arith.addi %0, %25 : i32
2026-02-21T09:48:42.3248164Z     scf.for %arg3 = %0 to %26 step %c2432_i32  : i32 {
2026-02-21T09:48:42.3248314Z       %27 = arith.divsi %arg3, %c1024_i32 : i32
2026-02-21T09:48:42.3248440Z       %28 = arith.muli %27, %c4_i32 : i32
2026-02-21T09:48:42.3248562Z       %29 = arith.subi %c64_i32, %28 : i32
2026-02-21T09:48:42.3248683Z       %30 = arith.minsi %29, %c4_i32 : i32
2026-02-21T09:48:42.3248809Z       %31 = arith.remsi %arg3, %c1024_i32 : i32
2026-02-21T09:48:42.3248934Z       %32 = arith.remsi %31, %30 : i32
2026-02-21T09:48:42.3249053Z       %33 = arith.addi %28, %32 : i32
2026-02-21T09:48:42.3249174Z       %34 = arith.divsi %31, %30 : i32
2026-02-21T09:48:42.3249294Z       %35 = arith.muli %33, %c256_i32 : i32
2026-02-21T09:48:42.3249477Z       %36 = tt.splat %35 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3249715Z       %37 = arith.addi %36, %1 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3249901Z       %38 = arith.muli %34, %c32_i32 : i32
2026-02-21T09:48:42.3250135Z       %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3250405Z       %40 = arith.muli %39, %cst_4 : tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3250611Z       %41 = tt.broadcast %40 : tensor<256x1xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3250813Z       %42 = arith.extsi %38 : i32 to i64
2026-02-21T09:48:42.3251031Z       %43 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3251363Z       %44 = arith.addi %43, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3251773Z       %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3252224Z       %46 = tt.broadcast %45 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3252548Z       %47 = arith.cmpi sge, %45, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3252803Z       %48 = arith.cmpi slt, %45, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3253050Z       %49 = arith.andi %47, %48 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3253358Z       %50 = tt.broadcast %49 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3253721Z       %51 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg5 = %cst) -> (tensor<256x32xf32, #mma>)  : i32 {
2026-02-21T09:48:42.3253948Z         %218 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:42.3254132Z         %219 = tt.splat %218 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3254372Z         %220 = arith.addi %219, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3254669Z         %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3254966Z         %222 = tt.broadcast %221 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3255191Z         %223 = arith.addi %41, %222 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3255410Z         %224 = tt.addptr %4, %223 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3255632Z         %225 = tt.load %224 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3255890Z         %226 = ttg.local_alloc %225 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3256245Z         %227 = ttg.local_load %226 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3256684Z         %228 = arith.extf %227 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3256992Z         %229 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:42.3257208Z         %230 = tt.splat %229 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3257501Z         %231 = arith.addi %230, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3257888Z         %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3258243Z         %233 = arith.muli %232, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3258550Z         %234 = tt.broadcast %233 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3258854Z         %235 = arith.addi %234, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3259175Z         %236 = tt.addptr %5, %235 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3259490Z         %237 = arith.cmpi sge, %232, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3259754Z         %238 = arith.cmpi slt, %232, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3259991Z         %239 = arith.andi %237, %238 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3260294Z         %240 = tt.broadcast %239 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3260594Z         %241 = arith.andi %240, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3260840Z         %242 = tt.load %236, %241, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3261088Z         %243 = arith.shli %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3261325Z         %244 = arith.shrsi %243, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3261566Z         %245 = arith.shrsi %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3261859Z         %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3262225Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3262511Z         %248 = tt.broadcast %246 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3262750Z         %249 = arith.select %14, %248, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3262991Z         %250 = tt.broadcast %247 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3263260Z         %251 = arith.select %16, %250, %249 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3263491Z         %252 = tt.reshape %251 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3263716Z         %253 = arith.sitofp %252 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3263981Z         %254 = ttg.local_alloc %253 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3264304Z         %255 = ttg.local_load %254 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3264779Z         %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3265131Z         %257 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:48:42.3265256Z         %258 = arith.muli %257, %c2_i32 : i32
2026-02-21T09:48:42.3265426Z         %259 = tt.splat %258 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3265651Z         %260 = arith.addi %259, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3265927Z         %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3266202Z         %262 = tt.broadcast %261 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3266397Z         %263 = arith.addi %41, %262 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3266597Z         %264 = tt.addptr %4, %263 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3266808Z         %265 = tt.load %264 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3267049Z         %266 = ttg.local_alloc %265 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3267380Z         %267 = ttg.local_load %266 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3267805Z         %268 = arith.extf %267 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3268085Z         %269 = arith.extsi %257 : i32 to i64
2026-02-21T09:48:42.3268296Z         %270 = tt.splat %269 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3268595Z         %271 = arith.addi %270, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3268987Z         %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3269339Z         %273 = arith.muli %272, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3269650Z         %274 = tt.broadcast %273 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3269953Z         %275 = arith.addi %274, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3270260Z         %276 = tt.addptr %5, %275 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3270571Z         %277 = arith.cmpi sge, %272, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3270827Z         %278 = arith.cmpi slt, %272, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3271084Z         %279 = arith.andi %277, %278 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3271380Z         %280 = tt.broadcast %279 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3271679Z         %281 = arith.andi %280, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3271941Z         %282 = tt.load %276, %281, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3272186Z         %283 = arith.shli %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3272421Z         %284 = arith.shrsi %283, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3272654Z         %285 = arith.shrsi %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3272941Z         %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3273275Z         %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3273555Z         %288 = tt.broadcast %286 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3273793Z         %289 = arith.select %14, %288, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3274025Z         %290 = tt.broadcast %287 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3274254Z         %291 = arith.select %16, %290, %289 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3274484Z         %292 = tt.reshape %291 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3274703Z         %293 = arith.sitofp %292 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3274971Z         %294 = ttg.local_alloc %293 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3275292Z         %295 = ttg.local_load %294 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3275777Z         %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3276124Z         %297 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:42.3276246Z         %298 = arith.muli %297, %c2_i32 : i32
2026-02-21T09:48:42.3276418Z         %299 = tt.splat %298 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3276639Z         %300 = arith.addi %299, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3276918Z         %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3277196Z         %302 = tt.broadcast %301 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3277392Z         %303 = arith.addi %41, %302 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3277594Z         %304 = tt.addptr %4, %303 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3277800Z         %305 = tt.load %304 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3278027Z         %306 = ttg.local_alloc %305 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3278357Z         %307 = ttg.local_load %306 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3278764Z         %308 = arith.extf %307 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3279065Z         %309 = arith.extsi %297 : i32 to i64
2026-02-21T09:48:42.3279271Z         %310 = tt.splat %309 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3279570Z         %311 = arith.addi %310, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3279971Z         %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3280323Z         %313 = arith.muli %312, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3280629Z         %314 = tt.broadcast %313 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3280935Z         %315 = arith.addi %314, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3281243Z         %316 = tt.addptr %5, %315 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3281558Z         %317 = arith.cmpi sge, %312, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3281802Z         %318 = arith.cmpi slt, %312, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3282042Z         %319 = arith.andi %317, %318 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3282344Z         %320 = tt.broadcast %319 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3282703Z         %321 = arith.andi %320, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3282968Z         %322 = tt.load %316, %321, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3283218Z         %323 = arith.shli %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3283482Z         %324 = arith.shrsi %323, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3283725Z         %325 = arith.shrsi %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3284013Z         %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3284353Z         %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3284642Z         %328 = tt.broadcast %326 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3284880Z         %329 = arith.select %14, %328, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3285123Z         %330 = tt.broadcast %327 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3285357Z         %331 = arith.select %16, %330, %329 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3285593Z         %332 = tt.reshape %331 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3285821Z         %333 = arith.sitofp %332 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3286075Z         %334 = ttg.local_alloc %333 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3286402Z         %335 = ttg.local_load %334 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3286890Z         %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3287241Z         %337 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:48:42.3287374Z         %338 = arith.muli %337, %c2_i32 : i32
2026-02-21T09:48:42.3287550Z         %339 = tt.splat %338 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3287798Z         %340 = arith.addi %339, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3288076Z         %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3288361Z         %342 = tt.broadcast %341 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3288563Z         %343 = arith.addi %41, %342 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3288767Z         %344 = tt.addptr %4, %343 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3288982Z         %345 = tt.load %344 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3289210Z         %346 = ttg.local_alloc %345 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3289546Z         %347 = ttg.local_load %346 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3289961Z         %348 = arith.extf %347 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3290245Z         %349 = arith.extsi %337 : i32 to i64
2026-02-21T09:48:42.3290460Z         %350 = tt.splat %349 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3290761Z         %351 = arith.addi %350, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3291173Z         %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3291556Z         %353 = arith.muli %352, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3291864Z         %354 = tt.broadcast %353 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3292170Z         %355 = arith.addi %354, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3292482Z         %356 = tt.addptr %5, %355 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3292798Z         %357 = arith.cmpi sge, %352, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3293050Z         %358 = arith.cmpi slt, %352, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3293287Z         %359 = arith.andi %357, %358 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3293594Z         %360 = tt.broadcast %359 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3293897Z         %361 = arith.andi %360, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3294140Z         %362 = tt.load %356, %361, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3294393Z         %363 = arith.shli %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3294627Z         %364 = arith.shrsi %363, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3294867Z         %365 = arith.shrsi %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3308479Z         %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3308833Z         %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3309147Z         %368 = tt.broadcast %366 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3309388Z         %369 = arith.select %14, %368, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3309626Z         %370 = tt.broadcast %367 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3309863Z         %371 = arith.select %16, %370, %369 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3310096Z         %372 = tt.reshape %371 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3310326Z         %373 = arith.sitofp %372 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3310583Z         %374 = ttg.local_alloc %373 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3310909Z         %375 = ttg.local_load %374 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3311383Z         %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3311740Z         scf.yield %376 : tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3311879Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:42.3312053Z       %52 = arith.truncf %51 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma>
2026-02-21T09:48:42.3312224Z       %53 = arith.extsi %35 : i32 to i64
2026-02-21T09:48:42.3312409Z       %54 = tt.splat %53 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3312620Z       %55 = arith.addi %54, %18 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3312908Z       %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3313152Z       %57 = arith.muli %56, %cst_13 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3313331Z       %58 = tt.broadcast %57 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3313540Z       %59 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3313744Z       %60 = arith.addi %59, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3314006Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3314262Z       %62 = tt.broadcast %61 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3314444Z       %63 = arith.addi %58, %62 : tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3314633Z       %64 = tt.addptr %17, %63 : tensor<256x32x!tt.ptr<bf16>, #mma>, tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3314831Z       %65 = arith.cmpi sge, %56, %cst_14 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3315002Z       %66 = arith.cmpi slt, %56, %cst_15 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3315157Z       %67 = arith.andi %65, %66 : tensor<256x1xi1, #mma>
2026-02-21T09:48:42.3315329Z       %68 = tt.broadcast %67 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3315515Z       %69 = arith.cmpi sge, %61, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3315675Z       %70 = arith.cmpi slt, %61, %cst_12 : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3315833Z       %71 = arith.andi %69, %70 : tensor<1x32xi1, #mma>
2026-02-21T09:48:42.3316001Z       %72 = tt.broadcast %71 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3316194Z       %73 = arith.andi %68, %72 : tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3316347Z       tt.store %64, %52, %73 : tensor<256x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:42.3316500Z       %74 = arith.addi %arg3, %c608_i32 : i32
2026-02-21T09:48:42.3316628Z       %75 = arith.divsi %74, %c1024_i32 : i32
2026-02-21T09:48:42.3316762Z       %76 = arith.muli %75, %c4_i32 : i32
2026-02-21T09:48:42.3316886Z       %77 = arith.subi %c64_i32, %76 : i32
2026-02-21T09:48:42.3317004Z       %78 = arith.minsi %77, %c4_i32 : i32
2026-02-21T09:48:42.3317126Z       %79 = arith.remsi %74, %c1024_i32 : i32
2026-02-21T09:48:42.3317245Z       %80 = arith.remsi %79, %78 : i32
2026-02-21T09:48:42.3317363Z       %81 = arith.addi %76, %80 : i32
2026-02-21T09:48:42.3317476Z       %82 = arith.divsi %79, %78 : i32
2026-02-21T09:48:42.3317597Z       %83 = arith.muli %81, %c256_i32 : i32
2026-02-21T09:48:42.3317771Z       %84 = tt.splat %83 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3317995Z       %85 = arith.addi %84, %1 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3318171Z       %86 = arith.muli %82, %c32_i32 : i32
2026-02-21T09:48:42.3318397Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3318655Z       %88 = arith.muli %87, %cst_4 : tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3318854Z       %89 = tt.broadcast %88 : tensor<256x1xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3319029Z       %90 = arith.extsi %86 : i32 to i64
2026-02-21T09:48:42.3319239Z       %91 = tt.splat %90 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3319535Z       %92 = arith.addi %91, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3319946Z       %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3320391Z       %94 = tt.broadcast %93 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3320701Z       %95 = arith.cmpi sge, %93, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3320949Z       %96 = arith.cmpi slt, %93, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3321179Z       %97 = arith.andi %95, %96 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3321480Z       %98 = tt.broadcast %97 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3321826Z       %99 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg5 = %cst) -> (tensor<256x32xf32, #mma>)  : i32 {
2026-02-21T09:48:42.3322041Z         %218 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:42.3322222Z         %219 = tt.splat %218 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3322452Z         %220 = arith.addi %219, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3322768Z         %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3323051Z         %222 = tt.broadcast %221 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3323249Z         %223 = arith.addi %89, %222 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3323460Z         %224 = tt.addptr %4, %223 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3323670Z         %225 = tt.load %224 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3323922Z         %226 = ttg.local_alloc %225 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3324258Z         %227 = ttg.local_load %226 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3324669Z         %228 = arith.extf %227 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3324980Z         %229 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:42.3325193Z         %230 = tt.splat %229 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3325496Z         %231 = arith.addi %230, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3325893Z         %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3326252Z         %233 = arith.muli %232, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3326564Z         %234 = tt.broadcast %233 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3326874Z         %235 = arith.addi %234, %94 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3327183Z         %236 = tt.addptr %5, %235 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3327502Z         %237 = arith.cmpi sge, %232, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3327752Z         %238 = arith.cmpi slt, %232, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3328010Z         %239 = arith.andi %237, %238 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3328317Z         %240 = tt.broadcast %239 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3328635Z         %241 = arith.andi %240, %98 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3328885Z         %242 = tt.load %236, %241, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3329132Z         %243 = arith.shli %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3329371Z         %244 = arith.shrsi %243, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3329614Z         %245 = arith.shrsi %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3329904Z         %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3330244Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3330536Z         %248 = tt.broadcast %246 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3330776Z         %249 = arith.select %14, %248, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3331015Z         %250 = tt.broadcast %247 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3331246Z         %251 = arith.select %16, %250, %249 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3331482Z         %252 = tt.reshape %251 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3331709Z         %253 = arith.sitofp %252 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3331982Z         %254 = ttg.local_alloc %253 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3332312Z         %255 = ttg.local_load %254 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3332782Z         %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3333149Z         %257 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:48:42.3333274Z         %258 = arith.muli %257, %c2_i32 : i32
2026-02-21T09:48:42.3333446Z         %259 = tt.splat %258 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3333670Z         %260 = arith.addi %259, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3333948Z         %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3334230Z         %262 = tt.broadcast %261 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3334428Z         %263 = arith.addi %89, %262 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3334629Z         %264 = tt.addptr %4, %263 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3334838Z         %265 = tt.load %264 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3335060Z         %266 = ttg.local_alloc %265 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3335392Z         %267 = ttg.local_load %266 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3335828Z         %268 = arith.extf %267 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3336109Z         %269 = arith.extsi %257 : i32 to i64
2026-02-21T09:48:42.3336318Z         %270 = tt.splat %269 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3336643Z         %271 = arith.addi %270, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3337032Z         %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3337388Z         %273 = arith.muli %272, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3337697Z         %274 = tt.broadcast %273 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3338003Z         %275 = arith.addi %274, %94 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3338311Z         %276 = tt.addptr %5, %275 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3338625Z         %277 = arith.cmpi sge, %272, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3338874Z         %278 = arith.cmpi slt, %272, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3339107Z         %279 = arith.andi %277, %278 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3339407Z         %280 = tt.broadcast %279 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3339702Z         %281 = arith.andi %280, %98 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3339960Z         %282 = tt.load %276, %281, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3340207Z         %283 = arith.shli %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3340439Z         %284 = arith.shrsi %283, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3340689Z         %285 = arith.shrsi %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3340974Z         %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3341306Z         %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3341589Z         %288 = tt.broadcast %286 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3341828Z         %289 = arith.select %14, %288, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3342066Z         %290 = tt.broadcast %287 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3342298Z         %291 = arith.select %16, %290, %289 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3342524Z         %292 = tt.reshape %291 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3342745Z         %293 = arith.sitofp %292 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3342993Z         %294 = ttg.local_alloc %293 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3343314Z         %295 = ttg.local_load %294 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3343803Z         %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3344144Z         %297 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:42.3344286Z         %298 = arith.muli %297, %c2_i32 : i32
2026-02-21T09:48:42.3344459Z         %299 = tt.splat %298 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3344681Z         %300 = arith.addi %299, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3344956Z         %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3345230Z         %302 = tt.broadcast %301 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3345425Z         %303 = arith.addi %89, %302 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3345627Z         %304 = tt.addptr %4, %303 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3345832Z         %305 = tt.load %304 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3346056Z         %306 = ttg.local_alloc %305 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3346388Z         %307 = ttg.local_load %306 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3346799Z         %308 = arith.extf %307 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3347083Z         %309 = arith.extsi %297 : i32 to i64
2026-02-21T09:48:42.3347289Z         %310 = tt.splat %309 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3347588Z         %311 = arith.addi %310, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3347997Z         %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3348350Z         %313 = arith.muli %312, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3348668Z         %314 = tt.broadcast %313 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3348966Z         %315 = arith.addi %314, %94 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3349275Z         %316 = tt.addptr %5, %315 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3349591Z         %317 = arith.cmpi sge, %312, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3349840Z         %318 = arith.cmpi slt, %312, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3350074Z         %319 = arith.andi %317, %318 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3350434Z         %320 = tt.broadcast %319 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3350733Z         %321 = arith.andi %320, %98 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3350977Z         %322 = tt.load %316, %321, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3351221Z         %323 = arith.shli %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3351458Z         %324 = arith.shrsi %323, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3351708Z         %325 = arith.shrsi %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3351995Z         %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3352347Z         %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3352630Z         %328 = tt.broadcast %326 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3352865Z         %329 = arith.select %14, %328, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3353100Z         %330 = tt.broadcast %327 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3353326Z         %331 = arith.select %16, %330, %329 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3353553Z         %332 = tt.reshape %331 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3353774Z         %333 = arith.sitofp %332 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3354025Z         %334 = ttg.local_alloc %333 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3354355Z         %335 = ttg.local_load %334 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3354820Z         %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3355164Z         %337 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:48:42.3355285Z         %338 = arith.muli %337, %c2_i32 : i32
2026-02-21T09:48:42.3355458Z         %339 = tt.splat %338 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3355700Z         %340 = arith.addi %339, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3355975Z         %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3356253Z         %342 = tt.broadcast %341 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3356462Z         %343 = arith.addi %89, %342 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3356662Z         %344 = tt.addptr %4, %343 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3356869Z         %345 = tt.load %344 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3357093Z         %346 = ttg.local_alloc %345 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3357422Z         %347 = ttg.local_load %346 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3357827Z         %348 = arith.extf %347 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3358111Z         %349 = arith.extsi %337 : i32 to i64
2026-02-21T09:48:42.3358320Z         %350 = tt.splat %349 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3358616Z         %351 = arith.addi %350, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3359010Z         %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3359367Z         %353 = arith.muli %352, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3359687Z         %354 = tt.broadcast %353 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3359987Z         %355 = arith.addi %354, %94 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3360311Z         %356 = tt.addptr %5, %355 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3360631Z         %357 = arith.cmpi sge, %352, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3360880Z         %358 = arith.cmpi slt, %352, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3361113Z         %359 = arith.andi %357, %358 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3361410Z         %360 = tt.broadcast %359 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3361704Z         %361 = arith.andi %360, %98 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3361947Z         %362 = tt.load %356, %361, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3362193Z         %363 = arith.shli %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3362426Z         %364 = arith.shrsi %363, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3362700Z         %365 = arith.shrsi %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3362985Z         %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3363323Z         %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3363628Z         %368 = tt.broadcast %366 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3363862Z         %369 = arith.select %14, %368, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3364102Z         %370 = tt.broadcast %367 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3364353Z         %371 = arith.select %16, %370, %369 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3364584Z         %372 = tt.reshape %371 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3364808Z         %373 = arith.sitofp %372 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3365058Z         %374 = ttg.local_alloc %373 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3365382Z         %375 = ttg.local_load %374 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3365854Z         %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3366200Z         scf.yield %376 : tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3366333Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:42.3366499Z       %100 = arith.truncf %99 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma>
2026-02-21T09:48:42.3366668Z       %101 = arith.extsi %83 : i32 to i64
2026-02-21T09:48:42.3366831Z       %102 = tt.splat %101 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3367045Z       %103 = arith.addi %102, %18 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3367312Z       %104 = tt.expand_dims %103 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3367572Z       %105 = arith.muli %104, %cst_13 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3367754Z       %106 = tt.broadcast %105 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3367978Z       %107 = tt.splat %90 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3368184Z       %108 = arith.addi %107, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3368446Z       %109 = tt.expand_dims %108 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3368700Z       %110 = tt.broadcast %109 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3368882Z       %111 = arith.addi %106, %110 : tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3369070Z       %112 = tt.addptr %17, %111 : tensor<256x32x!tt.ptr<bf16>, #mma>, tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3369270Z       %113 = arith.cmpi sge, %104, %cst_14 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3369439Z       %114 = arith.cmpi slt, %104, %cst_15 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3369598Z       %115 = arith.andi %113, %114 : tensor<256x1xi1, #mma>
2026-02-21T09:48:42.3369773Z       %116 = tt.broadcast %115 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3369956Z       %117 = arith.cmpi sge, %109, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3370123Z       %118 = arith.cmpi slt, %109, %cst_12 : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3370278Z       %119 = arith.andi %117, %118 : tensor<1x32xi1, #mma>
2026-02-21T09:48:42.3370449Z       %120 = tt.broadcast %119 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3370625Z       %121 = arith.andi %116, %120 : tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3370780Z       tt.store %112, %100, %121 : tensor<256x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:42.3370930Z       %122 = arith.addi %arg3, %c1216_i32 : i32
2026-02-21T09:48:42.3371054Z       %123 = arith.divsi %122, %c1024_i32 : i32
2026-02-21T09:48:42.3371177Z       %124 = arith.muli %123, %c4_i32 : i32
2026-02-21T09:48:42.3371311Z       %125 = arith.subi %c64_i32, %124 : i32
2026-02-21T09:48:42.3371462Z       %126 = arith.minsi %125, %c4_i32 : i32
2026-02-21T09:48:42.3371585Z       %127 = arith.remsi %122, %c1024_i32 : i32
2026-02-21T09:48:42.3371703Z       %128 = arith.remsi %127, %126 : i32
2026-02-21T09:48:42.3371854Z       %129 = arith.addi %124, %128 : i32
2026-02-21T09:48:42.3371967Z       %130 = arith.divsi %127, %126 : i32
2026-02-21T09:48:42.3372083Z       %131 = arith.muli %129, %c256_i32 : i32
2026-02-21T09:48:42.3372254Z       %132 = tt.splat %131 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3372480Z       %133 = arith.addi %132, %1 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3372654Z       %134 = arith.muli %130, %c32_i32 : i32
2026-02-21T09:48:42.3372879Z       %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3373133Z       %136 = arith.muli %135, %cst_4 : tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3373329Z       %137 = tt.broadcast %136 : tensor<256x1xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3373508Z       %138 = arith.extsi %134 : i32 to i64
2026-02-21T09:48:42.3373713Z       %139 = tt.splat %138 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3374012Z       %140 = arith.addi %139, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3374401Z       %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3374843Z       %142 = tt.broadcast %141 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3375157Z       %143 = arith.cmpi sge, %141, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3375419Z       %144 = arith.cmpi slt, %141, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3375653Z       %145 = arith.andi %143, %144 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3375955Z       %146 = tt.broadcast %145 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3376290Z       %147 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg5 = %cst) -> (tensor<256x32xf32, #mma>)  : i32 {
2026-02-21T09:48:42.3376505Z         %218 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:42.3376679Z         %219 = tt.splat %218 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3376903Z         %220 = arith.addi %219, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3377183Z         %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3377459Z         %222 = tt.broadcast %221 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3377660Z         %223 = arith.addi %137, %222 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3377863Z         %224 = tt.addptr %4, %223 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3378069Z         %225 = tt.load %224 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3378293Z         %226 = ttg.local_alloc %225 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3378626Z         %227 = ttg.local_load %226 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3379059Z         %228 = arith.extf %227 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3379345Z         %229 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:42.3379555Z         %230 = tt.splat %229 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3379870Z         %231 = arith.addi %230, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3380260Z         %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3380619Z         %233 = arith.muli %232, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3380937Z         %234 = tt.broadcast %233 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3381242Z         %235 = arith.addi %234, %142 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3381557Z         %236 = tt.addptr %5, %235 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3381877Z         %237 = arith.cmpi sge, %232, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3382120Z         %238 = arith.cmpi slt, %232, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3382359Z         %239 = arith.andi %237, %238 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3382662Z         %240 = tt.broadcast %239 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3382977Z         %241 = arith.andi %240, %146 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3383220Z         %242 = tt.load %236, %241, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3383479Z         %243 = arith.shli %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3383715Z         %244 = arith.shrsi %243, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3383948Z         %245 = arith.shrsi %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3384235Z         %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3384568Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3384848Z         %248 = tt.broadcast %246 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3385087Z         %249 = arith.select %14, %248, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3385323Z         %250 = tt.broadcast %247 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3385552Z         %251 = arith.select %16, %250, %249 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3385781Z         %252 = tt.reshape %251 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3386000Z         %253 = arith.sitofp %252 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3386249Z         %254 = ttg.local_alloc %253 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3394790Z         %255 = ttg.local_load %254 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3395349Z         %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3395703Z         %257 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:48:42.3395854Z         %258 = arith.muli %257, %c2_i32 : i32
2026-02-21T09:48:42.3396025Z         %259 = tt.splat %258 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3396253Z         %260 = arith.addi %259, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3396533Z         %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3396809Z         %262 = tt.broadcast %261 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3397007Z         %263 = arith.addi %137, %262 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3397217Z         %264 = tt.addptr %4, %263 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3397427Z         %265 = tt.load %264 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3397653Z         %266 = ttg.local_alloc %265 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3397986Z         %267 = ttg.local_load %266 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3398396Z         %268 = arith.extf %267 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3398676Z         %269 = arith.extsi %257 : i32 to i64
2026-02-21T09:48:42.3398886Z         %270 = tt.splat %269 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3399198Z         %271 = arith.addi %270, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3399585Z         %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3399962Z         %273 = arith.muli %272, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3400268Z         %274 = tt.broadcast %273 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3400573Z         %275 = arith.addi %274, %142 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3400888Z         %276 = tt.addptr %5, %275 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3401206Z         %277 = arith.cmpi sge, %272, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3401453Z         %278 = arith.cmpi slt, %272, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3401691Z         %279 = arith.andi %277, %278 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3401997Z         %280 = tt.broadcast %279 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3402300Z         %281 = arith.andi %280, %146 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3402546Z         %282 = tt.load %276, %281, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3402844Z         %283 = arith.shli %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3403084Z         %284 = arith.shrsi %283, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3403347Z         %285 = arith.shrsi %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3403636Z         %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3403988Z         %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3404270Z         %288 = tt.broadcast %286 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3404506Z         %289 = arith.select %14, %288, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3404738Z         %290 = tt.broadcast %287 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3404970Z         %291 = arith.select %16, %290, %289 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3405196Z         %292 = tt.reshape %291 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3405423Z         %293 = arith.sitofp %292 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3405674Z         %294 = ttg.local_alloc %293 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3405996Z         %295 = ttg.local_load %294 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3406465Z         %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3406812Z         %297 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:42.3406936Z         %298 = arith.muli %297, %c2_i32 : i32
2026-02-21T09:48:42.3407130Z         %299 = tt.splat %298 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3407354Z         %300 = arith.addi %299, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3407650Z         %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3407926Z         %302 = tt.broadcast %301 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3408122Z         %303 = arith.addi %137, %302 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3408328Z         %304 = tt.addptr %4, %303 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3408533Z         %305 = tt.load %304 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3408759Z         %306 = ttg.local_alloc %305 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3409090Z         %307 = ttg.local_load %306 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3409496Z         %308 = arith.extf %307 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3409782Z         %309 = arith.extsi %297 : i32 to i64
2026-02-21T09:48:42.3409988Z         %310 = tt.splat %309 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3410287Z         %311 = arith.addi %310, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3410678Z         %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3411038Z         %313 = arith.muli %312, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3411367Z         %314 = tt.broadcast %313 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3411672Z         %315 = arith.addi %314, %142 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3411995Z         %316 = tt.addptr %5, %315 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3412310Z         %317 = arith.cmpi sge, %312, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3412553Z         %318 = arith.cmpi slt, %312, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3412790Z         %319 = arith.andi %317, %318 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3413091Z         %320 = tt.broadcast %319 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3413386Z         %321 = arith.andi %320, %146 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3413630Z         %322 = tt.load %316, %321, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3413873Z         %323 = arith.shli %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3414106Z         %324 = arith.shrsi %323, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3414341Z         %325 = arith.shrsi %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3414626Z         %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3414979Z         %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3415260Z         %328 = tt.broadcast %326 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3415512Z         %329 = arith.select %14, %328, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3415749Z         %330 = tt.broadcast %327 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3415977Z         %331 = arith.select %16, %330, %329 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3416206Z         %332 = tt.reshape %331 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3416425Z         %333 = arith.sitofp %332 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3416676Z         %334 = ttg.local_alloc %333 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3417003Z         %335 = ttg.local_load %334 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3417468Z         %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3417814Z         %337 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:48:42.3417938Z         %338 = arith.muli %337, %c2_i32 : i32
2026-02-21T09:48:42.3418110Z         %339 = tt.splat %338 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3418336Z         %340 = arith.addi %339, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3418610Z         %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3418890Z         %342 = tt.broadcast %341 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3419108Z         %343 = arith.addi %137, %342 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3419319Z         %344 = tt.addptr %4, %343 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3419530Z         %345 = tt.load %344 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3419773Z         %346 = ttg.local_alloc %345 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3420105Z         %347 = ttg.local_load %346 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3420515Z         %348 = arith.extf %347 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3420799Z         %349 = arith.extsi %337 : i32 to i64
2026-02-21T09:48:42.3421012Z         %350 = tt.splat %349 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3421307Z         %351 = arith.addi %350, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3421696Z         %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3422051Z         %353 = arith.muli %352, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3422356Z         %354 = tt.broadcast %353 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3422660Z         %355 = arith.addi %354, %142 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3422988Z         %356 = tt.addptr %5, %355 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3423303Z         %357 = arith.cmpi sge, %352, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3423567Z         %358 = arith.cmpi slt, %352, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3423803Z         %359 = arith.andi %357, %358 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3424105Z         %360 = tt.broadcast %359 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3424403Z         %361 = arith.andi %360, %146 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3424648Z         %362 = tt.load %356, %361, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3424896Z         %363 = arith.shli %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3425127Z         %364 = arith.shrsi %363, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3425363Z         %365 = arith.shrsi %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3425650Z         %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3425983Z         %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3426267Z         %368 = tt.broadcast %366 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3426503Z         %369 = arith.select %14, %368, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3426741Z         %370 = tt.broadcast %367 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3426987Z         %371 = arith.select %16, %370, %369 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3427215Z         %372 = tt.reshape %371 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3427439Z         %373 = arith.sitofp %372 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3427705Z         %374 = ttg.local_alloc %373 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3428027Z         %375 = ttg.local_load %374 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3428492Z         %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3428838Z         scf.yield %376 : tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3428974Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:42.3429144Z       %148 = arith.truncf %147 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma>
2026-02-21T09:48:42.3429318Z       %149 = arith.extsi %131 : i32 to i64
2026-02-21T09:48:42.3429487Z       %150 = tt.splat %149 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3429702Z       %151 = arith.addi %150, %18 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3429970Z       %152 = tt.expand_dims %151 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3430210Z       %153 = arith.muli %152, %cst_13 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3430395Z       %154 = tt.broadcast %153 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3430605Z       %155 = tt.splat %138 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3430834Z       %156 = arith.addi %155, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3431095Z       %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3431370Z       %158 = tt.broadcast %157 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3431555Z       %159 = arith.addi %154, %158 : tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3431744Z       %160 = tt.addptr %17, %159 : tensor<256x32x!tt.ptr<bf16>, #mma>, tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3431947Z       %161 = arith.cmpi sge, %152, %cst_14 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3432119Z       %162 = arith.cmpi slt, %152, %cst_15 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3432276Z       %163 = arith.andi %161, %162 : tensor<256x1xi1, #mma>
2026-02-21T09:48:42.3432452Z       %164 = tt.broadcast %163 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3432638Z       %165 = arith.cmpi sge, %157, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3432804Z       %166 = arith.cmpi slt, %157, %cst_12 : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3432965Z       %167 = arith.andi %165, %166 : tensor<1x32xi1, #mma>
2026-02-21T09:48:42.3433133Z       %168 = tt.broadcast %167 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3433311Z       %169 = arith.andi %164, %168 : tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3433471Z       tt.store %160, %148, %169 : tensor<256x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:42.3433626Z       %170 = arith.addi %arg3, %c1824_i32 : i32
2026-02-21T09:48:42.3433751Z       %171 = arith.divsi %170, %c1024_i32 : i32
2026-02-21T09:48:42.3433874Z       %172 = arith.muli %171, %c4_i32 : i32
2026-02-21T09:48:42.3433995Z       %173 = arith.subi %c64_i32, %172 : i32
2026-02-21T09:48:42.3434113Z       %174 = arith.minsi %173, %c4_i32 : i32
2026-02-21T09:48:42.3434234Z       %175 = arith.remsi %170, %c1024_i32 : i32
2026-02-21T09:48:42.3434354Z       %176 = arith.remsi %175, %174 : i32
2026-02-21T09:48:42.3434493Z       %177 = arith.addi %172, %176 : i32
2026-02-21T09:48:42.3434609Z       %178 = arith.divsi %175, %174 : i32
2026-02-21T09:48:42.3434730Z       %179 = arith.muli %177, %c256_i32 : i32
2026-02-21T09:48:42.3434903Z       %180 = tt.splat %179 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3435146Z       %181 = arith.addi %180, %1 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3435321Z       %182 = arith.muli %178, %c32_i32 : i32
2026-02-21T09:48:42.3435545Z       %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3435803Z       %184 = arith.muli %183, %cst_4 : tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3435999Z       %185 = tt.broadcast %184 : tensor<256x1xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3436179Z       %186 = arith.extsi %182 : i32 to i64
2026-02-21T09:48:42.3436388Z       %187 = tt.splat %186 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3436686Z       %188 = arith.addi %187, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3437077Z       %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3437502Z       %190 = tt.broadcast %189 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3437814Z       %191 = arith.cmpi sge, %189, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3438058Z       %192 = arith.cmpi slt, %189, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3438308Z       %193 = arith.andi %191, %192 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3438608Z       %194 = tt.broadcast %193 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3438966Z       %195 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg5 = %cst) -> (tensor<256x32xf32, #mma>)  : i32 {
2026-02-21T09:48:42.3439181Z         %218 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:42.3439354Z         %219 = tt.splat %218 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3439575Z         %220 = arith.addi %219, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3439848Z         %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3440121Z         %222 = tt.broadcast %221 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3440319Z         %223 = arith.addi %185, %222 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3440521Z         %224 = tt.addptr %4, %223 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3440726Z         %225 = tt.load %224 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3440948Z         %226 = ttg.local_alloc %225 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3441275Z         %227 = ttg.local_load %226 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3441678Z         %228 = arith.extf %227 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3441959Z         %229 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:42.3442185Z         %230 = tt.splat %229 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3442482Z         %231 = arith.addi %230, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3442898Z         %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3443265Z         %233 = arith.muli %232, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3443567Z         %234 = tt.broadcast %233 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3443868Z         %235 = arith.addi %234, %190 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3444176Z         %236 = tt.addptr %5, %235 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3444492Z         %237 = arith.cmpi sge, %232, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3444734Z         %238 = arith.cmpi slt, %232, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3444968Z         %239 = arith.andi %237, %238 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3445264Z         %240 = tt.broadcast %239 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3445560Z         %241 = arith.andi %240, %194 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3445803Z         %242 = tt.load %236, %241, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3446063Z         %243 = arith.shli %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3446296Z         %244 = arith.shrsi %243, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3446548Z         %245 = arith.shrsi %242, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3446832Z         %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3447164Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3447442Z         %248 = tt.broadcast %246 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3447681Z         %249 = arith.select %14, %248, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3447915Z         %250 = tt.broadcast %247 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3448142Z         %251 = arith.select %16, %250, %249 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3448368Z         %252 = tt.reshape %251 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3448589Z         %253 = arith.sitofp %252 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3448839Z         %254 = ttg.local_alloc %253 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3449159Z         %255 = ttg.local_load %254 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3449627Z         %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3449974Z         %257 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:48:42.3450112Z         %258 = arith.muli %257, %c2_i32 : i32
2026-02-21T09:48:42.3450281Z         %259 = tt.splat %258 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3450503Z         %260 = arith.addi %259, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3450800Z         %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3451075Z         %262 = tt.broadcast %261 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3451271Z         %263 = arith.addi %185, %262 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3451472Z         %264 = tt.addptr %4, %263 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3451676Z         %265 = tt.load %264 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3451897Z         %266 = ttg.local_alloc %265 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3452224Z         %267 = ttg.local_load %266 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3452628Z         %268 = arith.extf %267 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3452909Z         %269 = arith.extsi %257 : i32 to i64
2026-02-21T09:48:42.3453116Z         %270 = tt.splat %269 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3453411Z         %271 = arith.addi %270, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3453814Z         %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3454164Z         %273 = arith.muli %272, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3454481Z         %274 = tt.broadcast %273 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3454783Z         %275 = arith.addi %274, %190 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3455087Z         %276 = tt.addptr %5, %275 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3455397Z         %277 = arith.cmpi sge, %272, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3455639Z         %278 = arith.cmpi slt, %272, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3455872Z         %279 = arith.andi %277, %278 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3456168Z         %280 = tt.broadcast %279 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3456464Z         %281 = arith.andi %280, %194 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3456705Z         %282 = tt.load %276, %281, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3456947Z         %283 = arith.shli %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3457175Z         %284 = arith.shrsi %283, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3457408Z         %285 = arith.shrsi %282, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3457690Z         %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3458037Z         %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3458317Z         %288 = tt.broadcast %286 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3458566Z         %289 = arith.select %14, %288, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3458797Z         %290 = tt.broadcast %287 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3459024Z         %291 = arith.select %16, %290, %289 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3459246Z         %292 = tt.reshape %291 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3459465Z         %293 = arith.sitofp %292 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3459713Z         %294 = ttg.local_alloc %293 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3460032Z         %295 = ttg.local_load %294 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3460496Z         %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3460839Z         %297 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:42.3460961Z         %298 = arith.muli %297, %c2_i32 : i32
2026-02-21T09:48:42.3461127Z         %299 = tt.splat %298 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3461346Z         %300 = arith.addi %299, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3461634Z         %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3461909Z         %302 = tt.broadcast %301 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3462124Z         %303 = arith.addi %185, %302 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3462322Z         %304 = tt.addptr %4, %303 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3462525Z         %305 = tt.load %304 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3462746Z         %306 = ttg.local_alloc %305 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3463073Z         %307 = ttg.local_load %306 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3463477Z         %308 = arith.extf %307 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3463754Z         %309 = arith.extsi %297 : i32 to i64
2026-02-21T09:48:42.3463960Z         %310 = tt.splat %309 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3464258Z         %311 = arith.addi %310, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3464641Z         %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3464993Z         %313 = arith.muli %312, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3465294Z         %314 = tt.broadcast %313 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3465613Z         %315 = arith.addi %314, %190 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3465919Z         %316 = tt.addptr %5, %315 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3466228Z         %317 = arith.cmpi sge, %312, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3466484Z         %318 = arith.cmpi slt, %312, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3466717Z         %319 = arith.andi %317, %318 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3467010Z         %320 = tt.broadcast %319 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3467304Z         %321 = arith.andi %320, %194 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3467543Z         %322 = tt.load %316, %321, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3467785Z         %323 = arith.shli %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3468015Z         %324 = arith.shrsi %323, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3468247Z         %325 = arith.shrsi %322, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3468530Z         %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3468860Z         %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3469138Z         %328 = tt.broadcast %326 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3469389Z         %329 = arith.select %14, %328, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3469620Z         %330 = tt.broadcast %327 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3469867Z         %331 = arith.select %16, %330, %329 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3470090Z         %332 = tt.reshape %331 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3470307Z         %333 = arith.sitofp %332 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3470554Z         %334 = ttg.local_alloc %333 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3470869Z         %335 = ttg.local_load %334 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3471336Z         %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3471675Z         %337 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:48:42.3471798Z         %338 = arith.muli %337, %c2_i32 : i32
2026-02-21T09:48:42.3471970Z         %339 = tt.splat %338 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3472191Z         %340 = arith.addi %339, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3472465Z         %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3472740Z         %342 = tt.broadcast %341 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3472935Z         %343 = arith.addi %185, %342 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3473136Z         %344 = tt.addptr %4, %343 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3473356Z         %345 = tt.load %344 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3473577Z         %346 = ttg.local_alloc %345 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3473903Z         %347 = ttg.local_load %346 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3474320Z         %348 = arith.extf %347 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3474597Z         %349 = arith.extsi %337 : i32 to i64
2026-02-21T09:48:42.3474803Z         %350 = tt.splat %349 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3475097Z         %351 = arith.addi %350, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3475483Z         %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3475832Z         %353 = arith.muli %352, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3476137Z         %354 = tt.broadcast %353 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3476437Z         %355 = arith.addi %354, %190 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3476740Z         %356 = tt.addptr %5, %355 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3477054Z         %357 = arith.cmpi sge, %352, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3477313Z         %358 = arith.cmpi slt, %352, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3477544Z         %359 = arith.andi %357, %358 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3477858Z         %360 = tt.broadcast %359 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3478151Z         %361 = arith.andi %360, %194 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3478391Z         %362 = tt.load %356, %361, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3478631Z         %363 = arith.shli %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3478861Z         %364 = arith.shrsi %363, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3479094Z         %365 = arith.shrsi %362, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3479379Z         %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3479713Z         %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3479992Z         %368 = tt.broadcast %366 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3480227Z         %369 = arith.select %14, %368, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3480461Z         %370 = tt.broadcast %367 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3480687Z         %371 = arith.select %16, %370, %369 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3480912Z         %372 = tt.reshape %371 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3481154Z         %373 = arith.sitofp %372 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3481400Z         %374 = ttg.local_alloc %373 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3481717Z         %375 = ttg.local_load %374 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3482191Z         %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3482536Z         scf.yield %376 : tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3482692Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:42.3482859Z       %196 = arith.truncf %195 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma>
2026-02-21T09:48:42.3483031Z       %197 = arith.extsi %179 : i32 to i64
2026-02-21T09:48:42.3483193Z       %198 = tt.splat %197 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3483402Z       %199 = arith.addi %198, %18 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3483669Z       %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3483910Z       %201 = arith.muli %200, %cst_13 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3484090Z       %202 = tt.broadcast %201 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3484291Z       %203 = tt.splat %186 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3484495Z       %204 = arith.addi %203, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3484749Z       %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3485021Z       %206 = tt.broadcast %205 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3485200Z       %207 = arith.addi %202, %206 : tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3485407Z       %208 = tt.addptr %17, %207 : tensor<256x32x!tt.ptr<bf16>, #mma>, tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3485608Z       %209 = arith.cmpi sge, %200, %cst_14 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3485773Z       %210 = arith.cmpi slt, %200, %cst_15 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3485929Z       %211 = arith.andi %209, %210 : tensor<256x1xi1, #mma>
2026-02-21T09:48:42.3486100Z       %212 = tt.broadcast %211 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3486281Z       %213 = arith.cmpi sge, %205, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3486444Z       %214 = arith.cmpi slt, %205, %cst_12 : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3486596Z       %215 = arith.andi %213, %214 : tensor<1x32xi1, #mma>
2026-02-21T09:48:42.3486764Z       %216 = tt.broadcast %215 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3486937Z       %217 = arith.andi %212, %216 : tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3487096Z       tt.store %208, %196, %217 : tensor<256x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:42.3487262Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:48:42.3487421Z     scf.for %arg3 = %26 to %c16384_i32 step %c608_i32  : i32 {
2026-02-21T09:48:42.3487565Z       %27 = arith.divsi %arg3, %c1024_i32 : i32
2026-02-21T09:48:42.3487684Z       %28 = arith.muli %27, %c4_i32 : i32
2026-02-21T09:48:42.3487800Z       %29 = arith.subi %c64_i32, %28 : i32
2026-02-21T09:48:42.3487913Z       %30 = arith.minsi %29, %c4_i32 : i32
2026-02-21T09:48:42.3488031Z       %31 = arith.remsi %arg3, %c1024_i32 : i32
2026-02-21T09:48:42.3488147Z       %32 = arith.remsi %31, %30 : i32
2026-02-21T09:48:42.3488256Z       %33 = arith.addi %28, %32 : i32
2026-02-21T09:48:42.3488364Z       %34 = arith.divsi %31, %30 : i32
2026-02-21T09:48:42.3488495Z       %35 = arith.muli %33, %c256_i32 : i32
2026-02-21T09:48:42.3488663Z       %36 = tt.splat %35 : i32 -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3488884Z       %37 = arith.addi %36, %1 : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:42.3489054Z       %38 = arith.muli %34, %c32_i32 : i32
2026-02-21T09:48:42.3489292Z       %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3489540Z       %40 = arith.muli %39, %cst_4 : tensor<256x1xi32, #blocked1>
2026-02-21T09:48:42.3489735Z       %41 = tt.broadcast %40 : tensor<256x1xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3489905Z       %42 = arith.extsi %38 : i32 to i64
2026-02-21T09:48:42.3490107Z       %43 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3490397Z       %44 = arith.addi %43, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3490777Z       %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3491197Z       %46 = tt.broadcast %45 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3491498Z       %47 = arith.cmpi sge, %45, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3491736Z       %48 = arith.cmpi slt, %45, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3491961Z       %49 = arith.andi %47, %48 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3492263Z       %50 = tt.broadcast %49 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3492595Z       %51 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c8_i32 iter_args(%arg5 = %cst) -> (tensor<256x32xf32, #mma>)  : i32 {
2026-02-21T09:48:42.3492822Z         %74 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:42.3492990Z         %75 = tt.splat %74 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3493207Z         %76 = arith.addi %75, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3493474Z         %77 = tt.expand_dims %76 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3493745Z         %78 = tt.broadcast %77 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3493937Z         %79 = arith.addi %41, %78 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3494133Z         %80 = tt.addptr %4, %79 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3494336Z         %81 = tt.load %80 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3494551Z         %82 = ttg.local_alloc %81 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3494876Z         %83 = ttg.local_load %82 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3495281Z         %84 = arith.extf %83 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3495562Z         %85 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:42.3495769Z         %86 = tt.splat %85 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3496060Z         %87 = arith.addi %86, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3496459Z         %88 = tt.expand_dims %87 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3496804Z         %89 = arith.muli %88, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3497117Z         %90 = tt.broadcast %89 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3497413Z         %91 = arith.addi %90, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3497708Z         %92 = tt.addptr %5, %91 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3498014Z         %93 = arith.cmpi sge, %88, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3498250Z         %94 = arith.cmpi slt, %88, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3498473Z         %95 = arith.andi %93, %94 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3498763Z         %96 = tt.broadcast %95 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3499051Z         %97 = arith.andi %96, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3499283Z         %98 = tt.load %92, %97, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3499519Z         %99 = arith.shli %98, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3499747Z         %100 = arith.shrsi %99, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3499994Z         %101 = arith.shrsi %98, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3500282Z         %102 = tt.expand_dims %100 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3500642Z         %103 = tt.expand_dims %101 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3500926Z         %104 = tt.broadcast %102 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3501167Z         %105 = arith.select %14, %104, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3501408Z         %106 = tt.broadcast %103 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3501642Z         %107 = arith.select %16, %106, %105 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3501878Z         %108 = tt.reshape %107 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3502108Z         %109 = arith.sitofp %108 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3502359Z         %110 = ttg.local_alloc %109 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3502687Z         %111 = ttg.local_load %110 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3503158Z         %112 = tt.dot %84, %111, %arg5, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3503506Z         %113 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:48:42.3503634Z         %114 = arith.muli %113, %c2_i32 : i32
2026-02-21T09:48:42.3503806Z         %115 = tt.splat %114 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3504034Z         %116 = arith.addi %115, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3504336Z         %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3504617Z         %118 = tt.broadcast %117 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3504831Z         %119 = arith.addi %41, %118 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3505034Z         %120 = tt.addptr %4, %119 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3505242Z         %121 = tt.load %120 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3505469Z         %122 = ttg.local_alloc %121 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3505801Z         %123 = ttg.local_load %122 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3506208Z         %124 = arith.extf %123 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3506493Z         %125 = arith.extsi %113 : i32 to i64
2026-02-21T09:48:42.3506705Z         %126 = tt.splat %125 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3507003Z         %127 = arith.addi %126, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3507387Z         %128 = tt.expand_dims %127 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3507746Z         %129 = arith.muli %128, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3508066Z         %130 = tt.broadcast %129 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3508370Z         %131 = arith.addi %130, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3508705Z         %132 = tt.addptr %5, %131 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3509019Z         %133 = arith.cmpi sge, %128, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3509270Z         %134 = arith.cmpi slt, %128, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3509504Z         %135 = arith.andi %133, %134 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3509805Z         %136 = tt.broadcast %135 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3510107Z         %137 = arith.andi %136, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3510349Z         %138 = tt.load %132, %137, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3510599Z         %139 = arith.shli %138, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3510837Z         %140 = arith.shrsi %139, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3511077Z         %141 = arith.shrsi %138, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3511369Z         %142 = tt.expand_dims %140 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3511702Z         %143 = tt.expand_dims %141 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3511984Z         %144 = tt.broadcast %142 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3512241Z         %145 = arith.select %14, %144, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3512475Z         %146 = tt.broadcast %143 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3512706Z         %147 = arith.select %16, %146, %145 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3512951Z         %148 = tt.reshape %147 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3513175Z         %149 = arith.sitofp %148 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3513422Z         %150 = ttg.local_alloc %149 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3513740Z         %151 = ttg.local_load %150 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3514207Z         %152 = tt.dot %124, %151, %112, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3514553Z         %153 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:42.3514676Z         %154 = arith.muli %153, %c2_i32 : i32
2026-02-21T09:48:42.3514847Z         %155 = tt.splat %154 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3515067Z         %156 = arith.addi %155, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3515340Z         %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3515617Z         %158 = tt.broadcast %157 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3515821Z         %159 = arith.addi %41, %158 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3516047Z         %160 = tt.addptr %4, %159 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3516252Z         %161 = tt.load %160 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3516399Z         %162 = ttg.local_alloc %161 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3516571Z         %163 = ttg.local_load %162 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3516770Z         %164 = arith.extf %163 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3516817Z         %165 = arith.extsi %153 : i32 to i64
2026-02-21T09:48:42.3516946Z         %166 = tt.splat %165 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3517073Z         %167 = arith.addi %166, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3517292Z         %168 = tt.expand_dims %167 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3517392Z         %169 = arith.muli %168, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3517559Z         %170 = tt.broadcast %169 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3517653Z         %171 = arith.addi %170, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3517824Z         %172 = tt.addptr %5, %171 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3517942Z         %173 = arith.cmpi sge, %168, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3518047Z         %174 = arith.cmpi slt, %168, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3518142Z         %175 = arith.andi %173, %174 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3518321Z         %176 = tt.broadcast %175 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3518416Z         %177 = arith.andi %176, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3518524Z         %178 = tt.load %172, %177, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3518619Z         %179 = arith.shli %178, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3518718Z         %180 = arith.shrsi %179, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3518814Z         %181 = arith.shrsi %178, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3518961Z         %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3519109Z         %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3519202Z         %184 = tt.broadcast %182 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3519304Z         %185 = arith.select %14, %184, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3519396Z         %186 = tt.broadcast %183 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3519493Z         %187 = arith.select %16, %186, %185 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3519595Z         %188 = tt.reshape %187 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3519689Z         %189 = arith.sitofp %188 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3519824Z         %190 = ttg.local_alloc %189 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3519990Z         %191 = ttg.local_load %190 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3520253Z         %192 = tt.dot %164, %191, %152, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3520299Z         %193 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:48:42.3520344Z         %194 = arith.muli %193, %c2_i32 : i32
2026-02-21T09:48:42.3520438Z         %195 = tt.splat %194 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3520529Z         %196 = arith.addi %195, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:42.3520671Z         %197 = tt.expand_dims %196 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:48:42.3520769Z         %198 = tt.broadcast %197 : tensor<1x4xi32, #blocked1> -> tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3520832Z         %199 = arith.addi %41, %198 : tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3520935Z         %200 = tt.addptr %4, %199 : tensor<256x4x!tt.ptr<bf16>, #blocked1>, tensor<256x4xi32, #blocked1>
2026-02-21T09:48:42.3520999Z         %201 = tt.load %200 : tensor<256x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:42.3521124Z         %202 = ttg.local_alloc %201 : (tensor<256x4xbf16, #blocked1>) -> !ttg.memdesc<256x4xbf16, #shared, #smem>
2026-02-21T09:48:42.3521305Z         %203 = ttg.local_load %202 : !ttg.memdesc<256x4xbf16, #shared, #smem> -> tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3521502Z         %204 = arith.extf %203 : tensor<256x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3521565Z         %205 = arith.extsi %193 : i32 to i64
2026-02-21T09:48:42.3521694Z         %206 = tt.splat %205 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3521821Z         %207 = arith.addi %206, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:42.3522039Z         %208 = tt.expand_dims %207 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3522138Z         %209 = arith.muli %208, %cst_8 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3522307Z         %210 = tt.broadcast %209 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3522402Z         %211 = arith.addi %210, %46 : tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3522608Z         %212 = tt.addptr %5, %211 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3522713Z         %213 = arith.cmpi sge, %208, %cst_9 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3522815Z         %214 = arith.cmpi slt, %208, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3522908Z         %215 = arith.andi %213, %214 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3523092Z         %216 = tt.broadcast %215 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3523185Z         %217 = arith.andi %216, %50 : tensor<2x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3523311Z         %218 = tt.load %212, %217, %cst_1 : tensor<2x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3523407Z         %219 = arith.shli %218, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3523506Z         %220 = arith.shrsi %219, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3523601Z         %221 = arith.shrsi %218, %cst_5 : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:42.3523750Z         %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3523899Z         %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x32xi8, #blocked>
2026-02-21T09:48:42.3523993Z         %224 = tt.broadcast %222 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3524094Z         %225 = arith.select %14, %224, %cst_0 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3524190Z         %226 = tt.broadcast %223 : tensor<2x1x32xi8, #blocked> -> tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3524288Z         %227 = arith.select %16, %226, %225 : tensor<2x2x32xi1, #blocked>, tensor<2x2x32xi8, #blocked>
2026-02-21T09:48:42.3524375Z         %228 = tt.reshape %227 : tensor<2x2x32xi8, #blocked> -> tensor<4x32xi8, #blocked2>
2026-02-21T09:48:42.3524464Z         %229 = arith.sitofp %228 : tensor<4x32xi8, #blocked2> to tensor<4x32xf32, #blocked2>
2026-02-21T09:48:42.3524579Z         %230 = ttg.local_alloc %229 : (tensor<4x32xf32, #blocked2>) -> !ttg.memdesc<4x32xf32, #shared1, #smem>
2026-02-21T09:48:42.3524760Z         %231 = ttg.local_load %230 : !ttg.memdesc<4x32xf32, #shared1, #smem> -> tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:42.3525018Z         %232 = tt.dot %204, %231, %192, inputPrecision = tf32 : tensor<256x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3525086Z         scf.yield %232 : tensor<256x32xf32, #mma>
2026-02-21T09:48:42.3525132Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:42.3525219Z       %52 = arith.truncf %51 : tensor<256x32xf32, #mma> to tensor<256x32xbf16, #mma>
2026-02-21T09:48:42.3525261Z       %53 = arith.extsi %35 : i32 to i64
2026-02-21T09:48:42.3525346Z       %54 = tt.splat %53 : i64 -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3525430Z       %55 = arith.addi %54, %18 : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:42.3525573Z       %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3525633Z       %57 = arith.muli %56, %cst_13 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3525718Z       %58 = tt.broadcast %57 : tensor<256x1xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3525798Z       %59 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3525882Z       %60 = arith.addi %59, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:42.3526014Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3526091Z       %62 = tt.broadcast %61 : tensor<1x32xi64, #mma> -> tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3526147Z       %63 = arith.addi %58, %62 : tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3526240Z       %64 = tt.addptr %17, %63 : tensor<256x32x!tt.ptr<bf16>, #mma>, tensor<256x32xi64, #mma>
2026-02-21T09:48:42.3526326Z       %65 = arith.cmpi sge, %56, %cst_14 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3526389Z       %66 = arith.cmpi slt, %56, %cst_15 : tensor<256x1xi64, #mma>
2026-02-21T09:48:42.3526461Z       %67 = arith.andi %65, %66 : tensor<256x1xi1, #mma>
2026-02-21T09:48:42.3526540Z       %68 = tt.broadcast %67 : tensor<256x1xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3526605Z       %69 = arith.cmpi sge, %61, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3526664Z       %70 = arith.cmpi slt, %61, %cst_12 : tensor<1x32xi64, #mma>
2026-02-21T09:48:42.3526719Z       %71 = arith.andi %69, %70 : tensor<1x32xi1, #mma>
2026-02-21T09:48:42.3526796Z       %72 = tt.broadcast %71 : tensor<1x32xi1, #mma> -> tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3526849Z       %73 = arith.andi %68, %72 : tensor<256x32xi1, #mma>
2026-02-21T09:48:42.3526912Z       tt.store %64, %52, %73 : tensor<256x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:42.3526977Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:48:42.3527013Z     tt.return
2026-02-21T09:48:42.3527050Z   }
2026-02-21T09:48:42.3527087Z }
2026-02-21T09:48:42.3527091Z 
2026-02-21T09:48:42.3527122Z {-#
2026-02-21T09:48:42.3527165Z   external_resources: {
2026-02-21T09:48:42.3527206Z     mlir_reproducer: {
2026-02-21T09:48:42.3528143Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:48:42.3528192Z       disable_threading: false,
2026-02-21T09:48:42.3528230Z       verify_each: true
2026-02-21T09:48:42.3528262Z     }
2026-02-21T09:48:42.3528311Z   }
2026-02-21T09:48:42.3528347Z #-}
2026-02-21T09:48:42.3528583Z /tmp/torchinductor_root/vg/cvgiz5oieiqzkpc2xx5tqhdu2sgedy4mj66kjm2f2poq4hiiezf3.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:48:42.3528995Z /tmp/torchinductor_root/vg/cvgiz5oieiqzkpc2xx5tqhdu2sgedy4mj66kjm2f2poq4hiiezf3.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:48:42.3529128Z [253s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:48:42.3529761Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 256, 32], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[4, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:48:42.3529821Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:48:42.3529903Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:48:43.9219616Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:48:43.9234491Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:48:43.9235112Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:48:43.9235896Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:48:43.9236420Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:48:43.9236993Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:48:43.9237420Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:48:43.9237753Z #smem = #ttg.shared_memory
2026-02-21T09:48:43.9238177Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:48:43.9239035Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:48:43.9239740Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9240035Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:48:43.9240261Z     %c608_i32 = arith.constant 608 : i32
2026-02-21T09:48:43.9240471Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:48:43.9240680Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:48:43.9240893Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T09:48:43.9241165Z     %cst_0 = arith.constant dense<0> : tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9241433Z     %c1824_i32 = arith.constant 1824 : i32
2026-02-21T09:48:43.9241646Z     %c1216_i32 = arith.constant 1216 : i32
2026-02-21T09:48:43.9241857Z     %c12_i32 = arith.constant 12 : i32
2026-02-21T09:48:43.9242059Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:48:43.9242263Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T09:48:43.9242467Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T09:48:43.9242798Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:48:43.9243011Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:48:43.9243223Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:48:43.9243543Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:48:43.9243878Z     %cst_1 = arith.constant dense<0> : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9244350Z     %cst_2 = arith.constant dense<8192> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9244881Z     %cst_3 = arith.constant dense<0> : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9245226Z     %c33375_i32 = arith.constant 33375 : i32
2026-02-21T09:48:43.9245498Z     %cst_4 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9245884Z     %cst_5 = arith.constant dense<4> : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9246269Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:43.9246606Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:43.9246981Z     %cst_8 = arith.constant dense<8192> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9247314Z     %cst_9 = arith.constant dense<0> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9247658Z     %cst_10 = arith.constant dense<512> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9247939Z     %cst_11 = arith.constant dense<0> : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9248172Z     %cst_12 = arith.constant dense<8192> : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9248395Z     %cst_13 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9248624Z     %cst_14 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9248846Z     %cst_15 = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9249046Z     %0 = tt.get_program_id x : i32
2026-02-21T09:48:43.9249308Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9249713Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9250076Z     %3 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9250427Z     %4 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9250745Z     %5 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9251154Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9251721Z     %7 = arith.extsi %6 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9252295Z     %8 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9252868Z     %9 = arith.extsi %8 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9253447Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:48:43.9254006Z     %11 = tt.expand_dims %10 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:48:43.9254537Z     %12 = tt.expand_dims %11 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:43.9254884Z     %13 = arith.cmpi eq, %12, %cst_6 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:43.9255173Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked>
2026-02-21T09:48:43.9255434Z     %15 = arith.cmpi eq, %12, %cst_7 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:48:43.9255688Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x32xi1, #blocked>
2026-02-21T09:48:43.9255963Z     %17 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:43.9256352Z     %18 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9256754Z     %19 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9257158Z     %20 = arith.extsi %19 : tensor<32xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9257406Z     %21 = arith.subi %c33375_i32, %0 : i32
2026-02-21T09:48:43.9257535Z     %22 = arith.divui %21, %c608_i32 : i32
2026-02-21T09:48:43.9257663Z     %23 = arith.remsi %22, %c4_i32 : i32
2026-02-21T09:48:43.9257785Z     %24 = arith.subi %22, %23 : i32
2026-02-21T09:48:43.9257910Z     %25 = arith.muli %24, %c608_i32 : i32
2026-02-21T09:48:43.9258035Z     %26 = arith.addi %0, %25 : i32
2026-02-21T09:48:43.9258170Z     scf.for %arg3 = %0 to %26 step %c2432_i32  : i32 {
2026-02-21T09:48:43.9258324Z       %27 = arith.divsi %arg3, %c1024_i32 : i32
2026-02-21T09:48:43.9258454Z       %28 = arith.muli %27, %c4_i32 : i32
2026-02-21T09:48:43.9258582Z       %29 = arith.subi %c128_i32, %28 : i32
2026-02-21T09:48:43.9258703Z       %30 = arith.minsi %29, %c4_i32 : i32
2026-02-21T09:48:43.9258835Z       %31 = arith.remsi %arg3, %c1024_i32 : i32
2026-02-21T09:48:43.9258966Z       %32 = arith.remsi %31, %30 : i32
2026-02-21T09:48:43.9259087Z       %33 = arith.addi %28, %32 : i32
2026-02-21T09:48:43.9259206Z       %34 = arith.divsi %31, %30 : i32
2026-02-21T09:48:43.9259328Z       %35 = arith.muli %33, %c128_i32 : i32
2026-02-21T09:48:43.9259527Z       %36 = tt.splat %35 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9259767Z       %37 = arith.addi %36, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9259971Z       %38 = arith.muli %34, %c32_i32 : i32
2026-02-21T09:48:43.9260216Z       %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9260483Z       %40 = arith.muli %39, %cst_4 : tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9260691Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9260878Z       %42 = arith.extsi %38 : i32 to i64
2026-02-21T09:48:43.9261099Z       %43 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9261413Z       %44 = arith.addi %43, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9261824Z       %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9262285Z       %46 = tt.broadcast %45 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9262620Z       %47 = arith.cmpi sge, %45, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9262878Z       %48 = arith.cmpi slt, %45, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9263126Z       %49 = arith.andi %47, %48 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9263442Z       %50 = tt.broadcast %49 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9263824Z       %51 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c16_i32 iter_args(%arg5 = %cst) -> (tensor<128x32xf32, #mma>)  : i32 {
2026-02-21T09:48:43.9264057Z         %218 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:43.9264251Z         %219 = tt.splat %218 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9264507Z         %220 = arith.addi %219, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9264800Z         %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9265099Z         %222 = tt.broadcast %221 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9265308Z         %223 = arith.addi %41, %222 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9265532Z         %224 = tt.addptr %4, %223 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9265758Z         %225 = tt.load %224 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9266001Z         %226 = ttg.local_alloc %225 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9266367Z         %227 = ttg.local_load %226 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9266809Z         %228 = arith.extf %227 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9267096Z         %229 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:43.9267311Z         %230 = tt.splat %229 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9267607Z         %231 = arith.addi %230, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9268018Z         %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9268391Z         %233 = arith.muli %232, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9268696Z         %234 = tt.broadcast %233 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9269001Z         %235 = arith.addi %234, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9269306Z         %236 = tt.addptr %5, %235 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9269625Z         %237 = arith.cmpi sge, %232, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9269875Z         %238 = arith.cmpi slt, %232, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9270111Z         %239 = arith.andi %237, %238 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9270416Z         %240 = tt.broadcast %239 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9270712Z         %241 = arith.andi %240, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9270958Z         %242 = tt.load %236, %241, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9271205Z         %243 = arith.shli %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9271437Z         %244 = arith.shrsi %243, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9271672Z         %245 = arith.shrsi %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9271990Z         %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9272325Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9272622Z         %248 = tt.broadcast %246 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9272860Z         %249 = arith.select %14, %248, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9273098Z         %250 = tt.broadcast %247 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9273331Z         %251 = arith.select %16, %250, %249 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9273559Z         %252 = tt.reshape %251 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9273788Z         %253 = arith.sitofp %252 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9274037Z         %254 = ttg.local_alloc %253 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9274363Z         %255 = ttg.local_load %254 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9274838Z         %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9275188Z         %257 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:43.9275313Z         %258 = arith.muli %257, %c2_i32 : i32
2026-02-21T09:48:43.9275483Z         %259 = tt.splat %258 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9275708Z         %260 = arith.addi %259, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9276000Z         %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9276293Z         %262 = tt.broadcast %261 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9276489Z         %263 = arith.addi %41, %262 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9276689Z         %264 = tt.addptr %4, %263 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9276898Z         %265 = tt.load %264 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9277124Z         %266 = ttg.local_alloc %265 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9277453Z         %267 = ttg.local_load %266 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9277862Z         %268 = arith.extf %267 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9278147Z         %269 = arith.extsi %257 : i32 to i64
2026-02-21T09:48:43.9278359Z         %270 = tt.splat %269 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9278656Z         %271 = arith.addi %270, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9279043Z         %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9279398Z         %273 = arith.muli %272, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9279705Z         %274 = tt.broadcast %273 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9280027Z         %275 = arith.addi %274, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9280336Z         %276 = tt.addptr %5, %275 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9280667Z         %277 = arith.cmpi sge, %272, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9280914Z         %278 = arith.cmpi slt, %272, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9281152Z         %279 = arith.andi %277, %278 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9281453Z         %280 = tt.broadcast %279 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9281754Z         %281 = arith.andi %280, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9281999Z         %282 = tt.load %276, %281, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9282252Z         %283 = arith.shli %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9282489Z         %284 = arith.shrsi %283, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9282773Z         %285 = arith.shrsi %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9283061Z         %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9283391Z         %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9283675Z         %288 = tt.broadcast %286 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9283932Z         %289 = arith.select %14, %288, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9284166Z         %290 = tt.broadcast %287 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9284424Z         %291 = arith.select %16, %290, %289 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9284660Z         %292 = tt.reshape %291 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9284883Z         %293 = arith.sitofp %292 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9285137Z         %294 = ttg.local_alloc %293 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9285463Z         %295 = ttg.local_load %294 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9285939Z         %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9286286Z         %297 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:48:43.9286410Z         %298 = arith.muli %297, %c2_i32 : i32
2026-02-21T09:48:43.9286583Z         %299 = tt.splat %298 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9286804Z         %300 = arith.addi %299, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9287081Z         %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9287364Z         %302 = tt.broadcast %301 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9287558Z         %303 = arith.addi %41, %302 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9287785Z         %304 = tt.addptr %4, %303 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9287992Z         %305 = tt.load %304 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9288219Z         %306 = ttg.local_alloc %305 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9288572Z         %307 = ttg.local_load %306 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9288974Z         %308 = arith.extf %307 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9289258Z         %309 = arith.extsi %297 : i32 to i64
2026-02-21T09:48:43.9289503Z         %310 = tt.splat %309 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9289800Z         %311 = arith.addi %310, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9290186Z         %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9290546Z         %313 = arith.muli %312, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9290856Z         %314 = tt.broadcast %313 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9291159Z         %315 = arith.addi %314, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9291467Z         %316 = tt.addptr %5, %315 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9291806Z         %317 = arith.cmpi sge, %312, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9292056Z         %318 = arith.cmpi slt, %312, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9292308Z         %319 = arith.andi %317, %318 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9292609Z         %320 = tt.broadcast %319 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9292904Z         %321 = arith.andi %320, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9293147Z         %322 = tt.load %316, %321, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9293392Z         %323 = arith.shli %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9293634Z         %324 = arith.shrsi %323, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9301921Z         %325 = arith.shrsi %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9302405Z         %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9302834Z         %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9303122Z         %328 = tt.broadcast %326 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9303393Z         %329 = arith.select %14, %328, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9303780Z         %330 = tt.broadcast %327 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9304100Z         %331 = arith.select %16, %330, %329 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9304388Z         %332 = tt.reshape %331 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9304662Z         %333 = arith.sitofp %332 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9305084Z         %334 = ttg.local_alloc %333 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9305617Z         %335 = ttg.local_load %334 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9306129Z         %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9306580Z         %337 = arith.addi %arg4, %c12_i32 : i32
2026-02-21T09:48:43.9306713Z         %338 = arith.muli %337, %c2_i32 : i32
2026-02-21T09:48:43.9306892Z         %339 = tt.splat %338 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9307275Z         %340 = arith.addi %339, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9307748Z         %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9308207Z         %342 = tt.broadcast %341 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9308524Z         %343 = arith.addi %41, %342 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9308864Z         %344 = tt.addptr %4, %343 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9309207Z         %345 = tt.load %344 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9309578Z         %346 = ttg.local_alloc %345 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9310168Z         %347 = ttg.local_load %346 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9310589Z         %348 = arith.extf %347 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9311019Z         %349 = arith.extsi %337 : i32 to i64
2026-02-21T09:48:43.9311364Z         %350 = tt.splat %349 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9311664Z         %351 = arith.addi %350, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9312179Z         %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9312801Z         %353 = arith.muli %352, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9313337Z         %354 = tt.broadcast %353 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9313842Z         %355 = arith.addi %354, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9314154Z         %356 = tt.addptr %5, %355 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9314469Z         %357 = arith.cmpi sge, %352, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9314716Z         %358 = arith.cmpi slt, %352, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9314953Z         %359 = arith.andi %357, %358 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9315253Z         %360 = tt.broadcast %359 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9315579Z         %361 = arith.andi %360, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9315824Z         %362 = tt.load %356, %361, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9316088Z         %363 = arith.shli %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9316323Z         %364 = arith.shrsi %363, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9316557Z         %365 = arith.shrsi %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9316846Z         %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9317186Z         %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9317470Z         %368 = tt.broadcast %366 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9317709Z         %369 = arith.select %14, %368, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9317943Z         %370 = tt.broadcast %367 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9318215Z         %371 = arith.select %16, %370, %369 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9318446Z         %372 = tt.reshape %371 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9318679Z         %373 = arith.sitofp %372 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9319089Z         %374 = ttg.local_alloc %373 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9319583Z         %375 = ttg.local_load %374 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9320370Z         %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9320741Z         scf.yield %376 : tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9320920Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:43.9321255Z       %52 = arith.truncf %51 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma>
2026-02-21T09:48:43.9321537Z       %53 = arith.extsi %35 : i32 to i64
2026-02-21T09:48:43.9321801Z       %54 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9322146Z       %55 = arith.addi %54, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9322628Z       %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9323027Z       %57 = arith.muli %56, %cst_13 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9323315Z       %58 = tt.broadcast %57 : tensor<128x1xi64, #mma> -> tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9323644Z       %59 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9323911Z       %60 = arith.addi %59, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9324167Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9324481Z       %62 = tt.broadcast %61 : tensor<1x32xi64, #mma> -> tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9324768Z       %63 = arith.addi %58, %62 : tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9325069Z       %64 = tt.addptr %17, %63 : tensor<128x32x!tt.ptr<bf16>, #mma>, tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9325392Z       %65 = arith.cmpi sge, %56, %cst_14 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9325699Z       %66 = arith.cmpi slt, %56, %cst_15 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9325956Z       %67 = arith.andi %65, %66 : tensor<128x1xi1, #mma>
2026-02-21T09:48:43.9326236Z       %68 = tt.broadcast %67 : tensor<128x1xi1, #mma> -> tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9326567Z       %69 = arith.cmpi sge, %61, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9326831Z       %70 = arith.cmpi slt, %61, %cst_12 : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9327078Z       %71 = arith.andi %69, %70 : tensor<1x32xi1, #mma>
2026-02-21T09:48:43.9327344Z       %72 = tt.broadcast %71 : tensor<1x32xi1, #mma> -> tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9327626Z       %73 = arith.andi %68, %72 : tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9327867Z       tt.store %64, %52, %73 : tensor<128x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:43.9328104Z       %74 = arith.addi %arg3, %c608_i32 : i32
2026-02-21T09:48:43.9328261Z       %75 = arith.divsi %74, %c1024_i32 : i32
2026-02-21T09:48:43.9328382Z       %76 = arith.muli %75, %c4_i32 : i32
2026-02-21T09:48:43.9328503Z       %77 = arith.subi %c128_i32, %76 : i32
2026-02-21T09:48:43.9328684Z       %78 = arith.minsi %77, %c4_i32 : i32
2026-02-21T09:48:43.9328842Z       %79 = arith.remsi %74, %c1024_i32 : i32
2026-02-21T09:48:43.9328960Z       %80 = arith.remsi %79, %78 : i32
2026-02-21T09:48:43.9329075Z       %81 = arith.addi %76, %80 : i32
2026-02-21T09:48:43.9329189Z       %82 = arith.divsi %79, %78 : i32
2026-02-21T09:48:43.9329303Z       %83 = arith.muli %81, %c128_i32 : i32
2026-02-21T09:48:43.9329472Z       %84 = tt.splat %83 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9329693Z       %85 = arith.addi %84, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9329866Z       %86 = arith.muli %82, %c32_i32 : i32
2026-02-21T09:48:43.9330112Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9330362Z       %88 = arith.muli %87, %cst_4 : tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9330574Z       %89 = tt.broadcast %88 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9330752Z       %90 = arith.extsi %86 : i32 to i64
2026-02-21T09:48:43.9330959Z       %91 = tt.splat %90 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9331250Z       %92 = arith.addi %91, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9331679Z       %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9332398Z       %94 = tt.broadcast %93 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9332772Z       %95 = arith.cmpi sge, %93, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9333013Z       %96 = arith.cmpi slt, %93, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9333313Z       %97 = arith.andi %95, %96 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9333800Z       %98 = tt.broadcast %97 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9334358Z       %99 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c16_i32 iter_args(%arg5 = %cst) -> (tensor<128x32xf32, #mma>)  : i32 {
2026-02-21T09:48:43.9334714Z         %218 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:43.9334953Z         %219 = tt.splat %218 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9335228Z         %220 = arith.addi %219, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9335524Z         %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9335850Z         %222 = tt.broadcast %221 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9336185Z         %223 = arith.addi %89, %222 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9336516Z         %224 = tt.addptr %4, %223 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9336860Z         %225 = tt.load %224 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9337128Z         %226 = ttg.local_alloc %225 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9337463Z         %227 = ttg.local_load %226 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9337873Z         %228 = arith.extf %227 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9338161Z         %229 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:43.9338374Z         %230 = tt.splat %229 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9338671Z         %231 = arith.addi %230, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9339061Z         %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9339420Z         %233 = arith.muli %232, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9339755Z         %234 = tt.broadcast %233 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9340060Z         %235 = arith.addi %234, %94 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9340383Z         %236 = tt.addptr %5, %235 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9340703Z         %237 = arith.cmpi sge, %232, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9340954Z         %238 = arith.cmpi slt, %232, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9341193Z         %239 = arith.andi %237, %238 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9341498Z         %240 = tt.broadcast %239 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9341802Z         %241 = arith.andi %240, %98 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9342050Z         %242 = tt.load %236, %241, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9342297Z         %243 = arith.shli %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9342538Z         %244 = arith.shrsi %243, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9342774Z         %245 = arith.shrsi %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9343058Z         %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9343392Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9343676Z         %248 = tt.broadcast %246 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9343929Z         %249 = arith.select %14, %248, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9344166Z         %250 = tt.broadcast %247 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9344410Z         %251 = arith.select %16, %250, %249 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9344641Z         %252 = tt.reshape %251 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9344864Z         %253 = arith.sitofp %252 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9345114Z         %254 = ttg.local_alloc %253 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9345440Z         %255 = ttg.local_load %254 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9345919Z         %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9346269Z         %257 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:43.9346397Z         %258 = arith.muli %257, %c2_i32 : i32
2026-02-21T09:48:43.9346568Z         %259 = tt.splat %258 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9346795Z         %260 = arith.addi %259, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9347073Z         %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9347349Z         %262 = tt.broadcast %261 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9347549Z         %263 = arith.addi %89, %262 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9347769Z         %264 = tt.addptr %4, %263 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9347980Z         %265 = tt.load %264 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9348222Z         %266 = ttg.local_alloc %265 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9348553Z         %267 = ttg.local_load %266 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9348960Z         %268 = arith.extf %267 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9349242Z         %269 = arith.extsi %257 : i32 to i64
2026-02-21T09:48:43.9349453Z         %270 = tt.splat %269 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9349750Z         %271 = arith.addi %270, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9350135Z         %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9350492Z         %273 = arith.muli %272, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9350796Z         %274 = tt.broadcast %273 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9351102Z         %275 = arith.addi %274, %94 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9351416Z         %276 = tt.addptr %5, %275 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9351754Z         %277 = arith.cmpi sge, %272, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9352001Z         %278 = arith.cmpi slt, %272, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9352244Z         %279 = arith.andi %277, %278 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9352555Z         %280 = tt.broadcast %279 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9352851Z         %281 = arith.andi %280, %98 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9353092Z         %282 = tt.load %276, %281, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9353339Z         %283 = arith.shli %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9353575Z         %284 = arith.shrsi %283, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9353810Z         %285 = arith.shrsi %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9354102Z         %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9354440Z         %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9354722Z         %288 = tt.broadcast %286 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9354958Z         %289 = arith.select %14, %288, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9355190Z         %290 = tt.broadcast %287 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9355424Z         %291 = arith.select %16, %290, %289 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9355668Z         %292 = tt.reshape %291 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9355890Z         %293 = arith.sitofp %292 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9356151Z         %294 = ttg.local_alloc %293 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9356476Z         %295 = ttg.local_load %294 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9356946Z         %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9357286Z         %297 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:48:43.9357408Z         %298 = arith.muli %297, %c2_i32 : i32
2026-02-21T09:48:43.9357577Z         %299 = tt.splat %298 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9357801Z         %300 = arith.addi %299, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9358077Z         %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9358352Z         %302 = tt.broadcast %301 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9358547Z         %303 = arith.addi %89, %302 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9358746Z         %304 = tt.addptr %4, %303 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9358952Z         %305 = tt.load %304 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9359174Z         %306 = ttg.local_alloc %305 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9359534Z         %307 = ttg.local_load %306 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9359941Z         %308 = arith.extf %307 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9360233Z         %309 = arith.extsi %297 : i32 to i64
2026-02-21T09:48:43.9360442Z         %310 = tt.splat %309 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9360740Z         %311 = arith.addi %310, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9361124Z         %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9361481Z         %313 = arith.muli %312, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9361786Z         %314 = tt.broadcast %313 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9362086Z         %315 = arith.addi %314, %94 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9362397Z         %316 = tt.addptr %5, %315 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9362774Z         %317 = arith.cmpi sge, %312, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9363019Z         %318 = arith.cmpi slt, %312, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9363253Z         %319 = arith.andi %317, %318 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9363571Z         %320 = tt.broadcast %319 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9363865Z         %321 = arith.andi %320, %98 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9364124Z         %322 = tt.load %316, %321, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9364369Z         %323 = arith.shli %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9364600Z         %324 = arith.shrsi %323, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9364835Z         %325 = arith.shrsi %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9365120Z         %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9365453Z         %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9365735Z         %328 = tt.broadcast %326 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9365971Z         %329 = arith.select %14, %328, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9366203Z         %330 = tt.broadcast %327 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9366432Z         %331 = arith.select %16, %330, %329 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9366658Z         %332 = tt.reshape %331 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9366879Z         %333 = arith.sitofp %332 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9367128Z         %334 = ttg.local_alloc %333 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9367468Z         %335 = ttg.local_load %334 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9367933Z         %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9368303Z         %337 = arith.addi %arg4, %c12_i32 : i32
2026-02-21T09:48:43.9368427Z         %338 = arith.muli %337, %c2_i32 : i32
2026-02-21T09:48:43.9368599Z         %339 = tt.splat %338 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9368822Z         %340 = arith.addi %339, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9369097Z         %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9369374Z         %342 = tt.broadcast %341 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9369568Z         %343 = arith.addi %89, %342 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9369769Z         %344 = tt.addptr %4, %343 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9369974Z         %345 = tt.load %344 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9370199Z         %346 = ttg.local_alloc %345 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9370528Z         %347 = ttg.local_load %346 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9370931Z         %348 = arith.extf %347 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9371210Z         %349 = arith.extsi %337 : i32 to i64
2026-02-21T09:48:43.9371431Z         %350 = tt.splat %349 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9371726Z         %351 = arith.addi %350, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9372123Z         %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9372474Z         %353 = arith.muli %352, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9372777Z         %354 = tt.broadcast %353 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9373075Z         %355 = arith.addi %354, %94 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9373380Z         %356 = tt.addptr %5, %355 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9373695Z         %357 = arith.cmpi sge, %352, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9373939Z         %358 = arith.cmpi slt, %352, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9374175Z         %359 = arith.andi %357, %358 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9374473Z         %360 = tt.broadcast %359 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9374766Z         %361 = arith.andi %360, %98 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9375008Z         %362 = tt.load %356, %361, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9375252Z         %363 = arith.shli %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9375503Z         %364 = arith.shrsi %363, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9375738Z         %365 = arith.shrsi %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9376035Z         %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9376366Z         %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9376644Z         %368 = tt.broadcast %366 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9376879Z         %369 = arith.select %14, %368, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9377114Z         %370 = tt.broadcast %367 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9377344Z         %371 = arith.select %16, %370, %369 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9377571Z         %372 = tt.reshape %371 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9377791Z         %373 = arith.sitofp %372 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9378041Z         %374 = ttg.local_alloc %373 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9378363Z         %375 = ttg.local_load %374 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9378830Z         %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9379192Z         scf.yield %376 : tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9379358Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:43.9379560Z       %100 = arith.truncf %99 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma>
2026-02-21T09:48:43.9379745Z       %101 = arith.extsi %83 : i32 to i64
2026-02-21T09:48:43.9379910Z       %102 = tt.splat %101 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9380123Z       %103 = arith.addi %102, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9380389Z       %104 = tt.expand_dims %103 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9380634Z       %105 = arith.muli %104, %cst_13 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9380817Z       %106 = tt.broadcast %105 : tensor<128x1xi64, #mma> -> tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9381023Z       %107 = tt.splat %90 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9381231Z       %108 = arith.addi %107, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9381494Z       %109 = tt.expand_dims %108 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9381755Z       %110 = tt.broadcast %109 : tensor<1x32xi64, #mma> -> tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9381938Z       %111 = arith.addi %106, %110 : tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9382126Z       %112 = tt.addptr %17, %111 : tensor<128x32x!tt.ptr<bf16>, #mma>, tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9382326Z       %113 = arith.cmpi sge, %104, %cst_14 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9382493Z       %114 = arith.cmpi slt, %104, %cst_15 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9382652Z       %115 = arith.andi %113, %114 : tensor<128x1xi1, #mma>
2026-02-21T09:48:43.9382826Z       %116 = tt.broadcast %115 : tensor<128x1xi1, #mma> -> tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9383029Z       %117 = arith.cmpi sge, %109, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9383195Z       %118 = arith.cmpi slt, %109, %cst_12 : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9383350Z       %119 = arith.andi %117, %118 : tensor<1x32xi1, #mma>
2026-02-21T09:48:43.9383520Z       %120 = tt.broadcast %119 : tensor<1x32xi1, #mma> -> tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9383715Z       %121 = arith.andi %116, %120 : tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9383874Z       tt.store %112, %100, %121 : tensor<128x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:43.9384025Z       %122 = arith.addi %arg3, %c1216_i32 : i32
2026-02-21T09:48:43.9384150Z       %123 = arith.divsi %122, %c1024_i32 : i32
2026-02-21T09:48:43.9384272Z       %124 = arith.muli %123, %c4_i32 : i32
2026-02-21T09:48:43.9384393Z       %125 = arith.subi %c128_i32, %124 : i32
2026-02-21T09:48:43.9384512Z       %126 = arith.minsi %125, %c4_i32 : i32
2026-02-21T09:48:43.9384630Z       %127 = arith.remsi %122, %c1024_i32 : i32
2026-02-21T09:48:43.9384752Z       %128 = arith.remsi %127, %126 : i32
2026-02-21T09:48:43.9384865Z       %129 = arith.addi %124, %128 : i32
2026-02-21T09:48:43.9384978Z       %130 = arith.divsi %127, %126 : i32
2026-02-21T09:48:43.9385096Z       %131 = arith.muli %129, %c128_i32 : i32
2026-02-21T09:48:43.9385267Z       %132 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9385493Z       %133 = arith.addi %132, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9385667Z       %134 = arith.muli %130, %c32_i32 : i32
2026-02-21T09:48:43.9385894Z       %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9386150Z       %136 = arith.muli %135, %cst_4 : tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9386347Z       %137 = tt.broadcast %136 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9386541Z       %138 = arith.extsi %134 : i32 to i64
2026-02-21T09:48:43.9386746Z       %139 = tt.splat %138 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9387060Z       %140 = arith.addi %139, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9387450Z       %141 = tt.expand_dims %140 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9387876Z       %142 = tt.broadcast %141 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9388185Z       %143 = arith.cmpi sge, %141, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9388428Z       %144 = arith.cmpi slt, %141, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9388664Z       %145 = arith.andi %143, %144 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9388964Z       %146 = tt.broadcast %145 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9389306Z       %147 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c16_i32 iter_args(%arg5 = %cst) -> (tensor<128x32xf32, #mma>)  : i32 {
2026-02-21T09:48:43.9389523Z         %218 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:43.9389695Z         %219 = tt.splat %218 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9389916Z         %220 = arith.addi %219, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9390193Z         %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9390487Z         %222 = tt.broadcast %221 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9390683Z         %223 = arith.addi %137, %222 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9390888Z         %224 = tt.addptr %4, %223 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9391095Z         %225 = tt.load %224 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9391338Z         %226 = ttg.local_alloc %225 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9391665Z         %227 = ttg.local_load %226 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9392072Z         %228 = arith.extf %227 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9392351Z         %229 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:43.9392562Z         %230 = tt.splat %229 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9392857Z         %231 = arith.addi %230, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9393240Z         %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9393590Z         %233 = arith.muli %232, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9393896Z         %234 = tt.broadcast %233 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9394199Z         %235 = arith.addi %234, %142 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9394531Z         %236 = tt.addptr %5, %235 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9394842Z         %237 = arith.cmpi sge, %232, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9395101Z         %238 = arith.cmpi slt, %232, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9395337Z         %239 = arith.andi %237, %238 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9395633Z         %240 = tt.broadcast %239 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9395928Z         %241 = arith.andi %240, %146 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9396169Z         %242 = tt.load %236, %241, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9396414Z         %243 = arith.shli %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9396647Z         %244 = arith.shrsi %243, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9396879Z         %245 = arith.shrsi %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9397166Z         %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9397502Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9397781Z         %248 = tt.broadcast %246 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9398015Z         %249 = arith.select %14, %248, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9398248Z         %250 = tt.broadcast %247 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9398496Z         %251 = arith.select %16, %250, %249 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9398724Z         %252 = tt.reshape %251 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9398944Z         %253 = arith.sitofp %252 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9399209Z         %254 = ttg.local_alloc %253 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9399530Z         %255 = ttg.local_load %254 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9400000Z         %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9400355Z         %257 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:43.9400478Z         %258 = arith.muli %257, %c2_i32 : i32
2026-02-21T09:48:43.9400647Z         %259 = tt.splat %258 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9400870Z         %260 = arith.addi %259, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9401148Z         %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9401425Z         %262 = tt.broadcast %261 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9401621Z         %263 = arith.addi %137, %262 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9401824Z         %264 = tt.addptr %4, %263 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9402029Z         %265 = tt.load %264 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9402267Z         %266 = ttg.local_alloc %265 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9402643Z         %267 = ttg.local_load %266 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9403068Z         %268 = arith.extf %267 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9403349Z         %269 = arith.extsi %257 : i32 to i64
2026-02-21T09:48:43.9403557Z         %270 = tt.splat %269 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9403851Z         %271 = arith.addi %270, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9404235Z         %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9404584Z         %273 = arith.muli %272, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9404890Z         %274 = tt.broadcast %273 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9405194Z         %275 = arith.addi %274, %142 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9405499Z         %276 = tt.addptr %5, %275 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9405815Z         %277 = arith.cmpi sge, %272, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9406058Z         %278 = arith.cmpi slt, %272, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9406317Z         %279 = arith.andi %277, %278 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9406617Z         %280 = tt.broadcast %279 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9406914Z         %281 = arith.andi %280, %146 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9407174Z         %282 = tt.load %276, %281, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9407418Z         %283 = arith.shli %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9407651Z         %284 = arith.shrsi %283, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9407886Z         %285 = arith.shrsi %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9408174Z         %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9408509Z         %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9408789Z         %288 = tt.broadcast %286 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9409026Z         %289 = arith.select %14, %288, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9409264Z         %290 = tt.broadcast %287 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9409493Z         %291 = arith.select %16, %290, %289 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9409723Z         %292 = tt.reshape %291 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9409943Z         %293 = arith.sitofp %292 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9410212Z         %294 = ttg.local_alloc %293 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9410538Z         %295 = ttg.local_load %294 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9411018Z         %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9411367Z         %297 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:48:43.9411494Z         %298 = arith.muli %297, %c2_i32 : i32
2026-02-21T09:48:43.9411666Z         %299 = tt.splat %298 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9411892Z         %300 = arith.addi %299, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9412172Z         %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9412451Z         %302 = tt.broadcast %301 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9412649Z         %303 = arith.addi %137, %302 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9412851Z         %304 = tt.addptr %4, %303 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9413061Z         %305 = tt.load %304 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9413285Z         %306 = ttg.local_alloc %305 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9413621Z         %307 = ttg.local_load %306 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9414033Z         %308 = arith.extf %307 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9414333Z         %309 = arith.extsi %297 : i32 to i64
2026-02-21T09:48:43.9414543Z         %310 = tt.splat %309 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9414842Z         %311 = arith.addi %310, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9415242Z         %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9415599Z         %313 = arith.muli %312, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9415904Z         %314 = tt.broadcast %313 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9416212Z         %315 = arith.addi %314, %142 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9416521Z         %316 = tt.addptr %5, %315 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9416838Z         %317 = arith.cmpi sge, %312, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9417088Z         %318 = arith.cmpi slt, %312, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9417326Z         %319 = arith.andi %317, %318 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9417632Z         %320 = tt.broadcast %319 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9417940Z         %321 = arith.andi %320, %146 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9418204Z         %322 = tt.load %316, %321, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9418455Z         %323 = arith.shli %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9418707Z         %324 = arith.shrsi %323, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9418951Z         %325 = arith.shrsi %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9419248Z         %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9419588Z         %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9419877Z         %328 = tt.broadcast %326 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9420117Z         %329 = arith.select %14, %328, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9420360Z         %330 = tt.broadcast %327 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9420600Z         %331 = arith.select %16, %330, %329 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9420831Z         %332 = tt.reshape %331 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9421060Z         %333 = arith.sitofp %332 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9421315Z         %334 = ttg.local_alloc %333 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9421644Z         %335 = ttg.local_load %334 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9422135Z         %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9422484Z         %337 = arith.addi %arg4, %c12_i32 : i32
2026-02-21T09:48:43.9422618Z         %338 = arith.muli %337, %c2_i32 : i32
2026-02-21T09:48:43.9422798Z         %339 = tt.splat %338 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9423039Z         %340 = arith.addi %339, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9423322Z         %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9423600Z         %342 = tt.broadcast %341 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9423803Z         %343 = arith.addi %137, %342 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9424008Z         %344 = tt.addptr %4, %343 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9424223Z         %345 = tt.load %344 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9424453Z         %346 = ttg.local_alloc %345 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9424786Z         %347 = ttg.local_load %346 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9425203Z         %348 = arith.extf %347 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9425494Z         %349 = arith.extsi %337 : i32 to i64
2026-02-21T09:48:43.9425706Z         %350 = tt.splat %349 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9426010Z         %351 = arith.addi %350, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9426460Z         %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9426842Z         %353 = arith.muli %352, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9427158Z         %354 = tt.broadcast %353 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9427466Z         %355 = arith.addi %354, %142 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9427783Z         %356 = tt.addptr %5, %355 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9428106Z         %357 = arith.cmpi sge, %352, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9428353Z         %358 = arith.cmpi slt, %352, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9428596Z         %359 = arith.andi %357, %358 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9428901Z         %360 = tt.broadcast %359 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9429207Z         %361 = arith.andi %360, %146 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9429459Z         %362 = tt.load %356, %361, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9429708Z         %363 = arith.shli %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9429948Z         %364 = arith.shrsi %363, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9430187Z         %365 = arith.shrsi %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9430500Z         %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9430842Z         %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9431139Z         %368 = tt.broadcast %366 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9431380Z         %369 = arith.select %14, %368, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9431617Z         %370 = tt.broadcast %367 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9431854Z         %371 = arith.select %16, %370, %369 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9432088Z         %372 = tt.reshape %371 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9432313Z         %373 = arith.sitofp %372 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9432570Z         %374 = ttg.local_alloc %373 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9432895Z         %375 = ttg.local_load %374 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9433366Z         %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9433721Z         scf.yield %376 : tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9433892Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:43.9434104Z       %148 = arith.truncf %147 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma>
2026-02-21T09:48:43.9434301Z       %149 = arith.extsi %131 : i32 to i64
2026-02-21T09:48:43.9434471Z       %150 = tt.splat %149 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9434706Z       %151 = arith.addi %150, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9434978Z       %152 = tt.expand_dims %151 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9435230Z       %153 = arith.muli %152, %cst_13 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9435415Z       %154 = tt.broadcast %153 : tensor<128x1xi64, #mma> -> tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9435632Z       %155 = tt.splat %138 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9435852Z       %156 = arith.addi %155, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9436116Z       %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9436381Z       %158 = tt.broadcast %157 : tensor<1x32xi64, #mma> -> tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9436566Z       %159 = arith.addi %154, %158 : tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9436765Z       %160 = tt.addptr %17, %159 : tensor<128x32x!tt.ptr<bf16>, #mma>, tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9436977Z       %161 = arith.cmpi sge, %152, %cst_14 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9437146Z       %162 = arith.cmpi slt, %152, %cst_15 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9437312Z       %163 = arith.andi %161, %162 : tensor<128x1xi1, #mma>
2026-02-21T09:48:43.9437492Z       %164 = tt.broadcast %163 : tensor<128x1xi1, #mma> -> tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9437684Z       %165 = arith.cmpi sge, %157, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9437853Z       %166 = arith.cmpi slt, %157, %cst_12 : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9438020Z       %167 = arith.andi %165, %166 : tensor<1x32xi1, #mma>
2026-02-21T09:48:43.9438217Z       %168 = tt.broadcast %167 : tensor<1x32xi1, #mma> -> tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9438398Z       %169 = arith.andi %164, %168 : tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9438566Z       tt.store %160, %148, %169 : tensor<128x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:43.9438736Z       %170 = arith.addi %arg3, %c1824_i32 : i32
2026-02-21T09:48:43.9438873Z       %171 = arith.divsi %170, %c1024_i32 : i32
2026-02-21T09:48:43.9439000Z       %172 = arith.muli %171, %c4_i32 : i32
2026-02-21T09:48:43.9439126Z       %173 = arith.subi %c128_i32, %172 : i32
2026-02-21T09:48:43.9439251Z       %174 = arith.minsi %173, %c4_i32 : i32
2026-02-21T09:48:43.9439375Z       %175 = arith.remsi %170, %c1024_i32 : i32
2026-02-21T09:48:43.9439500Z       %176 = arith.remsi %175, %174 : i32
2026-02-21T09:48:43.9439619Z       %177 = arith.addi %172, %176 : i32
2026-02-21T09:48:43.9439741Z       %178 = arith.divsi %175, %174 : i32
2026-02-21T09:48:43.9439863Z       %179 = arith.muli %177, %c128_i32 : i32
2026-02-21T09:48:43.9440043Z       %180 = tt.splat %179 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9440279Z       %181 = arith.addi %180, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9440458Z       %182 = arith.muli %178, %c32_i32 : i32
2026-02-21T09:48:43.9440691Z       %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9440948Z       %184 = arith.muli %183, %cst_4 : tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9441151Z       %185 = tt.broadcast %184 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9441335Z       %186 = arith.extsi %182 : i32 to i64
2026-02-21T09:48:43.9441547Z       %187 = tt.splat %186 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9441870Z       %188 = arith.addi %187, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9442261Z       %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9442990Z       %190 = tt.broadcast %189 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9443399Z       %191 = arith.cmpi sge, %189, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9443648Z       %192 = arith.cmpi slt, %189, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9443892Z       %193 = arith.andi %191, %192 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9444199Z       %194 = tt.broadcast %193 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9444553Z       %195 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c16_i32 iter_args(%arg5 = %cst) -> (tensor<128x32xf32, #mma>)  : i32 {
2026-02-21T09:48:43.9444780Z         %218 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:43.9444958Z         %219 = tt.splat %218 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9445196Z         %220 = arith.addi %219, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9445484Z         %221 = tt.expand_dims %220 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9445771Z         %222 = tt.broadcast %221 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9445977Z         %223 = arith.addi %185, %222 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9446185Z         %224 = tt.addptr %4, %223 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9446475Z         %225 = tt.load %224 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9446705Z         %226 = ttg.local_alloc %225 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9447044Z         %227 = ttg.local_load %226 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9447483Z         %228 = arith.extf %227 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9447771Z         %229 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:43.9447986Z         %230 = tt.splat %229 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9448284Z         %231 = arith.addi %230, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9448676Z         %232 = tt.expand_dims %231 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9449037Z         %233 = arith.muli %232, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9449348Z         %234 = tt.broadcast %233 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9449661Z         %235 = arith.addi %234, %190 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9449976Z         %236 = tt.addptr %5, %235 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9450292Z         %237 = arith.cmpi sge, %232, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9450564Z         %238 = arith.cmpi slt, %232, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9450801Z         %239 = arith.andi %237, %238 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9451126Z         %240 = tt.broadcast %239 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9451430Z         %241 = arith.andi %240, %194 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9451676Z         %242 = tt.load %236, %241, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9451927Z         %243 = arith.shli %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9452166Z         %244 = arith.shrsi %243, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9452408Z         %245 = arith.shrsi %242, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9452701Z         %246 = tt.expand_dims %244 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9453038Z         %247 = tt.expand_dims %245 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9453324Z         %248 = tt.broadcast %246 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9453565Z         %249 = arith.select %14, %248, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9453803Z         %250 = tt.broadcast %247 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9454039Z         %251 = arith.select %16, %250, %249 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9454269Z         %252 = tt.reshape %251 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9454514Z         %253 = arith.sitofp %252 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9454769Z         %254 = ttg.local_alloc %253 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9455092Z         %255 = ttg.local_load %254 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9455583Z         %256 = tt.dot %228, %255, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9455934Z         %257 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:43.9456064Z         %258 = arith.muli %257, %c2_i32 : i32
2026-02-21T09:48:43.9456241Z         %259 = tt.splat %258 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9456470Z         %260 = arith.addi %259, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9456753Z         %261 = tt.expand_dims %260 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9457036Z         %262 = tt.broadcast %261 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9457236Z         %263 = arith.addi %185, %262 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9457440Z         %264 = tt.addptr %4, %263 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9457647Z         %265 = tt.load %264 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9457872Z         %266 = ttg.local_alloc %265 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9458215Z         %267 = ttg.local_load %266 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9458623Z         %268 = arith.extf %267 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9458924Z         %269 = arith.extsi %257 : i32 to i64
2026-02-21T09:48:43.9459134Z         %270 = tt.splat %269 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9459433Z         %271 = arith.addi %270, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9459818Z         %272 = tt.expand_dims %271 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9460170Z         %273 = arith.muli %272, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9460477Z         %274 = tt.broadcast %273 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9460779Z         %275 = arith.addi %274, %190 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9461091Z         %276 = tt.addptr %5, %275 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9461410Z         %277 = arith.cmpi sge, %272, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9461652Z         %278 = arith.cmpi slt, %272, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9461889Z         %279 = arith.andi %277, %278 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9462187Z         %280 = tt.broadcast %279 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9462501Z         %281 = arith.andi %280, %194 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9462746Z         %282 = tt.load %276, %281, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9462992Z         %283 = arith.shli %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9463241Z         %284 = arith.shrsi %283, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9463475Z         %285 = arith.shrsi %282, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9463760Z         %286 = tt.expand_dims %284 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9464093Z         %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9464374Z         %288 = tt.broadcast %286 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9464611Z         %289 = arith.select %14, %288, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9464847Z         %290 = tt.broadcast %287 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9465077Z         %291 = arith.select %16, %290, %289 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9465307Z         %292 = tt.reshape %291 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9465527Z         %293 = arith.sitofp %292 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9465776Z         %294 = ttg.local_alloc %293 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9466114Z         %295 = ttg.local_load %294 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9466579Z         %296 = tt.dot %268, %295, %256, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9466939Z         %297 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:48:43.9467063Z         %298 = arith.muli %297, %c2_i32 : i32
2026-02-21T09:48:43.9467235Z         %299 = tt.splat %298 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9467460Z         %300 = arith.addi %299, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9467738Z         %301 = tt.expand_dims %300 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9468017Z         %302 = tt.broadcast %301 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9468216Z         %303 = arith.addi %185, %302 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9468420Z         %304 = tt.addptr %4, %303 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9468631Z         %305 = tt.load %304 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9468857Z         %306 = ttg.local_alloc %305 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9469190Z         %307 = ttg.local_load %306 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9469593Z         %308 = arith.extf %307 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9469877Z         %309 = arith.extsi %297 : i32 to i64
2026-02-21T09:48:43.9470088Z         %310 = tt.splat %309 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9470396Z         %311 = arith.addi %310, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9470781Z         %312 = tt.expand_dims %311 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9471155Z         %313 = arith.muli %312, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9471457Z         %314 = tt.broadcast %313 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9471760Z         %315 = arith.addi %314, %190 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9472068Z         %316 = tt.addptr %5, %315 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9472389Z         %317 = arith.cmpi sge, %312, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9472635Z         %318 = arith.cmpi slt, %312, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9472872Z         %319 = arith.andi %317, %318 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9473176Z         %320 = tt.broadcast %319 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9473476Z         %321 = arith.andi %320, %194 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9473723Z         %322 = tt.load %316, %321, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9473973Z         %323 = arith.shli %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9474229Z         %324 = arith.shrsi %323, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9474467Z         %325 = arith.shrsi %322, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9474766Z         %326 = tt.expand_dims %324 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9475100Z         %327 = tt.expand_dims %325 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9475382Z         %328 = tt.broadcast %326 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9475616Z         %329 = arith.select %14, %328, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9475850Z         %330 = tt.broadcast %327 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9476079Z         %331 = arith.select %16, %330, %329 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9476307Z         %332 = tt.reshape %331 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9476530Z         %333 = arith.sitofp %332 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9476777Z         %334 = ttg.local_alloc %333 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9477100Z         %335 = ttg.local_load %334 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9477567Z         %336 = tt.dot %308, %335, %296, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9477911Z         %337 = arith.addi %arg4, %c12_i32 : i32
2026-02-21T09:48:43.9478038Z         %338 = arith.muli %337, %c2_i32 : i32
2026-02-21T09:48:43.9478224Z         %339 = tt.splat %338 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9478447Z         %340 = arith.addi %339, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9478724Z         %341 = tt.expand_dims %340 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9479011Z         %342 = tt.broadcast %341 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9479208Z         %343 = arith.addi %185, %342 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9479409Z         %344 = tt.addptr %4, %343 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9479617Z         %345 = tt.load %344 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9479841Z         %346 = ttg.local_alloc %345 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9480169Z         %347 = ttg.local_load %346 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9480577Z         %348 = arith.extf %347 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9480858Z         %349 = arith.extsi %337 : i32 to i64
2026-02-21T09:48:43.9481067Z         %350 = tt.splat %349 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9481366Z         %351 = arith.addi %350, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9481752Z         %352 = tt.expand_dims %351 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9482119Z         %353 = arith.muli %352, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9482427Z         %354 = tt.broadcast %353 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9482824Z         %355 = arith.addi %354, %190 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9483134Z         %356 = tt.addptr %5, %355 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9483447Z         %357 = arith.cmpi sge, %352, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9483693Z         %358 = arith.cmpi slt, %352, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9483929Z         %359 = arith.andi %357, %358 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9484227Z         %360 = tt.broadcast %359 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9484524Z         %361 = arith.andi %360, %194 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9484766Z         %362 = tt.load %356, %361, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9485013Z         %363 = arith.shli %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9485246Z         %364 = arith.shrsi %363, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9485480Z         %365 = arith.shrsi %362, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9485766Z         %366 = tt.expand_dims %364 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9486118Z         %367 = tt.expand_dims %365 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9486400Z         %368 = tt.broadcast %366 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9486637Z         %369 = arith.select %14, %368, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9486888Z         %370 = tt.broadcast %367 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9487118Z         %371 = arith.select %16, %370, %369 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9487344Z         %372 = tt.reshape %371 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9487566Z         %373 = arith.sitofp %372 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9487814Z         %374 = ttg.local_alloc %373 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9488136Z         %375 = ttg.local_load %374 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9488609Z         %376 = tt.dot %348, %375, %336, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9488960Z         scf.yield %376 : tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9489124Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:43.9489332Z       %196 = arith.truncf %195 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma>
2026-02-21T09:48:43.9489503Z       %197 = arith.extsi %179 : i32 to i64
2026-02-21T09:48:43.9489669Z       %198 = tt.splat %197 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9489897Z       %199 = arith.addi %198, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9490167Z       %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9490428Z       %201 = arith.muli %200, %cst_13 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9490610Z       %202 = tt.broadcast %201 : tensor<128x1xi64, #mma> -> tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9490818Z       %203 = tt.splat %186 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9491024Z       %204 = arith.addi %203, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9491284Z       %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9491548Z       %206 = tt.broadcast %205 : tensor<1x32xi64, #mma> -> tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9491732Z       %207 = arith.addi %202, %206 : tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9491927Z       %208 = tt.addptr %17, %207 : tensor<128x32x!tt.ptr<bf16>, #mma>, tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9492128Z       %209 = arith.cmpi sge, %200, %cst_14 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9492298Z       %210 = arith.cmpi slt, %200, %cst_15 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9492460Z       %211 = arith.andi %209, %210 : tensor<128x1xi1, #mma>
2026-02-21T09:48:43.9492635Z       %212 = tt.broadcast %211 : tensor<128x1xi1, #mma> -> tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9492821Z       %213 = arith.cmpi sge, %205, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9492985Z       %214 = arith.cmpi slt, %205, %cst_12 : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9493142Z       %215 = arith.andi %213, %214 : tensor<1x32xi1, #mma>
2026-02-21T09:48:43.9493313Z       %216 = tt.broadcast %215 : tensor<1x32xi1, #mma> -> tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9493492Z       %217 = arith.andi %212, %216 : tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9493655Z       tt.store %208, %196, %217 : tensor<128x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:43.9493839Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:48:43.9494005Z     scf.for %arg3 = %26 to %c32768_i32 step %c608_i32  : i32 {
2026-02-21T09:48:43.9494151Z       %27 = arith.divsi %arg3, %c1024_i32 : i32
2026-02-21T09:48:43.9494276Z       %28 = arith.muli %27, %c4_i32 : i32
2026-02-21T09:48:43.9494408Z       %29 = arith.subi %c128_i32, %28 : i32
2026-02-21T09:48:43.9494528Z       %30 = arith.minsi %29, %c4_i32 : i32
2026-02-21T09:48:43.9494650Z       %31 = arith.remsi %arg3, %c1024_i32 : i32
2026-02-21T09:48:43.9494769Z       %32 = arith.remsi %31, %30 : i32
2026-02-21T09:48:43.9494883Z       %33 = arith.addi %28, %32 : i32
2026-02-21T09:48:43.9494992Z       %34 = arith.divsi %31, %30 : i32
2026-02-21T09:48:43.9495105Z       %35 = arith.muli %33, %c128_i32 : i32
2026-02-21T09:48:43.9495272Z       %36 = tt.splat %35 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9495495Z       %37 = arith.addi %36, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:48:43.9495668Z       %38 = arith.muli %34, %c32_i32 : i32
2026-02-21T09:48:43.9495890Z       %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9496144Z       %40 = arith.muli %39, %cst_4 : tensor<128x1xi32, #blocked1>
2026-02-21T09:48:43.9496336Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9496511Z       %42 = arith.extsi %38 : i32 to i64
2026-02-21T09:48:43.9496713Z       %43 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9497007Z       %44 = arith.addi %43, %9 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9497405Z       %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9497821Z       %46 = tt.broadcast %45 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9498142Z       %47 = arith.cmpi sge, %45, %cst_3 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9498382Z       %48 = arith.cmpi slt, %45, %cst_2 : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9498608Z       %49 = arith.andi %47, %48 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9498900Z       %50 = tt.broadcast %49 : tensor<1x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9499232Z       %51 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c16_i32 iter_args(%arg5 = %cst) -> (tensor<128x32xf32, #mma>)  : i32 {
2026-02-21T09:48:43.9499451Z         %74 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:48:43.9499620Z         %75 = tt.splat %74 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9499835Z         %76 = arith.addi %75, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9500107Z         %77 = tt.expand_dims %76 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9500379Z         %78 = tt.broadcast %77 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9500574Z         %79 = arith.addi %41, %78 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9500773Z         %80 = tt.addptr %4, %79 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9500974Z         %81 = tt.load %80 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9501196Z         %82 = ttg.local_alloc %81 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9501535Z         %83 = ttg.local_load %82 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9501939Z         %84 = arith.extf %83 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9502240Z         %85 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:48:43.9502445Z         %86 = tt.splat %85 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9502737Z         %87 = arith.addi %86, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9503115Z         %88 = tt.expand_dims %87 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9503463Z         %89 = arith.muli %88, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9503765Z         %90 = tt.broadcast %89 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9504061Z         %91 = arith.addi %90, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9504361Z         %92 = tt.addptr %5, %91 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9504671Z         %93 = arith.cmpi sge, %88, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9504909Z         %94 = arith.cmpi slt, %88, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9505135Z         %95 = arith.andi %93, %94 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9505441Z         %96 = tt.broadcast %95 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9505747Z         %97 = arith.andi %96, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9505981Z         %98 = tt.load %92, %97, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9506217Z         %99 = arith.shli %98, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9506446Z         %100 = arith.shrsi %99, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9506677Z         %101 = arith.shrsi %98, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9506963Z         %102 = tt.expand_dims %100 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9507298Z         %103 = tt.expand_dims %101 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9507576Z         %104 = tt.broadcast %102 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9507814Z         %105 = arith.select %14, %104, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9508046Z         %106 = tt.broadcast %103 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9508277Z         %107 = arith.select %16, %106, %105 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9508503Z         %108 = tt.reshape %107 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9508722Z         %109 = arith.sitofp %108 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9508971Z         %110 = ttg.local_alloc %109 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9509304Z         %111 = ttg.local_load %110 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9509773Z         %112 = tt.dot %84, %111, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9510131Z         %113 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:48:43.9510255Z         %114 = arith.muli %113, %c2_i32 : i32
2026-02-21T09:48:43.9510425Z         %115 = tt.splat %114 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9510650Z         %116 = arith.addi %115, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9515604Z         %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9515893Z         %118 = tt.broadcast %117 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9516092Z         %119 = arith.addi %41, %118 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9516297Z         %120 = tt.addptr %4, %119 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9516509Z         %121 = tt.load %120 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9516736Z         %122 = ttg.local_alloc %121 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9517071Z         %123 = ttg.local_load %122 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9517485Z         %124 = arith.extf %123 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9517809Z         %125 = arith.extsi %113 : i32 to i64
2026-02-21T09:48:43.9518019Z         %126 = tt.splat %125 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9518330Z         %127 = arith.addi %126, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9518716Z         %128 = tt.expand_dims %127 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9519071Z         %129 = arith.muli %128, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9519375Z         %130 = tt.broadcast %129 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9519676Z         %131 = arith.addi %130, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9519985Z         %132 = tt.addptr %5, %131 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9520299Z         %133 = arith.cmpi sge, %128, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9520546Z         %134 = arith.cmpi slt, %128, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9520780Z         %135 = arith.andi %133, %134 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9521080Z         %136 = tt.broadcast %135 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9521377Z         %137 = arith.andi %136, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9521619Z         %138 = tt.load %132, %137, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9521883Z         %139 = arith.shli %138, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9522115Z         %140 = arith.shrsi %139, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9522350Z         %141 = arith.shrsi %138, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9522700Z         %142 = tt.expand_dims %140 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9523034Z         %143 = tt.expand_dims %141 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9523316Z         %144 = tt.broadcast %142 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9523552Z         %145 = arith.select %14, %144, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9523786Z         %146 = tt.broadcast %143 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9524017Z         %147 = arith.select %16, %146, %145 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9524243Z         %148 = tt.reshape %147 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9524464Z         %149 = arith.sitofp %148 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9524715Z         %150 = ttg.local_alloc %149 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9525036Z         %151 = ttg.local_load %150 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9525527Z         %152 = tt.dot %124, %151, %112, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9525873Z         %153 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:48:43.9526000Z         %154 = arith.muli %153, %c2_i32 : i32
2026-02-21T09:48:43.9526188Z         %155 = tt.splat %154 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9526407Z         %156 = arith.addi %155, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9526681Z         %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9526956Z         %158 = tt.broadcast %157 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9527151Z         %159 = arith.addi %41, %158 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9527353Z         %160 = tt.addptr %4, %159 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9527558Z         %161 = tt.load %160 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9527684Z         %162 = ttg.local_alloc %161 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9527855Z         %163 = ttg.local_load %162 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9528055Z         %164 = arith.extf %163 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9528100Z         %165 = arith.extsi %153 : i32 to i64
2026-02-21T09:48:43.9528228Z         %166 = tt.splat %165 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9528354Z         %167 = arith.addi %166, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9528591Z         %168 = tt.expand_dims %167 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9528688Z         %169 = arith.muli %168, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9528857Z         %170 = tt.broadcast %169 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9528967Z         %171 = arith.addi %170, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9529142Z         %172 = tt.addptr %5, %171 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9529245Z         %173 = arith.cmpi sge, %168, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9529347Z         %174 = arith.cmpi slt, %168, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9529442Z         %175 = arith.andi %173, %174 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9529605Z         %176 = tt.broadcast %175 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9529699Z         %177 = arith.andi %176, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9529808Z         %178 = tt.load %172, %177, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9529902Z         %179 = arith.shli %178, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9529999Z         %180 = arith.shrsi %179, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9530094Z         %181 = arith.shrsi %178, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9530255Z         %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9530403Z         %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9530509Z         %184 = tt.broadcast %182 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9530611Z         %185 = arith.select %14, %184, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9530703Z         %186 = tt.broadcast %183 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9530799Z         %187 = arith.select %16, %186, %185 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9530887Z         %188 = tt.reshape %187 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9530978Z         %189 = arith.sitofp %188 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9531097Z         %190 = ttg.local_alloc %189 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9531263Z         %191 = ttg.local_load %190 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9531523Z         %192 = tt.dot %164, %191, %152, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9531571Z         %193 = arith.addi %arg4, %c12_i32 : i32
2026-02-21T09:48:43.9531616Z         %194 = arith.muli %193, %c2_i32 : i32
2026-02-21T09:48:43.9531707Z         %195 = tt.splat %194 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9531799Z         %196 = arith.addi %195, %3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:48:43.9531962Z         %197 = tt.expand_dims %196 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:48:43.9532054Z         %198 = tt.broadcast %197 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9532119Z         %199 = arith.addi %41, %198 : tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9532234Z         %200 = tt.addptr %4, %199 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:48:43.9532298Z         %201 = tt.load %200 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:48:43.9532422Z         %202 = ttg.local_alloc %201 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:48:43.9532591Z         %203 = ttg.local_load %202 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9532789Z         %204 = arith.extf %203 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9532833Z         %205 = arith.extsi %193 : i32 to i64
2026-02-21T09:48:43.9532964Z         %206 = tt.splat %205 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9533091Z         %207 = arith.addi %206, %7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:48:43.9533312Z         %208 = tt.expand_dims %207 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9533409Z         %209 = arith.muli %208, %cst_8 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9533577Z         %210 = tt.broadcast %209 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9533693Z         %211 = arith.addi %210, %46 : tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9533865Z         %212 = tt.addptr %5, %211 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x32xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9533983Z         %213 = arith.cmpi sge, %208, %cst_9 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9534087Z         %214 = arith.cmpi slt, %208, %cst_10 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9534178Z         %215 = arith.andi %213, %214 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9534340Z         %216 = tt.broadcast %215 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9534432Z         %217 = arith.andi %216, %50 : tensor<4x32xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9534541Z         %218 = tt.load %212, %217, %cst_1 : tensor<4x32x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9534637Z         %219 = arith.shli %218, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9534737Z         %220 = arith.shrsi %219, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9534831Z         %221 = arith.shrsi %218, %cst_5 : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:48:43.9534978Z         %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9535126Z         %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<4x32xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x32xi8, #blocked>
2026-02-21T09:48:43.9535219Z         %224 = tt.broadcast %222 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9535336Z         %225 = arith.select %14, %224, %cst_0 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9535429Z         %226 = tt.broadcast %223 : tensor<4x1x32xi8, #blocked> -> tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9535526Z         %227 = arith.select %16, %226, %225 : tensor<4x2x32xi1, #blocked>, tensor<4x2x32xi8, #blocked>
2026-02-21T09:48:43.9535629Z         %228 = tt.reshape %227 : tensor<4x2x32xi8, #blocked> -> tensor<8x32xi8, #blocked2>
2026-02-21T09:48:43.9535722Z         %229 = arith.sitofp %228 : tensor<8x32xi8, #blocked2> to tensor<8x32xf32, #blocked2>
2026-02-21T09:48:43.9535839Z         %230 = ttg.local_alloc %229 : (tensor<8x32xf32, #blocked2>) -> !ttg.memdesc<8x32xf32, #shared1, #smem>
2026-02-21T09:48:43.9536003Z         %231 = ttg.local_load %230 : !ttg.memdesc<8x32xf32, #shared1, #smem> -> tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:48:43.9536267Z         %232 = tt.dot %204, %231, %192, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x32xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9536319Z         scf.yield %232 : tensor<128x32xf32, #mma>
2026-02-21T09:48:43.9536401Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:48:43.9536489Z       %52 = arith.truncf %51 : tensor<128x32xf32, #mma> to tensor<128x32xbf16, #mma>
2026-02-21T09:48:43.9536532Z       %53 = arith.extsi %35 : i32 to i64
2026-02-21T09:48:43.9536615Z       %54 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9536698Z       %55 = arith.addi %54, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:48:43.9536835Z       %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9536896Z       %57 = arith.muli %56, %cst_13 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9536991Z       %58 = tt.broadcast %57 : tensor<128x1xi64, #mma> -> tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9537071Z       %59 = tt.splat %42 : i64 -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9537165Z       %60 = arith.addi %59, %20 : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:48:43.9537298Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9537376Z       %62 = tt.broadcast %61 : tensor<1x32xi64, #mma> -> tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9537432Z       %63 = arith.addi %58, %62 : tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9537523Z       %64 = tt.addptr %17, %63 : tensor<128x32x!tt.ptr<bf16>, #mma>, tensor<128x32xi64, #mma>
2026-02-21T09:48:43.9537587Z       %65 = arith.cmpi sge, %56, %cst_14 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9537649Z       %66 = arith.cmpi slt, %56, %cst_15 : tensor<128x1xi64, #mma>
2026-02-21T09:48:43.9537706Z       %67 = arith.andi %65, %66 : tensor<128x1xi1, #mma>
2026-02-21T09:48:43.9537786Z       %68 = tt.broadcast %67 : tensor<128x1xi1, #mma> -> tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9537849Z       %69 = arith.cmpi sge, %61, %cst_11 : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9537910Z       %70 = arith.cmpi slt, %61, %cst_12 : tensor<1x32xi64, #mma>
2026-02-21T09:48:43.9537965Z       %71 = arith.andi %69, %70 : tensor<1x32xi1, #mma>
2026-02-21T09:48:43.9538041Z       %72 = tt.broadcast %71 : tensor<1x32xi1, #mma> -> tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9538095Z       %73 = arith.andi %68, %72 : tensor<128x32xi1, #mma>
2026-02-21T09:48:43.9538158Z       tt.store %64, %52, %73 : tensor<128x32x!tt.ptr<bf16>, #mma>
2026-02-21T09:48:43.9538221Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:48:43.9538257Z     tt.return
2026-02-21T09:48:43.9538288Z   }
2026-02-21T09:48:43.9538324Z }
2026-02-21T09:48:43.9538329Z 
2026-02-21T09:48:43.9538359Z {-#
2026-02-21T09:48:43.9538401Z   external_resources: {
2026-02-21T09:48:43.9538464Z     mlir_reproducer: {
2026-02-21T09:48:43.9539393Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:48:43.9539455Z       disable_threading: false,
2026-02-21T09:48:43.9539492Z       verify_each: true
2026-02-21T09:48:43.9539524Z     }
2026-02-21T09:48:43.9539554Z   }
2026-02-21T09:48:43.9539586Z #-}
2026-02-21T09:48:43.9539829Z /tmp/torchinductor_root/hq/chqcwabtsnwtngk7ihgmoab5jfdhtejdjrvnzsagkuazwdcogvdr.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:48:43.9540254Z /tmp/torchinductor_root/hq/chqcwabtsnwtngk7ihgmoab5jfdhtejdjrvnzsagkuazwdcogvdr.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:48:43.9540372Z [254s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:48:43.9541018Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 32], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[4, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:48:43.9541076Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:48:43.9541159Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:48:52.3663457Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 111/111 9.1 configs/s
2026-02-21T09:48:54.2141902Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━ 155/155 80.1 configs/s
2026-02-21T09:48:57.8750449Z [268s] Generation 1 complete: 
2026-02-21T09:48:57.8750818Z error=15
2026-02-21T09:48:57.8751028Z ok=99
2026-02-21T09:48:57.8751229Z min=1.2871
2026-02-21T09:48:57.8751435Z mid=2.7963
2026-02-21T09:48:57.8751635Z max=72.1609
2026-02-21T09:48:57.8751873Z best={'block_sizes': [8, 128, 512],
2026-02-21T09:48:57.8752251Z  'indexing': ['pointer', 'block_ptr', 'block_ptr'],
2026-02-21T09:48:57.8752616Z  'l2_groupings': [1],
2026-02-21T09:48:57.8752896Z  'load_eviction_policies': ['', ''],
2026-02-21T09:48:57.8753212Z  'loop_orders': [[0, 1]],
2026-02-21T09:48:57.8753524Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:48:57.8753821Z  'num_sm_multiplier': 128,
2026-02-21T09:48:57.8754092Z  'num_stages': 4,
2026-02-21T09:48:57.8754321Z  'num_warps': 8,
2026-02-21T09:48:57.8754604Z  'pid_type': 'persistent_blocked',
2026-02-21T09:48:57.8754918Z  'range_flattens': [False, True],
2026-02-21T09:48:57.8755239Z  'range_multi_buffers': [True, True],
2026-02-21T09:48:57.8755545Z  'range_num_stages': [2, 1],
2026-02-21T09:48:57.8755826Z  'range_unroll_factors': [3, 3],
2026-02-21T09:48:57.8756154Z  'range_warp_specializes': [],
2026-02-21T09:48:57.8756373Z  'waves_per_eu': 1}
2026-02-21T09:48:57.8801599Z [268s] Fitting surrogate: 214 points, 214 targets
2026-02-21T09:48:58.9930334Z [269s] Generation 2 starting: 103 neighbors, 5 active search path(s)
2026-02-21T09:49:34.9814951Z [305s] Timeout after 30s compiling Config(block_sizes=[32, 32, 512], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:49:34.9838909Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 103/103 0.7 configs/s
2026-02-21T09:49:35.8807396Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:49:35.8881397Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 8], order = [1, 0]}>
2026-02-21T09:49:35.8889535Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 8], order = [2, 1, 0]}>
2026-02-21T09:49:35.8893392Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:49:35.8894984Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 8], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:49:35.8895579Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:49:35.8895923Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:49:35.8896218Z #smem = #ttg.shared_memory
2026-02-21T09:49:35.8896564Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:49:35.8897254Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:49:35.8898059Z     %cst = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.8898696Z     %cst_0 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.8898955Z     %cst_1 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.8899211Z     %cst_2 = arith.constant dense<8192> : tensor<1x512xi64, #mma>
2026-02-21T09:49:35.8899595Z     %cst_3 = arith.constant dense<0> : tensor<1x512xi64, #mma>
2026-02-21T09:49:35.8899857Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8900116Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8900371Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8900633Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:49:35.8900865Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:49:35.8901050Z     %cst_9 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:49:35.8901206Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:49:35.8901373Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x512xf32, #mma>
2026-02-21T09:49:35.8901612Z     %cst_11 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.8901876Z     %cst_12 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.8902071Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:49:35.8902192Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:49:35.8902321Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T09:49:35.8902440Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:49:35.8902563Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:49:35.8902692Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:49:35.8902811Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:49:35.8902961Z     %cst_13 = arith.constant dense<0> : tensor<2x512xi8, #blocked>
2026-02-21T09:49:35.8903138Z     %cst_14 = arith.constant dense<8192> : tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.8903323Z     %cst_15 = arith.constant dense<0> : tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.8903593Z     %cst_16 = arith.constant dense<0> : tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8903748Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:49:35.8903874Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:49:35.8904058Z     %cst_17 = arith.constant dense<4> : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8904333Z     %0 = tt.get_program_id x : i32
2026-02-21T09:49:35.8904449Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:49:35.8904571Z     %2 = arith.minsi %1, %c2048_i32 : i32
2026-02-21T09:49:35.8904782Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:35.8905064Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:35.8905339Z     %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.8905583Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.8905788Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.8906025Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.8906391Z     %9 = arith.extsi %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.8906851Z     %10 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:35.8907317Z     %11 = arith.extsi %10 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:35.8907875Z     %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:49:35.8908511Z     %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:49:35.8909122Z     %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:49:35.8909513Z     %15 = arith.cmpi eq, %14, %cst_8 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:49:35.8909802Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1>
2026-02-21T09:49:35.8910085Z     %17 = arith.cmpi eq, %14, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:49:35.8910366Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1>
2026-02-21T09:49:35.8910667Z     %19 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:35.8911058Z     %20 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:35.8911493Z     %21 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:35.8911936Z     %22 = arith.extsi %21 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:35.8912270Z     %23 = arith.subi %2, %0 : i32
2026-02-21T09:49:35.8912428Z     %24 = arith.remsi %23, %c3_i32 : i32
2026-02-21T09:49:35.8912596Z     %25 = arith.subi %23, %24 : i32
2026-02-21T09:49:35.8912750Z     %26 = arith.addi %0, %25 : i32
2026-02-21T09:49:35.8912985Z     %27 = arith.addi %5, %cst_11 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.8913382Z     %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.8913770Z     %29 = tt.broadcast %28 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8914132Z     %30 = arith.addi %9, %cst_12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.8914518Z     %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8914898Z     %32 = arith.muli %31, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8915164Z     %33 = tt.broadcast %32 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8915438Z     %34 = arith.cmpi sge, %31, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8915676Z     %35 = arith.cmpi slt, %31, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8915900Z     %36 = arith.andi %34, %35 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.8916159Z     %37 = tt.broadcast %36 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.8916410Z     scf.for %arg3 = %0 to %26 step %c3_i32  : i32 {
2026-02-21T09:49:35.8916612Z       %38 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:49:35.8916796Z       %39 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:49:35.8916969Z       %40 = arith.muli %38, %c128_i32 : i32
2026-02-21T09:49:35.8917216Z       %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:35.8917552Z       %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:35.8917803Z       %43 = arith.muli %39, %c512_i32 : i32
2026-02-21T09:49:35.8918125Z       %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:49:35.8918492Z       %45 = arith.muli %44, %cst_9 : tensor<128x1xi32, #blocked2>
2026-02-21T09:49:35.8918773Z       %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8919023Z       %47 = arith.extsi %43 : i32 to i64
2026-02-21T09:49:35.8919290Z       %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:35.8919604Z       %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:35.8920020Z       %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.8920420Z       %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8920704Z       %52 = arith.cmpi sge, %50, %cst_15 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.8920956Z       %53 = arith.cmpi slt, %50, %cst_14 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.8921186Z       %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked>
2026-02-21T09:49:35.8921446Z       %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.8921827Z       %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:49:35.8922138Z         %238 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:49:35.8922391Z         %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.8922766Z         %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.8923169Z         %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.8923571Z         %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8923903Z         %243 = arith.addi %46, %242 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8924212Z         %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8924514Z         %245 = tt.load %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.8924873Z         %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.8925367Z         %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8925981Z         %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8926447Z         %249 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:35.8926695Z         %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.8927016Z         %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.8927415Z         %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8927771Z         %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8928050Z         %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8928331Z         %255 = arith.addi %254, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8928613Z         %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8928917Z         %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8929162Z         %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8929400Z         %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.8929675Z         %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.8929950Z         %261 = arith.andi %260, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.8930217Z         %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.8930618Z         %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8931037Z         %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8931412Z         %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8931774Z         %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8932208Z         %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.8932709Z         %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.8933143Z         %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8933511Z         %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8933869Z         %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8934228Z         %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8934570Z         %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.8934895Z         %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.8935259Z         %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.8935759Z         %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8936502Z         %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.8937025Z         %278 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:49:35.8937205Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:49:35.8937472Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.8937793Z         %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.8938197Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.8938596Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8938883Z         %284 = arith.addi %46, %283 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8939178Z         %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8939477Z         %286 = tt.load %285 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.8939805Z         %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.8940291Z         %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8940918Z         %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8941332Z         %290 = arith.extsi %278 : i32 to i64
2026-02-21T09:49:35.8941571Z         %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.8941913Z         %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.8942307Z         %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8942685Z         %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8942965Z         %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8943244Z         %296 = arith.addi %295, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8943552Z         %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8943856Z         %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8944104Z         %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8944344Z         %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.8944610Z         %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.8944885Z         %302 = arith.andi %301, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.8945124Z         %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.8945505Z         %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8945854Z         %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8946101Z         %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8946436Z         %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8946868Z         %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.8947401Z         %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.8947857Z         %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8948324Z         %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8948718Z         %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8949085Z         %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8949428Z         %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.8949757Z         %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.8950121Z         %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.8950601Z         %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8951295Z         %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.8951818Z         %319 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:35.8951999Z         %320 = arith.muli %319, %c2_i32 : i32
2026-02-21T09:49:35.8952243Z         %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.8952570Z         %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.8952978Z         %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.8953403Z         %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8953687Z         %325 = arith.addi %46, %324 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8953996Z         %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8954301Z         %327 = tt.load %326 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.8954627Z         %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.8955107Z         %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8955705Z         %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8956113Z         %331 = arith.extsi %319 : i32 to i64
2026-02-21T09:49:35.8956360Z         %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.8956681Z         %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.8957075Z         %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8957433Z         %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8957706Z         %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8957985Z         %337 = arith.addi %336, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8958268Z         %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8958566Z         %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8958834Z         %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.8959068Z         %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.8959338Z         %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.8959606Z         %343 = arith.andi %342, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.8959867Z         %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.8960244Z         %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8960659Z         %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8961013Z         %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8961367Z         %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8961801Z         %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.8962306Z         %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.8962762Z         %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8963121Z         %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8963480Z         %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8963829Z         %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8964170Z         %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.8964511Z         %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.8964872Z         %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.8965373Z         %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8966061Z         %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.8966573Z         scf.yield %359 : tensor<128x512xf32, #mma>
2026-02-21T09:49:35.8966759Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:49:35.8966959Z       %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8967241Z       %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8967526Z       %59 = tt.load %58 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.8967836Z       %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.8968301Z       %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8968885Z       %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8969305Z       %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8969571Z       %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8969846Z       %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.8970075Z       %66 = tt.load %64, %65, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.8970461Z       %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8970863Z       %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8971215Z       %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8971553Z       %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.8971970Z       %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.8972462Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.8972872Z       %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8973217Z       %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8973561Z       %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8973893Z       %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.8974220Z       %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.8974529Z       %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.8974873Z       %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.8975332Z       %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.8989184Z       %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.8989970Z       %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:49:35.8990220Z       %83 = arith.extsi %40 : i32 to i64
2026-02-21T09:49:35.8990445Z       %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:35.8990738Z       %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:35.8991109Z       %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:49:35.8991443Z       %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.8991698Z       %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:35.8991995Z       %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:35.8992289Z       %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:35.8992670Z       %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:49:35.8993068Z       %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:35.8993328Z       %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma>
2026-02-21T09:49:35.8993592Z       %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:49:35.8993874Z       %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.8994098Z       %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.8994314Z       %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma>
2026-02-21T09:49:35.8994556Z       %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:35.8994850Z       %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:49:35.8995081Z       %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:49:35.8995303Z       %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma>
2026-02-21T09:49:35.8995573Z       %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:35.8995828Z       %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma>
2026-02-21T09:49:35.8996055Z       tt.store %94, %82, %103 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:35.8996265Z       %104 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:49:35.8996437Z       %105 = arith.remsi %104, %c128_i32 : i32
2026-02-21T09:49:35.8996611Z       %106 = arith.divsi %104, %c128_i32 : i32
2026-02-21T09:49:35.8996780Z       %107 = arith.muli %105, %c128_i32 : i32
2026-02-21T09:49:35.8997026Z       %108 = tt.splat %107 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:35.8997353Z       %109 = arith.addi %108, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:35.8997621Z       %110 = arith.muli %106, %c512_i32 : i32
2026-02-21T09:49:35.8997852Z       %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:49:35.8998108Z       %112 = arith.muli %111, %cst_9 : tensor<128x1xi32, #blocked2>
2026-02-21T09:49:35.8998307Z       %113 = tt.broadcast %112 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.8998486Z       %114 = arith.extsi %110 : i32 to i64
2026-02-21T09:49:35.8998660Z       %115 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:35.8998883Z       %116 = arith.addi %115, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:35.8999183Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.8999466Z       %118 = tt.broadcast %117 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.8999684Z       %119 = arith.cmpi sge, %117, %cst_15 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.8999864Z       %120 = arith.cmpi slt, %117, %cst_14 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.9000032Z       %121 = arith.andi %119, %120 : tensor<1x512xi1, #blocked>
2026-02-21T09:49:35.9000227Z       %122 = tt.broadcast %121 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9000498Z       %123 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:49:35.9000716Z         %238 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:49:35.9000890Z         %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9001114Z         %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9001392Z         %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.9001673Z         %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9001871Z         %243 = arith.addi %113, %242 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9002079Z         %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9002286Z         %245 = tt.load %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9002511Z         %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9002908Z         %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9003344Z         %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9003630Z         %249 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:35.9003799Z         %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9004037Z         %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9004316Z         %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9004562Z         %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9004757Z         %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9004951Z         %255 = arith.addi %254, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9005150Z         %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9005358Z         %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9005528Z         %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9005693Z         %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.9005879Z         %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9006070Z         %261 = arith.andi %260, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9006244Z         %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9006507Z         %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9006804Z         %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9007063Z         %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9007310Z         %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9007628Z         %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9007975Z         %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9008273Z         %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9008526Z         %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9008778Z         %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9009025Z         %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9009266Z         %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9009494Z         %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9009753Z         %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9010086Z         %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9010572Z         %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9010930Z         %278 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:49:35.9011076Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:49:35.9011247Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9011477Z         %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9011768Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.9012045Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9012244Z         %284 = arith.addi %113, %283 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9012446Z         %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9012660Z         %286 = tt.load %285 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9012889Z         %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9013223Z         %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9013635Z         %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9013922Z         %290 = arith.extsi %278 : i32 to i64
2026-02-21T09:49:35.9014090Z         %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9014314Z         %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9014585Z         %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9014845Z         %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9015036Z         %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9015231Z         %296 = arith.addi %295, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9015443Z         %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9015649Z         %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9015820Z         %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9015981Z         %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.9016172Z         %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9016361Z         %302 = arith.andi %301, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9016530Z         %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9016794Z         %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9017079Z         %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9017324Z         %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9017573Z         %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9017871Z         %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9018228Z         %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9018524Z         %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9018812Z         %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9019069Z         %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9019318Z         %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9019579Z         %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9019808Z         %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9020066Z         %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9020400Z         %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9020878Z         %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9021237Z         %319 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:35.9021369Z         %320 = arith.muli %319, %c2_i32 : i32
2026-02-21T09:49:35.9021543Z         %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9021773Z         %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9022050Z         %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.9022340Z         %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9022543Z         %325 = arith.addi %113, %324 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9022774Z         %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9022991Z         %327 = tt.load %326 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9023233Z         %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9023577Z         %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9023990Z         %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9024275Z         %331 = arith.extsi %319 : i32 to i64
2026-02-21T09:49:35.9024452Z         %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9024675Z         %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9024956Z         %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9025210Z         %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9025406Z         %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9025606Z         %337 = arith.addi %336, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9025804Z         %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9026019Z         %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9026190Z         %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9026360Z         %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.9026553Z         %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9026759Z         %343 = arith.andi %342, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9026936Z         %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9027201Z         %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9027511Z         %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9027759Z         %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9028007Z         %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9028311Z         %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9028658Z         %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9028956Z         %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9029212Z         %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9029462Z         %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9029709Z         %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9029953Z         %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9030183Z         %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9030443Z         %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9030792Z         %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9031287Z         %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9031648Z         scf.yield %359 : tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9031785Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:49:35.9031937Z       %124 = arith.addi %113, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9032143Z       %125 = tt.addptr %6, %124 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9032359Z       %126 = tt.load %125 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9032588Z       %127 = ttg.local_alloc %126 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9032926Z       %128 = ttg.local_load %127 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9033342Z       %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9033647Z       %130 = arith.addi %33, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9033851Z       %131 = tt.addptr %7, %130 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9034058Z       %132 = arith.andi %37, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9034228Z       %133 = tt.load %131, %132, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9034500Z       %134 = ttg.convert_layout %133 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9034811Z       %135 = arith.shli %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9035061Z       %136 = arith.shrsi %135, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9035309Z       %137 = arith.shrsi %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9035624Z       %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9035974Z       %139 = tt.expand_dims %137 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9036271Z       %140 = tt.broadcast %138 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9036524Z       %141 = arith.select %16, %140, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9036779Z       %142 = tt.broadcast %139 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9037022Z       %143 = arith.select %18, %142, %141 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9037264Z       %144 = tt.reshape %143 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9037494Z       %145 = arith.sitofp %144 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9037752Z       %146 = ttg.local_alloc %145 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9038087Z       %147 = ttg.local_load %146 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9038561Z       %148 = tt.dot %129, %147, %123, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9038974Z       %149 = arith.truncf %148 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:49:35.9039154Z       %150 = arith.extsi %107 : i32 to i64
2026-02-21T09:49:35.9039345Z       %151 = tt.splat %150 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:35.9039565Z       %152 = arith.addi %151, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:35.9039836Z       %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9040083Z       %154 = arith.muli %153, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9040269Z       %155 = tt.broadcast %154 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9040482Z       %156 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:35.9040699Z       %157 = arith.addi %156, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:35.9040966Z       %158 = tt.expand_dims %157 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:49:35.9041236Z       %159 = tt.broadcast %158 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9041419Z       %160 = arith.addi %155, %159 : tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9041619Z       %161 = tt.addptr %19, %160 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9041827Z       %162 = arith.cmpi sge, %153, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9041994Z       %163 = arith.cmpi slt, %153, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9042156Z       %164 = arith.andi %162, %163 : tensor<128x1xi1, #mma>
2026-02-21T09:49:35.9042334Z       %165 = tt.broadcast %164 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:35.9042527Z       %166 = arith.cmpi sge, %158, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:49:35.9042767Z       %167 = arith.cmpi slt, %158, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:49:35.9042933Z       %168 = arith.andi %166, %167 : tensor<1x512xi1, #mma>
2026-02-21T09:49:35.9043114Z       %169 = tt.broadcast %168 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:35.9043298Z       %170 = arith.andi %165, %169 : tensor<128x512xi1, #mma>
2026-02-21T09:49:35.9043606Z       tt.store %161, %149, %170 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:35.9043758Z       %171 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:49:35.9043889Z       %172 = arith.remsi %171, %c128_i32 : i32
2026-02-21T09:49:35.9044015Z       %173 = arith.divsi %171, %c128_i32 : i32
2026-02-21T09:49:35.9044148Z       %174 = arith.muli %172, %c128_i32 : i32
2026-02-21T09:49:35.9044329Z       %175 = tt.splat %174 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:35.9044559Z       %176 = arith.addi %175, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:35.9044743Z       %177 = arith.muli %173, %c512_i32 : i32
2026-02-21T09:49:35.9044972Z       %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:49:35.9045236Z       %179 = arith.muli %178, %cst_9 : tensor<128x1xi32, #blocked2>
2026-02-21T09:49:35.9045440Z       %180 = tt.broadcast %179 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9045620Z       %181 = arith.extsi %177 : i32 to i64
2026-02-21T09:49:35.9045794Z       %182 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:35.9046019Z       %183 = arith.addi %182, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:35.9046306Z       %184 = tt.expand_dims %183 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.9046610Z       %185 = tt.broadcast %184 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9046818Z       %186 = arith.cmpi sge, %184, %cst_15 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.9047003Z       %187 = arith.cmpi slt, %184, %cst_14 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.9047196Z       %188 = arith.andi %186, %187 : tensor<1x512xi1, #blocked>
2026-02-21T09:49:35.9047389Z       %189 = tt.broadcast %188 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9047660Z       %190 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:49:35.9047886Z         %238 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:49:35.9048066Z         %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9048293Z         %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9048579Z         %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.9048862Z         %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9049068Z         %243 = arith.addi %180, %242 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9049278Z         %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9049530Z         %245 = tt.load %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9049856Z         %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9050341Z         %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9050941Z         %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9051382Z         %249 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:35.9051627Z         %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9051950Z         %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9052379Z         %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9052737Z         %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9053013Z         %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9053289Z         %255 = arith.addi %254, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9053574Z         %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9053871Z         %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9054120Z         %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9054356Z         %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.9054620Z         %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9054895Z         %261 = arith.andi %260, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9055135Z         %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9055557Z         %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9055985Z         %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9056336Z         %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9056719Z         %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9057150Z         %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9057676Z         %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9058102Z         %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9058461Z         %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9058823Z         %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9059173Z         %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9059522Z         %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9059847Z         %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9060207Z         %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9060687Z         %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9061382Z         %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9061895Z         %278 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:49:35.9062075Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:49:35.9062320Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9062667Z         %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9063066Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.9063493Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9063781Z         %284 = arith.addi %180, %283 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9064071Z         %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9064375Z         %286 = tt.load %285 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9064695Z         %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9065182Z         %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9065781Z         %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9066206Z         %290 = arith.extsi %278 : i32 to i64
2026-02-21T09:49:35.9066464Z         %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9066783Z         %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9067184Z         %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9067542Z         %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9067815Z         %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9068126Z         %296 = arith.addi %295, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9068406Z         %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9068725Z         %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9068975Z         %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9069211Z         %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.9069479Z         %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9069748Z         %302 = arith.andi %301, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9069994Z         %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9070371Z         %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9070794Z         %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9071148Z         %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9071501Z         %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9071937Z         %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9072444Z         %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9072865Z         %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9073230Z         %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9073607Z         %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9073975Z         %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9074321Z         %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9074659Z         %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9075025Z         %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9075505Z         %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9076206Z         %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9076718Z         %319 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:35.9076900Z         %320 = arith.muli %319, %c2_i32 : i32
2026-02-21T09:49:35.9077170Z         %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9077492Z         %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9077896Z         %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.9078304Z         %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9078646Z         %325 = arith.addi %180, %324 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9078945Z         %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9079266Z         %327 = tt.load %326 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9079590Z         %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9080099Z         %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9080699Z         %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9081114Z         %331 = arith.extsi %319 : i32 to i64
2026-02-21T09:49:35.9081360Z         %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9081678Z         %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9093267Z         %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9093639Z         %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9093918Z         %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9094196Z         %337 = arith.addi %336, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9094485Z         %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9094786Z         %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9095032Z         %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9095268Z         %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.9095531Z         %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9095805Z         %343 = arith.andi %342, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9096047Z         %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9096494Z         %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9096913Z         %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9097288Z         %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9097641Z         %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9098070Z         %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9098573Z         %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9098998Z         %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9099357Z         %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9099720Z         %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9100069Z         %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9100412Z         %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9100735Z         %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9101090Z         %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9101567Z         %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9102281Z         %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9102837Z         scf.yield %359 : tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9103033Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:49:35.9103235Z       %191 = arith.addi %180, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9103527Z       %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9103825Z       %193 = tt.load %192 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9104143Z       %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9104627Z       %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9105219Z       %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9105650Z       %197 = arith.addi %33, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9105935Z       %198 = tt.addptr %7, %197 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9106216Z       %199 = arith.andi %37, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9106460Z       %200 = tt.load %198, %199, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9106829Z       %201 = ttg.convert_layout %200 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9107242Z       %202 = arith.shli %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9107642Z       %203 = arith.shrsi %202, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9107990Z       %204 = arith.shrsi %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9108421Z       %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9108935Z       %206 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9109358Z       %207 = tt.broadcast %205 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9109719Z       %208 = arith.select %16, %207, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9110073Z       %209 = tt.broadcast %206 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9110422Z       %210 = arith.select %18, %209, %208 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9110759Z       %211 = tt.reshape %210 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9111081Z       %212 = arith.sitofp %211 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9111448Z       %213 = ttg.local_alloc %212 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9111915Z       %214 = ttg.local_load %213 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9112622Z       %215 = tt.dot %196, %214, %190, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9113194Z       %216 = arith.truncf %215 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:49:35.9113461Z       %217 = arith.extsi %174 : i32 to i64
2026-02-21T09:49:35.9113701Z       %218 = tt.splat %217 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:35.9114006Z       %219 = arith.addi %218, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:35.9114409Z       %220 = tt.expand_dims %219 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9114753Z       %221 = arith.muli %220, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9115013Z       %222 = tt.broadcast %221 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9115315Z       %223 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:35.9115614Z       %224 = arith.addi %223, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:35.9116001Z       %225 = tt.expand_dims %224 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:49:35.9116381Z       %226 = tt.broadcast %225 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9116643Z       %227 = arith.addi %222, %226 : tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9116920Z       %228 = tt.addptr %19, %227 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9117209Z       %229 = arith.cmpi sge, %220, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9117449Z       %230 = arith.cmpi slt, %220, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9117672Z       %231 = arith.andi %229, %230 : tensor<128x1xi1, #mma>
2026-02-21T09:49:35.9117926Z       %232 = tt.broadcast %231 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:35.9118195Z       %233 = arith.cmpi sge, %225, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:49:35.9118429Z       %234 = arith.cmpi slt, %225, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:49:35.9118655Z       %235 = arith.andi %233, %234 : tensor<1x512xi1, #mma>
2026-02-21T09:49:35.9118919Z       %236 = tt.broadcast %235 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:35.9119180Z       %237 = arith.andi %232, %236 : tensor<128x512xi1, #mma>
2026-02-21T09:49:35.9119408Z       tt.store %228, %216, %237 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:35.9119620Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:49:35.9119811Z     scf.for %arg3 = %26 to %2 step %c1_i32  : i32 {
2026-02-21T09:49:35.9120001Z       %38 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:49:35.9120183Z       %39 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:49:35.9120355Z       %40 = arith.muli %38, %c128_i32 : i32
2026-02-21T09:49:35.9120596Z       %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:35.9120911Z       %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:35.9121160Z       %43 = arith.muli %39, %c512_i32 : i32
2026-02-21T09:49:35.9121485Z       %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:49:35.9121842Z       %45 = arith.muli %44, %cst_9 : tensor<128x1xi32, #blocked2>
2026-02-21T09:49:35.9122121Z       %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9122368Z       %47 = arith.extsi %43 : i32 to i64
2026-02-21T09:49:35.9122646Z       %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:35.9122953Z       %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:35.9123343Z       %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.9123735Z       %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9124034Z       %52 = arith.cmpi sge, %50, %cst_15 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.9124284Z       %53 = arith.cmpi slt, %50, %cst_14 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:35.9124513Z       %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked>
2026-02-21T09:49:35.9124789Z       %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9125169Z       %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:49:35.9125476Z         %104 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:49:35.9125726Z         %105 = tt.splat %104 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9126049Z         %106 = arith.addi %105, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9126449Z         %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.9126855Z         %108 = tt.broadcast %107 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9127132Z         %109 = arith.addi %46, %108 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9127425Z         %110 = tt.addptr %6, %109 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9127720Z         %111 = tt.load %110 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9128043Z         %112 = ttg.local_alloc %111 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9128531Z         %113 = ttg.local_load %112 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9129124Z         %114 = arith.extf %113 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9129537Z         %115 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:35.9129796Z         %116 = tt.splat %115 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9130114Z         %117 = arith.addi %116, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9130508Z         %118 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9130882Z         %119 = arith.muli %118, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9131156Z         %120 = tt.broadcast %119 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9131429Z         %121 = arith.addi %120, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9131711Z         %122 = tt.addptr %7, %121 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9132010Z         %123 = arith.cmpi sge, %118, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9132260Z         %124 = arith.cmpi slt, %118, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9132521Z         %125 = arith.andi %123, %124 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.9132782Z         %126 = tt.broadcast %125 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9133054Z         %127 = arith.andi %126, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9133297Z         %128 = tt.load %122, %127, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9133672Z         %129 = ttg.convert_layout %128 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9134089Z         %130 = arith.shli %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9134437Z         %131 = arith.shrsi %130, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9134791Z         %132 = arith.shrsi %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9135247Z         %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9135750Z         %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9136198Z         %135 = tt.broadcast %133 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9136558Z         %136 = arith.select %16, %135, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9136917Z         %137 = tt.broadcast %134 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9137267Z         %138 = arith.select %18, %137, %136 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9137604Z         %139 = tt.reshape %138 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9137928Z         %140 = arith.sitofp %139 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9138284Z         %141 = ttg.local_alloc %140 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9138758Z         %142 = ttg.local_load %141 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9139451Z         %143 = tt.dot %114, %142, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9139955Z         %144 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:49:35.9140133Z         %145 = arith.muli %144, %c2_i32 : i32
2026-02-21T09:49:35.9140374Z         %146 = tt.splat %145 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9140698Z         %147 = arith.addi %146, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9141115Z         %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.9141513Z         %149 = tt.broadcast %148 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9141812Z         %150 = arith.addi %46, %149 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9142098Z         %151 = tt.addptr %6, %150 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9142397Z         %152 = tt.load %151 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9142717Z         %153 = ttg.local_alloc %152 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9143191Z         %154 = ttg.local_load %153 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9143786Z         %155 = arith.extf %154 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9144191Z         %156 = arith.extsi %144 : i32 to i64
2026-02-21T09:49:35.9144436Z         %157 = tt.splat %156 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9144755Z         %158 = arith.addi %157, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9145145Z         %159 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9145520Z         %160 = arith.muli %159, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9145805Z         %161 = tt.broadcast %160 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9146091Z         %162 = arith.addi %161, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9146397Z         %163 = tt.addptr %7, %162 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9146691Z         %164 = arith.cmpi sge, %159, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9146958Z         %165 = arith.cmpi slt, %159, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9147189Z         %166 = arith.andi %164, %165 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.9147457Z         %167 = tt.broadcast %166 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9147722Z         %168 = arith.andi %167, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9147966Z         %169 = tt.load %163, %168, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9148340Z         %170 = ttg.convert_layout %169 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9148750Z         %171 = arith.shli %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9149105Z         %172 = arith.shrsi %171, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9149455Z         %173 = arith.shrsi %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9149888Z         %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9150392Z         %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9150810Z         %176 = tt.broadcast %174 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9151170Z         %177 = arith.select %16, %176, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9151529Z         %178 = tt.broadcast %175 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9151892Z         %179 = arith.select %18, %178, %177 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9152233Z         %180 = tt.reshape %179 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9152553Z         %181 = arith.sitofp %180 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9152929Z         %182 = ttg.local_alloc %181 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9153407Z         %183 = ttg.local_load %182 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9154089Z         %184 = tt.dot %155, %183, %143, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9154592Z         %185 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:35.9154767Z         %186 = arith.muli %185, %c2_i32 : i32
2026-02-21T09:49:35.9155014Z         %187 = tt.splat %186 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9155333Z         %188 = arith.addi %187, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:35.9155748Z         %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:35.9156159Z         %190 = tt.broadcast %189 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9156439Z         %191 = arith.addi %46, %190 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9156721Z         %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9157017Z         %193 = tt.load %192 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9157355Z         %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9157828Z         %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9158435Z         %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9158838Z         %197 = arith.extsi %185 : i32 to i64
2026-02-21T09:49:35.9159075Z         %198 = tt.splat %197 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9159388Z         %199 = arith.addi %198, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:35.9159778Z         %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9160126Z         %201 = arith.muli %200, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9160397Z         %202 = tt.broadcast %201 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9160671Z         %203 = arith.addi %202, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9160947Z         %204 = tt.addptr %7, %203 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9161239Z         %205 = arith.cmpi sge, %200, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9161481Z         %206 = arith.cmpi slt, %200, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:35.9161709Z         %207 = arith.andi %205, %206 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:35.9161972Z         %208 = tt.broadcast %207 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9162234Z         %209 = arith.andi %208, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9162470Z         %210 = tt.load %204, %209, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9162891Z         %211 = ttg.convert_layout %210 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9163299Z         %212 = arith.shli %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9163648Z         %213 = arith.shrsi %212, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9164012Z         %214 = arith.shrsi %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9164438Z         %215 = tt.expand_dims %213 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9164946Z         %216 = tt.expand_dims %214 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9165382Z         %217 = tt.broadcast %215 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9165742Z         %218 = arith.select %16, %217, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9166100Z         %219 = tt.broadcast %216 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9166447Z         %220 = arith.select %18, %219, %218 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9166788Z         %221 = tt.reshape %220 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9167102Z         %222 = arith.sitofp %221 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9167454Z         %223 = ttg.local_alloc %222 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9167922Z         %224 = ttg.local_load %223 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9168625Z         %225 = tt.dot %196, %224, %184, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9169131Z         scf.yield %225 : tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9169336Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:49:35.9169533Z       %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9169809Z       %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:35.9170089Z       %59 = tt.load %58 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:35.9170396Z       %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:35.9170861Z       %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9171441Z       %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9171860Z       %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9172125Z       %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:35.9172399Z       %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:35.9172623Z       %66 = tt.load %64, %65, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:35.9172989Z       %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9173402Z       %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9173734Z       %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9174073Z       %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:35.9174512Z       %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9175000Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:35.9175433Z       %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9175775Z       %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9176113Z       %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9176440Z       %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:35.9176769Z       %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:35.9177077Z       %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:35.9177418Z       %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:35.9177875Z       %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:35.9178540Z       %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:35.9179088Z       %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:49:35.9179330Z       %83 = arith.extsi %40 : i32 to i64
2026-02-21T09:49:35.9179553Z       %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:35.9179862Z       %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:35.9180232Z       %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9180587Z       %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9180839Z       %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9181123Z       %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:35.9181417Z       %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:35.9181783Z       %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:49:35.9182152Z       %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9182404Z       %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9182669Z       %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:49:35.9182945Z       %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9183168Z       %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:49:35.9183384Z       %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma>
2026-02-21T09:49:35.9183617Z       %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:35.9183872Z       %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:49:35.9184102Z       %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:49:35.9184320Z       %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma>
2026-02-21T09:49:35.9184562Z       %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:35.9184813Z       %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma>
2026-02-21T09:49:35.9185080Z       tt.store %94, %82, %103 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:35.9185284Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:49:35.9185423Z     tt.return
2026-02-21T09:49:35.9185537Z   }
2026-02-21T09:49:35.9185642Z }
2026-02-21T09:49:35.9185702Z 
2026-02-21T09:49:35.9185748Z {-#
2026-02-21T09:49:35.9185859Z   external_resources: {
2026-02-21T09:49:35.9186016Z     mlir_reproducer: {
2026-02-21T09:49:35.9187448Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:49:35.9188893Z       disable_threading: false,
2026-02-21T09:49:35.9189039Z       verify_each: true
2026-02-21T09:49:35.9189167Z     }
2026-02-21T09:49:35.9189271Z   }
2026-02-21T09:49:35.9189373Z #-}
2026-02-21T09:49:35.9189769Z /tmp/torchinductor_root/lv/clvxvl5r5sksdzmg7eet3qcymrlwe633kmps7qxlh4cpuwd6dba6.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:49:35.9190780Z /tmp/torchinductor_root/lv/clvxvl5r5sksdzmg7eet3qcymrlwe633kmps7qxlh4cpuwd6dba6.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:49:35.9191578Z [306s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:49:35.9192733Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 512], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[2, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:49:35.9193784Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:49:35.9194022Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:49:36.6643709Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:49:36.6653643Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 8], order = [1, 0]}>
2026-02-21T09:49:36.6654036Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 8], order = [2, 1, 0]}>
2026-02-21T09:49:36.6654421Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:49:36.6654721Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 8], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:49:36.6654996Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:49:36.6655238Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:49:36.6655421Z #smem = #ttg.shared_memory
2026-02-21T09:49:36.6655662Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:49:36.6656136Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:49:36.6657190Z     %cst = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6657374Z     %cst_0 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6657556Z     %cst_1 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6657733Z     %cst_2 = arith.constant dense<8192> : tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6658016Z     %cst_3 = arith.constant dense<0> : tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6658188Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6658359Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6658534Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6658722Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:49:36.6658895Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:49:36.6659079Z     %cst_9 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:49:36.6659236Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:49:36.6659397Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6659634Z     %cst_11 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6659894Z     %cst_12 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6660092Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:49:36.6660209Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:49:36.6660336Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T09:49:36.6660455Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:49:36.6660578Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T09:49:36.6660698Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:49:36.6660821Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:49:36.6661041Z     %cst_13 = arith.constant dense<0> : tensor<2x512xi8, #blocked>
2026-02-21T09:49:36.6661220Z     %cst_14 = arith.constant dense<8192> : tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6661407Z     %cst_15 = arith.constant dense<0> : tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6661675Z     %cst_16 = arith.constant dense<0> : tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6661833Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:49:36.6661946Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:49:36.6662135Z     %cst_17 = arith.constant dense<4> : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6662336Z     %0 = tt.get_program_id x : i32
2026-02-21T09:49:36.6662451Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:49:36.6662573Z     %2 = arith.minsi %1, %c2048_i32 : i32
2026-02-21T09:49:36.6662782Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:36.6663126Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:36.6663398Z     %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6663647Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6663847Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6664087Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6664400Z     %9 = arith.extsi %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6664721Z     %10 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:36.6665055Z     %11 = arith.extsi %10 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:36.6665436Z     %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:49:36.6665862Z     %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:49:36.6666287Z     %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:49:36.6666544Z     %15 = arith.cmpi eq, %14, %cst_8 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:49:36.6666753Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1>
2026-02-21T09:49:36.6666953Z     %17 = arith.cmpi eq, %14, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:49:36.6667153Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x512xi1, #blocked1>
2026-02-21T09:49:36.6667376Z     %19 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:36.6667649Z     %20 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:36.6667963Z     %21 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:36.6668271Z     %22 = arith.extsi %21 : tensor<512xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:36.6668504Z     %23 = arith.subi %2, %0 : i32
2026-02-21T09:49:36.6668621Z     %24 = arith.remsi %23, %c3_i32 : i32
2026-02-21T09:49:36.6668761Z     %25 = arith.subi %23, %24 : i32
2026-02-21T09:49:36.6668877Z     %26 = arith.addi %0, %25 : i32
2026-02-21T09:49:36.6669044Z     %27 = arith.addi %5, %cst_11 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6669343Z     %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6669623Z     %29 = tt.broadcast %28 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6669872Z     %30 = arith.addi %9, %cst_12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6670148Z     %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6670388Z     %32 = arith.muli %31, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6670577Z     %33 = tt.broadcast %32 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6670777Z     %34 = arith.cmpi sge, %31, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6670948Z     %35 = arith.cmpi slt, %31, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6671115Z     %36 = arith.andi %34, %35 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6671293Z     %37 = tt.broadcast %36 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6671479Z     scf.for %arg3 = %0 to %26 step %c3_i32  : i32 {
2026-02-21T09:49:36.6671618Z       %38 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:49:36.6671750Z       %39 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:49:36.6671875Z       %40 = arith.muli %38, %c128_i32 : i32
2026-02-21T09:49:36.6672054Z       %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:36.6672280Z       %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:36.6672453Z       %43 = arith.muli %39, %c512_i32 : i32
2026-02-21T09:49:36.6672683Z       %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:49:36.6672935Z       %45 = arith.muli %44, %cst_9 : tensor<128x1xi32, #blocked2>
2026-02-21T09:49:36.6673134Z       %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6673331Z       %47 = arith.extsi %43 : i32 to i64
2026-02-21T09:49:36.6673505Z       %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:36.6673729Z       %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:36.6674019Z       %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6674301Z       %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6674504Z       %52 = arith.cmpi sge, %50, %cst_15 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6674682Z       %53 = arith.cmpi slt, %50, %cst_14 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6674854Z       %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked>
2026-02-21T09:49:36.6675037Z       %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6675308Z       %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:49:36.6675526Z         %238 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:49:36.6675708Z         %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6675940Z         %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6676223Z         %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6676508Z         %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6676707Z         %243 = arith.addi %46, %242 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6676931Z         %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6677145Z         %245 = tt.load %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6677369Z         %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6677730Z         %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6678144Z         %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6678430Z         %249 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:36.6678603Z         %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6678818Z         %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6679099Z         %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6679343Z         %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6679536Z         %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6679729Z         %255 = arith.addi %254, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6679927Z         %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6680140Z         %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6680317Z         %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6680483Z         %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6680670Z         %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6680861Z         %261 = arith.andi %260, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6681050Z         %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6681315Z         %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6681627Z         %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6681868Z         %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6682114Z         %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6682415Z         %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6682837Z         %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6683139Z         %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6683391Z         %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6683638Z         %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6683887Z         %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6684124Z         %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6684349Z         %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6684608Z         %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6684988Z         %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6685469Z         %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6685851Z         %278 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:49:36.6685979Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:49:36.6686154Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6686376Z         %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6686651Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6686931Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6687128Z         %284 = arith.addi %46, %283 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6687329Z         %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6687537Z         %286 = tt.load %285 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6687760Z         %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6688094Z         %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6688505Z         %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6688787Z         %290 = arith.extsi %278 : i32 to i64
2026-02-21T09:49:36.6688976Z         %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6689196Z         %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6689468Z         %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6689738Z         %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6689931Z         %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6690120Z         %296 = arith.addi %295, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6690318Z         %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6690523Z         %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6690694Z         %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6690862Z         %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6691055Z         %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6691243Z         %302 = arith.andi %301, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6691411Z         %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6691672Z         %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6691957Z         %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6692198Z         %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6692447Z         %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6692764Z         %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6693110Z         %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6693423Z         %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6693675Z         %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6693921Z         %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6694162Z         %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6694397Z         %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6694624Z         %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6694871Z         %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6695197Z         %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6695670Z         %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6696020Z         %319 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:36.6696147Z         %320 = arith.muli %319, %c2_i32 : i32
2026-02-21T09:49:36.6696319Z         %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6696545Z         %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6696840Z         %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6697119Z         %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6697314Z         %325 = arith.addi %46, %324 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6697529Z         %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6697742Z         %327 = tt.load %326 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6697963Z         %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6698298Z         %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6698705Z         %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6698984Z         %331 = arith.extsi %319 : i32 to i64
2026-02-21T09:49:36.6699154Z         %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6699374Z         %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6699648Z         %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6699895Z         %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6700085Z         %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6700277Z         %337 = arith.addi %336, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6700485Z         %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6700694Z         %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6700864Z         %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6701047Z         %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6701233Z         %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6701417Z         %343 = arith.andi %342, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6701585Z         %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6701848Z         %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6702133Z         %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6702378Z         %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6702621Z         %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6702923Z         %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6703274Z         %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6703564Z         %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6703821Z         %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6704066Z         %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6704312Z         %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6704566Z         %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6704788Z         %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6705040Z         %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6705381Z         %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6705854Z         %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6706203Z         scf.yield %359 : tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6706336Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:49:36.6706484Z       %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6706678Z       %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6706882Z       %59 = tt.load %58 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6707102Z       %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6707429Z       %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6707830Z       %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6708127Z       %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6708330Z       %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6708525Z       %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6708686Z       %66 = tt.load %64, %65, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6708958Z       %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6709234Z       %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6709469Z       %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6709705Z       %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6709999Z       %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6710340Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6710752Z       %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6710995Z       %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6711236Z       %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6711470Z       %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6711700Z       %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6711917Z       %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6712159Z       %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6712498Z       %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6712956Z       %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6713360Z       %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:49:36.6713533Z       %83 = arith.extsi %40 : i32 to i64
2026-02-21T09:49:36.6713694Z       %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:36.6713904Z       %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:36.6714167Z       %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6714406Z       %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6714587Z       %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6714800Z       %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:36.6715011Z       %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:36.6715273Z       %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6715541Z       %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6715727Z       %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6715920Z       %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6716125Z       %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6716305Z       %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6716466Z       %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma>
2026-02-21T09:49:36.6716640Z       %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6716859Z       %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6717033Z       %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6717190Z       %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma>
2026-02-21T09:49:36.6717368Z       %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6717550Z       %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6717716Z       tt.store %94, %82, %103 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:36.6717865Z       %104 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:49:36.6717995Z       %105 = arith.remsi %104, %c128_i32 : i32
2026-02-21T09:49:36.6718123Z       %106 = arith.divsi %104, %c128_i32 : i32
2026-02-21T09:49:36.6718250Z       %107 = arith.muli %105, %c128_i32 : i32
2026-02-21T09:49:36.6718427Z       %108 = tt.splat %107 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:36.6718658Z       %109 = arith.addi %108, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:36.6718841Z       %110 = arith.muli %106, %c512_i32 : i32
2026-02-21T09:49:36.6719076Z       %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:49:36.6719335Z       %112 = arith.muli %111, %cst_9 : tensor<128x1xi32, #blocked2>
2026-02-21T09:49:36.6719537Z       %113 = tt.broadcast %112 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6719718Z       %114 = arith.extsi %110 : i32 to i64
2026-02-21T09:49:36.6719894Z       %115 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:36.6720141Z       %116 = arith.addi %115, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:36.6720423Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6720712Z       %118 = tt.broadcast %117 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6720933Z       %119 = arith.cmpi sge, %117, %cst_15 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6721116Z       %120 = arith.cmpi slt, %117, %cst_14 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6721285Z       %121 = arith.andi %119, %120 : tensor<1x512xi1, #blocked>
2026-02-21T09:49:36.6721479Z       %122 = tt.broadcast %121 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6721757Z       %123 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:49:36.6721978Z         %238 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:49:36.6722161Z         %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6722387Z         %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6722710Z         %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6722997Z         %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6723197Z         %243 = arith.addi %113, %242 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6723405Z         %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6723617Z         %245 = tt.load %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6723848Z         %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6724210Z         %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6724804Z         %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6725097Z         %249 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:36.6725271Z         %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6725498Z         %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6725776Z         %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6726024Z         %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6726228Z         %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6726424Z         %255 = arith.addi %254, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6726629Z         %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6726844Z         %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6727017Z         %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6727185Z         %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6727371Z         %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6727565Z         %261 = arith.andi %260, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6727737Z         %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6728005Z         %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6728326Z         %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6728575Z         %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6728828Z         %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6729164Z         %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6729517Z         %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6729815Z         %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6730069Z         %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6730325Z         %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6730575Z         %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6730812Z         %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6731043Z         %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6731297Z         %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6731630Z         %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6732130Z         %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6732484Z         %278 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:49:36.6732636Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:49:36.6732809Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6733039Z         %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6733320Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6733602Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6733806Z         %284 = arith.addi %113, %283 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6734015Z         %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6734236Z         %286 = tt.load %285 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6734469Z         %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6734804Z         %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6735217Z         %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6735500Z         %290 = arith.extsi %278 : i32 to i64
2026-02-21T09:49:36.6735676Z         %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6735903Z         %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6736198Z         %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6736451Z         %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6736645Z         %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6736859Z         %296 = arith.addi %295, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6737059Z         %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6737272Z         %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6737447Z         %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6737615Z         %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6737808Z         %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6738007Z         %302 = arith.andi %301, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6738181Z         %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6738449Z         %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6738738Z         %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6738989Z         %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6739239Z         %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6739545Z         %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6739913Z         %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6740208Z         %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6740468Z         %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6740739Z         %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6740984Z         %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6741225Z         %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6741451Z         %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6741708Z         %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6742040Z         %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6742515Z         %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6742874Z         %319 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:36.6743002Z         %320 = arith.muli %319, %c2_i32 : i32
2026-02-21T09:49:36.6743180Z         %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6743408Z         %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6743688Z         %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6743975Z         %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6744199Z         %325 = arith.addi %113, %324 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6744413Z         %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6744630Z         %327 = tt.load %326 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6744880Z         %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6745222Z         %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6745637Z         %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6745926Z         %331 = arith.extsi %319 : i32 to i64
2026-02-21T09:49:36.6746103Z         %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6746325Z         %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6746605Z         %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6746854Z         %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6747053Z         %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6747251Z         %337 = arith.addi %336, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6747449Z         %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6747663Z         %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6747836Z         %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6748023Z         %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6748217Z         %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6748420Z         %343 = arith.andi %342, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6748594Z         %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6748860Z         %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6749155Z         %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6749403Z         %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6749650Z         %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6749958Z         %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6750307Z         %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6750610Z         %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6750872Z         %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6751123Z         %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6751373Z         %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6751611Z         %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6751844Z         %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6752141Z         %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6752471Z         %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6752968Z         %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6753327Z         scf.yield %359 : tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6753462Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:49:36.6753612Z       %124 = arith.addi %113, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6753816Z       %125 = tt.addptr %6, %124 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6754029Z       %126 = tt.load %125 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6754255Z       %127 = ttg.local_alloc %126 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6754594Z       %128 = ttg.local_load %127 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6755006Z       %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6755307Z       %130 = arith.addi %33, %118 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6755511Z       %131 = tt.addptr %7, %130 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6755711Z       %132 = arith.andi %37, %122 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6755886Z       %133 = tt.load %131, %132, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6756175Z       %134 = ttg.convert_layout %133 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6756462Z       %135 = arith.shli %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6756729Z       %136 = arith.shrsi %135, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6756977Z       %137 = arith.shrsi %134, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6757282Z       %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6757633Z       %139 = tt.expand_dims %137 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6757927Z       %140 = tt.broadcast %138 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6758183Z       %141 = arith.select %16, %140, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6758438Z       %142 = tt.broadcast %139 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6758684Z       %143 = arith.select %18, %142, %141 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6758928Z       %144 = tt.reshape %143 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6759154Z       %145 = arith.sitofp %144 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6759413Z       %146 = ttg.local_alloc %145 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6759742Z       %147 = ttg.local_load %146 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6760241Z       %148 = tt.dot %129, %147, %123, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6760637Z       %149 = arith.truncf %148 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:49:36.6760814Z       %150 = arith.extsi %107 : i32 to i64
2026-02-21T09:49:36.6760999Z       %151 = tt.splat %150 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:36.6761217Z       %152 = arith.addi %151, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:36.6761485Z       %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6761732Z       %154 = arith.muli %153, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6761915Z       %155 = tt.broadcast %154 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6762135Z       %156 = tt.splat %114 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:36.6762353Z       %157 = arith.addi %156, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:36.6762654Z       %158 = tt.expand_dims %157 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6762924Z       %159 = tt.broadcast %158 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6763113Z       %160 = arith.addi %155, %159 : tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6763320Z       %161 = tt.addptr %19, %160 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6763535Z       %162 = arith.cmpi sge, %153, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6763705Z       %163 = arith.cmpi slt, %153, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6763868Z       %164 = arith.andi %162, %163 : tensor<128x1xi1, #mma>
2026-02-21T09:49:36.6764073Z       %165 = tt.broadcast %164 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6764269Z       %166 = arith.cmpi sge, %158, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6764434Z       %167 = arith.cmpi slt, %158, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6764618Z       %168 = arith.andi %166, %167 : tensor<1x512xi1, #mma>
2026-02-21T09:49:36.6764807Z       %169 = tt.broadcast %168 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6764996Z       %170 = arith.andi %165, %169 : tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6765169Z       tt.store %161, %149, %170 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:36.6765325Z       %171 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:49:36.6765459Z       %172 = arith.remsi %171, %c128_i32 : i32
2026-02-21T09:49:36.6765586Z       %173 = arith.divsi %171, %c128_i32 : i32
2026-02-21T09:49:36.6765718Z       %174 = arith.muli %172, %c128_i32 : i32
2026-02-21T09:49:36.6777311Z       %175 = tt.splat %174 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:36.6777566Z       %176 = arith.addi %175, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:36.6777751Z       %177 = arith.muli %173, %c512_i32 : i32
2026-02-21T09:49:36.6777994Z       %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:49:36.6778262Z       %179 = arith.muli %178, %cst_9 : tensor<128x1xi32, #blocked2>
2026-02-21T09:49:36.6778470Z       %180 = tt.broadcast %179 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6778656Z       %181 = arith.extsi %177 : i32 to i64
2026-02-21T09:49:36.6778830Z       %182 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:36.6779060Z       %183 = arith.addi %182, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:36.6779417Z       %184 = tt.expand_dims %183 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6779704Z       %185 = tt.broadcast %184 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6779918Z       %186 = arith.cmpi sge, %184, %cst_15 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6780121Z       %187 = arith.cmpi slt, %184, %cst_14 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6780295Z       %188 = arith.andi %186, %187 : tensor<1x512xi1, #blocked>
2026-02-21T09:49:36.6780485Z       %189 = tt.broadcast %188 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6780760Z       %190 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:49:36.6780985Z         %238 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:49:36.6781161Z         %239 = tt.splat %238 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6781394Z         %240 = arith.addi %239, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6781673Z         %241 = tt.expand_dims %240 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6781958Z         %242 = tt.broadcast %241 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6782163Z         %243 = arith.addi %180, %242 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6782371Z         %244 = tt.addptr %6, %243 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6782588Z         %245 = tt.load %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6782817Z         %246 = ttg.local_alloc %245 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6783163Z         %247 = ttg.local_load %246 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6783610Z         %248 = arith.extf %247 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6783917Z         %249 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:36.6784095Z         %250 = tt.splat %249 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6784318Z         %251 = arith.addi %250, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6784598Z         %252 = tt.expand_dims %251 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6784853Z         %253 = arith.muli %252, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6785048Z         %254 = tt.broadcast %253 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6785250Z         %255 = arith.addi %254, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6785451Z         %256 = tt.addptr %7, %255 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6785664Z         %257 = arith.cmpi sge, %252, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6785841Z         %258 = arith.cmpi slt, %252, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6786009Z         %259 = arith.andi %257, %258 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6786201Z         %260 = tt.broadcast %259 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6786393Z         %261 = arith.andi %260, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6786572Z         %262 = tt.load %256, %261, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6786836Z         %263 = ttg.convert_layout %262 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6787132Z         %264 = arith.shli %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6787403Z         %265 = arith.shrsi %264, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6787652Z         %266 = arith.shrsi %263, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6787960Z         %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6788329Z         %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6788626Z         %269 = tt.broadcast %267 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6788886Z         %270 = arith.select %16, %269, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6789139Z         %271 = tt.broadcast %268 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6789393Z         %272 = arith.select %18, %271, %270 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6789635Z         %273 = tt.reshape %272 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6789861Z         %274 = arith.sitofp %273 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6790119Z         %275 = ttg.local_alloc %274 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6790448Z         %276 = ttg.local_load %275 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6790939Z         %277 = tt.dot %248, %276, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6791325Z         %278 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:49:36.6791454Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:49:36.6791631Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6791875Z         %281 = arith.addi %280, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6792158Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6792441Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6792641Z         %284 = arith.addi %180, %283 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6792850Z         %285 = tt.addptr %6, %284 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6793058Z         %286 = tt.load %285 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6793292Z         %287 = ttg.local_alloc %286 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6793630Z         %288 = ttg.local_load %287 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6794040Z         %289 = arith.extf %288 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6794331Z         %290 = arith.extsi %278 : i32 to i64
2026-02-21T09:49:36.6794503Z         %291 = tt.splat %290 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6794730Z         %292 = arith.addi %291, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6795008Z         %293 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6795256Z         %294 = arith.muli %293, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6795480Z         %295 = tt.broadcast %294 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6795675Z         %296 = arith.addi %295, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6795879Z         %297 = tt.addptr %7, %296 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6796106Z         %298 = arith.cmpi sge, %293, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6796279Z         %299 = arith.cmpi slt, %293, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6796449Z         %300 = arith.andi %298, %299 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6796638Z         %301 = tt.broadcast %300 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6796835Z         %302 = arith.andi %301, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6797008Z         %303 = tt.load %297, %302, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6797279Z         %304 = ttg.convert_layout %303 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6797570Z         %305 = arith.shli %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6797814Z         %306 = arith.shrsi %305, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6798066Z         %307 = arith.shrsi %304, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6798371Z         %308 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6798720Z         %309 = tt.expand_dims %307 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6799019Z         %310 = tt.broadcast %308 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6799295Z         %311 = arith.select %16, %310, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6799548Z         %312 = tt.broadcast %309 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6799815Z         %313 = arith.select %18, %312, %311 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6800051Z         %314 = tt.reshape %313 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6800279Z         %315 = arith.sitofp %314 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6800533Z         %316 = ttg.local_alloc %315 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6800866Z         %317 = ttg.local_load %316 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6801345Z         %318 = tt.dot %289, %317, %277, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6801697Z         %319 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:36.6801830Z         %320 = arith.muli %319, %c2_i32 : i32
2026-02-21T09:49:36.6802008Z         %321 = tt.splat %320 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6802240Z         %322 = arith.addi %321, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6802525Z         %323 = tt.expand_dims %322 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6802870Z         %324 = tt.broadcast %323 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6803075Z         %325 = arith.addi %180, %324 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6803305Z         %326 = tt.addptr %6, %325 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6803520Z         %327 = tt.load %326 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6803755Z         %328 = ttg.local_alloc %327 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6804107Z         %329 = ttg.local_load %328 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6804519Z         %330 = arith.extf %329 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6804805Z         %331 = arith.extsi %319 : i32 to i64
2026-02-21T09:49:36.6804979Z         %332 = tt.splat %331 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6805205Z         %333 = arith.addi %332, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6805485Z         %334 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6805736Z         %335 = arith.muli %334, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6805929Z         %336 = tt.broadcast %335 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6806127Z         %337 = arith.addi %336, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6806328Z         %338 = tt.addptr %7, %337 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6806535Z         %339 = arith.cmpi sge, %334, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6806713Z         %340 = arith.cmpi slt, %334, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6806877Z         %341 = arith.andi %339, %340 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6807090Z         %342 = tt.broadcast %341 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6807287Z         %343 = arith.andi %342, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6807459Z         %344 = tt.load %338, %343, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6807745Z         %345 = ttg.convert_layout %344 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6808034Z         %346 = arith.shli %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6808283Z         %347 = arith.shrsi %346, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6808533Z         %348 = arith.shrsi %345, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6808831Z         %349 = tt.expand_dims %347 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6809186Z         %350 = tt.expand_dims %348 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6809478Z         %351 = tt.broadcast %349 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6809735Z         %352 = arith.select %16, %351, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6809988Z         %353 = tt.broadcast %350 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6810234Z         %354 = arith.select %18, %353, %352 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6810478Z         %355 = tt.reshape %354 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6810702Z         %356 = arith.sitofp %355 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6810957Z         %357 = ttg.local_alloc %356 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6811309Z         %358 = ttg.local_load %357 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6811784Z         %359 = tt.dot %330, %358, %318, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6812158Z         scf.yield %359 : tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6812294Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:49:36.6812443Z       %191 = arith.addi %180, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6812649Z       %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6812853Z       %193 = tt.load %192 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6813082Z       %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6813416Z       %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6813827Z       %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6814137Z       %197 = arith.addi %33, %185 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6814336Z       %198 = tt.addptr %7, %197 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6814541Z       %199 = arith.andi %37, %189 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6814710Z       %200 = tt.load %198, %199, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6814977Z       %201 = ttg.convert_layout %200 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6815286Z       %202 = arith.shli %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6815530Z       %203 = arith.shrsi %202, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6815793Z       %204 = arith.shrsi %201, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6816095Z       %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6816445Z       %206 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6816742Z       %207 = tt.broadcast %205 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6816990Z       %208 = arith.select %16, %207, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6817242Z       %209 = tt.broadcast %206 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6817486Z       %210 = arith.select %18, %209, %208 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6817724Z       %211 = tt.reshape %210 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6817952Z       %212 = arith.sitofp %211 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6818202Z       %213 = ttg.local_alloc %212 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6818534Z       %214 = ttg.local_load %213 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6819009Z       %215 = tt.dot %196, %214, %190, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6819420Z       %216 = arith.truncf %215 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:49:36.6819602Z       %217 = arith.extsi %174 : i32 to i64
2026-02-21T09:49:36.6819769Z       %218 = tt.splat %217 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:36.6820004Z       %219 = arith.addi %218, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:36.6820282Z       %220 = tt.expand_dims %219 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6820525Z       %221 = arith.muli %220, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6820711Z       %222 = tt.broadcast %221 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6820922Z       %223 = tt.splat %181 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:36.6821139Z       %224 = arith.addi %223, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:36.6821409Z       %225 = tt.expand_dims %224 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6821674Z       %226 = tt.broadcast %225 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6821863Z       %227 = arith.addi %222, %226 : tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6822059Z       %228 = tt.addptr %19, %227 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6822267Z       %229 = arith.cmpi sge, %220, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6822434Z       %230 = arith.cmpi slt, %220, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6822599Z       %231 = arith.andi %229, %230 : tensor<128x1xi1, #mma>
2026-02-21T09:49:36.6822784Z       %232 = tt.broadcast %231 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6822972Z       %233 = arith.cmpi sge, %225, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6823159Z       %234 = arith.cmpi slt, %225, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6823318Z       %235 = arith.andi %233, %234 : tensor<1x512xi1, #mma>
2026-02-21T09:49:36.6823511Z       %236 = tt.broadcast %235 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6823695Z       %237 = arith.andi %232, %236 : tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6823858Z       tt.store %228, %216, %237 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:36.6824031Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:49:36.6824182Z     scf.for %arg3 = %26 to %2 step %c1_i32  : i32 {
2026-02-21T09:49:36.6824320Z       %38 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:49:36.6824447Z       %39 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:49:36.6824575Z       %40 = arith.muli %38, %c128_i32 : i32
2026-02-21T09:49:36.6824752Z       %41 = tt.splat %40 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:36.6824978Z       %42 = arith.addi %41, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:36.6825157Z       %43 = arith.muli %39, %c512_i32 : i32
2026-02-21T09:49:36.6825307Z       %44 = tt.expand_dims %42 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:49:36.6825372Z       %45 = arith.muli %44, %cst_9 : tensor<128x1xi32, #blocked2>
2026-02-21T09:49:36.6825471Z       %46 = tt.broadcast %45 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6825516Z       %47 = arith.extsi %43 : i32 to i64
2026-02-21T09:49:36.6825605Z       %48 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:36.6825698Z       %49 = arith.addi %48, %11 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:49:36.6825841Z       %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6825950Z       %51 = tt.broadcast %50 : tensor<1x512xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6826024Z       %52 = arith.cmpi sge, %50, %cst_15 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6826096Z       %53 = arith.cmpi slt, %50, %cst_14 : tensor<1x512xi64, #blocked>
2026-02-21T09:49:36.6826170Z       %54 = arith.andi %52, %53 : tensor<1x512xi1, #blocked>
2026-02-21T09:49:36.6826257Z       %55 = tt.broadcast %54 : tensor<1x512xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6826398Z       %56 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst_10) -> (tensor<128x512xf32, #mma>)  : i32 {
2026-02-21T09:49:36.6826444Z         %104 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:49:36.6826540Z         %105 = tt.splat %104 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6826637Z         %106 = arith.addi %105, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6826785Z         %107 = tt.expand_dims %106 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6826879Z         %108 = tt.broadcast %107 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6826946Z         %109 = arith.addi %46, %108 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6827052Z         %110 = tt.addptr %6, %109 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6827118Z         %111 = tt.load %110 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6827246Z         %112 = ttg.local_alloc %111 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6827420Z         %113 = ttg.local_load %112 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6827642Z         %114 = arith.extf %113 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6827694Z         %115 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:36.6827805Z         %116 = tt.splat %115 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6827895Z         %117 = arith.addi %116, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6828042Z         %118 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6828107Z         %119 = arith.muli %118, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6828197Z         %120 = tt.broadcast %119 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6828262Z         %121 = arith.addi %120, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6828362Z         %122 = tt.addptr %7, %121 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6828433Z         %123 = arith.cmpi sge, %118, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6828504Z         %124 = arith.cmpi slt, %118, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6828568Z         %125 = arith.andi %123, %124 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6828658Z         %126 = tt.broadcast %125 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6828722Z         %127 = arith.andi %126, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6828796Z         %128 = tt.load %122, %127, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6828944Z         %129 = ttg.convert_layout %128 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6829047Z         %130 = arith.shli %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6829154Z         %131 = arith.shrsi %130, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6829273Z         %132 = arith.shrsi %129, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6829429Z         %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6829605Z         %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6829706Z         %135 = tt.broadcast %133 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6829819Z         %136 = arith.select %16, %135, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6829921Z         %137 = tt.broadcast %134 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6830027Z         %138 = arith.select %18, %137, %136 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6830119Z         %139 = tt.reshape %138 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6830216Z         %140 = arith.sitofp %139 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6830336Z         %141 = ttg.local_alloc %140 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6830508Z         %142 = ttg.local_load %141 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6830781Z         %143 = tt.dot %114, %142, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6830830Z         %144 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:49:36.6830876Z         %145 = arith.muli %144, %c2_i32 : i32
2026-02-21T09:49:36.6830986Z         %146 = tt.splat %145 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6831080Z         %147 = arith.addi %146, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6831243Z         %148 = tt.expand_dims %147 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6831341Z         %149 = tt.broadcast %148 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6831406Z         %150 = arith.addi %46, %149 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6831510Z         %151 = tt.addptr %6, %150 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6831580Z         %152 = tt.load %151 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6831704Z         %153 = ttg.local_alloc %152 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6831881Z         %154 = ttg.local_load %153 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6832084Z         %155 = arith.extf %154 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6832130Z         %156 = arith.extsi %144 : i32 to i64
2026-02-21T09:49:36.6832223Z         %157 = tt.splat %156 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6832316Z         %158 = arith.addi %157, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6832459Z         %159 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6832523Z         %160 = arith.muli %159, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6832620Z         %161 = tt.broadcast %160 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6832701Z         %162 = arith.addi %161, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6832802Z         %163 = tt.addptr %7, %162 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6832877Z         %164 = arith.cmpi sge, %159, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6832961Z         %165 = arith.cmpi slt, %159, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6833022Z         %166 = arith.andi %164, %165 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6833115Z         %167 = tt.broadcast %166 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6833175Z         %168 = arith.andi %167, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6833248Z         %169 = tt.load %163, %168, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6833396Z         %170 = ttg.convert_layout %169 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6833501Z         %171 = arith.shli %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6833603Z         %172 = arith.shrsi %171, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6833705Z         %173 = arith.shrsi %170, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6833865Z         %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6834016Z         %175 = tt.expand_dims %173 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6834116Z         %176 = tt.broadcast %174 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6834231Z         %177 = arith.select %16, %176, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6834344Z         %178 = tt.broadcast %175 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6834450Z         %179 = arith.select %18, %178, %177 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6834560Z         %180 = tt.reshape %179 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6834653Z         %181 = arith.sitofp %180 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6834772Z         %182 = ttg.local_alloc %181 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6834946Z         %183 = ttg.local_load %182 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6835210Z         %184 = tt.dot %155, %183, %143, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6835260Z         %185 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:36.6835309Z         %186 = arith.muli %185, %c2_i32 : i32
2026-02-21T09:49:36.6835401Z         %187 = tt.splat %186 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6835495Z         %188 = arith.addi %187, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:36.6835644Z         %189 = tt.expand_dims %188 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:49:36.6835738Z         %190 = tt.broadcast %189 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6835800Z         %191 = arith.addi %46, %190 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6835908Z         %192 = tt.addptr %6, %191 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6835998Z         %193 = tt.load %192 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6836122Z         %194 = ttg.local_alloc %193 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6836298Z         %195 = ttg.local_load %194 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6836516Z         %196 = arith.extf %195 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6836562Z         %197 = arith.extsi %185 : i32 to i64
2026-02-21T09:49:36.6836657Z         %198 = tt.splat %197 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6836747Z         %199 = arith.addi %198, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:36.6836892Z         %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6836959Z         %201 = arith.muli %200, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6837053Z         %202 = tt.broadcast %201 : tensor<2x1xi64, #blocked> -> tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6837115Z         %203 = arith.addi %202, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6837220Z         %204 = tt.addptr %7, %203 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6837289Z         %205 = arith.cmpi sge, %200, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6837356Z         %206 = arith.cmpi slt, %200, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:49:36.6837418Z         %207 = arith.andi %205, %206 : tensor<2x1xi1, #blocked>
2026-02-21T09:49:36.6837507Z         %208 = tt.broadcast %207 : tensor<2x1xi1, #blocked> -> tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6837567Z         %209 = arith.andi %208, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6837656Z         %210 = tt.load %204, %209, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6837808Z         %211 = ttg.convert_layout %210 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6837923Z         %212 = arith.shli %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6838026Z         %213 = arith.shrsi %212, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6838131Z         %214 = arith.shrsi %211, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6838283Z         %215 = tt.expand_dims %213 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6838435Z         %216 = tt.expand_dims %214 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6838540Z         %217 = tt.broadcast %215 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6838651Z         %218 = arith.select %16, %217, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6838749Z         %219 = tt.broadcast %216 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6838858Z         %220 = arith.select %18, %219, %218 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6838949Z         %221 = tt.reshape %220 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6839041Z         %222 = arith.sitofp %221 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6839163Z         %223 = ttg.local_alloc %222 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6839335Z         %224 = ttg.local_load %223 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6839616Z         %225 = tt.dot %196, %224, %184, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6839693Z         scf.yield %225 : tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6839741Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:49:36.6839804Z       %57 = arith.addi %46, %29 : tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6839912Z       %58 = tt.addptr %6, %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:49:36.6839975Z       %59 = tt.load %58 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:49:36.6840093Z       %60 = ttg.local_alloc %59 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:36.6840266Z       %61 = ttg.local_load %60 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6840462Z       %62 = arith.extf %61 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6840524Z       %63 = arith.addi %33, %51 : tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6840628Z       %64 = tt.addptr %7, %63 : tensor<2x512x!tt.ptr<i8>, #blocked>, tensor<2x512xi64, #blocked>
2026-02-21T09:49:36.6840688Z       %65 = arith.andi %37, %55 : tensor<2x512xi1, #blocked>
2026-02-21T09:49:36.6840759Z       %66 = tt.load %64, %65, %cst_13 : tensor<2x512x!tt.ptr<i8>, #blocked>
2026-02-21T09:49:36.6840910Z       %67 = ttg.convert_layout %66 : tensor<2x512xi8, #blocked> -> tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6841008Z       %68 = arith.shli %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6841292Z       %69 = arith.shrsi %68, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6841396Z       %70 = arith.shrsi %67, %cst_17 : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:36.6841562Z       %71 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6841711Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x512xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x512xi8, #blocked1>
2026-02-21T09:49:36.6841815Z       %73 = tt.broadcast %71 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6841921Z       %74 = arith.select %16, %73, %cst_16 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6842014Z       %75 = tt.broadcast %72 : tensor<2x1x512xi8, #blocked1> -> tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6842119Z       %76 = arith.select %18, %75, %74 : tensor<2x2x512xi1, #blocked1>, tensor<2x2x512xi8, #blocked1>
2026-02-21T09:49:36.6842207Z       %77 = tt.reshape %76 : tensor<2x2x512xi8, #blocked1> -> tensor<4x512xi8, #blocked>
2026-02-21T09:49:36.6842296Z       %78 = arith.sitofp %77 : tensor<4x512xi8, #blocked> to tensor<4x512xf32, #blocked>
2026-02-21T09:49:36.6842415Z       %79 = ttg.local_alloc %78 : (tensor<4x512xf32, #blocked>) -> !ttg.memdesc<4x512xf32, #shared1, #smem>
2026-02-21T09:49:36.6842623Z       %80 = ttg.local_load %79 : !ttg.memdesc<4x512xf32, #shared1, #smem> -> tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:36.6842884Z       %81 = tt.dot %62, %80, %56, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x512xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x512xf32, #mma>
2026-02-21T09:49:36.6842976Z       %82 = arith.truncf %81 : tensor<128x512xf32, #mma> to tensor<128x512xbf16, #mma>
2026-02-21T09:49:36.6843021Z       %83 = arith.extsi %40 : i32 to i64
2026-02-21T09:49:36.6843131Z       %84 = tt.splat %83 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:36.6843223Z       %85 = arith.addi %84, %20 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:36.6843362Z       %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6843440Z       %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6843528Z       %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6843616Z       %89 = tt.splat %47 : i64 -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:36.6843699Z       %90 = arith.addi %89, %22 : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:36.6843834Z       %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6843920Z       %92 = tt.broadcast %91 : tensor<1x512xi64, #mma> -> tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6843977Z       %93 = arith.addi %88, %92 : tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6844075Z       %94 = tt.addptr %19, %93 : tensor<128x512x!tt.ptr<bf16>, #mma>, tensor<128x512xi64, #mma>
2026-02-21T09:49:36.6844144Z       %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6844207Z       %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:49:36.6844264Z       %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma>
2026-02-21T09:49:36.6844350Z       %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6844414Z       %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6844481Z       %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x512xi64, #mma>
2026-02-21T09:49:36.6844543Z       %101 = arith.andi %99, %100 : tensor<1x512xi1, #mma>
2026-02-21T09:49:36.6844642Z       %102 = tt.broadcast %101 : tensor<1x512xi1, #mma> -> tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6844701Z       %103 = arith.andi %98, %102 : tensor<128x512xi1, #mma>
2026-02-21T09:49:36.6844792Z       tt.store %94, %82, %103 : tensor<128x512x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:36.6844856Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:49:36.6844895Z     tt.return
2026-02-21T09:49:36.6844929Z   }
2026-02-21T09:49:36.6844973Z }
2026-02-21T09:49:36.6844977Z 
2026-02-21T09:49:36.6845009Z {-#
2026-02-21T09:49:36.6845052Z   external_resources: {
2026-02-21T09:49:36.6845096Z     mlir_reproducer: {
2026-02-21T09:49:36.6846035Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:49:36.6846081Z       disable_threading: false,
2026-02-21T09:49:36.6846124Z       verify_each: true
2026-02-21T09:49:36.6846158Z     }
2026-02-21T09:49:36.6846190Z   }
2026-02-21T09:49:36.6846226Z #-}
2026-02-21T09:49:36.6846472Z /tmp/torchinductor_root/wk/cwktnkmg5v3y2kafb7nz4m3aoac7nvgsfypwxtyj22sga5evnyda.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:49:36.6846898Z /tmp/torchinductor_root/wk/cwktnkmg5v3y2kafb7nz4m3aoac7nvgsfypwxtyj22sga5evnyda.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:49:36.6847037Z [307s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:49:36.6847678Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 512], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[2, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:49:36.6847749Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:49:36.6847835Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:49:46.6744086Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:49:46.6753083Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:49:46.6753866Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:49:46.6754506Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:49:46.6755139Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:49:46.6755723Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:49:46.6756257Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:49:46.6756637Z #smem = #ttg.shared_memory
2026-02-21T09:49:46.6757115Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:49:46.6758576Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:49:46.6759503Z     %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma>
2026-02-21T09:49:46.6759884Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:46.6760244Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:46.6760619Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6760962Z     %c37631_i32 = arith.constant 37631 : i32
2026-02-21T09:49:46.6761213Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:49:46.6761453Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:49:46.6761689Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:49:46.6762108Z     %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6762733Z     %cst_4 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6763167Z     %cst_5 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6763528Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6763832Z     %cst_7 = arith.constant dense<0> : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6764123Z     %cst_8 = arith.constant dense<512> : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6764408Z     %cst_9 = arith.constant dense<0> : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6764711Z     %cst_10 = arith.constant dense<8192> : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6765014Z     %cst_11 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1>
2026-02-21T09:49:46.6765270Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:49:46.6765462Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:49:46.6765659Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:49:46.6765976Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:49:46.6766179Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:49:46.6766375Z     %c14592_i32 = arith.constant 14592 : i32
2026-02-21T09:49:46.6766616Z     %cst_12 = arith.constant dense<0> : tensor<2x64xi8, #blocked2>
2026-02-21T09:49:46.6766950Z     %c9728_i32 = arith.constant 9728 : i32
2026-02-21T09:49:46.6767191Z     %cst_13 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6767441Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:49:46.6767636Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:49:46.6767834Z     %c4864_i32 = arith.constant 4864 : i32
2026-02-21T09:49:46.6768145Z     %cst_14 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6768460Z     %0 = tt.get_program_id x : i32
2026-02-21T09:49:46.6768791Z     %1 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:46.6769253Z     %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:46.6769705Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:46.6770154Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:46.6770591Z     %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6771001Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6771331Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6771721Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6772291Z     %9 = arith.extsi %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6772903Z     %10 = arith.extsi %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:46.6773422Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:49:46.6773945Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:49:46.6774454Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:46.6774774Z     %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:46.6775021Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:49:46.6775271Z     %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:46.6775508Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:49:46.6775767Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:46.6775970Z     %19 = arith.subi %c37631_i32, %0 : i32
2026-02-21T09:49:46.6776117Z     %20 = arith.divui %19, %c4864_i32 : i32
2026-02-21T09:49:46.6776263Z     %21 = arith.remsi %20, %c3_i32 : i32
2026-02-21T09:49:46.6776404Z     %22 = arith.subi %20, %21 : i32
2026-02-21T09:49:46.6776547Z     %23 = arith.muli %22, %c4864_i32 : i32
2026-02-21T09:49:46.6776689Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:49:46.6776849Z     scf.for %arg3 = %0 to %24 step %c14592_i32  : i32 {
2026-02-21T09:49:46.6777027Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:49:46.6777176Z       %26 = arith.muli %25, %c2_i32 : i32
2026-02-21T09:49:46.6777348Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:49:46.6777491Z       %28 = arith.minsi %27, %c2_i32 : i32
2026-02-21T09:49:46.6777641Z       %29 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:49:46.6777787Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:49:46.6777927Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:49:46.6778085Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:49:46.6778232Z       %33 = arith.muli %31, %c64_i32 : i32
2026-02-21T09:49:46.6778433Z       %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:46.6778685Z       %35 = arith.addi %34, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:46.6778888Z       %36 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:49:46.6779092Z       %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:46.6779359Z       %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:46.6779618Z       %39 = arith.addi %37, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:46.6779890Z       %40 = arith.addi %38, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:46.6780229Z       %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:49:46.6780541Z       %42 = arith.muli %41, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:49:46.6780782Z       %43 = tt.broadcast %42 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6780995Z       %44 = arith.extsi %33 : i32 to i64
2026-02-21T09:49:46.6781207Z       %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:46.6781478Z       %46 = arith.addi %45, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:46.6781838Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6782182Z       %48 = tt.broadcast %47 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6782444Z       %49 = arith.cmpi sge, %47, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6782657Z       %50 = arith.cmpi slt, %47, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6782869Z       %51 = arith.andi %49, %50 : tensor<1x64xi1, #blocked2>
2026-02-21T09:49:46.6783091Z       %52 = tt.broadcast %51 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6783357Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:46.6783688Z       %54 = tt.expand_dims %5 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:46.6784030Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6784273Z       %56 = arith.addi %43, %55 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6784478Z       %57 = tt.addptr %6, %56 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6784686Z       %58 = tt.load %57 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6784968Z       %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6785422Z       ttg.local_store %58, %59 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6785697Z       %60 = arith.addi %5, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6785981Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:46.6786256Z       %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6786463Z       %63 = arith.addi %43, %62 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6786663Z       %64 = tt.addptr %6, %63 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6786864Z       %65 = tt.load %64 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6787156Z       %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6787509Z       ttg.local_store %65, %66 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6788030Z       %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:49:46.6788456Z         %306 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:46.6788588Z         %307 = arith.muli %306, %c2_i32 : i32
2026-02-21T09:49:46.6788761Z         %308 = tt.splat %307 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6788994Z         %309 = arith.addi %308, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6789274Z         %310 = tt.expand_dims %309 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:46.6789561Z         %311 = tt.broadcast %310 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6789758Z         %312 = arith.addi %43, %311 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6789958Z         %313 = tt.addptr %6, %312 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6790166Z         %314 = tt.load %313 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6790480Z         %315 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6790936Z         %316 = arith.extf %315 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6791223Z         %317 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:46.6791397Z         %318 = tt.splat %317 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6791624Z         %319 = arith.addi %318, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6791902Z         %320 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6792153Z         %321 = arith.muli %320, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6792352Z         %322 = tt.broadcast %321 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6792547Z         %323 = arith.addi %322, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6792745Z         %324 = tt.addptr %7, %323 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6792955Z         %325 = arith.cmpi sge, %320, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6793130Z         %326 = arith.cmpi slt, %320, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6793293Z         %327 = arith.andi %325, %326 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:46.6793477Z         %328 = tt.broadcast %327 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6793668Z         %329 = arith.andi %328, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6793833Z         %330 = tt.load %324, %329, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6794092Z         %331 = ttg.convert_layout %330 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6794411Z         %332 = arith.shli %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6794649Z         %333 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6794885Z         %334 = arith.shrsi %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6795187Z         %335 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6795521Z         %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6795803Z         %337 = tt.broadcast %335 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6796040Z         %338 = arith.select %15, %337, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6796277Z         %339 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6796505Z         %340 = arith.select %17, %339, %338 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6796735Z         %341 = tt.reshape %340 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6796954Z         %342 = arith.sitofp %341 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6797246Z         %343 = ttg.convert_layout %342 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6797713Z         %344 = tt.dot %316, %343, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6798060Z         %345 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:49:46.6798204Z         %346 = arith.cmpi slt, %345, %c2_i32 : i32
2026-02-21T09:49:46.6798341Z         %347 = arith.select %346, %345, %c0_i32 : i32
2026-02-21T09:49:46.6798601Z         %348 = ttg.memdesc_index %53[%347] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6798971Z         ttg.local_store %314, %348 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6799357Z         scf.yield %344, %347, %arg8, %348 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6799651Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:49:46.6799924Z       %68 = ttg.local_load %67#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6800340Z       %69 = arith.extf %68 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6800664Z       %70 = arith.addi %9, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6800939Z       %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6801178Z       %72 = arith.muli %71, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6801366Z       %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6801549Z       %74 = arith.addi %73, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6801738Z       %75 = tt.addptr %7, %74 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6801937Z       %76 = arith.cmpi sge, %71, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6802102Z       %77 = arith.cmpi slt, %71, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6802280Z       %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:46.6802455Z       %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6802681Z       %80 = arith.andi %79, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6802859Z       %81 = tt.load %75, %80, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6803108Z       %82 = ttg.convert_layout %81 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6803385Z       %83 = arith.shli %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6803612Z       %84 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6803842Z       %85 = arith.shrsi %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6804122Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6804445Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6804717Z       %88 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6804944Z       %89 = arith.select %15, %88, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6805172Z       %90 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6805392Z       %91 = arith.select %17, %90, %89 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6805610Z       %92 = tt.reshape %91 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6805821Z       %93 = arith.sitofp %92 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6806121Z       %94 = ttg.convert_layout %93 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6806572Z       %95 = tt.dot %69, %94, %67#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6807069Z       %96 = ttg.local_load %67#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6807479Z       %97 = arith.extf %96 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6807800Z       %98 = arith.addi %9, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6808071Z       %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6808315Z       %100 = arith.muli %99, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6808508Z       %101 = tt.broadcast %100 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6808696Z       %102 = arith.addi %101, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6808893Z       %103 = tt.addptr %7, %102 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6809095Z       %104 = arith.cmpi sge, %99, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6809270Z       %105 = arith.cmpi slt, %99, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6809435Z       %106 = arith.andi %104, %105 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:46.6809615Z       %107 = tt.broadcast %106 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6809805Z       %108 = arith.andi %107, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6809990Z       %109 = tt.load %103, %108, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6810244Z       %110 = ttg.convert_layout %109 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6810523Z       %111 = arith.shli %110, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6810772Z       %112 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6811009Z       %113 = arith.shrsi %110, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6811290Z       %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6811620Z       %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6811899Z       %116 = tt.broadcast %114 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6812134Z       %117 = arith.select %15, %116, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6812371Z       %118 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6812610Z       %119 = arith.select %17, %118, %117 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6812846Z       %120 = tt.reshape %119 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6813071Z       %121 = arith.sitofp %120 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6813365Z       %122 = ttg.convert_layout %121 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6813844Z       %123 = tt.dot %97, %122, %95, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6814224Z       ttg.local_dealloc %53 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:46.6814456Z       %124 = arith.truncf %123 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:49:46.6814725Z       %125 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:49:46.6814965Z       %126 = arith.muli %125, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:49:46.6815198Z       %127 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:49:46.6815457Z       %128 = tt.broadcast %126 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6815664Z       %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6815852Z       %130 = arith.addi %128, %129 : tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6816050Z       %131 = tt.addptr %18, %130 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6816253Z       tt.store %131, %124 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:46.6816400Z       %132 = arith.addi %arg3, %c4864_i32 : i32
2026-02-21T09:49:46.6816532Z       %133 = arith.divsi %132, %c512_i32 : i32
2026-02-21T09:49:46.6816657Z       %134 = arith.muli %133, %c2_i32 : i32
2026-02-21T09:49:46.6816783Z       %135 = arith.subi %c128_i32, %134 : i32
2026-02-21T09:49:46.6816908Z       %136 = arith.minsi %135, %c2_i32 : i32
2026-02-21T09:49:46.6817030Z       %137 = arith.remsi %132, %c512_i32 : i32
2026-02-21T09:49:46.6817154Z       %138 = arith.remsi %137, %136 : i32
2026-02-21T09:49:46.6817271Z       %139 = arith.addi %134, %138 : i32
2026-02-21T09:49:46.6817391Z       %140 = arith.divsi %137, %136 : i32
2026-02-21T09:49:46.6817508Z       %141 = arith.muli %139, %c64_i32 : i32
2026-02-21T09:49:46.6817682Z       %142 = tt.splat %141 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:46.6817916Z       %143 = arith.addi %142, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:46.6818087Z       %144 = arith.muli %140, %c64_i32 : i32
2026-02-21T09:49:46.6818259Z       %145 = tt.splat %144 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:46.6818491Z       %146 = tt.splat %144 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:46.6818710Z       %147 = arith.addi %145, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:46.6818924Z       %148 = arith.addi %146, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:46.6819200Z       %149 = tt.expand_dims %147 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:49:46.6819461Z       %150 = arith.muli %149, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:49:46.6819660Z       %151 = tt.broadcast %150 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6819841Z       %152 = arith.extsi %141 : i32 to i64
2026-02-21T09:49:46.6820011Z       %153 = tt.splat %152 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:46.6820241Z       %154 = arith.addi %153, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:46.6820525Z       %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6820809Z       %156 = tt.broadcast %155 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6821017Z       %157 = arith.cmpi sge, %155, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6821193Z       %158 = arith.cmpi slt, %155, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6821367Z       %159 = arith.andi %157, %158 : tensor<1x64xi1, #blocked2>
2026-02-21T09:49:46.6821579Z       %160 = tt.broadcast %159 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6821794Z       %161 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:46.6822007Z       %162 = arith.addi %151, %55 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6822203Z       %163 = tt.addptr %6, %162 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6822412Z       %164 = tt.load %163 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6822696Z       %165 = ttg.memdesc_index %161[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6823054Z       ttg.local_store %164, %165 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6823295Z       %166 = arith.addi %151, %62 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6823494Z       %167 = tt.addptr %6, %166 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6823703Z       %168 = tt.load %167 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6823988Z       %169 = ttg.memdesc_index %161[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6824347Z       ttg.local_store %168, %169 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6824867Z       %170:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %165, %arg8 = %169) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:49:46.6825285Z         %306 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:46.6825418Z         %307 = arith.muli %306, %c2_i32 : i32
2026-02-21T09:49:46.6825616Z         %308 = tt.splat %307 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6825843Z         %309 = arith.addi %308, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6826128Z         %310 = tt.expand_dims %309 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:46.6826422Z         %311 = tt.broadcast %310 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6826628Z         %312 = arith.addi %151, %311 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6826834Z         %313 = tt.addptr %6, %312 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6827042Z         %314 = tt.load %313 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6827346Z         %315 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6827782Z         %316 = arith.extf %315 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6828073Z         %317 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:46.6828253Z         %318 = tt.splat %317 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6828478Z         %319 = arith.addi %318, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6828762Z         %320 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6829015Z         %321 = arith.muli %320, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6829217Z         %322 = tt.broadcast %321 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6829418Z         %323 = arith.addi %322, %156 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6829644Z         %324 = tt.addptr %7, %323 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6829858Z         %325 = arith.cmpi sge, %320, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6830050Z         %326 = arith.cmpi slt, %320, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6830222Z         %327 = arith.andi %325, %326 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:46.6830413Z         %328 = tt.broadcast %327 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6830605Z         %329 = arith.andi %328, %160 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6830783Z         %330 = tt.load %324, %329, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6831040Z         %331 = ttg.convert_layout %330 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6831325Z         %332 = arith.shli %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6831580Z         %333 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6831825Z         %334 = arith.shrsi %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6832124Z         %335 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6832464Z         %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6832752Z         %337 = tt.broadcast %335 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6832994Z         %338 = arith.select %15, %337, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6833235Z         %339 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6833495Z         %340 = arith.select %17, %339, %338 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6833723Z         %341 = tt.reshape %340 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6833948Z         %342 = arith.sitofp %341 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6834265Z         %343 = ttg.convert_layout %342 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6834733Z         %344 = tt.dot %316, %343, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6835083Z         %345 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:49:46.6835213Z         %346 = arith.cmpi slt, %345, %c2_i32 : i32
2026-02-21T09:49:46.6835353Z         %347 = arith.select %346, %345, %c0_i32 : i32
2026-02-21T09:49:46.6835631Z         %348 = ttg.memdesc_index %161[%347] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6835985Z         ttg.local_store %314, %348 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6836378Z         scf.yield %344, %347, %arg8, %348 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6836680Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:49:46.6836961Z       %171 = ttg.local_load %170#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6837392Z       %172 = arith.extf %171 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6837706Z       %173 = arith.addi %73, %156 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6837910Z       %174 = tt.addptr %7, %173 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6838129Z       %175 = arith.andi %79, %160 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6838300Z       %176 = tt.load %174, %175, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6838569Z       %177 = ttg.convert_layout %176 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6838859Z       %178 = arith.shli %177, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6839105Z       %179 = arith.shrsi %178, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6839352Z       %180 = arith.shrsi %177, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6839649Z       %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6839992Z       %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6840274Z       %183 = tt.broadcast %181 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6840518Z       %184 = arith.select %15, %183, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6840760Z       %185 = tt.broadcast %182 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6840991Z       %186 = arith.select %17, %185, %184 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6841224Z       %187 = tt.reshape %186 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6841449Z       %188 = arith.sitofp %187 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6841909Z       %189 = ttg.convert_layout %188 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6842375Z       %190 = tt.dot %172, %189, %170#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6842914Z       %191 = ttg.local_load %170#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6843346Z       %192 = arith.extf %191 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6843646Z       %193 = arith.addi %101, %156 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6843845Z       %194 = tt.addptr %7, %193 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6844051Z       %195 = arith.andi %107, %160 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6844221Z       %196 = tt.load %194, %195, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6844484Z       %197 = ttg.convert_layout %196 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6844767Z       %198 = arith.shli %197, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6845003Z       %199 = arith.shrsi %198, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6845246Z       %200 = arith.shrsi %197, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6845531Z       %201 = tt.expand_dims %199 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6845892Z       %202 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6846176Z       %203 = tt.broadcast %201 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6846436Z       %204 = arith.select %15, %203, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6846681Z       %205 = tt.broadcast %202 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6846911Z       %206 = arith.select %17, %205, %204 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6847142Z       %207 = tt.reshape %206 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6847366Z       %208 = arith.sitofp %207 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6847662Z       %209 = ttg.convert_layout %208 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6848134Z       %210 = tt.dot %192, %209, %190, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6848525Z       ttg.local_dealloc %161 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:46.6848744Z       %211 = arith.truncf %210 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:49:46.6849017Z       %212 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:49:46.6849257Z       %213 = arith.muli %212, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:49:46.6849493Z       %214 = tt.expand_dims %143 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:49:46.6849754Z       %215 = tt.broadcast %213 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6849963Z       %216 = tt.broadcast %214 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6850168Z       %217 = arith.addi %215, %216 : tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6850356Z       %218 = tt.addptr %18, %217 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6850556Z       tt.store %218, %211 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:46.6850719Z       %219 = arith.addi %arg3, %c9728_i32 : i32
2026-02-21T09:49:46.6850852Z       %220 = arith.divsi %219, %c512_i32 : i32
2026-02-21T09:49:46.6850981Z       %221 = arith.muli %220, %c2_i32 : i32
2026-02-21T09:49:46.6851107Z       %222 = arith.subi %c128_i32, %221 : i32
2026-02-21T09:49:46.6851233Z       %223 = arith.minsi %222, %c2_i32 : i32
2026-02-21T09:49:46.6851353Z       %224 = arith.remsi %219, %c512_i32 : i32
2026-02-21T09:49:46.6851477Z       %225 = arith.remsi %224, %223 : i32
2026-02-21T09:49:46.6851595Z       %226 = arith.addi %221, %225 : i32
2026-02-21T09:49:46.6851716Z       %227 = arith.divsi %224, %223 : i32
2026-02-21T09:49:46.6851835Z       %228 = arith.muli %226, %c64_i32 : i32
2026-02-21T09:49:46.6852004Z       %229 = tt.splat %228 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:46.6852220Z       %230 = arith.addi %229, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:46.6852389Z       %231 = arith.muli %227, %c64_i32 : i32
2026-02-21T09:49:46.6852564Z       %232 = tt.splat %231 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:46.6852780Z       %233 = tt.splat %231 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:46.6853002Z       %234 = arith.addi %232, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:46.6853216Z       %235 = arith.addi %233, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:46.6853501Z       %236 = tt.expand_dims %234 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:49:46.6853786Z       %237 = arith.muli %236, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:49:46.6853985Z       %238 = tt.broadcast %237 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6854195Z       %239 = arith.extsi %228 : i32 to i64
2026-02-21T09:49:46.6854365Z       %240 = tt.splat %239 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:46.6854592Z       %241 = arith.addi %240, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:46.6854874Z       %242 = tt.expand_dims %241 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6855153Z       %243 = tt.broadcast %242 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6855363Z       %244 = arith.cmpi sge, %242, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6855540Z       %245 = arith.cmpi slt, %242, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6855718Z       %246 = arith.andi %244, %245 : tensor<1x64xi1, #blocked2>
2026-02-21T09:49:46.6855910Z       %247 = tt.broadcast %246 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6856129Z       %248 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:46.6856323Z       %249 = arith.addi %238, %55 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6856521Z       %250 = tt.addptr %6, %249 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6856730Z       %251 = tt.load %250 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6857012Z       %252 = ttg.memdesc_index %248[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6857374Z       ttg.local_store %251, %252 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6857619Z       %253 = arith.addi %238, %62 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6857841Z       %254 = tt.addptr %6, %253 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6858050Z       %255 = tt.load %254 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6858331Z       %256 = ttg.memdesc_index %248[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6858700Z       ttg.local_store %255, %256 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6859217Z       %257:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %252, %arg8 = %256) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:49:46.6859631Z         %306 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:46.6859769Z         %307 = arith.muli %306, %c2_i32 : i32
2026-02-21T09:49:46.6859950Z         %308 = tt.splat %307 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6860182Z         %309 = arith.addi %308, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6860470Z         %310 = tt.expand_dims %309 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:46.6860750Z         %311 = tt.broadcast %310 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6860954Z         %312 = arith.addi %238, %311 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6861162Z         %313 = tt.addptr %6, %312 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6861369Z         %314 = tt.load %313 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6861685Z         %315 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6862118Z         %316 = arith.extf %315 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6862420Z         %317 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:46.6862599Z         %318 = tt.splat %317 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6862823Z         %319 = arith.addi %318, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6863103Z         %320 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6863357Z         %321 = arith.muli %320, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6863556Z         %322 = tt.broadcast %321 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6863758Z         %323 = arith.addi %322, %243 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6863959Z         %324 = tt.addptr %7, %323 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6864176Z         %325 = arith.cmpi sge, %320, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6864352Z         %326 = arith.cmpi slt, %320, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6864525Z         %327 = arith.andi %325, %326 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:46.6864723Z         %328 = tt.broadcast %327 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6864914Z         %329 = arith.andi %328, %247 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6865094Z         %330 = tt.load %324, %329, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6865359Z         %331 = ttg.convert_layout %330 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6865672Z         %332 = arith.shli %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6865913Z         %333 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6866236Z         %334 = arith.shrsi %331, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6866576Z         %335 = tt.expand_dims %333 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6875121Z         %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6875423Z         %337 = tt.broadcast %335 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6875669Z         %338 = arith.select %15, %337, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6875922Z         %339 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6876160Z         %340 = arith.select %17, %339, %338 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6876393Z         %341 = tt.reshape %340 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6876625Z         %342 = arith.sitofp %341 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6876921Z         %343 = ttg.convert_layout %342 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6877396Z         %344 = tt.dot %316, %343, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6877750Z         %345 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:49:46.6877929Z         %346 = arith.cmpi slt, %345, %c2_i32 : i32
2026-02-21T09:49:46.6878074Z         %347 = arith.select %346, %345, %c0_i32 : i32
2026-02-21T09:49:46.6878347Z         %348 = ttg.memdesc_index %248[%347] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6878728Z         ttg.local_store %314, %348 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6879127Z         scf.yield %344, %347, %arg8, %348 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6879427Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:49:46.6879709Z       %258 = ttg.local_load %257#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6880145Z       %259 = arith.extf %258 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6880445Z       %260 = arith.addi %73, %243 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6880652Z       %261 = tt.addptr %7, %260 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6880855Z       %262 = arith.andi %79, %247 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6881030Z       %263 = tt.load %261, %262, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6881292Z       %264 = ttg.convert_layout %263 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6881575Z       %265 = arith.shli %264, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6881818Z       %266 = arith.shrsi %265, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6882057Z       %267 = arith.shrsi %264, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6882369Z       %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6882761Z       %269 = tt.expand_dims %267 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6883063Z       %270 = tt.broadcast %268 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6883307Z       %271 = arith.select %15, %270, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6883547Z       %272 = tt.broadcast %269 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6883787Z       %273 = arith.select %17, %272, %271 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6884026Z       %274 = tt.reshape %273 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6884255Z       %275 = arith.sitofp %274 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6884558Z       %276 = ttg.convert_layout %275 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6885029Z       %277 = tt.dot %259, %276, %257#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6885532Z       %278 = ttg.local_load %257#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6885969Z       %279 = arith.extf %278 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6886292Z       %280 = arith.addi %101, %243 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6886503Z       %281 = tt.addptr %7, %280 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6886708Z       %282 = arith.andi %107, %247 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6886898Z       %283 = tt.load %281, %282, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6887160Z       %284 = ttg.convert_layout %283 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6887442Z       %285 = arith.shli %284, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6887683Z       %286 = arith.shrsi %285, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6887928Z       %287 = arith.shrsi %284, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6888218Z       %288 = tt.expand_dims %286 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6888558Z       %289 = tt.expand_dims %287 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6888841Z       %290 = tt.broadcast %288 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6889086Z       %291 = arith.select %15, %290, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6889326Z       %292 = tt.broadcast %289 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6889559Z       %293 = arith.select %17, %292, %291 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6889793Z       %294 = tt.reshape %293 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6890013Z       %295 = arith.sitofp %294 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6890331Z       %296 = ttg.convert_layout %295 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6890795Z       %297 = tt.dot %279, %296, %277, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6891197Z       ttg.local_dealloc %248 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:46.6891414Z       %298 = arith.truncf %297 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:49:46.6891683Z       %299 = tt.expand_dims %235 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:49:46.6891921Z       %300 = arith.muli %299, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:49:46.6892156Z       %301 = tt.expand_dims %230 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:49:46.6892417Z       %302 = tt.broadcast %300 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6892625Z       %303 = tt.broadcast %301 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6892808Z       %304 = arith.addi %302, %303 : tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6893004Z       %305 = tt.addptr %18, %304 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6893208Z       tt.store %305, %298 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:46.6893351Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:49:46.6893490Z     scf.for %arg3 = %24 to %c32768_i32 step %c4864_i32  : i32 {
2026-02-21T09:49:46.6893640Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:49:46.6893768Z       %26 = arith.muli %25, %c2_i32 : i32
2026-02-21T09:49:46.6893890Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:49:46.6894013Z       %28 = arith.minsi %27, %c2_i32 : i32
2026-02-21T09:49:46.6894138Z       %29 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:49:46.6894275Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:49:46.6894394Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:49:46.6894506Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:49:46.6894638Z       %33 = arith.muli %31, %c64_i32 : i32
2026-02-21T09:49:46.6894797Z       %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:46.6895008Z       %35 = arith.addi %34, %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:46.6895176Z       %36 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:49:46.6895343Z       %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:46.6895558Z       %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:46.6895766Z       %39 = arith.addi %37, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:46.6895979Z       %40 = arith.addi %38, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:46.6896248Z       %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:49:46.6896503Z       %42 = arith.muli %41, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:49:46.6896699Z       %43 = tt.broadcast %42 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6896872Z       %44 = arith.extsi %33 : i32 to i64
2026-02-21T09:49:46.6897041Z       %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:46.6897261Z       %46 = arith.addi %45, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:46.6897537Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6897817Z       %48 = tt.broadcast %47 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6898037Z       %49 = arith.cmpi sge, %47, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6898214Z       %50 = arith.cmpi slt, %47, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:46.6898381Z       %51 = arith.andi %49, %50 : tensor<1x64xi1, #blocked2>
2026-02-21T09:49:46.6898571Z       %52 = tt.broadcast %51 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6898805Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:46.6899072Z       %54 = tt.expand_dims %5 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:46.6899347Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6899535Z       %56 = arith.addi %43, %55 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6899734Z       %57 = tt.addptr %6, %56 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6899942Z       %58 = tt.load %57 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6900223Z       %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6900581Z       ttg.local_store %58, %59 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6900853Z       %60 = arith.addi %5, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6901132Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:46.6901406Z       %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6901593Z       %63 = arith.addi %43, %62 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6901790Z       %64 = tt.addptr %6, %63 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6902004Z       %65 = tt.load %64 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6902284Z       %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6902651Z       ttg.local_store %65, %66 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6903165Z       %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:49:46.6903582Z         %132 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:46.6903711Z         %133 = arith.muli %132, %c2_i32 : i32
2026-02-21T09:49:46.6903889Z         %134 = tt.splat %133 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6904120Z         %135 = arith.addi %134, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:46.6904400Z         %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:46.6904689Z         %137 = tt.broadcast %136 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6904885Z         %138 = arith.addi %43, %137 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6905096Z         %139 = tt.addptr %6, %138 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:46.6905307Z         %140 = tt.load %139 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:46.6905609Z         %141 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6906074Z         %142 = arith.extf %141 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6906355Z         %143 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:46.6906534Z         %144 = tt.splat %143 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6906775Z         %145 = arith.addi %144, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6907053Z         %146 = tt.expand_dims %145 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6907308Z         %147 = arith.muli %146, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6907502Z         %148 = tt.broadcast %147 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6907701Z         %149 = arith.addi %148, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6907906Z         %150 = tt.addptr %7, %149 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6908118Z         %151 = arith.cmpi sge, %146, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6908297Z         %152 = arith.cmpi slt, %146, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6908465Z         %153 = arith.andi %151, %152 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:46.6908659Z         %154 = tt.broadcast %153 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6908856Z         %155 = arith.andi %154, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6909029Z         %156 = tt.load %150, %155, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6909297Z         %157 = ttg.convert_layout %156 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6909583Z         %158 = arith.shli %157, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6909844Z         %159 = arith.shrsi %158, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6910086Z         %160 = arith.shrsi %157, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6910400Z         %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6910743Z         %162 = tt.expand_dims %160 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6911027Z         %163 = tt.broadcast %161 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6911275Z         %164 = arith.select %15, %163, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6911520Z         %165 = tt.broadcast %162 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6911754Z         %166 = arith.select %17, %165, %164 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6911989Z         %167 = tt.reshape %166 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6912211Z         %168 = arith.sitofp %167 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6912512Z         %169 = ttg.convert_layout %168 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6912986Z         %170 = tt.dot %142, %169, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6913335Z         %171 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:49:46.6913474Z         %172 = arith.cmpi slt, %171, %c2_i32 : i32
2026-02-21T09:49:46.6913609Z         %173 = arith.select %172, %171, %c0_i32 : i32
2026-02-21T09:49:46.6913896Z         %174 = ttg.memdesc_index %53[%173] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6914253Z         ttg.local_store %140, %174 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6914644Z         scf.yield %170, %173, %arg8, %174 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:46.6914960Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:49:46.6915237Z       %68 = ttg.local_load %67#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6915659Z       %69 = arith.extf %68 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6915991Z       %70 = arith.addi %9, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6916268Z       %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6916517Z       %72 = arith.muli %71, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6916710Z       %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6916901Z       %74 = arith.addi %73, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6917094Z       %75 = tt.addptr %7, %74 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6917294Z       %76 = arith.cmpi sge, %71, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6917467Z       %77 = arith.cmpi slt, %71, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6917626Z       %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:46.6917811Z       %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6918017Z       %80 = arith.andi %79, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6918181Z       %81 = tt.load %75, %80, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6918452Z       %82 = ttg.convert_layout %81 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6918727Z       %83 = arith.shli %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6918961Z       %84 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6919195Z       %85 = arith.shrsi %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6919475Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6919809Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6920083Z       %88 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6920323Z       %89 = arith.select %15, %88, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6920558Z       %90 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6920788Z       %91 = arith.select %17, %90, %89 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6921003Z       %92 = tt.reshape %91 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6921216Z       %93 = arith.sitofp %92 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6921498Z       %94 = ttg.convert_layout %93 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6921966Z       %95 = tt.dot %69, %94, %67#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6922447Z       %96 = ttg.local_load %67#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6922913Z       %97 = arith.extf %96 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6923232Z       %98 = arith.addi %9, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:46.6923509Z       %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6923752Z       %100 = arith.muli %99, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6923945Z       %101 = tt.broadcast %100 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6924137Z       %102 = arith.addi %101, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6924331Z       %103 = tt.addptr %7, %102 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:46.6924539Z       %104 = arith.cmpi sge, %99, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6924710Z       %105 = arith.cmpi slt, %99, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:46.6924875Z       %106 = arith.andi %104, %105 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:46.6925056Z       %107 = tt.broadcast %106 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6925243Z       %108 = arith.andi %107, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:46.6925414Z       %109 = tt.load %103, %108, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:46.6925668Z       %110 = ttg.convert_layout %109 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6925969Z       %111 = arith.shli %110, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6926204Z       %112 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6926456Z       %113 = arith.shrsi %110, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:46.6926742Z       %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6927069Z       %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:46.6927347Z       %116 = tt.broadcast %114 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6927581Z       %117 = arith.select %15, %116, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6927821Z       %118 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6928051Z       %119 = arith.select %17, %118, %117 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:46.6928277Z       %120 = tt.reshape %119 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:49:46.6928498Z       %121 = arith.sitofp %120 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:49:46.6928785Z       %122 = ttg.convert_layout %121 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:46.6929237Z       %123 = tt.dot %97, %122, %95, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:46.6929615Z       ttg.local_dealloc %53 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:46.6929843Z       %124 = arith.truncf %123 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:49:46.6930104Z       %125 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:49:46.6930339Z       %126 = arith.muli %125, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:49:46.6930578Z       %127 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:49:46.6930836Z       %128 = tt.broadcast %126 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6931033Z       %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6931210Z       %130 = arith.addi %128, %129 : tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6931394Z       %131 = tt.addptr %18, %130 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:49:46.6931588Z       tt.store %131, %124 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:46.6931727Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:49:46.6931830Z     tt.return
2026-02-21T09:49:46.6931914Z   }
2026-02-21T09:49:46.6931990Z }
2026-02-21T09:49:46.6932035Z 
2026-02-21T09:49:46.6932066Z {-#
2026-02-21T09:49:46.6932145Z   external_resources: {
2026-02-21T09:49:46.6932247Z     mlir_reproducer: {
2026-02-21T09:49:46.6933236Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:49:46.6934229Z       disable_threading: false,
2026-02-21T09:49:46.6934349Z       verify_each: true
2026-02-21T09:49:46.6934440Z     }
2026-02-21T09:49:46.6934510Z   }
2026-02-21T09:49:46.6934579Z #-}
2026-02-21T09:49:46.6934873Z /tmp/torchinductor_root/jc/cjc6bibmitqfnd7irt5tdsf256e6evk3ednxdyldf7plkxktmfxs.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:49:46.6935556Z /tmp/torchinductor_root/jc/cjc6bibmitqfnd7irt5tdsf256e6evk3ednxdyldf7plkxktmfxs.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:49:46.6936112Z [317s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:49:46.6936890Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[1, 3], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:49:46.6937598Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:49:46.6937767Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:49:47.0202034Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:49:47.0211908Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:49:47.0212772Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:49:47.0213411Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:49:47.0213846Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:49:47.0214238Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:49:47.0214638Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:49:47.0214909Z #smem = #ttg.shared_memory
2026-02-21T09:49:47.0215264Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:49:47.0215983Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:49:47.0216557Z     %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma>
2026-02-21T09:49:47.0216826Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:47.0217089Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:47.0217358Z     %cst_2 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1>
2026-02-21T09:49:47.0217639Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0217893Z     %c42495_i32 = arith.constant 42495 : i32
2026-02-21T09:49:47.0218081Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:49:47.0218254Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:49:47.0218430Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:49:47.0218709Z     %cst_4 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0219099Z     %cst_5 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0219522Z     %cst_6 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0219849Z     %cst_7 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0220117Z     %cst_8 = arith.constant dense<0> : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0220406Z     %cst_9 = arith.constant dense<512> : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0220673Z     %cst_10 = arith.constant dense<0> : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0220945Z     %cst_11 = arith.constant dense<8192> : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0221215Z     %cst_12 = arith.constant dense<0> : tensor<2x64xi8, #blocked2>
2026-02-21T09:49:47.0221441Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:49:47.0221615Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:49:47.0221793Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:49:47.0221975Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:49:47.0222153Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:49:47.0222326Z     %c29184_i32 = arith.constant 29184 : i32
2026-02-21T09:49:47.0222514Z     %c19456_i32 = arith.constant 19456 : i32
2026-02-21T09:49:47.0222732Z     %cst_13 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0222958Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:49:47.0223123Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:49:47.0223270Z     %c9728_i32 = arith.constant 9728 : i32
2026-02-21T09:49:47.0223515Z     %cst_14 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0223751Z     %0 = tt.get_program_id x : i32
2026-02-21T09:49:47.0223996Z     %1 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:47.0224340Z     %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:47.0224672Z     %3 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:47.0225036Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:47.0225375Z     %5 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0225681Z     %6 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0225956Z     %7 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0226240Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0226641Z     %9 = arith.extsi %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0227093Z     %10 = arith.extsi %4 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:47.0227549Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:49:47.0228073Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:49:47.0228582Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:47.0228912Z     %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:47.0229158Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:49:47.0229403Z     %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:47.0229639Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:49:47.0229913Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:47.0230121Z     %19 = arith.subi %c42495_i32, %0 : i32
2026-02-21T09:49:47.0230273Z     %20 = arith.divui %19, %c9728_i32 : i32
2026-02-21T09:49:47.0230443Z     %21 = arith.remsi %20, %c3_i32 : i32
2026-02-21T09:49:47.0230588Z     %22 = arith.subi %20, %21 : i32
2026-02-21T09:49:47.0230729Z     %23 = arith.muli %22, %c9728_i32 : i32
2026-02-21T09:49:47.0230874Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:49:47.0231034Z     scf.for %arg3 = %0 to %24 step %c29184_i32  : i32 {
2026-02-21T09:49:47.0231210Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:49:47.0231359Z       %26 = arith.muli %25, %c2_i32 : i32
2026-02-21T09:49:47.0231509Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:49:47.0231652Z       %28 = arith.minsi %27, %c2_i32 : i32
2026-02-21T09:49:47.0231803Z       %29 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:49:47.0231951Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:49:47.0232091Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:49:47.0232234Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:49:47.0232373Z       %33 = arith.muli %31, %c64_i32 : i32
2026-02-21T09:49:47.0232572Z       %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:47.0232832Z       %35 = arith.addi %34, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:47.0233037Z       %36 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:49:47.0233247Z       %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:47.0233513Z       %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:47.0233780Z       %39 = arith.addi %37, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:47.0234043Z       %40 = arith.addi %38, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:47.0234401Z       %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:49:47.0234779Z       %42 = arith.muli %41, %cst_2 : tensor<64x1xi32, #blocked1>
2026-02-21T09:49:47.0235019Z       %43 = tt.broadcast %42 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0235218Z       %44 = arith.extsi %33 : i32 to i64
2026-02-21T09:49:47.0235408Z       %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:47.0235634Z       %46 = arith.addi %45, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:47.0235908Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0236191Z       %48 = tt.broadcast %47 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0236393Z       %49 = arith.cmpi sge, %47, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0236568Z       %50 = arith.cmpi slt, %47, %cst_11 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0236742Z       %51 = arith.andi %49, %50 : tensor<1x64xi1, #blocked2>
2026-02-21T09:49:47.0236928Z       %52 = tt.broadcast %51 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0237146Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:47.0237420Z       %54 = tt.expand_dims %5 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:47.0237690Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0237881Z       %56 = arith.addi %43, %55 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0238077Z       %57 = tt.addptr %6, %56 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0238289Z       %58 = tt.load %57 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0238589Z       %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0238946Z       ttg.local_store %58, %59 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0239238Z       %60 = arith.addi %5, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0239524Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:47.0239804Z       %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0239993Z       %63 = arith.addi %43, %62 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0240200Z       %64 = tt.addptr %6, %63 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0240405Z       %65 = tt.load %64 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0240687Z       %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0241051Z       ttg.local_store %65, %66 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0241576Z       %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:49:47.0242003Z         %312 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:47.0242136Z         %313 = arith.muli %312, %c2_i32 : i32
2026-02-21T09:49:47.0242313Z         %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0242552Z         %315 = arith.addi %314, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0243054Z         %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:47.0243341Z         %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0243554Z         %318 = arith.addi %43, %317 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0243753Z         %319 = tt.addptr %6, %318 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0243962Z         %320 = tt.load %319 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0244259Z         %321 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0244696Z         %322 = arith.extf %321 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0244981Z         %323 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:47.0245154Z         %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0245378Z         %325 = arith.addi %324, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0245652Z         %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0245905Z         %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0246099Z         %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0246290Z         %329 = arith.addi %328, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0246491Z         %330 = tt.addptr %7, %329 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0246729Z         %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0246910Z         %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0247077Z         %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:47.0247285Z         %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0247481Z         %335 = arith.andi %334, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0247649Z         %336 = tt.load %330, %335, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0247906Z         %337 = ttg.convert_layout %336 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0248189Z         %338 = arith.shli %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0248428Z         %339 = arith.shrsi %338, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0248668Z         %340 = arith.shrsi %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0248959Z         %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0249302Z         %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0249585Z         %343 = tt.broadcast %341 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0249823Z         %344 = arith.select %15, %343, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0250069Z         %345 = tt.broadcast %342 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0250301Z         %346 = arith.select %17, %345, %344 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0250534Z         %347 = tt.reshape %346 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0250775Z         %348 = arith.sitofp %347 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0251021Z         %349 = ttg.local_alloc %348 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0251347Z         %350 = ttg.local_load %349 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0251830Z         %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0252180Z         %352 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:49:47.0252307Z         %353 = arith.cmpi slt, %352, %c2_i32 : i32
2026-02-21T09:49:47.0252438Z         %354 = arith.select %353, %352, %c0_i32 : i32
2026-02-21T09:49:47.0252699Z         %355 = ttg.memdesc_index %53[%354] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0253048Z         ttg.local_store %320, %355 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0253439Z         scf.yield %351, %354, %arg8, %355 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0253738Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:49:47.0254007Z       %68 = ttg.local_load %67#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0254425Z       %69 = arith.extf %68 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0254768Z       %70 = arith.addi %9, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0255040Z       %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0255297Z       %72 = arith.muli %71, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0255480Z       %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0255666Z       %74 = arith.addi %73, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0255853Z       %75 = tt.addptr %7, %74 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0256052Z       %76 = arith.cmpi sge, %71, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0256216Z       %77 = arith.cmpi slt, %71, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0256371Z       %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:47.0256549Z       %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0256731Z       %80 = arith.andi %79, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0256891Z       %81 = tt.load %75, %80, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0257142Z       %82 = ttg.convert_layout %81 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0257415Z       %83 = arith.shli %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0257641Z       %84 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0257866Z       %85 = arith.shrsi %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0258146Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0258497Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0258768Z       %88 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0259001Z       %89 = arith.select %15, %88, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0259239Z       %90 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0259458Z       %91 = arith.select %17, %90, %89 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0259673Z       %92 = tt.reshape %91 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0259883Z       %93 = arith.sitofp %92 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0260122Z       %94 = ttg.local_alloc %93 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0260435Z       %95 = ttg.local_load %94 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0260894Z       %96 = tt.dot %69, %95, %67#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0261377Z       %97 = ttg.local_load %67#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0261790Z       %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0262112Z       %99 = arith.addi %9, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0262397Z       %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0262645Z       %101 = arith.muli %100, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0262836Z       %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0263037Z       %103 = arith.addi %102, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0263232Z       %104 = tt.addptr %7, %103 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0263439Z       %105 = arith.cmpi sge, %100, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0263608Z       %106 = arith.cmpi slt, %100, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0263771Z       %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:47.0263951Z       %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0264136Z       %109 = arith.andi %108, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0264302Z       %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0264555Z       %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0264837Z       %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0265071Z       %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0265303Z       %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0265589Z       %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0265918Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0266198Z       %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0266450Z       %118 = arith.select %15, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0266685Z       %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0266913Z       %120 = arith.select %17, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0267151Z       %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0267369Z       %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0267611Z       %123 = ttg.local_alloc %122 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0267931Z       %124 = ttg.local_load %123 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0268393Z       %125 = tt.dot %98, %124, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0268765Z       ttg.local_dealloc %53 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:47.0268972Z       %126 = arith.truncf %125 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:49:47.0269230Z       %127 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:49:47.0269462Z       %128 = arith.muli %127, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:49:47.0269688Z       %129 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:49:47.0269936Z       %130 = tt.broadcast %128 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0270150Z       %131 = tt.broadcast %129 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0270324Z       %132 = arith.addi %130, %131 : tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0270507Z       %133 = tt.addptr %18, %132 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0270715Z       tt.store %133, %126 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:47.0270858Z       %134 = arith.addi %arg3, %c9728_i32 : i32
2026-02-21T09:49:47.0270984Z       %135 = arith.divsi %134, %c512_i32 : i32
2026-02-21T09:49:47.0271103Z       %136 = arith.muli %135, %c2_i32 : i32
2026-02-21T09:49:47.0271224Z       %137 = arith.subi %c128_i32, %136 : i32
2026-02-21T09:49:47.0271340Z       %138 = arith.minsi %137, %c2_i32 : i32
2026-02-21T09:49:47.0271457Z       %139 = arith.remsi %134, %c512_i32 : i32
2026-02-21T09:49:47.0271573Z       %140 = arith.remsi %139, %138 : i32
2026-02-21T09:49:47.0271684Z       %141 = arith.addi %136, %140 : i32
2026-02-21T09:49:47.0271799Z       %142 = arith.divsi %139, %138 : i32
2026-02-21T09:49:47.0271913Z       %143 = arith.muli %141, %c64_i32 : i32
2026-02-21T09:49:47.0272073Z       %144 = tt.splat %143 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:47.0272286Z       %145 = arith.addi %144, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:47.0272451Z       %146 = arith.muli %142, %c64_i32 : i32
2026-02-21T09:49:47.0272617Z       %147 = tt.splat %146 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:47.0272826Z       %148 = tt.splat %146 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:47.0273039Z       %149 = arith.addi %147, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:47.0273248Z       %150 = arith.addi %148, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:47.0273517Z       %151 = tt.expand_dims %149 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:49:47.0273783Z       %152 = arith.muli %151, %cst_2 : tensor<64x1xi32, #blocked1>
2026-02-21T09:49:47.0273974Z       %153 = tt.broadcast %152 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0274149Z       %154 = arith.extsi %143 : i32 to i64
2026-02-21T09:49:47.0274311Z       %155 = tt.splat %154 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:47.0274556Z       %156 = arith.addi %155, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:47.0274830Z       %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0275109Z       %158 = tt.broadcast %157 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0275310Z       %159 = arith.cmpi sge, %157, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0275488Z       %160 = arith.cmpi slt, %157, %cst_11 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0275657Z       %161 = arith.andi %159, %160 : tensor<1x64xi1, #blocked2>
2026-02-21T09:49:47.0275845Z       %162 = tt.broadcast %161 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0276060Z       %163 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:47.0276244Z       %164 = arith.addi %153, %55 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0276440Z       %165 = tt.addptr %6, %164 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0276645Z       %166 = tt.load %165 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0276925Z       %167 = ttg.memdesc_index %163[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0277280Z       ttg.local_store %166, %167 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0277544Z       %168 = arith.addi %153, %62 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0277738Z       %169 = tt.addptr %6, %168 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0277958Z       %170 = tt.load %169 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0278234Z       %171 = ttg.memdesc_index %163[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0278586Z       ttg.local_store %170, %171 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0279098Z       %172:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %167, %arg8 = %171) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:49:47.0279508Z         %312 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:47.0279635Z         %313 = arith.muli %312, %c2_i32 : i32
2026-02-21T09:49:47.0279805Z         %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0280028Z         %315 = arith.addi %314, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0280304Z         %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:47.0280580Z         %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0280772Z         %318 = arith.addi %153, %317 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0280968Z         %319 = tt.addptr %6, %318 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0281170Z         %320 = tt.load %319 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0281481Z         %321 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0281909Z         %322 = arith.extf %321 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0282188Z         %323 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:47.0282374Z         %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0282633Z         %325 = arith.addi %324, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0282909Z         %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0283155Z         %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0283346Z         %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0283538Z         %329 = arith.addi %328, %158 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0283733Z         %330 = tt.addptr %7, %329 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0283941Z         %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0284109Z         %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0284274Z         %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:47.0284459Z         %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0284647Z         %335 = arith.andi %334, %162 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0284812Z         %336 = tt.load %330, %335, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0285066Z         %337 = ttg.convert_layout %336 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0285365Z         %338 = arith.shli %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0285598Z         %339 = arith.shrsi %338, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0285851Z         %340 = arith.shrsi %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0286138Z         %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0286469Z         %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0286748Z         %343 = tt.broadcast %341 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0286984Z         %344 = arith.select %15, %343, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0287221Z         %345 = tt.broadcast %342 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0287451Z         %346 = arith.select %17, %345, %344 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0287689Z         %347 = tt.reshape %346 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0287915Z         %348 = arith.sitofp %347 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0288167Z         %349 = ttg.local_alloc %348 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0288498Z         %350 = ttg.local_load %349 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0288976Z         %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0289325Z         %352 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:49:47.0289485Z         %353 = arith.cmpi slt, %352, %c2_i32 : i32
2026-02-21T09:49:47.0289620Z         %354 = arith.select %353, %352, %c0_i32 : i32
2026-02-21T09:49:47.0289891Z         %355 = ttg.memdesc_index %163[%354] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0290268Z         ttg.local_store %320, %355 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0290656Z         scf.yield %351, %354, %arg8, %355 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0290958Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:49:47.0291239Z       %173 = ttg.local_load %172#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0291672Z       %174 = arith.extf %173 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0291974Z       %175 = arith.addi %73, %158 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0292177Z       %176 = tt.addptr %7, %175 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0292380Z       %177 = arith.andi %79, %162 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0292553Z       %178 = tt.load %176, %177, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0292809Z       %179 = ttg.convert_layout %178 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0293092Z       %180 = arith.shli %179, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0293344Z       %181 = arith.shrsi %180, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0293585Z       %182 = arith.shrsi %179, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0293894Z       %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0294230Z       %184 = tt.expand_dims %182 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0294514Z       %185 = tt.broadcast %183 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0294753Z       %186 = arith.select %15, %185, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0294995Z       %187 = tt.broadcast %184 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0295233Z       %188 = arith.select %17, %187, %186 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0295462Z       %189 = tt.reshape %188 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0295688Z       %190 = arith.sitofp %189 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0295939Z       %191 = ttg.local_alloc %190 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0296264Z       %192 = ttg.local_load %191 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0296735Z       %193 = tt.dot %174, %192, %172#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0297225Z       %194 = ttg.local_load %172#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0297674Z       %195 = arith.extf %194 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0297977Z       %196 = arith.addi %102, %158 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0298177Z       %197 = tt.addptr %7, %196 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0298397Z       %198 = arith.andi %108, %162 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0298569Z       %199 = tt.load %197, %198, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0298831Z       %200 = ttg.convert_layout %199 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0299117Z       %201 = arith.shli %200, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0299356Z       %202 = arith.shrsi %201, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0299600Z       %203 = arith.shrsi %200, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0299892Z       %204 = tt.expand_dims %202 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0300236Z       %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0300526Z       %206 = tt.broadcast %204 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0300764Z       %207 = arith.select %15, %206, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0301006Z       %208 = tt.broadcast %205 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0301238Z       %209 = arith.select %17, %208, %207 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0301491Z       %210 = tt.reshape %209 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0301719Z       %211 = arith.sitofp %210 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0301985Z       %212 = ttg.local_alloc %211 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0302315Z       %213 = ttg.local_load %212 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0302779Z       %214 = tt.dot %195, %213, %193, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0303163Z       ttg.local_dealloc %163 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:47.0303378Z       %215 = arith.truncf %214 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:49:47.0303642Z       %216 = tt.expand_dims %150 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:49:47.0303884Z       %217 = arith.muli %216, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:49:47.0304116Z       %218 = tt.expand_dims %145 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:49:47.0304378Z       %219 = tt.broadcast %217 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0304583Z       %220 = tt.broadcast %218 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0304762Z       %221 = arith.addi %219, %220 : tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0304957Z       %222 = tt.addptr %18, %221 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0305157Z       tt.store %222, %215 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:47.0305307Z       %223 = arith.addi %arg3, %c19456_i32 : i32
2026-02-21T09:49:47.0305442Z       %224 = arith.divsi %223, %c512_i32 : i32
2026-02-21T09:49:47.0305590Z       %225 = arith.muli %224, %c2_i32 : i32
2026-02-21T09:49:47.0305715Z       %226 = arith.subi %c128_i32, %225 : i32
2026-02-21T09:49:47.0305837Z       %227 = arith.minsi %226, %c2_i32 : i32
2026-02-21T09:49:47.0305961Z       %228 = arith.remsi %223, %c512_i32 : i32
2026-02-21T09:49:47.0306105Z       %229 = arith.remsi %228, %227 : i32
2026-02-21T09:49:47.0306228Z       %230 = arith.addi %225, %229 : i32
2026-02-21T09:49:47.0306343Z       %231 = arith.divsi %228, %227 : i32
2026-02-21T09:49:47.0306465Z       %232 = arith.muli %230, %c64_i32 : i32
2026-02-21T09:49:47.0306631Z       %233 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:47.0306844Z       %234 = arith.addi %233, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:47.0307020Z       %235 = arith.muli %231, %c64_i32 : i32
2026-02-21T09:49:47.0307191Z       %236 = tt.splat %235 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:47.0307413Z       %237 = tt.splat %235 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:47.0307633Z       %238 = arith.addi %236, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:47.0307856Z       %239 = arith.addi %237, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:47.0308133Z       %240 = tt.expand_dims %238 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:49:47.0308388Z       %241 = arith.muli %240, %cst_2 : tensor<64x1xi32, #blocked1>
2026-02-21T09:49:47.0308588Z       %242 = tt.broadcast %241 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0308764Z       %243 = arith.extsi %232 : i32 to i64
2026-02-21T09:49:47.0308937Z       %244 = tt.splat %243 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:47.0309182Z       %245 = arith.addi %244, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:47.0309460Z       %246 = tt.expand_dims %245 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0309761Z       %247 = tt.broadcast %246 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0309965Z       %248 = arith.cmpi sge, %246, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0310147Z       %249 = arith.cmpi slt, %246, %cst_11 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0310319Z       %250 = arith.andi %248, %249 : tensor<1x64xi1, #blocked2>
2026-02-21T09:49:47.0310507Z       %251 = tt.broadcast %250 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0310726Z       %252 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:47.0310911Z       %253 = arith.addi %242, %55 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0311120Z       %254 = tt.addptr %6, %253 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0311330Z       %255 = tt.load %254 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0311613Z       %256 = ttg.memdesc_index %252[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0311982Z       ttg.local_store %255, %256 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0312220Z       %257 = arith.addi %242, %62 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0312422Z       %258 = tt.addptr %6, %257 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0312631Z       %259 = tt.load %258 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0312909Z       %260 = ttg.memdesc_index %252[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0313283Z       ttg.local_store %259, %260 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0313795Z       %261:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %256, %arg8 = %260) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:49:47.0314230Z         %312 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:47.0314362Z         %313 = arith.muli %312, %c2_i32 : i32
2026-02-21T09:49:47.0314536Z         %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0314766Z         %315 = arith.addi %314, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0315045Z         %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:47.0315331Z         %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0315533Z         %318 = arith.addi %242, %317 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0315733Z         %319 = tt.addptr %6, %318 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0315944Z         %320 = tt.load %319 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0316242Z         %321 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0316677Z         %322 = arith.extf %321 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0316965Z         %323 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:47.0317151Z         %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0317383Z         %325 = arith.addi %324, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0317676Z         %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0317930Z         %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0318128Z         %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0318323Z         %329 = arith.addi %328, %247 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0318526Z         %330 = tt.addptr %7, %329 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0318734Z         %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0318913Z         %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0319084Z         %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:47.0319274Z         %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0319472Z         %335 = arith.andi %334, %251 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0319641Z         %336 = tt.load %330, %335, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0319907Z         %337 = ttg.convert_layout %336 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0320194Z         %338 = arith.shli %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0320433Z         %339 = arith.shrsi %338, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0320682Z         %340 = arith.shrsi %337, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0321001Z         %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0321343Z         %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0321631Z         %343 = tt.broadcast %341 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0321896Z         %344 = arith.select %15, %343, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0322136Z         %345 = tt.broadcast %342 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0322369Z         %346 = arith.select %17, %345, %344 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0322642Z         %347 = tt.reshape %346 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0322872Z         %348 = arith.sitofp %347 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0323126Z         %349 = ttg.local_alloc %348 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0323455Z         %350 = ttg.local_load %349 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0323929Z         %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0324287Z         %352 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:49:47.0324422Z         %353 = arith.cmpi slt, %352, %c2_i32 : i32
2026-02-21T09:49:47.0324557Z         %354 = arith.select %353, %352, %c0_i32 : i32
2026-02-21T09:49:47.0324831Z         %355 = ttg.memdesc_index %252[%354] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0325212Z         ttg.local_store %320, %355 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0325609Z         scf.yield %351, %354, %arg8, %355 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0325930Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:49:47.0326207Z       %262 = ttg.local_load %261#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0326638Z       %263 = arith.extf %262 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0326939Z       %264 = arith.addi %73, %247 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0327139Z       %265 = tt.addptr %7, %264 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0327348Z       %266 = arith.andi %79, %251 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0327517Z       %267 = tt.load %265, %266, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0327781Z       %268 = ttg.convert_layout %267 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0328068Z       %269 = arith.shli %268, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0328304Z       %270 = arith.shrsi %269, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0328546Z       %271 = arith.shrsi %268, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0328832Z       %272 = tt.expand_dims %270 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0329175Z       %273 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0329485Z       %274 = tt.broadcast %272 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0329724Z       %275 = arith.select %15, %274, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0329981Z       %276 = tt.broadcast %273 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0330210Z       %277 = arith.select %17, %276, %275 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0330444Z       %278 = tt.reshape %277 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0330672Z       %279 = arith.sitofp %278 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0330931Z       %280 = ttg.local_alloc %279 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0331263Z       %281 = ttg.local_load %280 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0331731Z       %282 = tt.dot %263, %281, %261#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0332228Z       %283 = ttg.local_load %261#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0332656Z       %284 = arith.extf %283 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0332951Z       %285 = arith.addi %102, %247 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0333154Z       %286 = tt.addptr %7, %285 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0333374Z       %287 = arith.andi %108, %251 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0333543Z       %288 = tt.load %286, %287, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0333829Z       %289 = ttg.convert_layout %288 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0334111Z       %290 = arith.shli %289, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0334348Z       %291 = arith.shrsi %290, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0334584Z       %292 = arith.shrsi %289, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0334869Z       %293 = tt.expand_dims %291 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0335203Z       %294 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0335483Z       %295 = tt.broadcast %293 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0335721Z       %296 = arith.select %15, %295, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0335958Z       %297 = tt.broadcast %294 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0336191Z       %298 = arith.select %17, %297, %296 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0336418Z       %299 = tt.reshape %298 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0336637Z       %300 = arith.sitofp %299 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0336888Z       %301 = ttg.local_alloc %300 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0337213Z       %302 = ttg.local_load %301 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0337692Z       %303 = tt.dot %284, %302, %282, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0338075Z       ttg.local_dealloc %252 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:47.0338303Z       %304 = arith.truncf %303 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:49:47.0338568Z       %305 = tt.expand_dims %239 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:49:47.0338804Z       %306 = arith.muli %305, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:49:47.0339035Z       %307 = tt.expand_dims %234 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:49:47.0339295Z       %308 = tt.broadcast %306 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0339496Z       %309 = tt.broadcast %307 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0339674Z       %310 = arith.addi %308, %309 : tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0339864Z       %311 = tt.addptr %18, %310 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0340063Z       tt.store %311, %304 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:47.0340207Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:49:47.0340340Z     scf.for %arg3 = %24 to %c32768_i32 step %c9728_i32  : i32 {
2026-02-21T09:49:47.0340489Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:49:47.0340612Z       %26 = arith.muli %25, %c2_i32 : i32
2026-02-21T09:49:47.0340733Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:49:47.0340853Z       %28 = arith.minsi %27, %c2_i32 : i32
2026-02-21T09:49:47.0346666Z       %29 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:49:47.0346796Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:49:47.0346963Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:49:47.0347074Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:49:47.0347187Z       %33 = arith.muli %31, %c64_i32 : i32
2026-02-21T09:49:47.0347367Z       %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:47.0347580Z       %35 = arith.addi %34, %3 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:47.0347745Z       %36 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:49:47.0347911Z       %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:47.0348123Z       %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:47.0348333Z       %39 = arith.addi %37, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:47.0348544Z       %40 = arith.addi %38, %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:47.0348835Z       %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:49:47.0349087Z       %42 = arith.muli %41, %cst_2 : tensor<64x1xi32, #blocked1>
2026-02-21T09:49:47.0349279Z       %43 = tt.broadcast %42 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0349454Z       %44 = arith.extsi %33 : i32 to i64
2026-02-21T09:49:47.0349618Z       %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:47.0349835Z       %46 = arith.addi %45, %10 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:49:47.0350108Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0350379Z       %48 = tt.broadcast %47 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0350579Z       %49 = arith.cmpi sge, %47, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0350769Z       %50 = arith.cmpi slt, %47, %cst_11 : tensor<1x64xi64, #blocked2>
2026-02-21T09:49:47.0350933Z       %51 = arith.andi %49, %50 : tensor<1x64xi1, #blocked2>
2026-02-21T09:49:47.0351113Z       %52 = tt.broadcast %51 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0351340Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:47.0351607Z       %54 = tt.expand_dims %5 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:47.0351875Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0352063Z       %56 = arith.addi %43, %55 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0352259Z       %57 = tt.addptr %6, %56 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0352465Z       %58 = tt.load %57 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0352747Z       %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0353094Z       ttg.local_store %58, %59 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0353367Z       %60 = arith.addi %5, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0353642Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:47.0353913Z       %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0354099Z       %63 = arith.addi %43, %62 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0354290Z       %64 = tt.addptr %6, %63 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0354504Z       %65 = tt.load %64 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0354780Z       %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0355145Z       ttg.local_store %65, %66 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0355656Z       %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:49:47.0356073Z         %134 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:49:47.0356200Z         %135 = arith.muli %134, %c2_i32 : i32
2026-02-21T09:49:47.0356372Z         %136 = tt.splat %135 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0356599Z         %137 = arith.addi %136, %5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:47.0356876Z         %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:47.0357152Z         %139 = tt.broadcast %138 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0357347Z         %140 = arith.addi %43, %139 : tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0357547Z         %141 = tt.addptr %6, %140 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:49:47.0357752Z         %142 = tt.load %141 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:47.0358052Z         %143 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0358499Z         %144 = arith.extf %143 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0358782Z         %145 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:49:47.0358957Z         %146 = tt.splat %145 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0359182Z         %147 = arith.addi %146, %9 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0359472Z         %148 = tt.expand_dims %147 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0359721Z         %149 = arith.muli %148, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0359915Z         %150 = tt.broadcast %149 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0360109Z         %151 = arith.addi %150, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0360302Z         %152 = tt.addptr %7, %151 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0360516Z         %153 = arith.cmpi sge, %148, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0360689Z         %154 = arith.cmpi slt, %148, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0360856Z         %155 = arith.andi %153, %154 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:47.0361044Z         %156 = tt.broadcast %155 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0361239Z         %157 = arith.andi %156, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0361409Z         %158 = tt.load %152, %157, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0361666Z         %159 = ttg.convert_layout %158 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0361949Z         %160 = arith.shli %159, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0362187Z         %161 = arith.shrsi %160, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0362443Z         %162 = arith.shrsi %159, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0362777Z         %163 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0363135Z         %164 = tt.expand_dims %162 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0363419Z         %165 = tt.broadcast %163 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0363659Z         %166 = arith.select %15, %165, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0363898Z         %167 = tt.broadcast %164 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0364134Z         %168 = arith.select %17, %167, %166 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0364366Z         %169 = tt.reshape %168 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0364589Z         %170 = arith.sitofp %169 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0364840Z         %171 = ttg.local_alloc %170 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0365165Z         %172 = ttg.local_load %171 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0365639Z         %173 = tt.dot %144, %172, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0365982Z         %174 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:49:47.0366112Z         %175 = arith.cmpi slt, %174, %c2_i32 : i32
2026-02-21T09:49:47.0366248Z         %176 = arith.select %175, %174, %c0_i32 : i32
2026-02-21T09:49:47.0366537Z         %177 = ttg.memdesc_index %53[%176] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0366891Z         ttg.local_store %142, %177 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0367294Z         scf.yield %173, %176, %arg8, %177 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:49:47.0367596Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:49:47.0367872Z       %68 = ttg.local_load %67#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0368296Z       %69 = arith.extf %68 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0368622Z       %70 = arith.addi %9, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0368895Z       %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0369143Z       %72 = arith.muli %71, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0369335Z       %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0369521Z       %74 = arith.addi %73, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0369713Z       %75 = tt.addptr %7, %74 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0369913Z       %76 = arith.cmpi sge, %71, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0370083Z       %77 = arith.cmpi slt, %71, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0370243Z       %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:47.0370440Z       %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0370625Z       %80 = arith.andi %79, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0370949Z       %81 = tt.load %75, %80, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0371198Z       %82 = ttg.convert_layout %81 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0371477Z       %83 = arith.shli %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0371706Z       %84 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0371938Z       %85 = arith.shrsi %82, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0372215Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0372545Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0372819Z       %88 = tt.broadcast %86 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0373052Z       %89 = arith.select %15, %88, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0373283Z       %90 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0373503Z       %91 = arith.select %17, %90, %89 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0373722Z       %92 = tt.reshape %91 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0373933Z       %93 = arith.sitofp %92 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0374173Z       %94 = ttg.local_alloc %93 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0374511Z       %95 = ttg.local_load %94 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0374970Z       %96 = tt.dot %69, %95, %67#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0375465Z       %97 = ttg.local_load %67#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0375884Z       %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0376206Z       %99 = arith.addi %9, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:49:47.0376484Z       %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0376734Z       %101 = arith.muli %100, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0376925Z       %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0377118Z       %103 = arith.addi %102, %48 : tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0377313Z       %104 = tt.addptr %7, %103 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:49:47.0377521Z       %105 = arith.cmpi sge, %100, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0377690Z       %106 = arith.cmpi slt, %100, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:49:47.0377856Z       %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2>
2026-02-21T09:49:47.0378041Z       %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0378228Z       %109 = arith.andi %108, %52 : tensor<2x64xi1, #blocked2>
2026-02-21T09:49:47.0378411Z       %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:49:47.0378665Z       %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0378963Z       %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0379205Z       %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0379440Z       %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:47.0379729Z       %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0380060Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:47.0380340Z       %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0380579Z       %118 = arith.select %15, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0380816Z       %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0381049Z       %120 = arith.select %17, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:47.0381276Z       %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:47.0381496Z       %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:47.0381746Z       %123 = ttg.local_alloc %122 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:47.0382069Z       %124 = ttg.local_load %123 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:47.0382550Z       %125 = tt.dot %98, %124, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:49:47.0382930Z       ttg.local_dealloc %53 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:49:47.0383154Z       %126 = arith.truncf %125 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:49:47.0383418Z       %127 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:49:47.0383650Z       %128 = arith.muli %127, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:49:47.0383875Z       %129 = tt.expand_dims %35 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:49:47.0384130Z       %130 = tt.broadcast %128 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0384333Z       %131 = tt.broadcast %129 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0384514Z       %132 = arith.addi %130, %131 : tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0384698Z       %133 = tt.addptr %18, %132 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:49:47.0384891Z       tt.store %133, %126 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:47.0385030Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:49:47.0385133Z     tt.return
2026-02-21T09:49:47.0385214Z   }
2026-02-21T09:49:47.0385285Z }
2026-02-21T09:49:47.0385328Z 
2026-02-21T09:49:47.0385363Z {-#
2026-02-21T09:49:47.0385443Z   external_resources: {
2026-02-21T09:49:47.0385544Z     mlir_reproducer: {
2026-02-21T09:49:47.0386562Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:49:47.0387566Z       disable_threading: false,
2026-02-21T09:49:47.0387671Z       verify_each: true
2026-02-21T09:49:47.0387758Z     }
2026-02-21T09:49:47.0387833Z   }
2026-02-21T09:49:47.0387900Z #-}
2026-02-21T09:49:47.0388178Z /tmp/torchinductor_root/jq/cjqqzavugghhhp6334vkuw67nomwtxn6gnd35x5ktrwiogdnqiuk.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:49:47.0388863Z /tmp/torchinductor_root/jq/cjqqzavugghhhp6334vkuw67nomwtxn6gnd35x5ktrwiogdnqiuk.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:49:47.0389408Z [317s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:49:47.0390190Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:49:47.0390901Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:49:47.0391068Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:49:49.4794708Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:49:49.4797009Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [8, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:49:49.4797894Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:49:49.4798815Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:49:49.4799646Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:49:49.4800343Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:49:49.4800989Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:49:49.4801400Z #smem = #ttg.shared_memory
2026-02-21T09:49:49.4801888Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:49:49.4803021Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:49:49.4803840Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma>
2026-02-21T09:49:49.4804178Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:49:49.4804426Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:49:49.4804672Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:49:49.4804978Z     %cst_0 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:49.4805290Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:49:49.4805526Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:49:49.4805772Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:49:49.4806226Z     %cst_1 = arith.constant dense<0> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4806775Z     %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4807377Z     %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4807906Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4808436Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4808967Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4809429Z     %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:49:49.4809876Z     %cst_8 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4810316Z     %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:49.4810697Z     %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:49.4811058Z     %cst_11 = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:49:49.4811360Z     %0 = tt.get_program_id x : i32
2026-02-21T09:49:49.4811541Z     %1 = arith.divsi %0, %c256_i32 : i32
2026-02-21T09:49:49.4811712Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:49:49.4811894Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:49:49.4812060Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:49:49.4812233Z     %5 = arith.remsi %0, %c256_i32 : i32
2026-02-21T09:49:49.4812405Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:49:49.4812571Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:49:49.4812724Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:49:49.4812891Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:49:49.4813199Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:49.4813670Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:49.4814047Z     %12 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:49.4814362Z     %13 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:49.4814712Z     %14 = arith.addi %12, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:49.4815032Z     %15 = arith.addi %13, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:49.4815274Z     %16 = arith.muli %8, %c64_i32 : i32
2026-02-21T09:49:49.4815628Z     %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:49.4816089Z     %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:49.4816437Z     %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:49.4816736Z     %20 = arith.addi %19, %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:49.4817087Z     %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:49.4817553Z     %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:49:49.4817934Z     %23 = arith.muli %22, %cst_7 : tensor<128x1xi32, #blocked1>
2026-02-21T09:49:49.4818222Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:49:49.4818548Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:49.4818795Z     %26 = arith.extsi %16 : i32 to i64
2026-02-21T09:49:49.4819081Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4819571Z     %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:49.4820238Z     %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:49.4820831Z     %30 = tt.splat %26 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:49.4821428Z     %31 = arith.extsi %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:49.4822021Z     %32 = arith.addi %30, %31 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:49.4822521Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4823006Z     %34 = tt.broadcast %33 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4823366Z     %35 = arith.cmpi sge, %33, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4823637Z     %36 = arith.cmpi slt, %33, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4823899Z     %37 = arith.andi %35, %36 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4824237Z     %38 = tt.broadcast %37 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4824644Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:49:49.4825144Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:49:49.4825605Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:49.4825908Z     %42 = arith.cmpi eq, %41, %cst_9 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:49.4826135Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:49:49.4826362Z     %44 = arith.cmpi eq, %41, %cst_10 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:49.4826581Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:49:49.4826875Z     %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst) -> (tensor<128x64xf32, #mma>)  : i32 {
2026-02-21T09:49:49.4827124Z       %56 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:49:49.4827324Z       %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:49.4827574Z       %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:49.4827890Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:49.4828202Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:49:49.4828425Z       %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked1>
2026-02-21T09:49:49.4828654Z       %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:49:49.4828884Z       %63 = tt.load %62 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:49.4829139Z       %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:49.4829531Z       %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:49.4830017Z       %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:49.4830343Z       %67 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:49:49.4830581Z       %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:49.4830917Z       %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:49.4831359Z       %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4831763Z       %71 = arith.muli %70, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4832095Z       %72 = tt.broadcast %71 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4832392Z       %73 = arith.addi %72, %34 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4832700Z       %74 = tt.addptr %27, %73 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4833016Z       %75 = arith.cmpi sge, %70, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4833255Z       %76 = arith.cmpi slt, %70, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4833484Z       %77 = arith.andi %75, %76 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4833799Z       %78 = tt.broadcast %77 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4834092Z       %79 = arith.andi %78, %38 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4834336Z       %80 = tt.load %74, %79, %cst_1 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4834592Z       %81 = arith.shli %80, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4834826Z       %82 = arith.shrsi %81, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4835055Z       %83 = arith.shrsi %80, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:49.4835336Z       %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:49.4835666Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:49.4835940Z       %86 = tt.broadcast %84 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:49.4836194Z       %87 = arith.select %43, %86, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:49.4836425Z       %88 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:49.4836647Z       %89 = arith.select %45, %88, %87 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:49.4836872Z       %90 = tt.reshape %89 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:49.4837083Z       %91 = arith.sitofp %90 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:49.4837330Z       %92 = ttg.local_alloc %91 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:49.4837685Z       %93 = ttg.local_load %92 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:49.4838156Z       %94 = tt.dot %66, %93, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:49:49.4838527Z       scf.yield %94 : tensor<128x64xf32, #mma>
2026-02-21T09:49:49.4838653Z     } {tt.num_stages = 3 : i32}
2026-02-21T09:49:49.4838810Z     %47 = arith.truncf %46 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:49:49.4839079Z     %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:49:49.4839315Z     %49 = arith.muli %48, %cst_11 : tensor<128x1xi32, #mma>
2026-02-21T09:49:49.4839546Z     %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:49:49.4839800Z     %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:49:49.4840002Z     %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:49:49.4840179Z     %53 = arith.addi %51, %52 : tensor<128x64xi32, #mma>
2026-02-21T09:49:49.4840348Z     %54 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:49.4840562Z     %55 = tt.addptr %54, %53 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:49:49.4840753Z     tt.store %55, %47 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:49.4840886Z     tt.return
2026-02-21T09:49:49.4840968Z   }
2026-02-21T09:49:49.4841048Z }
2026-02-21T09:49:49.4841093Z 
2026-02-21T09:49:49.4841130Z {-#
2026-02-21T09:49:49.4841212Z   external_resources: {
2026-02-21T09:49:49.4841317Z     mlir_reproducer: {
2026-02-21T09:49:49.4842363Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:49:49.4843419Z       disable_threading: false,
2026-02-21T09:49:49.4843530Z       verify_each: true
2026-02-21T09:49:49.4843620Z     }
2026-02-21T09:49:49.4843699Z   }
2026-02-21T09:49:49.4843771Z #-}
2026-02-21T09:49:49.4844056Z /tmp/torchinductor_root/xc/cxc3puzgq6o6wvakcbzttp54fjlz6dp2bmya5kqyaks6j4rqzuvd.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:49:49.4844737Z /tmp/torchinductor_root/xc/cxc3puzgq6o6wvakcbzttp54fjlz6dp2bmya5kqyaks6j4rqzuvd.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:49:49.4845285Z [320s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:49:49.4846006Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:49:49.4846657Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:49:49.4846823Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:49:50.6785251Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:49:50.6786774Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [8, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:49:50.6787657Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:49:50.6788498Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:49:50.6789277Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:49:50.6790011Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:49:50.6790668Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:49:50.6791076Z #smem = #ttg.shared_memory
2026-02-21T09:49:50.6791552Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:49:50.6792510Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:49:50.6793294Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma>
2026-02-21T09:49:50.6793623Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:49:50.6793861Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:49:50.6794098Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:49:50.6794391Z     %cst_0 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:50.6794693Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:49:50.6794917Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:49:50.6795151Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:49:50.6795604Z     %cst_1 = arith.constant dense<0> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6796126Z     %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6796646Z     %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6797196Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6797703Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6798204Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6798654Z     %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:49:50.6799089Z     %cst_8 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6799515Z     %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:50.6799881Z     %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:50.6800231Z     %cst_11 = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:49:50.6800530Z     %0 = tt.get_program_id x : i32
2026-02-21T09:49:50.6800757Z     %1 = arith.divsi %0, %c256_i32 : i32
2026-02-21T09:49:50.6800956Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:49:50.6801134Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:49:50.6801295Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:49:50.6801491Z     %5 = arith.remsi %0, %c256_i32 : i32
2026-02-21T09:49:50.6801655Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:49:50.6801815Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:49:50.6801975Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:49:50.6802142Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:49:50.6802465Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:50.6802969Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:50.6803360Z     %12 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:50.6803669Z     %13 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:50.6803980Z     %14 = arith.addi %12, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:49:50.6804293Z     %15 = arith.addi %13, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:49:50.6804522Z     %16 = arith.muli %8, %c64_i32 : i32
2026-02-21T09:49:50.6804874Z     %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:50.6805326Z     %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:50.6805655Z     %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:50.6805955Z     %20 = arith.addi %19, %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:49:50.6806305Z     %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:50.6806755Z     %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:49:50.6807124Z     %23 = arith.muli %22, %cst_7 : tensor<128x1xi32, #blocked1>
2026-02-21T09:49:50.6807397Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:49:50.6807717Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:50.6807948Z     %26 = arith.extsi %16 : i32 to i64
2026-02-21T09:49:50.6808257Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6808702Z     %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:50.6809323Z     %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:50.6809937Z     %30 = tt.splat %26 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:50.6810512Z     %31 = arith.extsi %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:50.6811092Z     %32 = arith.addi %30, %31 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:50.6811612Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6812086Z     %34 = tt.broadcast %33 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6812432Z     %35 = arith.cmpi sge, %33, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6812704Z     %36 = arith.cmpi slt, %33, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6812956Z     %37 = arith.andi %35, %36 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6813281Z     %38 = tt.broadcast %37 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6813699Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:49:50.6814157Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:49:50.6814623Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:50.6814907Z     %42 = arith.cmpi eq, %41, %cst_9 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:50.6815128Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:49:50.6815346Z     %44 = arith.cmpi eq, %41, %cst_10 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:49:50.6815555Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:49:50.6815850Z     %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst) -> (tensor<128x64xf32, #mma>)  : i32 {
2026-02-21T09:49:50.6816090Z       %56 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:49:50.6816278Z       %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:50.6816519Z       %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:49:50.6816822Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:49:50.6817125Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:49:50.6817337Z       %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked1>
2026-02-21T09:49:50.6817557Z       %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:49:50.6817784Z       %63 = tt.load %62 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:49:50.6818049Z       %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:49:50.6818416Z       %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:50.6818872Z       %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:50.6819209Z       %67 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:49:50.6819439Z       %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:50.6819771Z       %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:49:50.6820202Z       %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6820601Z       %71 = arith.muli %70, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6820944Z       %72 = tt.broadcast %71 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6821269Z       %73 = arith.addi %72, %34 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6821570Z       %74 = tt.addptr %27, %73 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6821883Z       %75 = arith.cmpi sge, %70, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6822119Z       %76 = arith.cmpi slt, %70, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6822344Z       %77 = arith.andi %75, %76 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6822650Z       %78 = tt.broadcast %77 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6822955Z       %79 = arith.andi %78, %38 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6823189Z       %80 = tt.load %74, %79, %cst_1 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6823425Z       %81 = arith.shli %80, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6823654Z       %82 = arith.shrsi %81, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6823886Z       %83 = arith.shrsi %80, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:49:50.6824163Z       %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:50.6824494Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:49:50.6824770Z       %86 = tt.broadcast %84 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:50.6824997Z       %87 = arith.select %43, %86, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:50.6825228Z       %88 = tt.broadcast %85 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:50.6825448Z       %89 = arith.select %45, %88, %87 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:49:50.6825668Z       %90 = tt.reshape %89 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:49:50.6825879Z       %91 = arith.sitofp %90 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:49:50.6826122Z       %92 = ttg.local_alloc %91 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:49:50.6826473Z       %93 = ttg.local_load %92 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:49:50.6826936Z       %94 = tt.dot %66, %93, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:49:50.6827304Z       scf.yield %94 : tensor<128x64xf32, #mma>
2026-02-21T09:49:50.6827453Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32}
2026-02-21T09:49:50.6827635Z     %47 = arith.truncf %46 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:49:50.6827897Z     %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:49:50.6828132Z     %49 = arith.muli %48, %cst_11 : tensor<128x1xi32, #mma>
2026-02-21T09:49:50.6828361Z     %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:49:50.6828616Z     %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:49:50.6828812Z     %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:49:50.6828984Z     %53 = arith.addi %51, %52 : tensor<128x64xi32, #mma>
2026-02-21T09:49:50.6829155Z     %54 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:50.6829363Z     %55 = tt.addptr %54, %53 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:49:50.6829550Z     tt.store %55, %47 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:49:50.6829682Z     tt.return
2026-02-21T09:49:50.6829764Z   }
2026-02-21T09:49:50.6829835Z }
2026-02-21T09:49:50.6829877Z 
2026-02-21T09:49:50.6829910Z {-#
2026-02-21T09:49:50.6829988Z   external_resources: {
2026-02-21T09:49:50.6830087Z     mlir_reproducer: {
2026-02-21T09:49:50.6831088Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:49:50.6832094Z       disable_threading: false,
2026-02-21T09:49:50.6832211Z       verify_each: true
2026-02-21T09:49:50.6832315Z     }
2026-02-21T09:49:50.6832387Z   }
2026-02-21T09:49:50.6832453Z #-}
2026-02-21T09:49:50.6832730Z /tmp/torchinductor_root/e2/ce2keubxat3ohvkvcbr4g2kv75vu3b4vxfk6rolgmc3rzn6ikxlu.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:49:50.6833413Z /tmp/torchinductor_root/e2/ce2keubxat3ohvkvcbr4g2kv75vu3b4vxfk6rolgmc3rzn6ikxlu.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:49:50.6833955Z [321s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:49:50.6834679Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:49:50.6835342Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:49:50.6835511Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:49:52.0443776Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 103/103 6.0 configs/s
2026-02-21T09:49:56.6689545Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 179/179 22.1 configs/s
2026-02-21T09:49:59.9110415Z [330s] Generation 2 complete: 
2026-02-21T09:49:59.9110655Z error=19
2026-02-21T09:49:59.9111254Z timeout=1
2026-02-21T09:49:59.9111367Z ok=88
2026-02-21T09:49:59.9111507Z min=1.1387
2026-02-21T09:49:59.9111620Z mid=1.8622
2026-02-21T09:49:59.9111729Z max=438.1238
2026-02-21T09:49:59.9111860Z best={'block_sizes': [8, 128, 256],
2026-02-21T09:49:59.9112062Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:49:59.9112257Z  'l2_groupings': [2],
2026-02-21T09:49:59.9112404Z  'load_eviction_policies': ['', ''],
2026-02-21T09:49:59.9112565Z  'loop_orders': [[0, 1]],
2026-02-21T09:49:59.9112712Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:49:59.9112853Z  'num_stages': 1,
2026-02-21T09:49:59.9112973Z  'num_warps': 4,
2026-02-21T09:49:59.9113112Z  'pid_type': 'flat',
2026-02-21T09:49:59.9113256Z  'range_flattens': [None, None],
2026-02-21T09:49:59.9113418Z  'range_multi_buffers': [None, False],
2026-02-21T09:49:59.9113592Z  'range_num_stages': [0, 3],
2026-02-21T09:49:59.9113738Z  'range_unroll_factors': [0, 0],
2026-02-21T09:49:59.9113893Z  'range_warp_specializes': [],
2026-02-21T09:49:59.9114042Z  'waves_per_eu': 2}
2026-02-21T09:49:59.9160444Z [330s] Fitting surrogate: 322 points, 322 targets
2026-02-21T09:50:00.9209802Z [331s] Generation 3 starting: 98 neighbors, 5 active search path(s)
2026-02-21T09:50:24.1191574Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 1.2 configs/s
2026-02-21T09:50:29.6463916Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:50:29.6474809Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:29.6478046Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:50:29.6479313Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:29.6480116Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:29.6480812Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:50:29.6481450Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:50:29.6481896Z #smem = #ttg.shared_memory
2026-02-21T09:50:29.6482479Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:50:29.6483809Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:50:29.6484795Z     %cst = arith.constant dense<4> : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6485189Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:50:29.6485478Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:50:29.6485765Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:50:29.6486219Z     %cst_0 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6486510Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:50:29.6486706Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:50:29.6486910Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T09:50:29.6487124Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:50:29.6487328Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:50:29.6487529Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:50:29.6487739Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:50:29.6488079Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:50:29.6488341Z     %cst_1 = arith.constant dense<1024> : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6488662Z     %cst_2 = arith.constant dense<8192> : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6489029Z     %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6489455Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:50:29.6489768Z     %cst_4 = arith.constant dense<2> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6490152Z     %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6490469Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:50:29.6490771Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:50:29.6491073Z     %cst_8 = arith.constant dense<8192> : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6491377Z     %cst_9 = arith.constant dense<0> : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6491669Z     %cst_10 = arith.constant dense<16384> : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6492047Z     %cst_11 = arith.constant dense<0> : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6492351Z     %cst_12 = arith.constant dense<8192> : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6492613Z     %0 = tt.get_program_id x : i32
2026-02-21T09:50:29.6492878Z     %1 = arith.muli %0, %c2_i32 : i32
2026-02-21T09:50:29.6493082Z     %2 = arith.addi %1, %c2_i32 : i32
2026-02-21T09:50:29.6493286Z     %3 = arith.minsi %2, %c16384_i32 : i32
2026-02-21T09:50:29.6493714Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6494196Z     %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6494716Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6495206Z     %7 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6495697Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6496164Z     %9 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6496506Z     %10 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6496776Z     %11 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6497140Z     %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:50:29.6497696Z     %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:50:29.6498240Z     %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:50:29.6498583Z     %15 = arith.cmpi eq, %14, %cst_6 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:50:29.6498857Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1>
2026-02-21T09:50:29.6499121Z     %17 = arith.cmpi eq, %14, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:50:29.6499376Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1>
2026-02-21T09:50:29.6499650Z     %19 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.6500005Z     %20 = arith.extsi %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6500483Z     %21 = arith.extsi %7 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6500791Z     %22 = arith.subi %3, %1 : i32
2026-02-21T09:50:29.6500942Z     %23 = arith.remsi %22, %c4_i32 : i32
2026-02-21T09:50:29.6501103Z     %24 = arith.subi %22, %23 : i32
2026-02-21T09:50:29.6501244Z     %25 = arith.addi %1, %24 : i32
2026-02-21T09:50:29.6501429Z     scf.for %arg3 = %1 to %25 step %c4_i32  : i32 {
2026-02-21T09:50:29.6501614Z       %26 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:50:29.6501780Z       %27 = arith.muli %26, %c8_i32 : i32
2026-02-21T09:50:29.6501940Z       %28 = arith.subi %c256_i32, %27 : i32
2026-02-21T09:50:29.6502091Z       %29 = arith.minsi %28, %c8_i32 : i32
2026-02-21T09:50:29.6502254Z       %30 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:50:29.6502412Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T09:50:29.6502564Z       %32 = arith.addi %27, %31 : i32
2026-02-21T09:50:29.6502709Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T09:50:29.6502867Z       %34 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:50:29.6503094Z       %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6503391Z       %36 = arith.addi %35, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6503633Z       %37 = arith.muli %33, %c128_i32 : i32
2026-02-21T09:50:29.6503860Z       %38 = tt.splat %37 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6504151Z       %39 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6504511Z       %40 = tt.expand_dims %36 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6504850Z       %41 = arith.muli %40, %cst_1 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6505113Z       %42 = tt.broadcast %41 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6505509Z       %43 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked>
2026-02-21T09:50:29.6505866Z       %44 = tt.broadcast %43 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6506179Z       %45 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.6506537Z       %46 = tt.expand_dims %9 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.6506892Z       %47 = tt.broadcast %46 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6507155Z       %48 = arith.addi %42, %47 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6507363Z       %49 = tt.addptr %10, %48 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6507575Z       %50 = tt.load %49 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6507832Z       %51 = tt.expand_dims %8 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6508092Z       %52 = arith.muli %51, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6508288Z       %53 = tt.broadcast %52 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6508488Z       %54 = arith.addi %53, %44 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6508690Z       %55 = tt.addptr %11, %54 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6508901Z       %56 = tt.load %55 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6509191Z       %57 = ttg.memdesc_index %45[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6509568Z       ttg.local_store %50, %57 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6509859Z       %58 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6510128Z       %59 = arith.addi %9, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6510419Z       %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.6510732Z       %61 = tt.broadcast %60 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6510927Z       %62 = arith.addi %42, %61 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6511136Z       %63 = tt.addptr %10, %62 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6511349Z       %64 = tt.load %63 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6511605Z       %65 = tt.expand_dims %58 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6511860Z       %66 = arith.muli %65, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6512060Z       %67 = tt.broadcast %66 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6512263Z       %68 = arith.addi %67, %44 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6512464Z       %69 = tt.addptr %11, %68 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6512672Z       %70 = tt.load %69 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6512959Z       %71 = ttg.memdesc_index %45[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6513329Z       ttg.local_store %64, %71 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6514017Z       %72:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %57, %arg8 = %71, %arg9 = %56, %arg10 = %70) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:50:29.6514569Z         %409 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:29.6514776Z         %410 = tt.splat %409 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6515016Z         %411 = arith.addi %410, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6515201Z         %412 = arith.muli %409, %c2_i32 : i32
2026-02-21T09:50:29.6515383Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6515618Z         %414 = arith.addi %413, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6515915Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.6516213Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6516412Z         %417 = arith.addi %42, %416 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6516621Z         %418 = tt.addptr %10, %417 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6516832Z         %419 = tt.load %418 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6517143Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6517596Z         %421 = arith.extf %420 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6517986Z         %422 = tt.expand_dims %411 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6518241Z         %423 = arith.muli %422, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6518461Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6518663Z         %425 = arith.addi %424, %44 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6518873Z         %426 = tt.addptr %11, %425 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6519121Z         %427 = tt.load %426 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6519291Z         %428 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6519455Z         %429 = arith.shrsi %428, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6519711Z         %430 = ttg.convert_layout %429 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6519972Z         %431 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6520224Z         %432 = ttg.convert_layout %431 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6520573Z         %433 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6520925Z         %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6521232Z         %435 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6521493Z         %436 = arith.select %16, %435, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6521746Z         %437 = tt.broadcast %434 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6521999Z         %438 = arith.select %18, %437, %436 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6522254Z         %439 = tt.reshape %438 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6522494Z         %440 = arith.sitofp %439 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6522868Z         %441 = ttg.convert_layout %440 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6523378Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6523742Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:29.6523876Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:50:29.6524014Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:50:29.6524287Z         %446 = ttg.memdesc_index %45[%445] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6524647Z         ttg.local_store %419, %446 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6525139Z         scf.yield %442, %445, %arg8, %446, %arg10, %427 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6525536Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:29.6525812Z       %73 = ttg.local_load %72#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6526244Z       %74 = arith.extf %73 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6526545Z       %75 = arith.shli %72#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6526738Z       %76 = arith.shrsi %75, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6526985Z       %77 = ttg.convert_layout %76 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6527232Z       %78 = arith.shrsi %72#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6527502Z       %79 = ttg.convert_layout %78 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6527839Z       %80 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6528187Z       %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6528483Z       %82 = tt.broadcast %80 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6528730Z       %83 = arith.select %16, %82, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6528979Z       %84 = tt.broadcast %81 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6529222Z       %85 = arith.select %18, %84, %83 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6529455Z       %86 = tt.reshape %85 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6529686Z       %87 = arith.sitofp %86 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6529985Z       %88 = ttg.convert_layout %87 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6530455Z       %89 = tt.dot %74, %88, %72#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6530968Z       %90 = ttg.local_load %72#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6531391Z       %91 = arith.extf %90 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6531709Z       %92 = arith.shli %72#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6531871Z       %93 = arith.shrsi %92, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6532116Z       %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6532365Z       %95 = arith.shrsi %72#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6532606Z       %96 = ttg.convert_layout %95 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6532945Z       %97 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6533291Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6533585Z       %99 = tt.broadcast %97 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6533838Z       %100 = arith.select %16, %99, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6534082Z       %101 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6534327Z       %102 = arith.select %18, %101, %100 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6534570Z       %103 = tt.reshape %102 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6534801Z       %104 = arith.sitofp %103 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6535124Z       %105 = ttg.convert_layout %104 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6535587Z       %106 = tt.dot %91, %105, %89, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6535991Z       ttg.local_dealloc %45 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.6536222Z       %107 = arith.truncf %106 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T09:50:29.6536398Z       %108 = arith.extsi %34 : i32 to i64
2026-02-21T09:50:29.6536522Z       %109 = arith.extsi %37 : i32 to i64
2026-02-21T09:50:29.6536688Z       %110 = tt.splat %108 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6536906Z       %111 = arith.addi %110, %20 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6537180Z       %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6537423Z       %113 = arith.muli %112, %cst_8 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6537608Z       %114 = tt.broadcast %113 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6537822Z       %115 = tt.splat %109 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6538041Z       %116 = arith.addi %115, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6538312Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6538582Z       %118 = tt.broadcast %117 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6538772Z       %119 = arith.addi %114, %118 : tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6538991Z       %120 = tt.addptr %19, %119 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6539199Z       %121 = arith.cmpi sge, %112, %cst_9 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6539387Z       %122 = arith.cmpi slt, %112, %cst_10 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6539553Z       %123 = arith.andi %121, %122 : tensor<64x1xi1, #mma>
2026-02-21T09:50:29.6539733Z       %124 = tt.broadcast %123 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6539922Z       %125 = arith.cmpi sge, %117, %cst_11 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6540097Z       %126 = arith.cmpi slt, %117, %cst_12 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6540259Z       %127 = arith.andi %125, %126 : tensor<1x128xi1, #mma>
2026-02-21T09:50:29.6540441Z       %128 = tt.broadcast %127 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6540622Z       %129 = arith.andi %124, %128 : tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6549481Z       tt.store %120, %107, %129 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.6549649Z       %130 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:50:29.6549786Z       %131 = arith.divsi %130, %c512_i32 : i32
2026-02-21T09:50:29.6549925Z       %132 = arith.muli %131, %c8_i32 : i32
2026-02-21T09:50:29.6550056Z       %133 = arith.subi %c256_i32, %132 : i32
2026-02-21T09:50:29.6550190Z       %134 = arith.minsi %133, %c8_i32 : i32
2026-02-21T09:50:29.6550318Z       %135 = arith.remsi %130, %c512_i32 : i32
2026-02-21T09:50:29.6550438Z       %136 = arith.remsi %135, %134 : i32
2026-02-21T09:50:29.6550564Z       %137 = arith.addi %132, %136 : i32
2026-02-21T09:50:29.6550681Z       %138 = arith.divsi %135, %134 : i32
2026-02-21T09:50:29.6550806Z       %139 = arith.muli %137, %c64_i32 : i32
2026-02-21T09:50:29.6550982Z       %140 = tt.splat %139 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6551215Z       %141 = arith.addi %140, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6551398Z       %142 = arith.muli %138, %c128_i32 : i32
2026-02-21T09:50:29.6551624Z       %143 = tt.splat %142 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6551854Z       %144 = arith.addi %143, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6552138Z       %145 = tt.expand_dims %141 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6552418Z       %146 = arith.muli %145, %cst_1 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6552620Z       %147 = tt.broadcast %146 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6552908Z       %148 = tt.expand_dims %144 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked>
2026-02-21T09:50:29.6553194Z       %149 = tt.broadcast %148 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6553421Z       %150 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.6553617Z       %151 = arith.addi %147, %47 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6553879Z       %152 = tt.addptr %10, %151 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6554088Z       %153 = tt.load %152 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6554255Z       %154 = arith.addi %53, %149 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6554455Z       %155 = tt.addptr %11, %154 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6554661Z       %156 = tt.load %155 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6554944Z       %157 = ttg.memdesc_index %150[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6555326Z       ttg.local_store %153, %157 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6555575Z       %158 = arith.addi %147, %61 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6555775Z       %159 = tt.addptr %10, %158 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6556005Z       %160 = tt.load %159 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6556166Z       %161 = arith.addi %67, %149 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6556365Z       %162 = tt.addptr %11, %161 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6556570Z       %163 = tt.load %162 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6556847Z       %164 = ttg.memdesc_index %150[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6557211Z       ttg.local_store %160, %164 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6557849Z       %165:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %157, %arg8 = %164, %arg9 = %156, %arg10 = %163) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:50:29.6558380Z         %409 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:29.6558561Z         %410 = tt.splat %409 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6558787Z         %411 = arith.addi %410, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6558965Z         %412 = arith.muli %409, %c2_i32 : i32
2026-02-21T09:50:29.6559141Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6559363Z         %414 = arith.addi %413, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6559665Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.6559947Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6560150Z         %417 = arith.addi %147, %416 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6560377Z         %418 = tt.addptr %10, %417 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6560587Z         %419 = tt.load %418 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6560894Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6561334Z         %421 = arith.extf %420 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6561726Z         %422 = tt.expand_dims %411 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6561981Z         %423 = arith.muli %422, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6562177Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6562379Z         %425 = arith.addi %424, %149 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6562640Z         %426 = tt.addptr %11, %425 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6562852Z         %427 = tt.load %426 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6563018Z         %428 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6563179Z         %429 = arith.shrsi %428, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6563453Z         %430 = ttg.convert_layout %429 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6563711Z         %431 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6563964Z         %432 = ttg.convert_layout %431 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6564333Z         %433 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6564683Z         %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6564986Z         %435 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6565239Z         %436 = arith.select %16, %435, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6565494Z         %437 = tt.broadcast %434 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6565750Z         %438 = arith.select %18, %437, %436 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6565990Z         %439 = tt.reshape %438 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6566228Z         %440 = arith.sitofp %439 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6566534Z         %441 = ttg.convert_layout %440 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6567017Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6567378Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:29.6567510Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:50:29.6567655Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:50:29.6567946Z         %446 = ttg.memdesc_index %150[%445] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6568305Z         ttg.local_store %419, %446 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6568810Z         scf.yield %442, %445, %arg8, %446, %arg10, %427 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6569196Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:29.6569479Z       %166 = ttg.local_load %165#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6569913Z       %167 = arith.extf %166 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6570214Z       %168 = arith.shli %165#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6570387Z       %169 = arith.shrsi %168, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6570635Z       %170 = ttg.convert_layout %169 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6570891Z       %171 = arith.shrsi %165#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6571145Z       %172 = ttg.convert_layout %171 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6571491Z       %173 = tt.expand_dims %170 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6571864Z       %174 = tt.expand_dims %172 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6572159Z       %175 = tt.broadcast %173 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6572425Z       %176 = arith.select %16, %175, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6572675Z       %177 = tt.broadcast %174 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6572915Z       %178 = arith.select %18, %177, %176 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6573156Z       %179 = tt.reshape %178 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6573387Z       %180 = arith.sitofp %179 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6573688Z       %181 = ttg.convert_layout %180 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6574163Z       %182 = tt.dot %167, %181, %165#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6574656Z       %183 = ttg.local_load %165#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6575087Z       %184 = arith.extf %183 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6575390Z       %185 = arith.shli %165#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6575553Z       %186 = arith.shrsi %185, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6575803Z       %187 = ttg.convert_layout %186 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6576059Z       %188 = arith.shrsi %165#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6576330Z       %189 = ttg.convert_layout %188 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6576674Z       %190 = tt.expand_dims %187 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6577032Z       %191 = tt.expand_dims %189 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6577327Z       %192 = tt.broadcast %190 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6577580Z       %193 = arith.select %16, %192, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6577825Z       %194 = tt.broadcast %191 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6578070Z       %195 = arith.select %18, %194, %193 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6578310Z       %196 = tt.reshape %195 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6578544Z       %197 = arith.sitofp %196 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6578848Z       %198 = ttg.convert_layout %197 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6579313Z       %199 = tt.dot %184, %198, %182, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6579706Z       ttg.local_dealloc %150 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.6579922Z       %200 = arith.truncf %199 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T09:50:29.6580100Z       %201 = arith.extsi %139 : i32 to i64
2026-02-21T09:50:29.6580264Z       %202 = arith.extsi %142 : i32 to i64
2026-02-21T09:50:29.6580428Z       %203 = tt.splat %201 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6580664Z       %204 = arith.addi %203, %20 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6580930Z       %205 = tt.expand_dims %204 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6581175Z       %206 = arith.muli %205, %cst_8 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6581360Z       %207 = tt.broadcast %206 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6581569Z       %208 = tt.splat %202 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6581786Z       %209 = arith.addi %208, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6582056Z       %210 = tt.expand_dims %209 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6582322Z       %211 = tt.broadcast %210 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6582513Z       %212 = arith.addi %207, %211 : tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6582707Z       %213 = tt.addptr %19, %212 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6582917Z       %214 = arith.cmpi sge, %205, %cst_9 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6583085Z       %215 = arith.cmpi slt, %205, %cst_10 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6583249Z       %216 = arith.andi %214, %215 : tensor<64x1xi1, #mma>
2026-02-21T09:50:29.6583422Z       %217 = tt.broadcast %216 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6583613Z       %218 = arith.cmpi sge, %210, %cst_11 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6583786Z       %219 = arith.cmpi slt, %210, %cst_12 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6583945Z       %220 = arith.andi %218, %219 : tensor<1x128xi1, #mma>
2026-02-21T09:50:29.6584142Z       %221 = tt.broadcast %220 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6584323Z       %222 = arith.andi %217, %221 : tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6584489Z       tt.store %213, %200, %222 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.6584653Z       %223 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:50:29.6584781Z       %224 = arith.divsi %223, %c512_i32 : i32
2026-02-21T09:50:29.6584909Z       %225 = arith.muli %224, %c8_i32 : i32
2026-02-21T09:50:29.6585030Z       %226 = arith.subi %c256_i32, %225 : i32
2026-02-21T09:50:29.6585156Z       %227 = arith.minsi %226, %c8_i32 : i32
2026-02-21T09:50:29.6585278Z       %228 = arith.remsi %223, %c512_i32 : i32
2026-02-21T09:50:29.6585402Z       %229 = arith.remsi %228, %227 : i32
2026-02-21T09:50:29.6585520Z       %230 = arith.addi %225, %229 : i32
2026-02-21T09:50:29.6585642Z       %231 = arith.divsi %228, %227 : i32
2026-02-21T09:50:29.6585760Z       %232 = arith.muli %230, %c64_i32 : i32
2026-02-21T09:50:29.6585939Z       %233 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6586169Z       %234 = arith.addi %233, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6586344Z       %235 = arith.muli %231, %c128_i32 : i32
2026-02-21T09:50:29.6586521Z       %236 = tt.splat %235 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6586742Z       %237 = arith.addi %236, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6587025Z       %238 = tt.expand_dims %234 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6587281Z       %239 = arith.muli %238, %cst_1 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6587477Z       %240 = tt.broadcast %239 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6587789Z       %241 = tt.expand_dims %237 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked>
2026-02-21T09:50:29.6588070Z       %242 = tt.broadcast %241 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6588314Z       %243 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.6588509Z       %244 = arith.addi %240, %47 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6588710Z       %245 = tt.addptr %10, %244 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6588922Z       %246 = tt.load %245 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6589081Z       %247 = arith.addi %53, %242 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6589282Z       %248 = tt.addptr %11, %247 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6589487Z       %249 = tt.load %248 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6589768Z       %250 = ttg.memdesc_index %243[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6590132Z       ttg.local_store %246, %250 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6590371Z       %251 = arith.addi %240, %61 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6590574Z       %252 = tt.addptr %10, %251 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6590783Z       %253 = tt.load %252 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6590940Z       %254 = arith.addi %67, %242 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6591136Z       %255 = tt.addptr %11, %254 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6591335Z       %256 = tt.load %255 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6591646Z       %257 = ttg.memdesc_index %243[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6592000Z       ttg.local_store %253, %257 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6592627Z       %258:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %250, %arg8 = %257, %arg9 = %249, %arg10 = %256) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:50:29.6593168Z         %409 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:29.6593344Z         %410 = tt.splat %409 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6593571Z         %411 = arith.addi %410, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6593751Z         %412 = arith.muli %409, %c2_i32 : i32
2026-02-21T09:50:29.6593924Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6594142Z         %414 = arith.addi %413, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6594418Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.6594692Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6594889Z         %417 = arith.addi %240, %416 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6595092Z         %418 = tt.addptr %10, %417 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6595300Z         %419 = tt.load %418 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6595618Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6596052Z         %421 = arith.extf %420 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6596448Z         %422 = tt.expand_dims %411 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6596693Z         %423 = arith.muli %422, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6596881Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6597068Z         %425 = arith.addi %424, %242 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6597260Z         %426 = tt.addptr %11, %425 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6597464Z         %427 = tt.load %426 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6597621Z         %428 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6597775Z         %429 = arith.shrsi %428, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6598016Z         %430 = ttg.convert_layout %429 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6598262Z         %431 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6598499Z         %432 = ttg.convert_layout %431 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6598836Z         %433 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6599177Z         %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6599472Z         %435 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6599740Z         %436 = arith.select %16, %435, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6599984Z         %437 = tt.broadcast %434 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6600237Z         %438 = arith.select %18, %437, %436 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6600471Z         %439 = tt.reshape %438 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6600693Z         %440 = arith.sitofp %439 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6600993Z         %441 = ttg.convert_layout %440 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6601459Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6601803Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:29.6601930Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:50:29.6602061Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:50:29.6602324Z         %446 = ttg.memdesc_index %243[%445] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6602947Z         ttg.local_store %419, %446 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6603448Z         scf.yield %442, %445, %arg8, %446, %arg10, %427 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6603932Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:29.6604212Z       %259 = ttg.local_load %258#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6604667Z       %260 = arith.extf %259 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6604967Z       %261 = arith.shli %258#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6605130Z       %262 = arith.shrsi %261, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6605376Z       %263 = ttg.convert_layout %262 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6605619Z       %264 = arith.shrsi %258#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6605858Z       %265 = ttg.convert_layout %264 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6606198Z       %266 = tt.expand_dims %263 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6606542Z       %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6606831Z       %268 = tt.broadcast %266 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6607072Z       %269 = arith.select %16, %268, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6607313Z       %270 = tt.broadcast %267 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6607551Z       %271 = arith.select %18, %270, %269 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6607783Z       %272 = tt.reshape %271 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6608034Z       %273 = arith.sitofp %272 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6608327Z       %274 = ttg.convert_layout %273 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6608790Z       %275 = tt.dot %260, %274, %258#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6609301Z       %276 = ttg.local_load %258#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6609722Z       %277 = arith.extf %276 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6610014Z       %278 = arith.shli %258#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6610174Z       %279 = arith.shrsi %278, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6610413Z       %280 = ttg.convert_layout %279 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6610666Z       %281 = arith.shrsi %258#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6610907Z       %282 = ttg.convert_layout %281 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6611241Z       %283 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6611579Z       %284 = tt.expand_dims %282 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6611864Z       %285 = tt.broadcast %283 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6612131Z       %286 = arith.select %16, %285, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6612370Z       %287 = tt.broadcast %284 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6612621Z       %288 = arith.select %18, %287, %286 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6612859Z       %289 = tt.reshape %288 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6613083Z       %290 = arith.sitofp %289 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6613378Z       %291 = ttg.convert_layout %290 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6613833Z       %292 = tt.dot %277, %291, %275, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6614216Z       ttg.local_dealloc %243 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.6614426Z       %293 = arith.truncf %292 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T09:50:29.6614595Z       %294 = arith.extsi %232 : i32 to i64
2026-02-21T09:50:29.6614709Z       %295 = arith.extsi %235 : i32 to i64
2026-02-21T09:50:29.6614868Z       %296 = tt.splat %294 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6615074Z       %297 = arith.addi %296, %20 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6615340Z       %298 = tt.expand_dims %297 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6615574Z       %299 = arith.muli %298, %cst_8 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6615748Z       %300 = tt.broadcast %299 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6615975Z       %301 = tt.splat %295 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6616184Z       %302 = arith.addi %301, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6616446Z       %303 = tt.expand_dims %302 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6616717Z       %304 = tt.broadcast %303 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6616900Z       %305 = arith.addi %300, %304 : tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6617088Z       %306 = tt.addptr %19, %305 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6617283Z       %307 = arith.cmpi sge, %298, %cst_9 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6617446Z       %308 = arith.cmpi slt, %298, %cst_10 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6617598Z       %309 = arith.andi %307, %308 : tensor<64x1xi1, #mma>
2026-02-21T09:50:29.6617773Z       %310 = tt.broadcast %309 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6617955Z       %311 = arith.cmpi sge, %303, %cst_11 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6618120Z       %312 = arith.cmpi slt, %303, %cst_12 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6618277Z       %313 = arith.andi %311, %312 : tensor<1x128xi1, #mma>
2026-02-21T09:50:29.6618445Z       %314 = tt.broadcast %313 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6618618Z       %315 = arith.andi %310, %314 : tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6618774Z       tt.store %306, %293, %315 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.6618918Z       %316 = arith.addi %arg3, %c3_i32 : i32
2026-02-21T09:50:29.6619042Z       %317 = arith.divsi %316, %c512_i32 : i32
2026-02-21T09:50:29.6619163Z       %318 = arith.muli %317, %c8_i32 : i32
2026-02-21T09:50:29.6619280Z       %319 = arith.subi %c256_i32, %318 : i32
2026-02-21T09:50:29.6619394Z       %320 = arith.minsi %319, %c8_i32 : i32
2026-02-21T09:50:29.6619525Z       %321 = arith.remsi %316, %c512_i32 : i32
2026-02-21T09:50:29.6619638Z       %322 = arith.remsi %321, %320 : i32
2026-02-21T09:50:29.6619749Z       %323 = arith.addi %318, %322 : i32
2026-02-21T09:50:29.6619880Z       %324 = arith.divsi %321, %320 : i32
2026-02-21T09:50:29.6619995Z       %325 = arith.muli %323, %c64_i32 : i32
2026-02-21T09:50:29.6620164Z       %326 = tt.splat %325 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6620383Z       %327 = arith.addi %326, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6620553Z       %328 = arith.muli %324, %c128_i32 : i32
2026-02-21T09:50:29.6620714Z       %329 = tt.splat %328 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6620930Z       %330 = arith.addi %329, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6621203Z       %331 = tt.expand_dims %327 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6621456Z       %332 = arith.muli %331, %cst_1 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6621647Z       %333 = tt.broadcast %332 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6621921Z       %334 = tt.expand_dims %330 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked>
2026-02-21T09:50:29.6622194Z       %335 = tt.broadcast %334 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6622409Z       %336 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.6622594Z       %337 = arith.addi %333, %47 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6622796Z       %338 = tt.addptr %10, %337 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6623000Z       %339 = tt.load %338 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6623176Z       %340 = arith.addi %53, %335 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6623366Z       %341 = tt.addptr %11, %340 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6623562Z       %342 = tt.load %341 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6623836Z       %343 = ttg.memdesc_index %336[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6624204Z       ttg.local_store %339, %343 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6624443Z       %344 = arith.addi %333, %61 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6624635Z       %345 = tt.addptr %10, %344 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6624834Z       %346 = tt.load %345 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6624987Z       %347 = arith.addi %67, %335 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6625175Z       %348 = tt.addptr %11, %347 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6625372Z       %349 = tt.load %348 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6625649Z       %350 = ttg.memdesc_index %336[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6626000Z       ttg.local_store %346, %350 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6626612Z       %351:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %343, %arg8 = %350, %arg9 = %342, %arg10 = %349) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:50:29.6627142Z         %409 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:29.6627313Z         %410 = tt.splat %409 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6627549Z         %411 = arith.addi %410, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6627718Z         %412 = arith.muli %409, %c2_i32 : i32
2026-02-21T09:50:29.6627890Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6628109Z         %414 = arith.addi %413, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6628379Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.6628651Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6628842Z         %417 = arith.addi %333, %416 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6629043Z         %418 = tt.addptr %10, %417 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6629244Z         %419 = tt.load %418 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6629540Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6629968Z         %421 = arith.extf %420 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6630339Z         %422 = tt.expand_dims %411 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6630581Z         %423 = arith.muli %422, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6630766Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6630979Z         %425 = arith.addi %424, %335 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6631175Z         %426 = tt.addptr %11, %425 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6631369Z         %427 = tt.load %426 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6631523Z         %428 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6631690Z         %429 = arith.shrsi %428, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6631931Z         %430 = ttg.convert_layout %429 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6632176Z         %431 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6632417Z         %432 = ttg.convert_layout %431 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6632752Z         %433 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6633092Z         %434 = tt.expand_dims %432 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6633381Z         %435 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6633629Z         %436 = arith.select %16, %435, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6633868Z         %437 = tt.broadcast %434 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6634105Z         %438 = arith.select %18, %437, %436 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6634337Z         %439 = tt.reshape %438 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6634561Z         %440 = arith.sitofp %439 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6634876Z         %441 = ttg.convert_layout %440 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6635354Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6635701Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:29.6635824Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:50:29.6635953Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:50:29.6636214Z         %446 = ttg.memdesc_index %336[%445] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6636564Z         ttg.local_store %419, %446 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6637045Z         scf.yield %442, %445, %arg8, %446, %arg10, %427 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6637428Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:29.6637700Z       %352 = ttg.local_load %351#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6638126Z       %353 = arith.extf %352 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6638418Z       %354 = arith.shli %351#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6638578Z       %355 = arith.shrsi %354, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6638823Z       %356 = ttg.convert_layout %355 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6639084Z       %357 = arith.shrsi %351#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6639324Z       %358 = ttg.convert_layout %357 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6639678Z       %359 = tt.expand_dims %356 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6640022Z       %360 = tt.expand_dims %358 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6640311Z       %361 = tt.broadcast %359 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6640553Z       %362 = arith.select %16, %361, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6640794Z       %363 = tt.broadcast %360 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6641029Z       %364 = arith.select %18, %363, %362 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6641260Z       %365 = tt.reshape %364 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6641483Z       %366 = arith.sitofp %365 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6641776Z       %367 = ttg.convert_layout %366 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6642239Z       %368 = tt.dot %353, %367, %351#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6642808Z       %369 = ttg.local_load %351#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6643256Z       %370 = arith.extf %369 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6643569Z       %371 = arith.shli %351#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6643727Z       %372 = arith.shrsi %371, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6643976Z       %373 = ttg.convert_layout %372 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6644230Z       %374 = arith.shrsi %351#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6644476Z       %375 = ttg.convert_layout %374 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6644822Z       %376 = tt.expand_dims %373 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6645171Z       %377 = tt.expand_dims %375 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6645476Z       %378 = tt.broadcast %376 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6645731Z       %379 = arith.select %16, %378, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6645979Z       %380 = tt.broadcast %377 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6646223Z       %381 = arith.select %18, %380, %379 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6646463Z       %382 = tt.reshape %381 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6646696Z       %383 = arith.sitofp %382 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6647001Z       %384 = ttg.convert_layout %383 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6647492Z       %385 = tt.dot %370, %384, %368, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6647898Z       ttg.local_dealloc %336 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.6648117Z       %386 = arith.truncf %385 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T09:50:29.6648290Z       %387 = arith.extsi %325 : i32 to i64
2026-02-21T09:50:29.6648416Z       %388 = arith.extsi %328 : i32 to i64
2026-02-21T09:50:29.6648579Z       %389 = tt.splat %387 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6648796Z       %390 = arith.addi %389, %20 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6649060Z       %391 = tt.expand_dims %390 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6649303Z       %392 = arith.muli %391, %cst_8 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6649488Z       %393 = tt.broadcast %392 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6649698Z       %394 = tt.splat %388 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6649916Z       %395 = arith.addi %394, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6650185Z       %396 = tt.expand_dims %395 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6650450Z       %397 = tt.broadcast %396 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6650638Z       %398 = arith.addi %393, %397 : tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6650827Z       %399 = tt.addptr %19, %398 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6651053Z       %400 = arith.cmpi sge, %391, %cst_9 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6651224Z       %401 = arith.cmpi slt, %391, %cst_10 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6651403Z       %402 = arith.andi %400, %401 : tensor<64x1xi1, #mma>
2026-02-21T09:50:29.6651580Z       %403 = tt.broadcast %402 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6651770Z       %404 = arith.cmpi sge, %396, %cst_11 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6651946Z       %405 = arith.cmpi slt, %396, %cst_12 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6652106Z       %406 = arith.andi %404, %405 : tensor<1x128xi1, #mma>
2026-02-21T09:50:29.6652285Z       %407 = tt.broadcast %406 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6652466Z       %408 = arith.andi %403, %407 : tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6652633Z       tt.store %399, %386, %408 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.6652784Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:50:29.6652908Z     scf.for %arg3 = %25 to %3 step %c1_i32  : i32 {
2026-02-21T09:50:29.6653051Z       %26 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:50:29.6653174Z       %27 = arith.muli %26, %c8_i32 : i32
2026-02-21T09:50:29.6653296Z       %28 = arith.subi %c256_i32, %27 : i32
2026-02-21T09:50:29.6653413Z       %29 = arith.minsi %28, %c8_i32 : i32
2026-02-21T09:50:29.6653534Z       %30 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:50:29.6653651Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T09:50:29.6653765Z       %32 = arith.addi %27, %31 : i32
2026-02-21T09:50:29.6653876Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T09:50:29.6653987Z       %34 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:50:29.6654153Z       %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6654370Z       %36 = arith.addi %35, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.6654542Z       %37 = arith.muli %33, %c128_i32 : i32
2026-02-21T09:50:29.6654723Z       %38 = tt.splat %37 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6654938Z       %39 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.6655212Z       %40 = tt.expand_dims %36 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6655472Z       %41 = arith.muli %40, %cst_1 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.6655660Z       %42 = tt.broadcast %41 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6655930Z       %43 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked>
2026-02-21T09:50:29.6656201Z       %44 = tt.broadcast %43 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6656413Z       %45 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.6656678Z       %46 = tt.expand_dims %9 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.6656945Z       %47 = tt.broadcast %46 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6657130Z       %48 = arith.addi %42, %47 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6657325Z       %49 = tt.addptr %10, %48 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6657524Z       %50 = tt.load %49 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6657754Z       %51 = tt.expand_dims %8 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6657994Z       %52 = arith.muli %51, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6658174Z       %53 = tt.broadcast %52 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6658384Z       %54 = arith.addi %53, %44 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6658574Z       %55 = tt.addptr %11, %54 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6658785Z       %56 = tt.load %55 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6659058Z       %57 = ttg.memdesc_index %45[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6659406Z       ttg.local_store %50, %57 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6659671Z       %58 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6659891Z       %59 = arith.addi %9, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6660161Z       %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.6660428Z       %61 = tt.broadcast %60 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6660611Z       %62 = arith.addi %42, %61 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6660804Z       %63 = tt.addptr %10, %62 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6661002Z       %64 = tt.load %63 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6661234Z       %65 = tt.expand_dims %58 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6661471Z       %66 = arith.muli %65, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6661648Z       %67 = tt.broadcast %66 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6661830Z       %68 = arith.addi %67, %44 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6662014Z       %69 = tt.addptr %11, %68 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6662205Z       %70 = tt.load %69 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6662493Z       %71 = ttg.memdesc_index %45[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6662839Z       ttg.local_store %64, %71 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6663465Z       %72:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %57, %arg8 = %71, %arg9 = %56, %arg10 = %70) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:50:29.6663979Z         %130 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:29.6664149Z         %131 = tt.splat %130 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6664372Z         %132 = arith.addi %131, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.6664544Z         %133 = arith.muli %130, %c2_i32 : i32
2026-02-21T09:50:29.6664714Z         %134 = tt.splat %133 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6664935Z         %135 = arith.addi %134, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.6665209Z         %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.6665483Z         %137 = tt.broadcast %136 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6665676Z         %138 = arith.addi %42, %137 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6665877Z         %139 = tt.addptr %10, %138 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.6666082Z         %140 = tt.load %139 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.6666394Z         %141 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6666839Z         %142 = arith.extf %141 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6667217Z         %143 = tt.expand_dims %132 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6667464Z         %144 = arith.muli %143, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.6667655Z         %145 = tt.broadcast %144 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6667844Z         %146 = arith.addi %145, %44 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6668038Z         %147 = tt.addptr %11, %146 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.6668238Z         %148 = tt.load %147 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.6668398Z         %149 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6668558Z         %150 = arith.shrsi %149, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6668801Z         %151 = ttg.convert_layout %150 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6669050Z         %152 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6669291Z         %153 = ttg.convert_layout %152 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6669627Z         %154 = tt.expand_dims %151 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6669973Z         %155 = tt.expand_dims %153 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6670276Z         %156 = tt.broadcast %154 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6670526Z         %157 = arith.select %16, %156, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6670769Z         %158 = tt.broadcast %155 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6671026Z         %159 = arith.select %18, %158, %157 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6671262Z         %160 = tt.reshape %159 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6671486Z         %161 = arith.sitofp %160 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6671786Z         %162 = ttg.convert_layout %161 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6672256Z         %163 = tt.dot %142, %162, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6672601Z         %164 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:29.6672730Z         %165 = arith.cmpi slt, %164, %c2_i32 : i32
2026-02-21T09:50:29.6672862Z         %166 = arith.select %165, %164, %c0_i32 : i32
2026-02-21T09:50:29.6673125Z         %167 = ttg.memdesc_index %45[%166] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6673480Z         ttg.local_store %140, %167 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.6673974Z         scf.yield %163, %166, %arg8, %167, %arg10, %148 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6674357Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:29.6674625Z       %73 = ttg.local_load %72#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6675058Z       %74 = arith.extf %73 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6675353Z       %75 = arith.shli %72#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6675510Z       %76 = arith.shrsi %75, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6675751Z       %77 = ttg.convert_layout %76 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6675990Z       %78 = arith.shrsi %72#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6676230Z       %79 = ttg.convert_layout %78 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6676566Z       %80 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6676902Z       %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6677186Z       %82 = tt.broadcast %80 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6677426Z       %83 = arith.select %16, %82, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6677661Z       %84 = tt.broadcast %81 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6677892Z       %85 = arith.select %18, %84, %83 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6678117Z       %86 = tt.reshape %85 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6678353Z       %87 = arith.sitofp %86 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6678644Z       %88 = ttg.convert_layout %87 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6679113Z       %89 = tt.dot %74, %88, %72#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6679593Z       %90 = ttg.local_load %72#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6680012Z       %91 = arith.extf %90 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6680303Z       %92 = arith.shli %72#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6680462Z       %93 = arith.shrsi %92, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6680701Z       %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6680944Z       %95 = arith.shrsi %72#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.6681179Z       %96 = ttg.convert_layout %95 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.6681509Z       %97 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6681848Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.6682130Z       %99 = tt.broadcast %97 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6682392Z       %100 = arith.select %16, %99, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6682685Z       %101 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6682945Z       %102 = arith.select %18, %101, %100 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.6683183Z       %103 = tt.reshape %102 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.6683409Z       %104 = arith.sitofp %103 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.6683706Z       %105 = ttg.convert_layout %104 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.6684162Z       %106 = tt.dot %91, %105, %89, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.6684538Z       ttg.local_dealloc %45 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.6684750Z       %107 = arith.truncf %106 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T09:50:29.6684919Z       %108 = arith.extsi %34 : i32 to i64
2026-02-21T09:50:29.6685037Z       %109 = arith.extsi %37 : i32 to i64
2026-02-21T09:50:29.6685198Z       %110 = tt.splat %108 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6685407Z       %111 = arith.addi %110, %20 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.6685668Z       %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6685901Z       %113 = arith.muli %112, %cst_8 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6686079Z       %114 = tt.broadcast %113 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6686308Z       %115 = tt.splat %109 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6686517Z       %116 = arith.addi %115, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.6686787Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6687066Z       %118 = tt.broadcast %117 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6687248Z       %119 = arith.addi %114, %118 : tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6687437Z       %120 = tt.addptr %19, %119 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi64, #mma>
2026-02-21T09:50:29.6687637Z       %121 = arith.cmpi sge, %112, %cst_9 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6687802Z       %122 = arith.cmpi slt, %112, %cst_10 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.6687957Z       %123 = arith.andi %121, %122 : tensor<64x1xi1, #mma>
2026-02-21T09:50:29.6688128Z       %124 = tt.broadcast %123 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6688309Z       %125 = arith.cmpi sge, %117, %cst_11 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6688478Z       %126 = arith.cmpi slt, %117, %cst_12 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.6688635Z       %127 = arith.andi %125, %126 : tensor<1x128xi1, #mma>
2026-02-21T09:50:29.6688806Z       %128 = tt.broadcast %127 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6688983Z       %129 = arith.andi %124, %128 : tensor<64x128xi1, #mma>
2026-02-21T09:50:29.6689141Z       tt.store %120, %107, %129 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.6689287Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:50:29.6689387Z     tt.return
2026-02-21T09:50:29.6689464Z   }
2026-02-21T09:50:29.6689541Z }
2026-02-21T09:50:29.6689584Z 
2026-02-21T09:50:29.6689613Z {-#
2026-02-21T09:50:29.6689693Z   external_resources: {
2026-02-21T09:50:29.6689791Z     mlir_reproducer: {
2026-02-21T09:50:29.6690811Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:50:29.6691833Z       disable_threading: false,
2026-02-21T09:50:29.6691941Z       verify_each: true
2026-02-21T09:50:29.6692039Z     }
2026-02-21T09:50:29.6692114Z   }
2026-02-21T09:50:29.6692192Z #-}
2026-02-21T09:50:29.6692476Z /tmp/torchinductor_root/3a/c3ahk2rjgdqgfrjhf4gky4xzjvueqvkmdafctmimwymr2l4aohzg.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:50:29.6693171Z /tmp/torchinductor_root/3a/c3ahk2rjgdqgfrjhf4gky4xzjvueqvkmdafctmimwymr2l4aohzg.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:50:29.6693725Z [360s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:50:29.6694504Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 128], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:50:29.6695211Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:50:29.6695400Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:50:29.8729448Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:50:29.8738693Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:29.8739239Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:50:29.8739727Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:29.8740193Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:29.8740648Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:50:29.8741061Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:50:29.8741354Z #smem = #ttg.shared_memory
2026-02-21T09:50:29.8741721Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:50:29.8742466Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:50:29.8743156Z     %cst = arith.constant dense<4> : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8743406Z     %c9728_i32 = arith.constant 9728 : i32
2026-02-21T09:50:29.8743596Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:50:29.8743784Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:50:29.8744128Z     %cst_0 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8744408Z     %c29184_i32 = arith.constant 29184 : i32
2026-02-21T09:50:29.8744592Z     %c19456_i32 = arith.constant 19456 : i32
2026-02-21T09:50:29.8744881Z     %c38912_i32 = arith.constant 38912 : i32
2026-02-21T09:50:29.8745046Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:50:29.8745207Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T09:50:29.8745372Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:50:29.8745537Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:50:29.8745702Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:50:29.8745861Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:50:29.8746019Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:50:29.8746224Z     %cst_1 = arith.constant dense<1024> : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8746478Z     %cst_2 = arith.constant dense<8192> : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8746783Z     %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8747047Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:50:29.8747298Z     %cst_4 = arith.constant dense<2> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8747555Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:50:29.8747718Z     %c26111_i32 = arith.constant 26111 : i32
2026-02-21T09:50:29.8747936Z     %cst_5 = arith.constant dense<0.000000e+00> : tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8748200Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:50:29.8748446Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:50:29.8748686Z     %cst_8 = arith.constant dense<8192> : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8748928Z     %cst_9 = arith.constant dense<0> : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8749155Z     %cst_10 = arith.constant dense<16384> : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8749400Z     %cst_11 = arith.constant dense<0> : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8749767Z     %cst_12 = arith.constant dense<8192> : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8749979Z     %0 = tt.get_program_id x : i32
2026-02-21T09:50:29.8750264Z     %1 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8750678Z     %2 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8751060Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8751450Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8751823Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8752201Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8752579Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8752862Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8753242Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:50:29.8753838Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:50:29.8754396Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:50:29.8754684Z     %12 = arith.cmpi eq, %11, %cst_6 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:50:29.8754907Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1>
2026-02-21T09:50:29.8755150Z     %14 = arith.cmpi eq, %11, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:50:29.8755366Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1>
2026-02-21T09:50:29.8755616Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.8755919Z     %17 = arith.extsi %2 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8756293Z     %18 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8756550Z     %19 = arith.subi %c26111_i32, %0 : i32
2026-02-21T09:50:29.8756689Z     %20 = arith.divui %19, %c9728_i32 : i32
2026-02-21T09:50:29.8756822Z     %21 = arith.remsi %20, %c4_i32 : i32
2026-02-21T09:50:29.8756948Z     %22 = arith.subi %20, %21 : i32
2026-02-21T09:50:29.8757077Z     %23 = arith.muli %22, %c9728_i32 : i32
2026-02-21T09:50:29.8757205Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:50:29.8757350Z     scf.for %arg3 = %0 to %24 step %c38912_i32  : i32 {
2026-02-21T09:50:29.8757504Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:50:29.8757647Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:50:29.8757771Z       %27 = arith.subi %c256_i32, %26 : i32
2026-02-21T09:50:29.8757904Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:50:29.8758032Z       %29 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:50:29.8758170Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:50:29.8758294Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:50:29.8758417Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:50:29.8758542Z       %33 = arith.muli %31, %c64_i32 : i32
2026-02-21T09:50:29.8758731Z       %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8758979Z       %35 = arith.addi %34, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8759167Z       %36 = arith.muli %32, %c128_i32 : i32
2026-02-21T09:50:29.8759384Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8759623Z       %38 = arith.addi %37, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8759923Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8760219Z       %40 = arith.muli %39, %cst_1 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8760428Z       %41 = tt.broadcast %40 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8760734Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked>
2026-02-21T09:50:29.8761035Z       %43 = tt.broadcast %42 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8761277Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.8761579Z       %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.8761874Z       %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8762090Z       %47 = arith.addi %41, %46 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8762304Z       %48 = tt.addptr %7, %47 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8762525Z       %49 = tt.load %48 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8762843Z       %50 = tt.expand_dims %5 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8763108Z       %51 = arith.muli %50, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8763317Z       %52 = tt.broadcast %51 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8763548Z       %53 = arith.addi %52, %43 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8763762Z       %54 = tt.addptr %8, %53 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8764004Z       %55 = tt.load %54 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8764302Z       %56 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8764658Z       ttg.local_store %49, %56 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8764929Z       %57 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8765150Z       %58 = arith.addi %6, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8765424Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.8765692Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8765879Z       %61 = arith.addi %41, %60 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8766071Z       %62 = tt.addptr %7, %61 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8766271Z       %63 = tt.load %62 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8766510Z       %64 = tt.expand_dims %57 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8766749Z       %65 = arith.muli %64, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8766936Z       %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8767118Z       %67 = arith.addi %66, %43 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8767308Z       %68 = tt.addptr %8, %67 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8767520Z       %69 = tt.load %68 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8767788Z       %70 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8768134Z       ttg.local_store %63, %70 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8768769Z       %71:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %56, %arg8 = %70, %arg9 = %55, %arg10 = %69) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:50:29.8769289Z         %408 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:29.8769466Z         %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8769687Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8769863Z         %411 = arith.muli %408, %c2_i32 : i32
2026-02-21T09:50:29.8770033Z         %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8770253Z         %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8770530Z         %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.8770808Z         %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8771006Z         %416 = arith.addi %41, %415 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8771205Z         %417 = tt.addptr %7, %416 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8771429Z         %418 = tt.load %417 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8771731Z         %419 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8772202Z         %420 = arith.extf %419 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8772583Z         %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8772831Z         %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8773020Z         %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8773214Z         %424 = arith.addi %423, %43 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8773409Z         %425 = tt.addptr %8, %424 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8773611Z         %426 = tt.load %425 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8773769Z         %427 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8773934Z         %428 = arith.shrsi %427, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8774184Z         %429 = ttg.convert_layout %428 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8774435Z         %430 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8774682Z         %431 = ttg.convert_layout %430 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8775025Z         %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8775393Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8775688Z         %434 = tt.broadcast %432 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8775938Z         %435 = arith.select %13, %434, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8776217Z         %436 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8776462Z         %437 = arith.select %15, %436, %435 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8776699Z         %438 = tt.reshape %437 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8776932Z         %439 = arith.sitofp %438 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8777232Z         %440 = ttg.convert_layout %439 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8777709Z         %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8778060Z         %442 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:29.8778190Z         %443 = arith.cmpi slt, %442, %c2_i32 : i32
2026-02-21T09:50:29.8778327Z         %444 = arith.select %443, %442, %c0_i32 : i32
2026-02-21T09:50:29.8778586Z         %445 = ttg.memdesc_index %44[%444] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8778941Z         ttg.local_store %418, %445 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8779439Z         scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8779824Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:29.8780149Z       %72 = ttg.local_load %71#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8780574Z       %73 = arith.extf %72 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8780877Z       %74 = arith.shli %71#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8781043Z       %75 = arith.shrsi %74, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8781293Z       %76 = ttg.convert_layout %75 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8781545Z       %77 = arith.shrsi %71#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8781787Z       %78 = ttg.convert_layout %77 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8782131Z       %79 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8782472Z       %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8782767Z       %81 = tt.broadcast %79 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8783017Z       %82 = arith.select %13, %81, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8783261Z       %83 = tt.broadcast %80 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8783503Z       %84 = arith.select %15, %83, %82 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8794109Z       %85 = tt.reshape %84 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8794373Z       %86 = arith.sitofp %85 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8794684Z       %87 = ttg.convert_layout %86 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8795170Z       %88 = tt.dot %73, %87, %71#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8795665Z       %89 = ttg.local_load %71#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8796093Z       %90 = arith.extf %89 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8796394Z       %91 = arith.shli %71#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8796565Z       %92 = arith.shrsi %91, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8796813Z       %93 = ttg.convert_layout %92 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8797067Z       %94 = arith.shrsi %71#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8797315Z       %95 = ttg.convert_layout %94 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8797651Z       %96 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8797998Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8798302Z       %98 = tt.broadcast %96 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8798558Z       %99 = arith.select %13, %98, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8798828Z       %100 = tt.broadcast %97 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8799071Z       %101 = arith.select %15, %100, %99 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8799316Z       %102 = tt.reshape %101 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8799550Z       %103 = arith.sitofp %102 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8799858Z       %104 = ttg.convert_layout %103 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8800331Z       %105 = tt.dot %90, %104, %88, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8800715Z       ttg.local_dealloc %44 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.8800943Z       %106 = arith.truncf %105 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T09:50:29.8801125Z       %107 = arith.extsi %33 : i32 to i64
2026-02-21T09:50:29.8801248Z       %108 = arith.extsi %36 : i32 to i64
2026-02-21T09:50:29.8801419Z       %109 = tt.splat %107 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8801633Z       %110 = arith.addi %109, %17 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8801906Z       %111 = tt.expand_dims %110 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8802149Z       %112 = arith.muli %111, %cst_8 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8802337Z       %113 = tt.broadcast %112 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8802637Z       %114 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8802856Z       %115 = arith.addi %114, %18 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8803131Z       %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8803416Z       %117 = tt.broadcast %116 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8803608Z       %118 = arith.addi %113, %117 : tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8803809Z       %119 = tt.addptr %16, %118 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8804014Z       %120 = arith.cmpi sge, %111, %cst_9 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8804191Z       %121 = arith.cmpi slt, %111, %cst_10 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8804357Z       %122 = arith.andi %120, %121 : tensor<64x1xi1, #mma>
2026-02-21T09:50:29.8804542Z       %123 = tt.broadcast %122 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8804733Z       %124 = arith.cmpi sge, %116, %cst_11 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8804911Z       %125 = arith.cmpi slt, %116, %cst_12 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8805081Z       %126 = arith.andi %124, %125 : tensor<1x128xi1, #mma>
2026-02-21T09:50:29.8805258Z       %127 = tt.broadcast %126 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8805445Z       %128 = arith.andi %123, %127 : tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8805607Z       tt.store %119, %106, %128 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.8805771Z       %129 = arith.addi %arg3, %c9728_i32 : i32
2026-02-21T09:50:29.8805903Z       %130 = arith.divsi %129, %c512_i32 : i32
2026-02-21T09:50:29.8806034Z       %131 = arith.muli %130, %c8_i32 : i32
2026-02-21T09:50:29.8806163Z       %132 = arith.subi %c256_i32, %131 : i32
2026-02-21T09:50:29.8806313Z       %133 = arith.minsi %132, %c8_i32 : i32
2026-02-21T09:50:29.8806444Z       %134 = arith.remsi %129, %c512_i32 : i32
2026-02-21T09:50:29.8806586Z       %135 = arith.remsi %134, %133 : i32
2026-02-21T09:50:29.8806710Z       %136 = arith.addi %131, %135 : i32
2026-02-21T09:50:29.8806831Z       %137 = arith.divsi %134, %133 : i32
2026-02-21T09:50:29.8806959Z       %138 = arith.muli %136, %c64_i32 : i32
2026-02-21T09:50:29.8807140Z       %139 = tt.splat %138 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8807371Z       %140 = arith.addi %139, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8807557Z       %141 = arith.muli %137, %c128_i32 : i32
2026-02-21T09:50:29.8807731Z       %142 = tt.splat %141 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8807962Z       %143 = arith.addi %142, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8808247Z       %144 = tt.expand_dims %140 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8808511Z       %145 = arith.muli %144, %cst_1 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8808716Z       %146 = tt.broadcast %145 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8809002Z       %147 = tt.expand_dims %143 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked>
2026-02-21T09:50:29.8809289Z       %148 = tt.broadcast %147 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8809513Z       %149 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.8809711Z       %150 = arith.addi %146, %46 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8809918Z       %151 = tt.addptr %7, %150 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8810152Z       %152 = tt.load %151 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8810320Z       %153 = arith.addi %52, %148 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8810523Z       %154 = tt.addptr %8, %153 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8810730Z       %155 = tt.load %154 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8811036Z       %156 = ttg.memdesc_index %149[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8811399Z       ttg.local_store %152, %156 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8811646Z       %157 = arith.addi %146, %60 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8811846Z       %158 = tt.addptr %7, %157 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8812058Z       %159 = tt.load %158 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8812225Z       %160 = arith.addi %66, %148 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8812422Z       %161 = tt.addptr %8, %160 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8812627Z       %162 = tt.load %161 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8812903Z       %163 = ttg.memdesc_index %149[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8813262Z       ttg.local_store %159, %163 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8813912Z       %164:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %156, %arg8 = %163, %arg9 = %155, %arg10 = %162) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:50:29.8814437Z         %408 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:29.8814621Z         %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8814920Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8815101Z         %411 = arith.muli %408, %c2_i32 : i32
2026-02-21T09:50:29.8815284Z         %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8815511Z         %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8815796Z         %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.8816084Z         %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8816285Z         %416 = arith.addi %146, %415 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8816495Z         %417 = tt.addptr %7, %416 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8816707Z         %418 = tt.load %417 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8817018Z         %419 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8817462Z         %420 = arith.extf %419 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8817848Z         %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8818106Z         %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8818303Z         %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8822690Z         %424 = arith.addi %423, %148 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8822907Z         %425 = tt.addptr %8, %424 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8823110Z         %426 = tt.load %425 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8823305Z         %427 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8823468Z         %428 = arith.shrsi %427, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8823722Z         %429 = ttg.convert_layout %428 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8823979Z         %430 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8824228Z         %431 = ttg.convert_layout %430 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8824578Z         %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8824934Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8825242Z         %434 = tt.broadcast %432 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8825503Z         %435 = arith.select %13, %434, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8825754Z         %436 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8826005Z         %437 = arith.select %15, %436, %435 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8826246Z         %438 = tt.reshape %437 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8826509Z         %439 = arith.sitofp %438 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8826819Z         %440 = ttg.convert_layout %439 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8827310Z         %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8827669Z         %442 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:29.8827803Z         %443 = arith.cmpi slt, %442, %c2_i32 : i32
2026-02-21T09:50:29.8827945Z         %444 = arith.select %443, %442, %c0_i32 : i32
2026-02-21T09:50:29.8828217Z         %445 = ttg.memdesc_index %149[%444] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8828575Z         ttg.local_store %418, %445 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8829062Z         scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8829457Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:29.8829739Z       %165 = ttg.local_load %164#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8830174Z       %166 = arith.extf %165 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8830476Z       %167 = arith.shli %164#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8830647Z       %168 = arith.shrsi %167, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8830918Z       %169 = ttg.convert_layout %168 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8831169Z       %170 = arith.shrsi %164#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8831418Z       %171 = ttg.convert_layout %170 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8831771Z       %172 = tt.expand_dims %169 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8832125Z       %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8832426Z       %174 = tt.broadcast %172 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8832680Z       %175 = arith.select %13, %174, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8832937Z       %176 = tt.broadcast %173 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8833180Z       %177 = arith.select %15, %176, %175 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8833427Z       %178 = tt.reshape %177 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8833663Z       %179 = arith.sitofp %178 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8833964Z       %180 = ttg.convert_layout %179 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8834435Z       %181 = tt.dot %166, %180, %164#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8834953Z       %182 = ttg.local_load %164#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8835378Z       %183 = arith.extf %182 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8835701Z       %184 = arith.shli %164#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8835869Z       %185 = arith.shrsi %184, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8836125Z       %186 = ttg.convert_layout %185 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8836385Z       %187 = arith.shrsi %164#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8836632Z       %188 = ttg.convert_layout %187 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8836980Z       %189 = tt.expand_dims %186 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8837331Z       %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8837632Z       %191 = tt.broadcast %189 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8837890Z       %192 = arith.select %13, %191, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8838139Z       %193 = tt.broadcast %190 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8838390Z       %194 = arith.select %15, %193, %192 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8838629Z       %195 = tt.reshape %194 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8838868Z       %196 = arith.sitofp %195 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8839205Z       %197 = ttg.convert_layout %196 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8839671Z       %198 = tt.dot %183, %197, %181, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8840078Z       ttg.local_dealloc %149 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.8840303Z       %199 = arith.truncf %198 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T09:50:29.8840480Z       %200 = arith.extsi %138 : i32 to i64
2026-02-21T09:50:29.8840610Z       %201 = arith.extsi %141 : i32 to i64
2026-02-21T09:50:29.8840776Z       %202 = tt.splat %200 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8840995Z       %203 = arith.addi %202, %17 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8841262Z       %204 = tt.expand_dims %203 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8841508Z       %205 = arith.muli %204, %cst_8 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8841698Z       %206 = tt.broadcast %205 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8841909Z       %207 = tt.splat %201 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8842129Z       %208 = arith.addi %207, %18 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8842399Z       %209 = tt.expand_dims %208 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8842727Z       %210 = tt.broadcast %209 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8842921Z       %211 = arith.addi %206, %210 : tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8843141Z       %212 = tt.addptr %16, %211 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8843346Z       %213 = arith.cmpi sge, %204, %cst_9 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8843543Z       %214 = arith.cmpi slt, %204, %cst_10 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8843705Z       %215 = arith.andi %213, %214 : tensor<64x1xi1, #mma>
2026-02-21T09:50:29.8843888Z       %216 = tt.broadcast %215 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8844083Z       %217 = arith.cmpi sge, %209, %cst_11 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8844254Z       %218 = arith.cmpi slt, %209, %cst_12 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8844422Z       %219 = arith.andi %217, %218 : tensor<1x128xi1, #mma>
2026-02-21T09:50:29.8844599Z       %220 = tt.broadcast %219 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8844785Z       %221 = arith.andi %216, %220 : tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8844940Z       tt.store %212, %199, %221 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.8845092Z       %222 = arith.addi %arg3, %c19456_i32 : i32
2026-02-21T09:50:29.8845219Z       %223 = arith.divsi %222, %c512_i32 : i32
2026-02-21T09:50:29.8845340Z       %224 = arith.muli %223, %c8_i32 : i32
2026-02-21T09:50:29.8845461Z       %225 = arith.subi %c256_i32, %224 : i32
2026-02-21T09:50:29.8845581Z       %226 = arith.minsi %225, %c8_i32 : i32
2026-02-21T09:50:29.8845702Z       %227 = arith.remsi %222, %c512_i32 : i32
2026-02-21T09:50:29.8845817Z       %228 = arith.remsi %227, %226 : i32
2026-02-21T09:50:29.8845933Z       %229 = arith.addi %224, %228 : i32
2026-02-21T09:50:29.8846045Z       %230 = arith.divsi %227, %226 : i32
2026-02-21T09:50:29.8846170Z       %231 = arith.muli %229, %c64_i32 : i32
2026-02-21T09:50:29.8846340Z       %232 = tt.splat %231 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8846566Z       %233 = arith.addi %232, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8846741Z       %234 = arith.muli %230, %c128_i32 : i32
2026-02-21T09:50:29.8846928Z       %235 = tt.splat %234 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8847149Z       %236 = arith.addi %235, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8847423Z       %237 = tt.expand_dims %233 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8847692Z       %238 = arith.muli %237, %cst_1 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8847888Z       %239 = tt.broadcast %238 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8848167Z       %240 = tt.expand_dims %236 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked>
2026-02-21T09:50:29.8848443Z       %241 = tt.broadcast %240 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8848667Z       %242 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.8848854Z       %243 = arith.addi %239, %46 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8849052Z       %244 = tt.addptr %7, %243 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8849254Z       %245 = tt.load %244 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8849412Z       %246 = arith.addi %52, %241 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8849602Z       %247 = tt.addptr %8, %246 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8849798Z       %248 = tt.load %247 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8850075Z       %249 = ttg.memdesc_index %242[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8850434Z       ttg.local_store %245, %249 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8850691Z       %250 = arith.addi %239, %60 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8850884Z       %251 = tt.addptr %7, %250 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8851103Z       %252 = tt.load %251 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8851262Z       %253 = arith.addi %66, %241 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8851454Z       %254 = tt.addptr %8, %253 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8851651Z       %255 = tt.load %254 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8851921Z       %256 = ttg.memdesc_index %242[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8852274Z       ttg.local_store %252, %256 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8852902Z       %257:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %249, %arg8 = %256, %arg9 = %248, %arg10 = %255) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:50:29.8853422Z         %408 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:29.8853594Z         %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8853813Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8853985Z         %411 = arith.muli %408, %c2_i32 : i32
2026-02-21T09:50:29.8854154Z         %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8854372Z         %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8854664Z         %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.8854941Z         %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8855132Z         %416 = arith.addi %239, %415 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8855352Z         %417 = tt.addptr %7, %416 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8855553Z         %418 = tt.load %417 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8855848Z         %419 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8856277Z         %420 = arith.extf %419 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8856654Z         %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8856898Z         %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8857090Z         %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8857280Z         %424 = arith.addi %423, %241 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8857473Z         %425 = tt.addptr %8, %424 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8857668Z         %426 = tt.load %425 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8857827Z         %427 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8857982Z         %428 = arith.shrsi %427, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8858240Z         %429 = ttg.convert_layout %428 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8858487Z         %430 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8858733Z         %431 = ttg.convert_layout %430 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8859087Z         %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8859431Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8859721Z         %434 = tt.broadcast %432 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8859970Z         %435 = arith.select %13, %434, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8860211Z         %436 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8860456Z         %437 = arith.select %15, %436, %435 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8860692Z         %438 = tt.reshape %437 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8860919Z         %439 = arith.sitofp %438 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8861218Z         %440 = ttg.convert_layout %439 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8861685Z         %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8862031Z         %442 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:29.8862158Z         %443 = arith.cmpi slt, %442, %c2_i32 : i32
2026-02-21T09:50:29.8862293Z         %444 = arith.select %443, %442, %c0_i32 : i32
2026-02-21T09:50:29.8862572Z         %445 = ttg.memdesc_index %242[%444] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8862923Z         ttg.local_store %418, %445 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8863415Z         scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8863801Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:29.8864072Z       %258 = ttg.local_load %257#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8864499Z       %259 = arith.extf %258 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8864792Z       %260 = arith.shli %257#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8864955Z       %261 = arith.shrsi %260, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8865198Z       %262 = ttg.convert_layout %261 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8865441Z       %263 = arith.shrsi %257#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8865683Z       %264 = ttg.convert_layout %263 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8866018Z       %265 = tt.expand_dims %262 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8866377Z       %266 = tt.expand_dims %264 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8866669Z       %267 = tt.broadcast %265 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8866928Z       %268 = arith.select %13, %267, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8867173Z       %269 = tt.broadcast %266 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8867412Z       %270 = arith.select %15, %269, %268 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8867648Z       %271 = tt.reshape %270 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8867873Z       %272 = arith.sitofp %271 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8868166Z       %273 = ttg.convert_layout %272 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8868632Z       %274 = tt.dot %259, %273, %257#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8869121Z       %275 = ttg.local_load %257#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8869544Z       %276 = arith.extf %275 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8869841Z       %277 = arith.shli %257#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8870000Z       %278 = arith.shrsi %277, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8870245Z       %279 = ttg.convert_layout %278 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8870492Z       %280 = arith.shrsi %257#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8870750Z       %281 = ttg.convert_layout %280 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8871087Z       %282 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8871440Z       %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8871727Z       %284 = tt.broadcast %282 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8871973Z       %285 = arith.select %13, %284, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8872214Z       %286 = tt.broadcast %283 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8872454Z       %287 = arith.select %15, %286, %285 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8872687Z       %288 = tt.reshape %287 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8872911Z       %289 = arith.sitofp %288 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8873207Z       %290 = ttg.convert_layout %289 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8873663Z       %291 = tt.dot %276, %290, %274, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8874043Z       ttg.local_dealloc %242 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.8874259Z       %292 = arith.truncf %291 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T09:50:29.8874427Z       %293 = arith.extsi %231 : i32 to i64
2026-02-21T09:50:29.8874559Z       %294 = arith.extsi %234 : i32 to i64
2026-02-21T09:50:29.8874718Z       %295 = tt.splat %293 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8874942Z       %296 = arith.addi %295, %17 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8875202Z       %297 = tt.expand_dims %296 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8875440Z       %298 = arith.muli %297, %cst_8 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8875619Z       %299 = tt.broadcast %298 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8875824Z       %300 = tt.splat %294 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8876036Z       %301 = arith.addi %300, %18 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8876302Z       %302 = tt.expand_dims %301 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8876568Z       %303 = tt.broadcast %302 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8876751Z       %304 = arith.addi %299, %303 : tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8876940Z       %305 = tt.addptr %16, %304 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8877137Z       %306 = arith.cmpi sge, %297, %cst_9 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8877301Z       %307 = arith.cmpi slt, %297, %cst_10 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8877457Z       %308 = arith.andi %306, %307 : tensor<64x1xi1, #mma>
2026-02-21T09:50:29.8877627Z       %309 = tt.broadcast %308 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8877811Z       %310 = arith.cmpi sge, %302, %cst_11 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8877978Z       %311 = arith.cmpi slt, %302, %cst_12 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8878134Z       %312 = arith.andi %310, %311 : tensor<1x128xi1, #mma>
2026-02-21T09:50:29.8878324Z       %313 = tt.broadcast %312 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8878499Z       %314 = arith.andi %309, %313 : tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8878659Z       tt.store %305, %292, %314 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.8878820Z       %315 = arith.addi %arg3, %c29184_i32 : i32
2026-02-21T09:50:29.8878944Z       %316 = arith.divsi %315, %c512_i32 : i32
2026-02-21T09:50:29.8879065Z       %317 = arith.muli %316, %c8_i32 : i32
2026-02-21T09:50:29.8879182Z       %318 = arith.subi %c256_i32, %317 : i32
2026-02-21T09:50:29.8879301Z       %319 = arith.minsi %318, %c8_i32 : i32
2026-02-21T09:50:29.8879418Z       %320 = arith.remsi %315, %c512_i32 : i32
2026-02-21T09:50:29.8879536Z       %321 = arith.remsi %320, %319 : i32
2026-02-21T09:50:29.8879648Z       %322 = arith.addi %317, %321 : i32
2026-02-21T09:50:29.8879761Z       %323 = arith.divsi %320, %319 : i32
2026-02-21T09:50:29.8879878Z       %324 = arith.muli %322, %c64_i32 : i32
2026-02-21T09:50:29.8880045Z       %325 = tt.splat %324 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8880268Z       %326 = arith.addi %325, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8880441Z       %327 = arith.muli %323, %c128_i32 : i32
2026-02-21T09:50:29.8880610Z       %328 = tt.splat %327 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8880828Z       %329 = arith.addi %328, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8881100Z       %330 = tt.expand_dims %326 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8881353Z       %331 = arith.muli %330, %cst_1 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8881545Z       %332 = tt.broadcast %331 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8881841Z       %333 = tt.expand_dims %329 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked>
2026-02-21T09:50:29.8882115Z       %334 = tt.broadcast %333 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8882350Z       %335 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.8882539Z       %336 = arith.addi %332, %46 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8882783Z       %337 = tt.addptr %7, %336 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8882992Z       %338 = tt.load %337 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8883153Z       %339 = arith.addi %52, %334 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8883349Z       %340 = tt.addptr %8, %339 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8883547Z       %341 = tt.load %340 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8883825Z       %342 = ttg.memdesc_index %335[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8884183Z       ttg.local_store %338, %342 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8884420Z       %343 = arith.addi %332, %60 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8884618Z       %344 = tt.addptr %7, %343 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8884821Z       %345 = tt.load %344 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8884976Z       %346 = arith.addi %66, %334 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8885168Z       %347 = tt.addptr %8, %346 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8885360Z       %348 = tt.load %347 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8885660Z       %349 = ttg.memdesc_index %335[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8886020Z       ttg.local_store %345, %349 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8886642Z       %350:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %342, %arg8 = %349, %arg9 = %341, %arg10 = %348) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:50:29.8887177Z         %408 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:29.8887350Z         %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8887569Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8887742Z         %411 = arith.muli %408, %c2_i32 : i32
2026-02-21T09:50:29.8887912Z         %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8888133Z         %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8888408Z         %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.8888688Z         %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8888892Z         %416 = arith.addi %332, %415 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8889097Z         %417 = tt.addptr %7, %416 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8889316Z         %418 = tt.load %417 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8889643Z         %419 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8890077Z         %420 = arith.extf %419 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8890484Z         %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8890735Z         %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8890936Z         %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8891136Z         %424 = arith.addi %423, %334 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8891336Z         %425 = tt.addptr %8, %424 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8891546Z         %426 = tt.load %425 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8891711Z         %427 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8891881Z         %428 = arith.shrsi %427, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8892132Z         %429 = ttg.convert_layout %428 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8892392Z         %430 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8892647Z         %431 = ttg.convert_layout %430 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8892988Z         %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8893343Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8893648Z         %434 = tt.broadcast %432 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8893918Z         %435 = arith.select %13, %434, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8894175Z         %436 = tt.broadcast %433 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8894421Z         %437 = arith.select %15, %436, %435 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8894682Z         %438 = tt.reshape %437 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8894920Z         %439 = arith.sitofp %438 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8895222Z         %440 = ttg.convert_layout %439 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8895704Z         %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8896056Z         %442 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:29.8896195Z         %443 = arith.cmpi slt, %442, %c2_i32 : i32
2026-02-21T09:50:29.8896339Z         %444 = arith.select %443, %442, %c0_i32 : i32
2026-02-21T09:50:29.8896609Z         %445 = ttg.memdesc_index %335[%444] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8896971Z         ttg.local_store %418, %445 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8897454Z         scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8897864Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:29.8898149Z       %351 = ttg.local_load %350#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8898594Z       %352 = arith.extf %351 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8898902Z       %353 = arith.shli %350#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8899075Z       %354 = arith.shrsi %353, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8899323Z       %355 = ttg.convert_layout %354 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8899579Z       %356 = arith.shrsi %350#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8899826Z       %357 = ttg.convert_layout %356 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8900173Z       %358 = tt.expand_dims %355 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8900527Z       %359 = tt.expand_dims %357 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8900822Z       %360 = tt.broadcast %358 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8901078Z       %361 = arith.select %13, %360, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8901329Z       %362 = tt.broadcast %359 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8901578Z       %363 = arith.select %15, %362, %361 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8901823Z       %364 = tt.reshape %363 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8902071Z       %365 = arith.sitofp %364 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8902378Z       %366 = ttg.convert_layout %365 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8902854Z       %367 = tt.dot %352, %366, %350#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8903365Z       %368 = ttg.local_load %350#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8903798Z       %369 = arith.extf %368 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8904098Z       %370 = arith.shli %350#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8904271Z       %371 = arith.shrsi %370, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8904523Z       %372 = ttg.convert_layout %371 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8904776Z       %373 = arith.shrsi %350#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8905029Z       %374 = ttg.convert_layout %373 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8905369Z       %375 = tt.expand_dims %372 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8905721Z       %376 = tt.expand_dims %374 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8906021Z       %377 = tt.broadcast %375 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8906286Z       %378 = arith.select %13, %377, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8906539Z       %379 = tt.broadcast %376 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8906805Z       %380 = arith.select %15, %379, %378 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8907046Z       %381 = tt.reshape %380 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8907280Z       %382 = arith.sitofp %381 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8907578Z       %383 = ttg.convert_layout %382 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8908048Z       %384 = tt.dot %369, %383, %367, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8908439Z       ttg.local_dealloc %335 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.8908656Z       %385 = arith.truncf %384 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T09:50:29.8908839Z       %386 = arith.extsi %324 : i32 to i64
2026-02-21T09:50:29.8908965Z       %387 = arith.extsi %327 : i32 to i64
2026-02-21T09:50:29.8909136Z       %388 = tt.splat %386 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8909350Z       %389 = arith.addi %388, %17 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8909622Z       %390 = tt.expand_dims %389 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8909869Z       %391 = arith.muli %390, %cst_8 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8910051Z       %392 = tt.broadcast %391 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8910269Z       %393 = tt.splat %387 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8910503Z       %394 = arith.addi %393, %18 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8910780Z       %395 = tt.expand_dims %394 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8911063Z       %396 = tt.broadcast %395 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8911249Z       %397 = arith.addi %392, %396 : tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8911447Z       %398 = tt.addptr %16, %397 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8911648Z       %399 = arith.cmpi sge, %390, %cst_9 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8911824Z       %400 = arith.cmpi slt, %390, %cst_10 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8911991Z       %401 = arith.andi %399, %400 : tensor<64x1xi1, #mma>
2026-02-21T09:50:29.8912166Z       %402 = tt.broadcast %401 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8912366Z       %403 = arith.cmpi sge, %395, %cst_11 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8912537Z       %404 = arith.cmpi slt, %395, %cst_12 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8912707Z       %405 = arith.andi %403, %404 : tensor<1x128xi1, #mma>
2026-02-21T09:50:29.8912885Z       %406 = tt.broadcast %405 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8913071Z       %407 = arith.andi %402, %406 : tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8913239Z       tt.store %398, %385, %407 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.8913388Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:50:29.8913529Z     scf.for %arg3 = %24 to %c16384_i32 step %c9728_i32  : i32 {
2026-02-21T09:50:29.8913679Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:50:29.8913814Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:50:29.8913937Z       %27 = arith.subi %c256_i32, %26 : i32
2026-02-21T09:50:29.8914079Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:50:29.8914212Z       %29 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:50:29.8914335Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:50:29.8914481Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:50:29.8914598Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:50:29.8914723Z       %33 = arith.muli %31, %c64_i32 : i32
2026-02-21T09:50:29.8914896Z       %34 = tt.splat %33 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8915126Z       %35 = arith.addi %34, %1 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:29.8915305Z       %36 = arith.muli %32, %c128_i32 : i32
2026-02-21T09:50:29.8915472Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8915696Z       %38 = arith.addi %37, %3 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:50:29.8915974Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8916231Z       %40 = arith.muli %39, %cst_1 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:29.8916426Z       %41 = tt.broadcast %40 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8916707Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi32, #blocked>
2026-02-21T09:50:29.8916986Z       %43 = tt.broadcast %42 : tensor<1x128xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8917203Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.8917480Z       %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.8917750Z       %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8917946Z       %47 = arith.addi %41, %46 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8918169Z       %48 = tt.addptr %7, %47 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8918374Z       %49 = tt.load %48 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8918619Z       %50 = tt.expand_dims %5 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8918876Z       %51 = arith.muli %50, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8919067Z       %52 = tt.broadcast %51 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8919261Z       %53 = arith.addi %52, %43 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8919455Z       %54 = tt.addptr %8, %53 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8919657Z       %55 = tt.load %54 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8919942Z       %56 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8920305Z       ttg.local_store %49, %56 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8920583Z       %57 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8920811Z       %58 = arith.addi %6, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8921091Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.8921363Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8921558Z       %61 = arith.addi %41, %60 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8921758Z       %62 = tt.addptr %7, %61 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8921980Z       %63 = tt.load %62 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8922223Z       %64 = tt.expand_dims %57 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8922507Z       %65 = arith.muli %64, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8922730Z       %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8922920Z       %67 = arith.addi %66, %43 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8923113Z       %68 = tt.addptr %8, %67 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8923310Z       %69 = tt.load %68 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8923583Z       %70 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8923940Z       ttg.local_store %63, %70 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8924562Z       %71:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %56, %arg8 = %70, %arg9 = %55, %arg10 = %69) -> (tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:50:29.8925087Z         %129 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:29.8925268Z         %130 = tt.splat %129 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8925493Z         %131 = arith.addi %130, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:29.8925678Z         %132 = arith.muli %129, %c2_i32 : i32
2026-02-21T09:50:29.8925858Z         %133 = tt.splat %132 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8926106Z         %134 = arith.addi %133, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:29.8926392Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:29.8926675Z         %136 = tt.broadcast %135 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8926895Z         %137 = arith.addi %41, %136 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8927104Z         %138 = tt.addptr %7, %137 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:29.8927313Z         %139 = tt.load %138 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:29.8927620Z         %140 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8928058Z         %141 = arith.extf %140 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8928444Z         %142 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8928699Z         %143 = arith.muli %142, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:50:29.8928895Z         %144 = tt.broadcast %143 : tensor<2x1xi32, #blocked> -> tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8929094Z         %145 = arith.addi %144, %43 : tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8929293Z         %146 = tt.addptr %8, %145 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi32, #blocked>
2026-02-21T09:50:29.8929502Z         %147 = tt.load %146 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:50:29.8929671Z         %148 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8929835Z         %149 = arith.shrsi %148, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8930111Z         %150 = ttg.convert_layout %149 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8930365Z         %151 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8930635Z         %152 = ttg.convert_layout %151 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8930984Z         %153 = tt.expand_dims %150 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8931331Z         %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8931631Z         %155 = tt.broadcast %153 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8931886Z         %156 = arith.select %13, %155, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8932140Z         %157 = tt.broadcast %154 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8932391Z         %158 = arith.select %15, %157, %156 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8932635Z         %159 = tt.reshape %158 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8932873Z         %160 = arith.sitofp %159 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8933180Z         %161 = ttg.convert_layout %160 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8933650Z         %162 = tt.dot %141, %161, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8934006Z         %163 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:29.8934140Z         %164 = arith.cmpi slt, %163, %c2_i32 : i32
2026-02-21T09:50:29.8934301Z         %165 = arith.select %164, %163, %c0_i32 : i32
2026-02-21T09:50:29.8934575Z         %166 = ttg.memdesc_index %44[%165] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8934945Z         ttg.local_store %139, %166 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:29.8935440Z         scf.yield %162, %165, %arg8, %166, %arg10, %147 : tensor<64x128xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8935832Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:29.8936113Z       %72 = ttg.local_load %71#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8936543Z       %73 = arith.extf %72 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8936841Z       %74 = arith.shli %71#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8937011Z       %75 = arith.shrsi %74, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8937257Z       %76 = ttg.convert_layout %75 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8937507Z       %77 = arith.shrsi %71#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8937753Z       %78 = ttg.convert_layout %77 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8938087Z       %79 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8938447Z       %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8938546Z       %81 = tt.broadcast %79 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8938678Z       %82 = arith.select %13, %81, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8938774Z       %83 = tt.broadcast %80 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8938876Z       %84 = arith.select %15, %83, %82 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8938971Z       %85 = tt.reshape %84 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8939065Z       %86 = arith.sitofp %85 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8939227Z       %87 = ttg.convert_layout %86 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8939495Z       %88 = tt.dot %73, %87, %71#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8939686Z       %89 = ttg.local_load %71#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8939877Z       %90 = arith.extf %89 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8939947Z       %91 = arith.shli %71#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8940009Z       %92 = arith.shrsi %91, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8940153Z       %93 = ttg.convert_layout %92 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8940241Z       %94 = arith.shrsi %71#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:50:29.8940380Z       %95 = ttg.convert_layout %94 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:29.8940536Z       %96 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8940703Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:50:29.8940798Z       %98 = tt.broadcast %96 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8940904Z       %99 = arith.select %13, %98, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8941007Z       %100 = tt.broadcast %97 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8941113Z       %101 = arith.select %15, %100, %99 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:50:29.8941206Z       %102 = tt.reshape %101 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:50:29.8941305Z       %103 = arith.sitofp %102 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:50:29.8941468Z       %104 = ttg.convert_layout %103 : tensor<4x128xf32, #blocked3> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:29.8941726Z       %105 = tt.dot %90, %104, %88, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x128xf32, #mma>
2026-02-21T09:50:29.8941816Z       ttg.local_dealloc %44 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:29.8941905Z       %106 = arith.truncf %105 : tensor<64x128xf32, #mma> to tensor<64x128xbf16, #mma>
2026-02-21T09:50:29.8941970Z       %107 = arith.extsi %33 : i32 to i64
2026-02-21T09:50:29.8942023Z       %108 = arith.extsi %36 : i32 to i64
2026-02-21T09:50:29.8942112Z       %109 = tt.splat %107 : i64 -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8942213Z       %110 = arith.addi %109, %17 : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:29.8942354Z       %111 = tt.expand_dims %110 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8942420Z       %112 = arith.muli %111, %cst_8 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8942504Z       %113 = tt.broadcast %112 : tensor<64x1xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8942592Z       %114 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8942684Z       %115 = arith.addi %114, %18 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:29.8942827Z       %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8942911Z       %117 = tt.broadcast %116 : tensor<1x128xi64, #mma> -> tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8942979Z       %118 = arith.addi %113, %117 : tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8943078Z       %119 = tt.addptr %16, %118 : tensor<64x128x!tt.ptr<bf16>, #mma>, tensor<64x128xi64, #mma>
2026-02-21T09:50:29.8943147Z       %120 = arith.cmpi sge, %111, %cst_9 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8943221Z       %121 = arith.cmpi slt, %111, %cst_10 : tensor<64x1xi64, #mma>
2026-02-21T09:50:29.8943280Z       %122 = arith.andi %120, %121 : tensor<64x1xi1, #mma>
2026-02-21T09:50:29.8943362Z       %123 = tt.broadcast %122 : tensor<64x1xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8943437Z       %124 = arith.cmpi sge, %116, %cst_11 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8943504Z       %125 = arith.cmpi slt, %116, %cst_12 : tensor<1x128xi64, #mma>
2026-02-21T09:50:29.8943564Z       %126 = arith.andi %124, %125 : tensor<1x128xi1, #mma>
2026-02-21T09:50:29.8943667Z       %127 = tt.broadcast %126 : tensor<1x128xi1, #mma> -> tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8943727Z       %128 = arith.andi %123, %127 : tensor<64x128xi1, #mma>
2026-02-21T09:50:29.8943797Z       tt.store %119, %106, %128 : tensor<64x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:29.8943856Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:50:29.8943900Z     tt.return
2026-02-21T09:50:29.8943935Z   }
2026-02-21T09:50:29.8943976Z }
2026-02-21T09:50:29.8943980Z 
2026-02-21T09:50:29.8944021Z {-#
2026-02-21T09:50:29.8944066Z   external_resources: {
2026-02-21T09:50:29.8944107Z     mlir_reproducer: {
2026-02-21T09:50:29.8945051Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:50:29.8945098Z       disable_threading: false,
2026-02-21T09:50:29.8945138Z       verify_each: true
2026-02-21T09:50:29.8945178Z     }
2026-02-21T09:50:29.8945211Z   }
2026-02-21T09:50:29.8945244Z #-}
2026-02-21T09:50:29.8945494Z /tmp/torchinductor_root/kf/ckfbghpgt3j65na45xytph3niqm4zjzp5oea4cwbjjsx3nxiqfac.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:50:29.8945946Z /tmp/torchinductor_root/kf/ckfbghpgt3j65na45xytph3niqm4zjzp5oea4cwbjjsx3nxiqfac.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:50:29.8946063Z [360s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:50:29.8946702Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 128], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:50:29.8946778Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:50:29.8946861Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:50:32.8973016Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:50:32.8979357Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:50:32.8980193Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:32.8980919Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:32.8981631Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:32.8982300Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:50:32.8982905Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:50:32.8983336Z #smem = #ttg.shared_memory
2026-02-21T09:50:32.8984400Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:50:32.8985271Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:50:32.8986216Z     %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma>
2026-02-21T09:50:32.8986547Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:50:32.8986865Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:50:32.8987198Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma>
2026-02-21T09:50:32.8987489Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:50:32.8987833Z     %cst_3 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.8988313Z     %cst_4 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.8988775Z     %cst_5 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.8989170Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.8989498Z     %cst_7 = arith.constant dense<1024> : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:32.8989782Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:50:32.8989994Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:50:32.8990208Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:50:32.8990431Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:50:32.8990649Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:50:32.8990852Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:50:32.8991108Z     %cst_8 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.8991373Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:50:32.8991568Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:50:32.8991976Z     %cst_9 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.8992323Z     %0 = tt.get_program_id x : i32
2026-02-21T09:50:32.8992529Z     %1 = arith.muli %0, %c4_i32 : i32
2026-02-21T09:50:32.8992837Z     %2 = arith.addi %1, %c4_i32 : i32
2026-02-21T09:50:32.8993045Z     %3 = arith.minsi %2, %c32768_i32 : i32
2026-02-21T09:50:32.8993421Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:32.8993921Z     %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:32.8994307Z     %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:32.8994659Z     %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:32.8995006Z     %8 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.8995363Z     %9 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.8995692Z     %10 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.8995962Z     %11 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.8996327Z     %12 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:50:32.8996887Z     %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:50:32.8997436Z     %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:50:32.8997770Z     %15 = arith.cmpi eq, %14, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:50:32.8998062Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:50:32.8998328Z     %17 = arith.cmpi eq, %14, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:50:32.8998578Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:50:32.8998878Z     %19 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:32.8999083Z     %20 = arith.subi %3, %1 : i32
2026-02-21T09:50:32.8999234Z     %21 = arith.remsi %20, %c3_i32 : i32
2026-02-21T09:50:32.8999387Z     %22 = arith.subi %20, %21 : i32
2026-02-21T09:50:32.8999534Z     %23 = arith.addi %1, %22 : i32
2026-02-21T09:50:32.8999732Z     scf.for %arg3 = %1 to %23 step %c3_i32  : i32 {
2026-02-21T09:50:32.8999914Z       %24 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:50:32.9000078Z       %25 = arith.muli %24, %c2_i32 : i32
2026-02-21T09:50:32.9000237Z       %26 = arith.subi %c256_i32, %25 : i32
2026-02-21T09:50:32.9000397Z       %27 = arith.minsi %26, %c2_i32 : i32
2026-02-21T09:50:32.9000564Z       %28 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:50:32.9000722Z       %29 = arith.remsi %28, %27 : i32
2026-02-21T09:50:32.9000872Z       %30 = arith.addi %25, %29 : i32
2026-02-21T09:50:32.9001014Z       %31 = arith.divsi %28, %27 : i32
2026-02-21T09:50:32.9001166Z       %32 = arith.muli %30, %c64_i32 : i32
2026-02-21T09:50:32.9001393Z       %33 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:32.9001676Z       %34 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:32.9001959Z       %35 = arith.addi %33, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:32.9002240Z       %36 = arith.addi %34, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:32.9002465Z       %37 = arith.muli %31, %c64_i32 : i32
2026-02-21T09:50:32.9002861Z       %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:32.9003143Z       %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:32.9003447Z       %40 = arith.addi %38, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:32.9003732Z       %41 = arith.addi %39, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:32.9004129Z       %42 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:32.9004393Z       %43 = arith.muli %42, %cst_7 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:32.9004598Z       %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9004890Z       %45 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T09:50:32.9005187Z       %46 = tt.broadcast %45 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9005414Z       %47 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:32.9005702Z       %48 = tt.expand_dims %9 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:32.9005992Z       %49 = tt.broadcast %48 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9006195Z       %50 = arith.addi %44, %49 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9006407Z       %51 = tt.addptr %10, %50 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9006625Z       %52 = tt.load %51 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9006926Z       %53 = ttg.memdesc_index %47[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9007332Z       ttg.local_store %52, %53 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9007617Z       %54 = arith.addi %9, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.9007915Z       %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:32.9008230Z       %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9008430Z       %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9008638Z       %58 = tt.addptr %10, %57 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9008851Z       %59 = tt.load %58 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9009148Z       %60 = ttg.memdesc_index %47[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9009525Z       ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9010072Z       %61:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %53, %arg8 = %60) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:50:32.9010578Z         %276 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9010822Z         %277 = arith.addi %276, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9011013Z         %278 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:32.9011145Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:50:32.9011327Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.9011583Z         %281 = arith.addi %280, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.9011875Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:32.9012195Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9012407Z         %284 = arith.addi %44, %283 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9012622Z         %285 = tt.addptr %10, %284 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9012843Z         %286 = tt.load %285 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9013166Z         %287 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9013632Z         %288 = arith.extf %287 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9014027Z         %289 = tt.expand_dims %277 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9014276Z         %290 = arith.muli %289, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9014470Z         %291 = tt.broadcast %290 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9014659Z         %292 = arith.addi %291, %46 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9014856Z         %293 = tt.addptr %11, %292 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9015057Z         %294 = tt.load %293 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9015298Z         %295 = ttg.convert_layout %294 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9015580Z         %296 = arith.shli %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9015858Z         %297 = arith.shrsi %296, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9016100Z         %298 = arith.shrsi %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9016388Z         %299 = tt.expand_dims %297 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9016762Z         %300 = tt.expand_dims %298 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9017047Z         %301 = tt.broadcast %299 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9017283Z         %302 = arith.select %16, %301, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9017520Z         %303 = tt.broadcast %300 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9017756Z         %304 = arith.select %18, %303, %302 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9017984Z         %305 = tt.reshape %304 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9018208Z         %306 = arith.sitofp %305 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9018507Z         %307 = ttg.convert_layout %306 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9018983Z         %308 = tt.dot %288, %307, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9019334Z         %309 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:32.9019463Z         %310 = arith.cmpi slt, %309, %c2_i32 : i32
2026-02-21T09:50:32.9019615Z         %311 = arith.select %310, %309, %c0_i32 : i32
2026-02-21T09:50:32.9019878Z         %312 = ttg.memdesc_index %47[%311] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9020243Z         ttg.local_store %286, %312 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9020634Z         scf.yield %308, %311, %arg8, %312 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9020933Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:32.9021107Z       %62 = arith.addi %8, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9021430Z       %63 = ttg.local_load %61#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9021851Z       %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9022224Z       %65 = tt.expand_dims %62 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9022466Z       %66 = arith.muli %65, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9030068Z       %67 = tt.broadcast %66 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9030286Z       %68 = arith.addi %67, %46 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9030482Z       %69 = tt.addptr %11, %68 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9030677Z       %70 = tt.load %69 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9030914Z       %71 = ttg.convert_layout %70 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9031242Z       %72 = arith.shli %71, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9031467Z       %73 = arith.shrsi %72, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9031698Z       %74 = arith.shrsi %71, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9031997Z       %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9032322Z       %76 = tt.expand_dims %74 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9032597Z       %77 = tt.broadcast %75 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9032825Z       %78 = arith.select %16, %77, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9033054Z       %79 = tt.broadcast %76 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9033275Z       %80 = arith.select %18, %79, %78 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9033494Z       %81 = tt.reshape %80 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9033708Z       %82 = arith.sitofp %81 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9033995Z       %83 = ttg.convert_layout %82 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9034452Z       %84 = tt.dot %64, %83, %61#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9034836Z       %85 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9051318Z       %86 = ttg.local_load %61#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9051757Z       %87 = arith.extf %86 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9052162Z       %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9052412Z       %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9052600Z       %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9052788Z       %91 = arith.addi %90, %46 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9052980Z       %92 = tt.addptr %11, %91 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9053173Z       %93 = tt.load %92 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9053411Z       %94 = ttg.convert_layout %93 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9053684Z       %95 = arith.shli %94, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9053910Z       %96 = arith.shrsi %95, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9054138Z       %97 = arith.shrsi %94, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9054416Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9054746Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9055021Z       %100 = tt.broadcast %98 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9055255Z       %101 = arith.select %16, %100, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9055511Z       %102 = tt.broadcast %99 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9055742Z       %103 = arith.select %18, %102, %101 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9055988Z       %104 = tt.reshape %103 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9056206Z       %105 = arith.sitofp %104 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9056495Z       %106 = ttg.convert_layout %105 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9056949Z       %107 = tt.dot %87, %106, %84, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9057330Z       ttg.local_dealloc %47 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:32.9057545Z       %108 = arith.truncf %107 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:50:32.9057809Z       %109 = tt.expand_dims %36 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:50:32.9058045Z       %110 = arith.muli %109, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:50:32.9058274Z       %111 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:50:32.9058531Z       %112 = tt.broadcast %110 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9058732Z       %113 = tt.broadcast %111 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9058911Z       %114 = arith.addi %112, %113 : tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9059094Z       %115 = tt.addptr %19, %114 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9059311Z       tt.store %115, %108 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:32.9059451Z       %116 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:50:32.9059594Z       %117 = arith.divsi %116, %c256_i32 : i32
2026-02-21T09:50:32.9059713Z       %118 = arith.muli %117, %c2_i32 : i32
2026-02-21T09:50:32.9059834Z       %119 = arith.subi %c256_i32, %118 : i32
2026-02-21T09:50:32.9059953Z       %120 = arith.minsi %119, %c2_i32 : i32
2026-02-21T09:50:32.9060071Z       %121 = arith.remsi %116, %c256_i32 : i32
2026-02-21T09:50:32.9060191Z       %122 = arith.remsi %121, %120 : i32
2026-02-21T09:50:32.9060303Z       %123 = arith.addi %118, %122 : i32
2026-02-21T09:50:32.9060421Z       %124 = arith.divsi %121, %120 : i32
2026-02-21T09:50:32.9060536Z       %125 = arith.muli %123, %c64_i32 : i32
2026-02-21T09:50:32.9060711Z       %126 = tt.splat %125 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:32.9060926Z       %127 = tt.splat %125 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:32.9061141Z       %128 = arith.addi %126, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:32.9061358Z       %129 = arith.addi %127, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:32.9061521Z       %130 = arith.muli %124, %c64_i32 : i32
2026-02-21T09:50:32.9061689Z       %131 = tt.splat %130 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:32.9061898Z       %132 = tt.splat %130 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:32.9062112Z       %133 = arith.addi %131, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:32.9062324Z       %134 = arith.addi %132, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:32.9062593Z       %135 = tt.expand_dims %128 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:32.9062866Z       %136 = arith.muli %135, %cst_7 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:32.9063060Z       %137 = tt.broadcast %136 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9063344Z       %138 = tt.expand_dims %133 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T09:50:32.9063642Z       %139 = tt.broadcast %138 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9063863Z       %140 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:32.9064053Z       %141 = arith.addi %137, %49 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9064253Z       %142 = tt.addptr %10, %141 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9064461Z       %143 = tt.load %142 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9064749Z       %144 = ttg.memdesc_index %140[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9065105Z       ttg.local_store %143, %144 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9065346Z       %145 = arith.addi %137, %56 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9065544Z       %146 = tt.addptr %10, %145 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9065753Z       %147 = tt.load %146 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9066033Z       %148 = ttg.memdesc_index %140[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9066384Z       ttg.local_store %147, %148 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9066934Z       %149:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %144, %arg8 = %148) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:50:32.9067419Z         %276 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9067649Z         %277 = arith.addi %276, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9067829Z         %278 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:32.9067953Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:50:32.9068124Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.9068347Z         %281 = arith.addi %280, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.9068622Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:32.9068904Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9069099Z         %284 = arith.addi %137, %283 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9069305Z         %285 = tt.addptr %10, %284 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9069513Z         %286 = tt.load %285 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9069812Z         %287 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9070246Z         %288 = arith.extf %287 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9070627Z         %289 = tt.expand_dims %277 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9070899Z         %290 = arith.muli %289, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9071094Z         %291 = tt.broadcast %290 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9071288Z         %292 = arith.addi %291, %139 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9071484Z         %293 = tt.addptr %11, %292 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9071701Z         %294 = tt.load %293 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9071947Z         %295 = ttg.convert_layout %294 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9072229Z         %296 = arith.shli %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9072462Z         %297 = arith.shrsi %296, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9072703Z         %298 = arith.shrsi %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9072992Z         %299 = tt.expand_dims %297 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9073331Z         %300 = tt.expand_dims %298 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9073616Z         %301 = tt.broadcast %299 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9073851Z         %302 = arith.select %16, %301, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9074088Z         %303 = tt.broadcast %300 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9074317Z         %304 = arith.select %18, %303, %302 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9074547Z         %305 = tt.reshape %304 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9074794Z         %306 = arith.sitofp %305 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9075087Z         %307 = ttg.convert_layout %306 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9075573Z         %308 = tt.dot %288, %307, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9075921Z         %309 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:32.9076048Z         %310 = arith.cmpi slt, %309, %c2_i32 : i32
2026-02-21T09:50:32.9076184Z         %311 = arith.select %310, %309, %c0_i32 : i32
2026-02-21T09:50:32.9076447Z         %312 = ttg.memdesc_index %140[%311] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9076804Z         ttg.local_store %286, %312 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9077194Z         scf.yield %308, %311, %arg8, %312 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9077497Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:32.9077775Z       %150 = ttg.local_load %149#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9078198Z       %151 = arith.extf %150 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9078496Z       %152 = arith.addi %67, %139 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9078695Z       %153 = tt.addptr %11, %152 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9078911Z       %154 = tt.load %153 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9079155Z       %155 = ttg.convert_layout %154 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9079432Z       %156 = arith.shli %155, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9079682Z       %157 = arith.shrsi %156, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9079918Z       %158 = arith.shrsi %155, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9080200Z       %159 = tt.expand_dims %157 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9080533Z       %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9080814Z       %161 = tt.broadcast %159 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9081051Z       %162 = arith.select %16, %161, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9081292Z       %163 = tt.broadcast %160 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9081521Z       %164 = arith.select %18, %163, %162 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9081749Z       %165 = tt.reshape %164 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9081968Z       %166 = arith.sitofp %165 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9082262Z       %167 = ttg.convert_layout %166 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9082790Z       %168 = tt.dot %151, %167, %149#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9083616Z       %169 = ttg.local_load %149#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9084162Z       %170 = arith.extf %169 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9084464Z       %171 = arith.addi %90, %139 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9084662Z       %172 = tt.addptr %11, %171 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9084864Z       %173 = tt.load %172 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9085106Z       %174 = ttg.convert_layout %173 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9085391Z       %175 = arith.shli %174, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9085628Z       %176 = arith.shrsi %175, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9085864Z       %177 = arith.shrsi %174, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9086154Z       %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9086485Z       %179 = tt.expand_dims %177 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9086768Z       %180 = tt.broadcast %178 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9087008Z       %181 = arith.select %16, %180, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9087245Z       %182 = tt.broadcast %179 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9087507Z       %183 = arith.select %18, %182, %181 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9087732Z       %184 = tt.reshape %183 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9087955Z       %185 = arith.sitofp %184 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9088273Z       %186 = ttg.convert_layout %185 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9088729Z       %187 = tt.dot %170, %186, %168, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9089113Z       ttg.local_dealloc %140 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:32.9089327Z       %188 = arith.truncf %187 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:50:32.9089592Z       %189 = tt.expand_dims %129 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:50:32.9089832Z       %190 = arith.muli %189, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:50:32.9090056Z       %191 = tt.expand_dims %134 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:50:32.9090314Z       %192 = tt.broadcast %190 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9090512Z       %193 = tt.broadcast %191 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9090690Z       %194 = arith.addi %192, %193 : tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9090876Z       %195 = tt.addptr %19, %194 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9091067Z       tt.store %195, %188 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:32.9091208Z       %196 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:50:32.9091355Z       %197 = arith.divsi %196, %c256_i32 : i32
2026-02-21T09:50:32.9091476Z       %198 = arith.muli %197, %c2_i32 : i32
2026-02-21T09:50:32.9091594Z       %199 = arith.subi %c256_i32, %198 : i32
2026-02-21T09:50:32.9091732Z       %200 = arith.minsi %199, %c2_i32 : i32
2026-02-21T09:50:32.9091851Z       %201 = arith.remsi %196, %c256_i32 : i32
2026-02-21T09:50:32.9091967Z       %202 = arith.remsi %201, %200 : i32
2026-02-21T09:50:32.9092083Z       %203 = arith.addi %198, %202 : i32
2026-02-21T09:50:32.9092196Z       %204 = arith.divsi %201, %200 : i32
2026-02-21T09:50:32.9092316Z       %205 = arith.muli %203, %c64_i32 : i32
2026-02-21T09:50:32.9092485Z       %206 = tt.splat %205 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:32.9092700Z       %207 = tt.splat %205 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:32.9092912Z       %208 = arith.addi %206, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:32.9093126Z       %209 = arith.addi %207, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:32.9093290Z       %210 = arith.muli %204, %c64_i32 : i32
2026-02-21T09:50:32.9093460Z       %211 = tt.splat %210 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:32.9093671Z       %212 = tt.splat %210 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:32.9093884Z       %213 = arith.addi %211, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:32.9094097Z       %214 = arith.addi %212, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:32.9094363Z       %215 = tt.expand_dims %208 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:32.9094614Z       %216 = arith.muli %215, %cst_7 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:32.9094810Z       %217 = tt.broadcast %216 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9095105Z       %218 = tt.expand_dims %213 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T09:50:32.9095385Z       %219 = tt.broadcast %218 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9095617Z       %220 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:32.9095803Z       %221 = arith.addi %217, %49 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9096000Z       %222 = tt.addptr %10, %221 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9096205Z       %223 = tt.load %222 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9096485Z       %224 = ttg.memdesc_index %220[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9096841Z       ttg.local_store %223, %224 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9097079Z       %225 = arith.addi %217, %56 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9097278Z       %226 = tt.addptr %10, %225 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9097480Z       %227 = tt.load %226 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9097759Z       %228 = ttg.memdesc_index %220[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9098110Z       ttg.local_store %227, %228 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9098642Z       %229:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %224, %arg8 = %228) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:50:32.9099111Z         %276 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9099358Z         %277 = arith.addi %276, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9099536Z         %278 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:32.9099659Z         %279 = arith.muli %278, %c2_i32 : i32
2026-02-21T09:50:32.9099828Z         %280 = tt.splat %279 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.9100048Z         %281 = arith.addi %280, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.9100321Z         %282 = tt.expand_dims %281 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:32.9100596Z         %283 = tt.broadcast %282 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9100789Z         %284 = arith.addi %217, %283 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9100991Z         %285 = tt.addptr %10, %284 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9101200Z         %286 = tt.load %285 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9101494Z         %287 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9101923Z         %288 = arith.extf %287 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9102300Z         %289 = tt.expand_dims %277 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9102550Z         %290 = arith.muli %289, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9102745Z         %291 = tt.broadcast %290 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9102955Z         %292 = arith.addi %291, %219 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9103152Z         %293 = tt.addptr %11, %292 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9103350Z         %294 = tt.load %293 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9103641Z         %295 = ttg.convert_layout %294 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9103917Z         %296 = arith.shli %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9104151Z         %297 = arith.shrsi %296, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9104382Z         %298 = arith.shrsi %295, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9104669Z         %299 = tt.expand_dims %297 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9105004Z         %300 = tt.expand_dims %298 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9105284Z         %301 = tt.broadcast %299 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9105521Z         %302 = arith.select %16, %301, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9105753Z         %303 = tt.broadcast %300 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9105983Z         %304 = arith.select %18, %303, %302 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9106212Z         %305 = tt.reshape %304 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9106430Z         %306 = arith.sitofp %305 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9106744Z         %307 = ttg.convert_layout %306 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9107206Z         %308 = tt.dot %288, %307, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9107568Z         %309 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:32.9107697Z         %310 = arith.cmpi slt, %309, %c2_i32 : i32
2026-02-21T09:50:32.9107829Z         %311 = arith.select %310, %309, %c0_i32 : i32
2026-02-21T09:50:32.9108091Z         %312 = ttg.memdesc_index %220[%311] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9108442Z         ttg.local_store %286, %312 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9108829Z         scf.yield %308, %311, %arg8, %312 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9109128Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:32.9109400Z       %230 = ttg.local_load %229#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9109828Z       %231 = arith.extf %230 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9110123Z       %232 = arith.addi %67, %219 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9110320Z       %233 = tt.addptr %11, %232 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9110521Z       %234 = tt.load %233 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9110775Z       %235 = ttg.convert_layout %234 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9111053Z       %236 = arith.shli %235, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9111288Z       %237 = arith.shrsi %236, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9111541Z       %238 = arith.shrsi %235, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9111830Z       %239 = tt.expand_dims %237 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9112157Z       %240 = tt.expand_dims %238 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9112438Z       %241 = tt.broadcast %239 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9112677Z       %242 = arith.select %16, %241, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9112910Z       %243 = tt.broadcast %240 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9113140Z       %244 = arith.select %18, %243, %242 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9113363Z       %245 = tt.reshape %244 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9113582Z       %246 = arith.sitofp %245 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9113871Z       %247 = ttg.convert_layout %246 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9114330Z       %248 = tt.dot %231, %247, %229#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9114831Z       %249 = ttg.local_load %229#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9115256Z       %250 = arith.extf %249 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9115566Z       %251 = arith.addi %90, %219 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9115763Z       %252 = tt.addptr %11, %251 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9115958Z       %253 = tt.load %252 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9116195Z       %254 = ttg.convert_layout %253 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9116472Z       %255 = arith.shli %254, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9116705Z       %256 = arith.shrsi %255, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9116935Z       %257 = arith.shrsi %254, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9117219Z       %258 = tt.expand_dims %256 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9117548Z       %259 = tt.expand_dims %257 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9117826Z       %260 = tt.broadcast %258 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9118058Z       %261 = arith.select %16, %260, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9118291Z       %262 = tt.broadcast %259 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9118516Z       %263 = arith.select %18, %262, %261 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9118770Z       %264 = tt.reshape %263 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9118989Z       %265 = arith.sitofp %264 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9119282Z       %266 = ttg.convert_layout %265 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9119753Z       %267 = tt.dot %250, %266, %248, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9120132Z       ttg.local_dealloc %220 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:32.9120342Z       %268 = arith.truncf %267 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:50:32.9120602Z       %269 = tt.expand_dims %209 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:50:32.9120834Z       %270 = arith.muli %269, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:50:32.9121060Z       %271 = tt.expand_dims %214 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:50:32.9121314Z       %272 = tt.broadcast %270 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9121513Z       %273 = tt.broadcast %271 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9121689Z       %274 = arith.addi %272, %273 : tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9121871Z       %275 = tt.addptr %19, %274 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9122063Z       tt.store %275, %268 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:32.9122200Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:50:32.9122321Z     scf.for %arg3 = %23 to %3 step %c1_i32  : i32 {
2026-02-21T09:50:32.9122455Z       %24 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:50:32.9122686Z       %25 = arith.muli %24, %c2_i32 : i32
2026-02-21T09:50:32.9122810Z       %26 = arith.subi %c256_i32, %25 : i32
2026-02-21T09:50:32.9122925Z       %27 = arith.minsi %26, %c2_i32 : i32
2026-02-21T09:50:32.9123070Z       %28 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:50:32.9123186Z       %29 = arith.remsi %28, %27 : i32
2026-02-21T09:50:32.9123298Z       %30 = arith.addi %25, %29 : i32
2026-02-21T09:50:32.9123406Z       %31 = arith.divsi %28, %27 : i32
2026-02-21T09:50:32.9123518Z       %32 = arith.muli %30, %c64_i32 : i32
2026-02-21T09:50:32.9123683Z       %33 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:32.9123892Z       %34 = tt.splat %32 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:32.9124101Z       %35 = arith.addi %33, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:32.9124305Z       %36 = arith.addi %34, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:32.9124466Z       %37 = arith.muli %31, %c64_i32 : i32
2026-02-21T09:50:32.9124626Z       %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:32.9124833Z       %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:32.9125038Z       %40 = arith.addi %38, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:32.9125241Z       %41 = arith.addi %39, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:32.9125504Z       %42 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:50:32.9125746Z       %43 = arith.muli %42, %cst_7 : tensor<64x1xi32, #blocked2>
2026-02-21T09:50:32.9125933Z       %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9126223Z       %45 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x64xi32, #blocked1>
2026-02-21T09:50:32.9126490Z       %46 = tt.broadcast %45 : tensor<1x64xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9126705Z       %47 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:32.9126965Z       %48 = tt.expand_dims %9 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:32.9127368Z       %49 = tt.broadcast %48 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9127555Z       %50 = arith.addi %44, %49 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9127745Z       %51 = tt.addptr %10, %50 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9127943Z       %52 = tt.load %51 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9128216Z       %53 = ttg.memdesc_index %47[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9128565Z       ttg.local_store %52, %53 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9128832Z       %54 = arith.addi %9, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.9129101Z       %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:32.9129368Z       %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9129552Z       %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9129745Z       %58 = tt.addptr %10, %57 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9129945Z       %59 = tt.load %58 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9130232Z       %60 = ttg.memdesc_index %47[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9130575Z       ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9131093Z       %61:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %53, %arg8 = %60) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:50:32.9131554Z         %116 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9131782Z         %117 = arith.addi %116, %8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9131957Z         %118 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:32.9132081Z         %119 = arith.muli %118, %c2_i32 : i32
2026-02-21T09:50:32.9132251Z         %120 = tt.splat %119 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.9132470Z         %121 = arith.addi %120, %9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:32.9132744Z         %122 = tt.expand_dims %121 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:50:32.9133017Z         %123 = tt.broadcast %122 : tensor<1x4xi32, #blocked2> -> tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9133213Z         %124 = arith.addi %44, %123 : tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9133412Z         %125 = tt.addptr %10, %124 : tensor<64x4x!tt.ptr<bf16>, #blocked2>, tensor<64x4xi32, #blocked2>
2026-02-21T09:50:32.9133618Z         %126 = tt.load %125 : tensor<64x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:50:32.9133916Z         %127 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9134363Z         %128 = arith.extf %127 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9134743Z         %129 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9135006Z         %130 = arith.muli %129, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9135196Z         %131 = tt.broadcast %130 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9135387Z         %132 = arith.addi %131, %46 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9135579Z         %133 = tt.addptr %11, %132 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9135780Z         %134 = tt.load %133 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9136023Z         %135 = ttg.convert_layout %134 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9136302Z         %136 = arith.shli %135, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9136541Z         %137 = arith.shrsi %136, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9136776Z         %138 = arith.shrsi %135, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9137064Z         %139 = tt.expand_dims %137 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9137396Z         %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9137678Z         %141 = tt.broadcast %139 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9137914Z         %142 = arith.select %16, %141, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9138160Z         %143 = tt.broadcast %140 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9138389Z         %144 = arith.select %18, %143, %142 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9138634Z         %145 = tt.reshape %144 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9138854Z         %146 = arith.sitofp %145 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9139148Z         %147 = ttg.convert_layout %146 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9139609Z         %148 = tt.dot %128, %147, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9139955Z         %149 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:32.9140085Z         %150 = arith.cmpi slt, %149, %c2_i32 : i32
2026-02-21T09:50:32.9140217Z         %151 = arith.select %150, %149, %c0_i32 : i32
2026-02-21T09:50:32.9140480Z         %152 = ttg.memdesc_index %47[%151] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9140831Z         ttg.local_store %126, %152 : tensor<64x4xbf16, #blocked2> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9141226Z         scf.yield %148, %151, %arg8, %152 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:32.9141533Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:32.9141706Z       %62 = arith.addi %8, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9142029Z       %63 = ttg.local_load %61#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9142475Z       %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9142848Z       %65 = tt.expand_dims %62 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9143106Z       %66 = arith.muli %65, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9143289Z       %67 = tt.broadcast %66 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9143476Z       %68 = arith.addi %67, %46 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9143665Z       %69 = tt.addptr %11, %68 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9143853Z       %70 = tt.load %69 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9144088Z       %71 = ttg.convert_layout %70 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9144359Z       %72 = arith.shli %71, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9144586Z       %73 = arith.shrsi %72, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9144812Z       %74 = arith.shrsi %71, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9145090Z       %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9145414Z       %76 = tt.expand_dims %74 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9145684Z       %77 = tt.broadcast %75 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9145930Z       %78 = arith.select %16, %77, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9146157Z       %79 = tt.broadcast %76 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9146374Z       %80 = arith.select %18, %79, %78 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9146607Z       %81 = tt.reshape %80 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9146816Z       %82 = arith.sitofp %81 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9147096Z       %83 = ttg.convert_layout %82 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9147543Z       %84 = tt.dot %64, %83, %61#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9147922Z       %85 = arith.addi %8, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:32.9148245Z       %86 = ttg.local_load %61#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9148661Z       %87 = arith.extf %86 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9149034Z       %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9149278Z       %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:50:32.9149460Z       %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9149646Z       %91 = arith.addi %90, %46 : tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9149831Z       %92 = tt.addptr %11, %91 : tensor<2x64x!tt.ptr<i8>, #blocked1>, tensor<2x64xi32, #blocked1>
2026-02-21T09:50:32.9150037Z       %93 = tt.load %92 : tensor<2x64x!tt.ptr<i8>, #blocked1>
2026-02-21T09:50:32.9150268Z       %94 = ttg.convert_layout %93 : tensor<2x64xi8, #blocked1> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9150537Z       %95 = arith.shli %94, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9150775Z       %96 = arith.shrsi %95, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9150999Z       %97 = arith.shrsi %94, %cst_9 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:32.9151276Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9151603Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:32.9151876Z       %100 = tt.broadcast %98 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9152112Z       %101 = arith.select %16, %100, %cst_8 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9152345Z       %102 = tt.broadcast %99 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9152573Z       %103 = arith.select %18, %102, %101 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:32.9152807Z       %104 = tt.reshape %103 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:32.9153025Z       %105 = arith.sitofp %104 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:32.9153320Z       %106 = ttg.convert_layout %105 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:32.9153792Z       %107 = tt.dot %87, %106, %84, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:32.9154168Z       ttg.local_dealloc %47 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:32.9154400Z       %108 = arith.truncf %107 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:50:32.9154662Z       %109 = tt.expand_dims %36 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:50:32.9154905Z       %110 = arith.muli %109, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:50:32.9155141Z       %111 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:50:32.9155400Z       %112 = tt.broadcast %110 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9155608Z       %113 = tt.broadcast %111 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9155789Z       %114 = arith.addi %112, %113 : tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9155984Z       %115 = tt.addptr %19, %114 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:50:32.9156183Z       tt.store %115, %108 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:32.9156324Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:50:32.9156435Z     tt.return
2026-02-21T09:50:32.9156518Z   }
2026-02-21T09:50:32.9156603Z }
2026-02-21T09:50:32.9156648Z 
2026-02-21T09:50:32.9156682Z {-#
2026-02-21T09:50:32.9156768Z   external_resources: {
2026-02-21T09:50:32.9156871Z     mlir_reproducer: {
2026-02-21T09:50:32.9157892Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:50:32.9158886Z       disable_threading: false,
2026-02-21T09:50:32.9159002Z       verify_each: true
2026-02-21T09:50:32.9159095Z     }
2026-02-21T09:50:32.9159176Z   }
2026-02-21T09:50:32.9159263Z #-}
2026-02-21T09:50:32.9159546Z /tmp/torchinductor_root/v4/cv4k7rhb7nmh5ams6lqfgfcicgqzja7tlcrle4voteobtcjndy3s.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:50:32.9160252Z /tmp/torchinductor_root/v4/cv4k7rhb7nmh5ams6lqfgfcicgqzja7tlcrle4voteobtcjndy3s.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:50:32.9160798Z [363s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:50:32.9161571Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:50:32.9162269Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:50:32.9162438Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:50:33.8392427Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:50:33.8400948Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:50:33.8401308Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:33.8401722Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:33.8402027Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:50:33.8402311Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:50:33.8402593Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:50:33.8402780Z #smem = #ttg.shared_memory
2026-02-21T09:50:33.8403010Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:50:33.8403481Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:50:33.8403873Z     %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma>
2026-02-21T09:50:33.8404051Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:50:33.8404231Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:50:33.8404413Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8404577Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:50:33.8404767Z     %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8405020Z     %cst_4 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8405273Z     %cst_5 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8405590Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8405771Z     %cst_7 = arith.constant dense<0> : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8405949Z     %cst_8 = arith.constant dense<512> : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8406121Z     %cst_9 = arith.constant dense<0> : tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8406353Z     %cst_10 = arith.constant dense<8192> : tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8406531Z     %cst_11 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1>
2026-02-21T09:50:33.8406687Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:50:33.8406804Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:50:33.8406925Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:50:33.8407054Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:50:33.8407176Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:50:33.8407294Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:50:33.8407437Z     %cst_12 = arith.constant dense<0> : tensor<2x64xi8, #blocked2>
2026-02-21T09:50:33.8407617Z     %cst_13 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8407766Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:50:33.8407891Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:50:33.8408002Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:50:33.8408187Z     %cst_14 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8408379Z     %0 = tt.get_program_id x : i32
2026-02-21T09:50:33.8408493Z     %1 = arith.muli %0, %c4_i32 : i32
2026-02-21T09:50:33.8408611Z     %2 = arith.addi %1, %c4_i32 : i32
2026-02-21T09:50:33.8408728Z     %3 = arith.minsi %2, %c32768_i32 : i32
2026-02-21T09:50:33.8408932Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:33.8409202Z     %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:33.8409490Z     %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:33.8409757Z     %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:33.8410037Z     %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8410284Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8410483Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8410716Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8411055Z     %12 = arith.extsi %11 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8411420Z     %13 = arith.extsi %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:33.8411770Z     %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:50:33.8412183Z     %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:50:33.8412584Z     %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:50:33.8412834Z     %17 = arith.cmpi eq, %16, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:50:33.8413034Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:50:33.8413235Z     %19 = arith.cmpi eq, %16, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:50:33.8413445Z     %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:50:33.8413653Z     %21 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:33.8413815Z     %22 = arith.subi %3, %1 : i32
2026-02-21T09:50:33.8413933Z     %23 = arith.remsi %22, %c3_i32 : i32
2026-02-21T09:50:33.8414063Z     %24 = arith.subi %22, %23 : i32
2026-02-21T09:50:33.8414180Z     %25 = arith.addi %1, %24 : i32
2026-02-21T09:50:33.8414308Z     scf.for %arg3 = %1 to %25 step %c3_i32  : i32 {
2026-02-21T09:50:33.8414447Z       %26 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:50:33.8414575Z       %27 = arith.muli %26, %c2_i32 : i32
2026-02-21T09:50:33.8414696Z       %28 = arith.subi %c128_i32, %27 : i32
2026-02-21T09:50:33.8414817Z       %29 = arith.minsi %28, %c2_i32 : i32
2026-02-21T09:50:33.8414938Z       %30 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:50:33.8415063Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T09:50:33.8415176Z       %32 = arith.addi %27, %31 : i32
2026-02-21T09:50:33.8415294Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T09:50:33.8415412Z       %34 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:50:33.8415570Z       %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:33.8415783Z       %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:33.8415948Z       %37 = arith.muli %33, %c64_i32 : i32
2026-02-21T09:50:33.8416119Z       %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:33.8416328Z       %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:33.8416544Z       %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:33.8416756Z       %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:33.8417041Z       %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:50:33.8417295Z       %43 = arith.muli %42, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:50:33.8417506Z       %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8417682Z       %45 = arith.extsi %34 : i32 to i64
2026-02-21T09:50:33.8417853Z       %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:33.8418071Z       %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:33.8418350Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8418624Z       %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8418824Z       %50 = arith.cmpi sge, %48, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8419001Z       %51 = arith.cmpi slt, %48, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8419166Z       %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked2>
2026-02-21T09:50:33.8419349Z       %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8419586Z       %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:33.8419853Z       %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:50:33.8420116Z       %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8420305Z       %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8420497Z       %58 = tt.addptr %9, %57 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8420694Z       %59 = tt.load %58 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8421001Z       %60 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8421348Z       ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8421621Z       %61 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8421910Z       %62 = tt.expand_dims %61 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:50:33.8422173Z       %63 = tt.broadcast %62 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8422358Z       %64 = arith.addi %44, %63 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8422546Z       %65 = tt.addptr %9, %64 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8422743Z       %66 = tt.load %65 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8423016Z       %67 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8423360Z       ttg.local_store %66, %67 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8423871Z       %68:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %60, %arg8 = %67) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:50:33.8424285Z         %307 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:33.8424408Z         %308 = arith.muli %307, %c2_i32 : i32
2026-02-21T09:50:33.8424579Z         %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8424803Z         %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8425094Z         %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:50:33.8425410Z         %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8425602Z         %313 = arith.addi %44, %312 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8425804Z         %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8426010Z         %315 = tt.load %314 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8426313Z         %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8426748Z         %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8427036Z         %318 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:50:33.8427212Z         %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8427437Z         %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8427716Z         %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8427967Z         %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8428158Z         %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8428351Z         %324 = arith.addi %323, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8428545Z         %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8428755Z         %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8428945Z         %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8429110Z         %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2>
2026-02-21T09:50:33.8429302Z         %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8429504Z         %330 = arith.andi %329, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8429671Z         %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8429926Z         %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8430208Z         %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8430450Z         %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8430691Z         %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8430980Z         %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8431314Z         %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8431597Z         %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8431834Z         %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8432068Z         %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8432300Z         %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8432542Z         %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8432764Z         %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8433056Z         %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8433543Z         %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8433892Z         %346 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:33.8434023Z         %347 = arith.cmpi slt, %346, %c2_i32 : i32
2026-02-21T09:50:33.8434154Z         %348 = arith.select %347, %346, %c0_i32 : i32
2026-02-21T09:50:33.8434417Z         %349 = ttg.memdesc_index %54[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8434764Z         ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8435150Z         scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8435454Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:33.8435727Z       %69 = ttg.local_load %68#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8436150Z       %70 = arith.extf %69 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8436479Z       %71 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8436771Z       %72 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8437012Z       %73 = arith.muli %72, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8437197Z       %74 = tt.broadcast %73 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8437386Z       %75 = arith.addi %74, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8437588Z       %76 = tt.addptr %10, %75 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8437790Z       %77 = arith.cmpi sge, %72, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8437957Z       %78 = arith.cmpi slt, %72, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8438113Z       %79 = arith.andi %77, %78 : tensor<2x1xi1, #blocked2>
2026-02-21T09:50:33.8438294Z       %80 = tt.broadcast %79 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8438478Z       %81 = arith.andi %80, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8438643Z       %82 = tt.load %76, %81, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8438892Z       %83 = ttg.convert_layout %82 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8439164Z       %84 = arith.shli %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8439396Z       %85 = arith.shrsi %84, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8439623Z       %86 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8439903Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8440231Z       %88 = tt.expand_dims %86 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8440518Z       %89 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8440748Z       %90 = arith.select %18, %89, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8440992Z       %91 = tt.broadcast %88 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8441212Z       %92 = arith.select %20, %91, %90 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8441429Z       %93 = tt.reshape %92 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8441637Z       %94 = arith.sitofp %93 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8441918Z       %95 = ttg.convert_layout %94 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8442367Z       %96 = tt.dot %70, %95, %68#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8442902Z       %97 = ttg.local_load %68#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8443322Z       %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8443643Z       %99 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8443920Z       %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8444168Z       %101 = arith.muli %100, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8444359Z       %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8444552Z       %103 = arith.addi %102, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8444769Z       %104 = tt.addptr %10, %103 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8444977Z       %105 = arith.cmpi sge, %100, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8445147Z       %106 = arith.cmpi slt, %100, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8445327Z       %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2>
2026-02-21T09:50:33.8445513Z       %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8445699Z       %109 = arith.andi %108, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8445866Z       %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8446118Z       %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8446399Z       %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8446637Z       %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8446871Z       %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8447156Z       %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8447491Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8447769Z       %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8448005Z       %118 = arith.select %18, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8448237Z       %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8448485Z       %120 = arith.select %20, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8448708Z       %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8448947Z       %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8449240Z       %123 = ttg.convert_layout %122 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8449695Z       %124 = tt.dot %98, %123, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8450072Z       ttg.local_dealloc %54 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:33.8450282Z       %125 = arith.truncf %124 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:50:33.8450544Z       %126 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:50:33.8450784Z       %127 = arith.muli %126, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:50:33.8451047Z       %128 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:50:33.8451303Z       %129 = tt.broadcast %127 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8451504Z       %130 = tt.broadcast %128 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8451678Z       %131 = arith.addi %129, %130 : tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8451866Z       %132 = tt.addptr %21, %131 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8452057Z       tt.store %132, %125 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:33.8452201Z       %133 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:50:33.8452324Z       %134 = arith.divsi %133, %c512_i32 : i32
2026-02-21T09:50:33.8452462Z       %135 = arith.muli %134, %c2_i32 : i32
2026-02-21T09:50:33.8452582Z       %136 = arith.subi %c128_i32, %135 : i32
2026-02-21T09:50:33.8452700Z       %137 = arith.minsi %136, %c2_i32 : i32
2026-02-21T09:50:33.8452818Z       %138 = arith.remsi %133, %c512_i32 : i32
2026-02-21T09:50:33.8452947Z       %139 = arith.remsi %138, %137 : i32
2026-02-21T09:50:33.8453063Z       %140 = arith.addi %135, %139 : i32
2026-02-21T09:50:33.8453174Z       %141 = arith.divsi %138, %137 : i32
2026-02-21T09:50:33.8453292Z       %142 = arith.muli %140, %c64_i32 : i32
2026-02-21T09:50:33.8453449Z       %143 = tt.splat %142 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:33.8453657Z       %144 = arith.addi %143, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:33.8453824Z       %145 = arith.muli %141, %c64_i32 : i32
2026-02-21T09:50:33.8453990Z       %146 = tt.splat %145 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:33.8454203Z       %147 = tt.splat %145 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:33.8454415Z       %148 = arith.addi %146, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:33.8454630Z       %149 = arith.addi %147, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:33.8454899Z       %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:50:33.8455148Z       %151 = arith.muli %150, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:50:33.8455342Z       %152 = tt.broadcast %151 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8455514Z       %153 = arith.extsi %142 : i32 to i64
2026-02-21T09:50:33.8455680Z       %154 = tt.splat %153 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:33.8455925Z       %155 = arith.addi %154, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:33.8456198Z       %156 = tt.expand_dims %155 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8456494Z       %157 = tt.broadcast %156 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8456694Z       %158 = arith.cmpi sge, %156, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8456870Z       %159 = arith.cmpi slt, %156, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8457036Z       %160 = arith.andi %158, %159 : tensor<1x64xi1, #blocked2>
2026-02-21T09:50:33.8457225Z       %161 = tt.broadcast %160 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8457441Z       %162 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:33.8457626Z       %163 = arith.addi %152, %56 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8457828Z       %164 = tt.addptr %9, %163 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8458032Z       %165 = tt.load %164 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8458314Z       %166 = ttg.memdesc_index %162[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8458671Z       ttg.local_store %165, %166 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8458905Z       %167 = arith.addi %152, %63 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8459101Z       %168 = tt.addptr %9, %167 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8459300Z       %169 = tt.load %168 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8459576Z       %170 = ttg.memdesc_index %162[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8459944Z       ttg.local_store %169, %170 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8460457Z       %171:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %166, %arg8 = %170) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:50:33.8460888Z         %307 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:33.8461014Z         %308 = arith.muli %307, %c2_i32 : i32
2026-02-21T09:50:33.8461186Z         %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8461412Z         %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8461689Z         %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:50:33.8461968Z         %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8462164Z         %313 = arith.addi %152, %312 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8462361Z         %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8462568Z         %315 = tt.load %314 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8462863Z         %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8463296Z         %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8463578Z         %318 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:50:33.8463772Z         %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8463996Z         %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8464284Z         %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8464534Z         %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8464726Z         %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8464918Z         %324 = arith.addi %323, %157 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8465113Z         %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8465316Z         %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8465487Z         %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8465649Z         %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2>
2026-02-21T09:50:33.8465844Z         %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8466038Z         %330 = arith.andi %329, %161 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8466204Z         %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8466462Z         %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8466741Z         %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8466982Z         %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8467221Z         %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8467526Z         %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8467859Z         %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8468141Z         %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8468392Z         %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8468627Z         %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8476411Z         %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8476668Z         %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8476890Z         %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8477188Z         %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8477660Z         %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8478011Z         %346 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:33.8478139Z         %347 = arith.cmpi slt, %346, %c2_i32 : i32
2026-02-21T09:50:33.8478273Z         %348 = arith.select %347, %346, %c0_i32 : i32
2026-02-21T09:50:33.8478538Z         %349 = ttg.memdesc_index %162[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8478890Z         ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8479321Z         scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8479634Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:33.8479909Z       %172 = ttg.local_load %171#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8480337Z       %173 = arith.extf %172 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8480634Z       %174 = arith.addi %74, %157 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8480834Z       %175 = tt.addptr %10, %174 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8481033Z       %176 = arith.andi %80, %161 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8481204Z       %177 = tt.load %175, %176, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8481466Z       %178 = ttg.convert_layout %177 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8481748Z       %179 = arith.shli %178, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8481984Z       %180 = arith.shrsi %179, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8482216Z       %181 = arith.shrsi %178, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8482505Z       %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8482887Z       %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8483167Z       %184 = tt.broadcast %182 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8483428Z       %185 = arith.select %18, %184, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8483664Z       %186 = tt.broadcast %183 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8483892Z       %187 = arith.select %20, %186, %185 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8484138Z       %188 = tt.reshape %187 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8484355Z       %189 = arith.sitofp %188 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8484643Z       %190 = ttg.convert_layout %189 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8485106Z       %191 = tt.dot %173, %190, %171#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8485597Z       %192 = ttg.local_load %171#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8486020Z       %193 = arith.extf %192 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8486314Z       %194 = arith.addi %102, %157 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8486512Z       %195 = tt.addptr %10, %194 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8486712Z       %196 = arith.andi %108, %161 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8486877Z       %197 = tt.load %195, %196, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8487145Z       %198 = ttg.convert_layout %197 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8487421Z       %199 = arith.shli %198, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8487803Z       %200 = arith.shrsi %199, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8488036Z       %201 = arith.shrsi %198, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8488317Z       %202 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8488646Z       %203 = tt.expand_dims %201 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8488923Z       %204 = tt.broadcast %202 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8489157Z       %205 = arith.select %18, %204, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8489394Z       %206 = tt.broadcast %203 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8489619Z       %207 = arith.select %20, %206, %205 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8489846Z       %208 = tt.reshape %207 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8490062Z       %209 = arith.sitofp %208 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8490350Z       %210 = ttg.convert_layout %209 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8490810Z       %211 = tt.dot %193, %210, %191, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8491190Z       ttg.local_dealloc %162 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:33.8491419Z       %212 = arith.truncf %211 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:50:33.8491676Z       %213 = tt.expand_dims %149 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:50:33.8491915Z       %214 = arith.muli %213, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:50:33.8492156Z       %215 = tt.expand_dims %144 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:50:33.8492409Z       %216 = tt.broadcast %214 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8492608Z       %217 = tt.broadcast %215 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8492784Z       %218 = arith.addi %216, %217 : tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8492969Z       %219 = tt.addptr %21, %218 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8493163Z       tt.store %219, %212 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:33.8493304Z       %220 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:50:33.8493429Z       %221 = arith.divsi %220, %c512_i32 : i32
2026-02-21T09:50:33.8493549Z       %222 = arith.muli %221, %c2_i32 : i32
2026-02-21T09:50:33.8493669Z       %223 = arith.subi %c128_i32, %222 : i32
2026-02-21T09:50:33.8493785Z       %224 = arith.minsi %223, %c2_i32 : i32
2026-02-21T09:50:33.8493904Z       %225 = arith.remsi %220, %c512_i32 : i32
2026-02-21T09:50:33.8494019Z       %226 = arith.remsi %225, %224 : i32
2026-02-21T09:50:33.8494132Z       %227 = arith.addi %222, %226 : i32
2026-02-21T09:50:33.8494245Z       %228 = arith.divsi %225, %224 : i32
2026-02-21T09:50:33.8494358Z       %229 = arith.muli %227, %c64_i32 : i32
2026-02-21T09:50:33.8494518Z       %230 = tt.splat %229 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:33.8494724Z       %231 = arith.addi %230, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:33.8494904Z       %232 = arith.muli %228, %c64_i32 : i32
2026-02-21T09:50:33.8495072Z       %233 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:33.8495300Z       %234 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:33.8495513Z       %235 = arith.addi %233, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:33.8495725Z       %236 = arith.addi %234, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:33.8495994Z       %237 = tt.expand_dims %235 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:50:33.8496251Z       %238 = arith.muli %237, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:50:33.8496445Z       %239 = tt.broadcast %238 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8496620Z       %240 = arith.extsi %229 : i32 to i64
2026-02-21T09:50:33.8496787Z       %241 = tt.splat %240 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:33.8497006Z       %242 = arith.addi %241, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:33.8497282Z       %243 = tt.expand_dims %242 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8497562Z       %244 = tt.broadcast %243 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8497766Z       %245 = arith.cmpi sge, %243, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8497939Z       %246 = arith.cmpi slt, %243, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8498107Z       %247 = arith.andi %245, %246 : tensor<1x64xi1, #blocked2>
2026-02-21T09:50:33.8498292Z       %248 = tt.broadcast %247 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8498506Z       %249 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:33.8498707Z       %250 = arith.addi %239, %56 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8498902Z       %251 = tt.addptr %9, %250 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8499108Z       %252 = tt.load %251 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8499399Z       %253 = ttg.memdesc_index %249[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8499754Z       ttg.local_store %252, %253 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8499992Z       %254 = arith.addi %239, %63 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8500183Z       %255 = tt.addptr %9, %254 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8500384Z       %256 = tt.load %255 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8500659Z       %257 = ttg.memdesc_index %249[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8501008Z       ttg.local_store %256, %257 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8501525Z       %258:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %253, %arg8 = %257) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:50:33.8501943Z         %307 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:33.8502068Z         %308 = arith.muli %307, %c2_i32 : i32
2026-02-21T09:50:33.8502238Z         %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8502464Z         %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8502756Z         %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:50:33.8503046Z         %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8503240Z         %313 = arith.addi %239, %312 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8503437Z         %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8503642Z         %315 = tt.load %314 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8503938Z         %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8504363Z         %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8504642Z         %318 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:50:33.8504812Z         %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8505041Z         %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8505321Z         %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8505567Z         %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8505759Z         %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8505949Z         %324 = arith.addi %323, %244 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8506146Z         %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8506353Z         %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8506545Z         %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8506709Z         %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2>
2026-02-21T09:50:33.8506895Z         %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8507096Z         %330 = arith.andi %329, %248 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8507264Z         %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8507516Z         %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8507795Z         %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8508030Z         %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8508267Z         %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8508555Z         %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8508886Z         %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8509168Z         %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8509403Z         %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8509640Z         %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8509870Z         %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8510094Z         %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8510329Z         %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8510619Z         %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8511097Z         %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8511443Z         %346 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:33.8511569Z         %347 = arith.cmpi slt, %346, %c2_i32 : i32
2026-02-21T09:50:33.8511703Z         %348 = arith.select %347, %346, %c0_i32 : i32
2026-02-21T09:50:33.8511964Z         %349 = ttg.memdesc_index %249[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8512316Z         ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8512704Z         scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8512999Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:33.8513275Z       %259 = ttg.local_load %258#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8513698Z       %260 = arith.extf %259 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8513990Z       %261 = arith.addi %74, %244 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8514187Z       %262 = tt.addptr %10, %261 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8514400Z       %263 = arith.andi %80, %248 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8514567Z       %264 = tt.load %262, %263, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8514822Z       %265 = ttg.convert_layout %264 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8515110Z       %266 = arith.shli %265, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8515342Z       %267 = arith.shrsi %266, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8515574Z       %268 = arith.shrsi %265, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8515858Z       %269 = tt.expand_dims %267 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8516188Z       %270 = tt.expand_dims %268 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8516463Z       %271 = tt.broadcast %269 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8516698Z       %272 = arith.select %18, %271, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8516930Z       %273 = tt.broadcast %270 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8517160Z       %274 = arith.select %20, %273, %272 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8517386Z       %275 = tt.reshape %274 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8517604Z       %276 = arith.sitofp %275 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8517893Z       %277 = ttg.convert_layout %276 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8518363Z       %278 = tt.dot %260, %277, %258#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8518865Z       %279 = ttg.local_load %258#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8519287Z       %280 = arith.extf %279 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8519578Z       %281 = arith.addi %102, %244 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8519774Z       %282 = tt.addptr %10, %281 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8519975Z       %283 = arith.andi %108, %248 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8520141Z       %284 = tt.load %282, %283, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8520398Z       %285 = ttg.convert_layout %284 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8520676Z       %286 = arith.shli %285, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8520910Z       %287 = arith.shrsi %286, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8521144Z       %288 = arith.shrsi %285, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8521426Z       %289 = tt.expand_dims %287 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8521758Z       %290 = tt.expand_dims %288 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8522033Z       %291 = tt.broadcast %289 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8522285Z       %292 = arith.select %18, %291, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8522518Z       %293 = tt.broadcast %290 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8522779Z       %294 = arith.select %20, %293, %292 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8523022Z       %295 = tt.reshape %294 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8523235Z       %296 = arith.sitofp %295 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8523526Z       %297 = ttg.convert_layout %296 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8523978Z       %298 = tt.dot %280, %297, %278, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8524354Z       ttg.local_dealloc %249 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:33.8524562Z       %299 = arith.truncf %298 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:50:33.8524822Z       %300 = tt.expand_dims %236 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:50:33.8525052Z       %301 = arith.muli %300, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:50:33.8525276Z       %302 = tt.expand_dims %231 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:50:33.8525525Z       %303 = tt.broadcast %301 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8525723Z       %304 = tt.broadcast %302 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8525896Z       %305 = arith.addi %303, %304 : tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8526096Z       %306 = tt.addptr %21, %305 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8526287Z       tt.store %306, %299 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:33.8526422Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:50:33.8526561Z     scf.for %arg3 = %25 to %3 step %c1_i32  : i32 {
2026-02-21T09:50:33.8526690Z       %26 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:50:33.8526808Z       %27 = arith.muli %26, %c2_i32 : i32
2026-02-21T09:50:33.8526923Z       %28 = arith.subi %c128_i32, %27 : i32
2026-02-21T09:50:33.8527037Z       %29 = arith.minsi %28, %c2_i32 : i32
2026-02-21T09:50:33.8527154Z       %30 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:50:33.8527269Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T09:50:33.8527379Z       %32 = arith.addi %27, %31 : i32
2026-02-21T09:50:33.8527485Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T09:50:33.8527594Z       %34 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:50:33.8527748Z       %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:33.8527951Z       %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:50:33.8528113Z       %37 = arith.muli %33, %c64_i32 : i32
2026-02-21T09:50:33.8528276Z       %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:33.8528484Z       %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:33.8528689Z       %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:50:33.8528892Z       %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:50:33.8529153Z       %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:50:33.8529399Z       %43 = arith.muli %42, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:50:33.8529587Z       %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8529773Z       %45 = arith.extsi %34 : i32 to i64
2026-02-21T09:50:33.8529932Z       %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:33.8530144Z       %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:50:33.8530430Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8530699Z       %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8530889Z       %50 = arith.cmpi sge, %48, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8531055Z       %51 = arith.cmpi slt, %48, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:50:33.8531211Z       %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked2>
2026-02-21T09:50:33.8531386Z       %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8531595Z       %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:33.8531854Z       %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:50:33.8532117Z       %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8532301Z       %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8532489Z       %58 = tt.addptr %9, %57 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8532684Z       %59 = tt.load %58 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8532956Z       %60 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8533318Z       ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8533581Z       %61 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8533872Z       %62 = tt.expand_dims %61 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:50:33.8534139Z       %63 = tt.broadcast %62 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8534320Z       %64 = arith.addi %44, %63 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8534509Z       %65 = tt.addptr %9, %64 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8534700Z       %66 = tt.load %65 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8534971Z       %67 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8535315Z       ttg.local_store %66, %67 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8535817Z       %68:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %60, %arg8 = %67) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:50:33.8536231Z         %133 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:50:33.8536352Z         %134 = arith.muli %133, %c2_i32 : i32
2026-02-21T09:50:33.8536522Z         %135 = tt.splat %134 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8536743Z         %136 = arith.addi %135, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:50:33.8537013Z         %137 = tt.expand_dims %136 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:50:33.8537286Z         %138 = tt.broadcast %137 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8537491Z         %139 = arith.addi %44, %138 : tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8537687Z         %140 = tt.addptr %9, %139 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:50:33.8537888Z         %141 = tt.load %140 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:50:33.8538192Z         %142 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8538622Z         %143 = arith.extf %142 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8538898Z         %144 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:50:33.8539068Z         %145 = tt.splat %144 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8539294Z         %146 = arith.addi %145, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8539568Z         %147 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8539815Z         %148 = arith.muli %147, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8540004Z         %149 = tt.broadcast %148 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8540195Z         %150 = arith.addi %149, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8540388Z         %151 = tt.addptr %10, %150 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8540594Z         %152 = arith.cmpi sge, %147, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8540762Z         %153 = arith.cmpi slt, %147, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8540921Z         %154 = arith.andi %152, %153 : tensor<2x1xi1, #blocked2>
2026-02-21T09:50:33.8541122Z         %155 = tt.broadcast %154 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8541307Z         %156 = arith.andi %155, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8541484Z         %157 = tt.load %151, %156, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8541736Z         %158 = ttg.convert_layout %157 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8542012Z         %159 = arith.shli %158, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8542247Z         %160 = arith.shrsi %159, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8542481Z         %161 = arith.shrsi %158, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8542766Z         %162 = tt.expand_dims %160 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8543096Z         %163 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8543374Z         %164 = tt.broadcast %162 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8543610Z         %165 = arith.select %18, %164, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8543846Z         %166 = tt.broadcast %163 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8544072Z         %167 = arith.select %20, %166, %165 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8544297Z         %168 = tt.reshape %167 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8544513Z         %169 = arith.sitofp %168 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8544819Z         %170 = ttg.convert_layout %169 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8545276Z         %171 = tt.dot %143, %170, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8545629Z         %172 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:50:33.8545754Z         %173 = arith.cmpi slt, %172, %c2_i32 : i32
2026-02-21T09:50:33.8545883Z         %174 = arith.select %173, %172, %c0_i32 : i32
2026-02-21T09:50:33.8546144Z         %175 = ttg.memdesc_index %54[%174] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8546490Z         ttg.local_store %141, %175 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8546875Z         scf.yield %171, %174, %arg8, %175 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:50:33.8547166Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:50:33.8547433Z       %69 = ttg.local_load %68#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8547849Z       %70 = arith.extf %69 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8548169Z       %71 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8548438Z       %72 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8548676Z       %73 = arith.muli %72, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8548876Z       %74 = tt.broadcast %73 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8549057Z       %75 = arith.addi %74, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8549242Z       %76 = tt.addptr %10, %75 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8549453Z       %77 = arith.cmpi sge, %72, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8549620Z       %78 = arith.cmpi slt, %72, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8549775Z       %79 = arith.andi %77, %78 : tensor<2x1xi1, #blocked2>
2026-02-21T09:50:33.8549950Z       %80 = tt.broadcast %79 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8550130Z       %81 = arith.andi %80, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8550285Z       %82 = tt.load %76, %81, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8550530Z       %83 = ttg.convert_layout %82 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8550803Z       %84 = arith.shli %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8551029Z       %85 = arith.shrsi %84, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8551256Z       %86 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8551532Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8551854Z       %88 = tt.expand_dims %86 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8552121Z       %89 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8552348Z       %90 = arith.select %18, %89, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8552574Z       %91 = tt.broadcast %88 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8552807Z       %92 = arith.select %20, %91, %90 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8553021Z       %93 = tt.reshape %92 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8553227Z       %94 = arith.sitofp %93 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8553518Z       %95 = ttg.convert_layout %94 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8553966Z       %96 = tt.dot %70, %95, %68#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8554443Z       %97 = ttg.local_load %68#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8554857Z       %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8555179Z       %99 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:50:33.8555454Z       %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8555698Z       %101 = arith.muli %100, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8555883Z       %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8556071Z       %103 = arith.addi %102, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8556263Z       %104 = tt.addptr %10, %103 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:50:33.8556481Z       %105 = arith.cmpi sge, %100, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8556652Z       %106 = arith.cmpi slt, %100, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:50:33.8556811Z       %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2>
2026-02-21T09:50:33.8557008Z       %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8557192Z       %109 = arith.andi %108, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:50:33.8557355Z       %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:50:33.8557606Z       %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8557879Z       %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8558110Z       %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8558342Z       %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:50:33.8558624Z       %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8558954Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:50:33.8559228Z       %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8559460Z       %118 = arith.select %18, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8559690Z       %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8559913Z       %120 = arith.select %20, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:50:33.8560134Z       %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:50:33.8560363Z       %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:50:33.8560649Z       %123 = ttg.convert_layout %122 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:50:33.8561102Z       %124 = tt.dot %98, %123, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:50:33.8561486Z       ttg.local_dealloc %54 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:50:33.8561692Z       %125 = arith.truncf %124 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:50:33.8561947Z       %126 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:50:33.8562179Z       %127 = arith.muli %126, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:50:33.8562405Z       %128 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:50:33.8562700Z       %129 = tt.broadcast %127 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8562901Z       %130 = tt.broadcast %128 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8563075Z       %131 = arith.addi %129, %130 : tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8563256Z       %132 = tt.addptr %21, %131 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:50:33.8563446Z       tt.store %132, %125 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:50:33.8563579Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:50:33.8563680Z     tt.return
2026-02-21T09:50:33.8563757Z   }
2026-02-21T09:50:33.8563830Z }
2026-02-21T09:50:33.8563872Z 
2026-02-21T09:50:33.8563902Z {-#
2026-02-21T09:50:33.8563980Z   external_resources: {
2026-02-21T09:50:33.8564075Z     mlir_reproducer: {
2026-02-21T09:50:33.8565097Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:50:33.8566112Z       disable_threading: false,
2026-02-21T09:50:33.8566213Z       verify_each: true
2026-02-21T09:50:33.8566300Z     }
2026-02-21T09:50:33.8566369Z   }
2026-02-21T09:50:33.8566433Z #-}
2026-02-21T09:50:33.8566708Z /tmp/torchinductor_root/tt/cttppwz2klxjgyaagf3tfl4fnvqsgrdnel4y4bu34rosohpmvfbm.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:50:33.8567385Z /tmp/torchinductor_root/tt/cttppwz2klxjgyaagf3tfl4fnvqsgrdnel4y4bu34rosohpmvfbm.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:50:33.8567931Z [364s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:50:33.8568696Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:50:33.8569388Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:50:33.8569570Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:50:36.1365615Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 98/98 8.1 configs/s
2026-02-21T09:50:45.2605455Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 179/179 15.4 configs/s
2026-02-21T09:50:48.5258592Z [379s] Generation 3 complete: 
2026-02-21T09:50:48.5259027Z error=15
2026-02-21T09:50:48.5259238Z ok=88
2026-02-21T09:50:48.5259445Z min=1.1053
2026-02-21T09:50:48.5259646Z mid=1.6796
2026-02-21T09:50:48.5259850Z max=103.7485
2026-02-21T09:50:48.5260095Z best={'block_sizes': [8, 128, 128],
2026-02-21T09:50:48.5260482Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T09:50:48.5260847Z  'l2_groupings': [8],
2026-02-21T09:50:48.5261121Z  'load_eviction_policies': ['', ''],
2026-02-21T09:50:48.5261430Z  'loop_orders': [[0, 1]],
2026-02-21T09:50:48.5261709Z  'matrix_instr_nonkdim': 0,
2026-02-21T09:50:48.5261984Z  'num_sm_multiplier': 32,
2026-02-21T09:50:48.5262292Z  'num_stages': 4,
2026-02-21T09:50:48.5262543Z  'num_warps': 2,
2026-02-21T09:50:48.5262814Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:50:48.5263174Z  'range_flattens': [False, True],
2026-02-21T09:50:48.5263861Z  'range_multi_buffers': [True, None],
2026-02-21T09:50:48.5264124Z  'range_num_stages': [2, 3],
2026-02-21T09:50:48.5264366Z  'range_unroll_factors': [4, 0],
2026-02-21T09:50:48.5264621Z  'range_warp_specializes': [],
2026-02-21T09:50:48.5264848Z  'waves_per_eu': 2}
2026-02-21T09:50:48.5332196Z [379s] Fitting surrogate: 425 points, 425 targets
2026-02-21T09:50:50.2527257Z [381s] Generation 4 starting: 77 neighbors, 4 active search path(s)
2026-02-21T09:51:04.1782796Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 1.4 configs/s
2026-02-21T09:51:04.4222577Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:51:04.4230965Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:51:04.4231336Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:51:04.4231642Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}>
2026-02-21T09:51:04.4231935Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:51:04.4232193Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:51:04.4232429Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:51:04.4232615Z #smem = #ttg.shared_memory
2026-02-21T09:51:04.4232849Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:51:04.4233445Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:51:04.4233854Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:51:04.4234024Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:51:04.4234149Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:51:04.4234296Z     %cst_0 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4234459Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:51:04.4234585Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:51:04.4234703Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T09:51:04.4234825Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:51:04.4234939Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:51:04.4235173Z     %cst_1 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4235471Z     %cst_2 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:04.4235737Z     %cst_3 = arith.constant dense<0> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4236059Z     %cst_4 = arith.constant dense<8192> : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4236314Z     %cst_5 = arith.constant dense<0> : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4236538Z     %cst_6 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:51:04.4236752Z     %cst_7 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4236974Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:04.4237151Z     %cst_9 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:04.4237371Z     %cst_10 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4237631Z     %cst_11 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4237963Z     %cst_12 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4238184Z     %cst_13 = arith.constant dense<0> : tensor<1x128xi64, #mma>
2026-02-21T09:51:04.4238356Z     %cst_14 = arith.constant dense<8192> : tensor<1x128xi64, #mma>
2026-02-21T09:51:04.4238532Z     %cst_15 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:51:04.4238703Z     %cst_16 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:51:04.4238874Z     %cst_17 = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:51:04.4239029Z     %0 = tt.get_program_id x : i32
2026-02-21T09:51:04.4239144Z     %1 = arith.remsi %0, %c64_i32 : i32
2026-02-21T09:51:04.4239284Z     %2 = arith.divsi %0, %c64_i32 : i32
2026-02-21T09:51:04.4239399Z     %3 = arith.muli %1, %c128_i32 : i32
2026-02-21T09:51:04.4239516Z     %4 = arith.muli %2, %c128_i32 : i32
2026-02-21T09:51:04.4239721Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:04.4240000Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:04.4240331Z     %7 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4240644Z     %8 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:04.4240894Z     %9 = tt.splat %4 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:04.4241115Z     %10 = arith.addi %9, %5 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:04.4241389Z     %11 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:04.4241693Z     %12 = tt.expand_dims %10 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:51:04.4241948Z     %13 = arith.muli %12, %cst_6 : tensor<128x1xi32, #blocked1>
2026-02-21T09:51:04.4242147Z     %14 = tt.broadcast %13 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4242365Z     %15 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:04.4242534Z     %16 = arith.extsi %3 : i32 to i64
2026-02-21T09:51:04.4242808Z     %17 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4243123Z     %18 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4243561Z     %19 = arith.extsi %18 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4243964Z     %20 = tt.splat %16 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4244392Z     %21 = arith.extsi %7 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4244800Z     %22 = arith.addi %20, %21 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4245188Z     %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4245623Z     %24 = tt.broadcast %23 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4245937Z     %25 = arith.cmpi sge, %23, %cst_5 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4246207Z     %26 = arith.cmpi slt, %23, %cst_4 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4246448Z     %27 = arith.andi %25, %26 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4246747Z     %28 = tt.broadcast %27 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4247112Z     %29 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:51:04.4247544Z     %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:51:04.4247939Z     %31 = tt.expand_dims %30 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:04.4248197Z     %32 = arith.cmpi eq, %31, %cst_8 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:04.4248395Z     %33 = tt.broadcast %32 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:51:04.4248594Z     %34 = arith.cmpi eq, %31, %cst_9 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:04.4248790Z     %35 = tt.broadcast %34 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:51:04.4249054Z     %36 = scf.for %arg3 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg4 = %cst) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:51:04.4249276Z       %97 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:51:04.4249448Z       %98 = tt.splat %97 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:04.4249693Z       %99 = arith.addi %98, %11 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:04.4249966Z       %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:04.4250249Z       %101 = tt.broadcast %100 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4250453Z       %102 = arith.addi %14, %101 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4250659Z       %103 = tt.addptr %15, %102 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4250875Z       %104 = tt.load %103 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:04.4251100Z       %105 = ttg.local_alloc %104 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:04.4251441Z       %106 = ttg.local_load %105 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4251860Z       %107 = arith.extf %106 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4252145Z       %108 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:51:04.4252362Z       %109 = tt.splat %108 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4252679Z       %110 = arith.addi %109, %19 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4253065Z       %111 = tt.expand_dims %110 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4253426Z       %112 = arith.muli %111, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4253739Z       %113 = tt.broadcast %112 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4254049Z       %114 = arith.addi %113, %24 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4254390Z       %115 = tt.addptr %17, %114 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4254712Z       %116 = arith.cmpi sge, %111, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4254962Z       %117 = arith.cmpi slt, %111, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4255205Z       %118 = arith.andi %116, %117 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4255509Z       %119 = tt.broadcast %118 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4255847Z       %120 = arith.andi %119, %28 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4256091Z       %121 = tt.load %115, %120, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4256345Z       %122 = arith.shli %121, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4256584Z       %123 = arith.shrsi %122, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4256822Z       %124 = arith.shrsi %121, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4257118Z       %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:04.4257457Z       %126 = tt.expand_dims %124 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:04.4257768Z       %127 = tt.broadcast %125 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4258017Z       %128 = arith.select %33, %127, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4258257Z       %129 = tt.broadcast %126 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4258501Z       %130 = arith.select %35, %129, %128 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4258737Z       %131 = tt.reshape %130 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:04.4258967Z       %132 = arith.sitofp %131 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:04.4259233Z       %133 = ttg.local_alloc %132 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:04.4259560Z       %134 = ttg.local_load %133 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4260048Z       %135 = tt.dot %107, %134, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:04.4260410Z       %136 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:51:04.4260560Z       %137 = arith.muli %136, %c2_i32 : i32
2026-02-21T09:51:04.4260735Z       %138 = tt.splat %137 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:04.4260959Z       %139 = arith.addi %138, %11 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:04.4261241Z       %140 = tt.expand_dims %139 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:04.4261523Z       %141 = tt.broadcast %140 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4261725Z       %142 = arith.addi %14, %141 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4261937Z       %143 = tt.addptr %15, %142 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4262147Z       %144 = tt.load %143 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:04.4262394Z       %145 = ttg.local_alloc %144 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:04.4262724Z       %146 = ttg.local_load %145 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4263154Z       %147 = arith.extf %146 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4263438Z       %148 = arith.extsi %136 : i32 to i64
2026-02-21T09:51:04.4263644Z       %149 = tt.splat %148 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4263957Z       %150 = arith.addi %149, %19 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4264342Z       %151 = tt.expand_dims %150 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4264693Z       %152 = arith.muli %151, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4265002Z       %153 = tt.broadcast %152 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4265308Z       %154 = arith.addi %153, %24 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4265619Z       %155 = tt.addptr %17, %154 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4265961Z       %156 = arith.cmpi sge, %151, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4266203Z       %157 = arith.cmpi slt, %151, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4266439Z       %158 = arith.andi %156, %157 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4266741Z       %159 = tt.broadcast %158 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4267039Z       %160 = arith.andi %159, %28 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4267283Z       %161 = tt.load %155, %160, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4267531Z       %162 = arith.shli %161, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4267766Z       %163 = arith.shrsi %162, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4268005Z       %164 = arith.shrsi %161, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4268297Z       %165 = tt.expand_dims %163 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:04.4268652Z       %166 = tt.expand_dims %164 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:04.4268939Z       %167 = tt.broadcast %165 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4269180Z       %168 = arith.select %33, %167, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4269422Z       %169 = tt.broadcast %166 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4269656Z       %170 = arith.select %35, %169, %168 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4269891Z       %171 = tt.reshape %170 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:04.4270119Z       %172 = arith.sitofp %171 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:04.4270391Z       %173 = ttg.local_alloc %172 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:04.4270722Z       %174 = ttg.local_load %173 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4271190Z       %175 = tt.dot %147, %174, %135, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:04.4271535Z       %176 = arith.addi %arg3, %c4_i32 : i32
2026-02-21T09:51:04.4271659Z       %177 = arith.muli %176, %c2_i32 : i32
2026-02-21T09:51:04.4271841Z       %178 = tt.splat %177 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:04.4272065Z       %179 = arith.addi %178, %11 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:04.4272337Z       %180 = tt.expand_dims %179 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:04.4272614Z       %181 = tt.broadcast %180 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4272809Z       %182 = arith.addi %14, %181 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4273009Z       %183 = tt.addptr %15, %182 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4273215Z       %184 = tt.load %183 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:04.4273434Z       %185 = ttg.local_alloc %184 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:04.4273783Z       %186 = ttg.local_load %185 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4274189Z       %187 = arith.extf %186 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4274468Z       %188 = arith.extsi %176 : i32 to i64
2026-02-21T09:51:04.4274677Z       %189 = tt.splat %188 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4274972Z       %190 = arith.addi %189, %19 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4275358Z       %191 = tt.expand_dims %190 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4275714Z       %192 = arith.muli %191, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4276019Z       %193 = tt.broadcast %192 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4276326Z       %194 = arith.addi %193, %24 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4276638Z       %195 = tt.addptr %17, %194 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4276970Z       %196 = arith.cmpi sge, %191, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4277215Z       %197 = arith.cmpi slt, %191, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4277445Z       %198 = arith.andi %196, %197 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4277746Z       %199 = tt.broadcast %198 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4278048Z       %200 = arith.andi %199, %28 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4278305Z       %201 = tt.load %195, %200, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4278551Z       %202 = arith.shli %201, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4278782Z       %203 = arith.shrsi %202, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4279019Z       %204 = arith.shrsi %201, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4279307Z       %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:04.4279642Z       %206 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:04.4279946Z       %207 = tt.broadcast %205 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4280184Z       %208 = arith.select %33, %207, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4280425Z       %209 = tt.broadcast %206 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4280660Z       %210 = arith.select %35, %209, %208 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4280889Z       %211 = tt.reshape %210 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:04.4281116Z       %212 = arith.sitofp %211 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:04.4281369Z       %213 = ttg.local_alloc %212 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:04.4281697Z       %214 = ttg.local_load %213 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4282183Z       %215 = tt.dot %187, %214, %175, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:04.4282535Z       scf.yield %215 : tensor<128x128xf32, #mma>
2026-02-21T09:51:04.4282704Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:51:04.4282867Z     %37 = arith.addi %11, %cst_2 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:04.4283141Z     %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:04.4283411Z     %39 = tt.broadcast %38 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4283600Z     %40 = arith.addi %14, %39 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4283796Z     %41 = tt.addptr %15, %40 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:04.4283996Z     %42 = tt.load %41 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:04.4284212Z     %43 = ttg.local_alloc %42 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:04.4284536Z     %44 = ttg.local_load %43 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
﻿2026-02-21T09:51:04.4286676Z     %45 = arith.extf %44 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4287043Z     %46 = arith.addi %19, %cst_1 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:04.4287425Z     %47 = tt.expand_dims %46 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4287784Z     %48 = arith.muli %47, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4288086Z     %49 = tt.broadcast %48 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4288408Z     %50 = arith.addi %49, %24 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4288711Z     %51 = tt.addptr %17, %50 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4289043Z     %52 = arith.cmpi sge, %47, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4289284Z     %53 = arith.cmpi slt, %47, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4289510Z     %54 = arith.andi %52, %53 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4289820Z     %55 = tt.broadcast %54 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4290112Z     %56 = arith.andi %55, %28 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4290341Z     %57 = tt.load %51, %56, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4290577Z     %58 = arith.shli %57, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4290805Z     %59 = arith.shrsi %58, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4291031Z     %60 = arith.shrsi %57, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:04.4291311Z     %61 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:04.4291661Z     %62 = tt.expand_dims %60 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:04.4291939Z     %63 = tt.broadcast %61 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4292175Z     %64 = arith.select %33, %63, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4292406Z     %65 = tt.broadcast %62 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4292631Z     %66 = arith.select %35, %65, %64 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:04.4292855Z     %67 = tt.reshape %66 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:04.4293073Z     %68 = arith.sitofp %67 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:04.4293318Z     %69 = ttg.local_alloc %68 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:04.4293635Z     %70 = ttg.local_load %69 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:04.4294098Z     %71 = tt.dot %45, %70, %36, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:04.4294485Z     %72 = arith.truncf %71 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:51:04.4294724Z     %73 = arith.extsi %4 : i32 to i64
2026-02-21T09:51:04.4294884Z     %74 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:04.4295086Z     %75 = tt.splat %73 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:04.4295358Z     %76 = arith.extsi %6 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:04.4295691Z     %77 = arith.extsi %8 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:04.4295957Z     %78 = arith.addi %75, %76 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:04.4296235Z     %79 = tt.expand_dims %78 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:51:04.4296468Z     %80 = arith.muli %79, %cst_15 : tensor<128x1xi64, #mma>
2026-02-21T09:51:04.4296648Z     %81 = tt.broadcast %80 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:04.4296851Z     %82 = tt.splat %16 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:04.4297050Z     %83 = arith.addi %82, %77 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:04.4297307Z     %84 = tt.expand_dims %83 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:51:04.4297575Z     %85 = tt.broadcast %84 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:04.4297754Z     %86 = arith.addi %81, %85 : tensor<128x128xi64, #mma>
2026-02-21T09:51:04.4297940Z     %87 = tt.addptr %74, %86 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi64, #mma>
2026-02-21T09:51:04.4298140Z     %88 = arith.cmpi sge, %79, %cst_16 : tensor<128x1xi64, #mma>
2026-02-21T09:51:04.4298303Z     %89 = arith.cmpi slt, %79, %cst_17 : tensor<128x1xi64, #mma>
2026-02-21T09:51:04.4298454Z     %90 = arith.andi %88, %89 : tensor<128x1xi1, #mma>
2026-02-21T09:51:04.4298624Z     %91 = tt.broadcast %90 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:04.4298801Z     %92 = arith.cmpi sge, %84, %cst_13 : tensor<1x128xi64, #mma>
2026-02-21T09:51:04.4298960Z     %93 = arith.cmpi slt, %84, %cst_14 : tensor<1x128xi64, #mma>
2026-02-21T09:51:04.4299112Z     %94 = arith.andi %92, %93 : tensor<1x128xi1, #mma>
2026-02-21T09:51:04.4299276Z     %95 = tt.broadcast %94 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:04.4299451Z     %96 = arith.andi %91, %95 : tensor<128x128xi1, #mma>
2026-02-21T09:51:04.4299619Z     tt.store %87, %72, %96 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:04.4299756Z     tt.return
2026-02-21T09:51:04.4299835Z   }
2026-02-21T09:51:04.4299912Z }
2026-02-21T09:51:04.4299956Z 
2026-02-21T09:51:04.4299989Z {-#
2026-02-21T09:51:04.4300072Z   external_resources: {
2026-02-21T09:51:04.4300173Z     mlir_reproducer: {
2026-02-21T09:51:04.4301160Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:51:04.4302149Z       disable_threading: false,
2026-02-21T09:51:04.4302258Z       verify_each: true
2026-02-21T09:51:04.4302348Z     }
2026-02-21T09:51:04.4302425Z   }
2026-02-21T09:51:04.4302494Z #-}
2026-02-21T09:51:04.4302775Z /tmp/torchinductor_root/it/cit3ulfy5jg54neqzjzy7mosayhvrimhhw6wmkdny764b2zi2n24.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:51:04.4303468Z /tmp/torchinductor_root/it/cit3ulfy5jg54neqzjzy7mosayhvrimhhw6wmkdny764b2zi2n24.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:51:04.4304018Z [395s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:51:04.4304745Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:51:04.4305416Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:51:04.4305583Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:51:05.5926167Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:51:05.5943881Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:51:05.5944455Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:51:05.5944823Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}>
2026-02-21T09:51:05.5945170Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:51:05.5945467Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:51:05.5945751Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:51:05.5945965Z #smem = #ttg.shared_memory
2026-02-21T09:51:05.5946246Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:51:05.5946803Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:51:05.5947359Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:51:05.5947552Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:51:05.5947689Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:51:05.5947827Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:51:05.5948004Z     %cst_0 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.5948184Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:51:05.5948320Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:51:05.5948453Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:51:05.5948590Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:51:05.5948725Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:51:05.5948856Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T09:51:05.5948986Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:51:05.5949115Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:51:05.5949382Z     %cst_1 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5949727Z     %cst_2 = arith.constant dense<1020> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.5950030Z     %cst_3 = arith.constant dense<0> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5950330Z     %cst_4 = arith.constant dense<8192> : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5950684Z     %cst_5 = arith.constant dense<0> : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5950940Z     %cst_6 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:51:05.5951176Z     %cst_7 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5951411Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:05.5951596Z     %cst_9 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:05.5951834Z     %cst_10 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5952110Z     %cst_11 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5952468Z     %cst_12 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5952702Z     %cst_13 = arith.constant dense<0> : tensor<1x128xi64, #mma>
2026-02-21T09:51:05.5952886Z     %cst_14 = arith.constant dense<8192> : tensor<1x128xi64, #mma>
2026-02-21T09:51:05.5953071Z     %cst_15 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.5953255Z     %cst_16 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.5953440Z     %cst_17 = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.5953602Z     %0 = tt.get_program_id x : i32
2026-02-21T09:51:05.5953726Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:51:05.5953874Z     %2 = arith.minsi %1, %c8192_i32 : i32
2026-02-21T09:51:05.5954096Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:05.5954402Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:05.5954746Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5955089Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:05.5955382Z     %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.5955648Z     %8 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.5955913Z     %9 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5956273Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5956746Z     %11 = arith.extsi %10 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5957305Z     %12 = arith.extsi %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5957774Z     %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:51:05.5958230Z     %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:51:05.5958670Z     %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:05.5958948Z     %16 = arith.cmpi eq, %15, %cst_8 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:05.5959163Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:51:05.5959383Z     %18 = arith.cmpi eq, %15, %cst_9 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:05.5959604Z     %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:51:05.5959834Z     %20 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:05.5960135Z     %21 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:05.5960506Z     %22 = arith.extsi %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:05.5960760Z     %23 = arith.subi %2, %0 : i32
2026-02-21T09:51:05.5960883Z     %24 = arith.remsi %23, %c3_i32 : i32
2026-02-21T09:51:05.5961030Z     %25 = arith.subi %23, %24 : i32
2026-02-21T09:51:05.5961142Z     %26 = arith.addi %0, %25 : i32
2026-02-21T09:51:05.5961304Z     %27 = arith.addi %7, %cst_2 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.5961623Z     %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.5961892Z     %29 = tt.broadcast %28 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.5962154Z     %30 = arith.addi %11, %cst_1 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5962551Z     %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5962973Z     %32 = arith.muli %31, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5963276Z     %33 = tt.broadcast %32 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5963580Z     %34 = arith.cmpi sge, %31, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5963819Z     %35 = arith.cmpi slt, %31, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5964041Z     %36 = arith.andi %34, %35 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5964332Z     %37 = tt.broadcast %36 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5964590Z     scf.for %arg3 = %0 to %26 step %c3_i32  : i32 {
2026-02-21T09:51:05.5964729Z       %38 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:51:05.5964875Z       %39 = arith.muli %38, %c2_i32 : i32
2026-02-21T09:51:05.5964991Z       %40 = arith.subi %c64_i32, %39 : i32
2026-02-21T09:51:05.5965113Z       %41 = arith.minsi %40, %c2_i32 : i32
2026-02-21T09:51:05.5965234Z       %42 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:51:05.5965356Z       %43 = arith.remsi %42, %41 : i32
2026-02-21T09:51:05.5965467Z       %44 = arith.addi %39, %43 : i32
2026-02-21T09:51:05.5965579Z       %45 = arith.divsi %42, %41 : i32
2026-02-21T09:51:05.5965693Z       %46 = arith.muli %44, %c128_i32 : i32
2026-02-21T09:51:05.5965808Z       %47 = arith.muli %45, %c128_i32 : i32
2026-02-21T09:51:05.5965975Z       %48 = tt.splat %47 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:05.5966194Z       %49 = arith.addi %48, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:05.5966475Z       %50 = tt.expand_dims %49 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:51:05.5966729Z       %51 = arith.muli %50, %cst_6 : tensor<128x1xi32, #blocked1>
2026-02-21T09:51:05.5966922Z       %52 = tt.broadcast %51 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.5967096Z       %53 = arith.extsi %46 : i32 to i64
2026-02-21T09:51:05.5967298Z       %54 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5967613Z       %55 = arith.addi %54, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5968000Z       %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5968432Z       %57 = tt.broadcast %56 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5968741Z       %58 = arith.cmpi sge, %56, %cst_5 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5969003Z       %59 = arith.cmpi slt, %56, %cst_4 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5969234Z       %60 = arith.andi %58, %59 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5969542Z       %61 = tt.broadcast %60 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5969887Z       %62 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:51:05.5970104Z         %253 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:51:05.5970281Z         %254 = tt.splat %253 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.5970532Z         %255 = arith.addi %254, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.5970804Z         %256 = tt.expand_dims %255 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.5971084Z         %257 = tt.broadcast %256 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.5971279Z         %258 = arith.addi %52, %257 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.5971486Z         %259 = tt.addptr %8, %258 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.5971696Z         %260 = tt.load %259 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.5971930Z         %261 = ttg.local_alloc %260 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.5972266Z         %262 = ttg.local_load %261 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.5972704Z         %263 = arith.extf %262 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.5972992Z         %264 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:05.5973205Z         %265 = tt.splat %264 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5973504Z         %266 = arith.addi %265, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5973894Z         %267 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5974245Z         %268 = arith.muli %267, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5974561Z         %269 = tt.broadcast %268 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5974870Z         %270 = arith.addi %269, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5975185Z         %271 = tt.addptr %9, %270 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5975504Z         %272 = arith.cmpi sge, %267, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5975774Z         %273 = arith.cmpi slt, %267, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5976009Z         %274 = arith.andi %272, %273 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5976312Z         %275 = tt.broadcast %274 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5976612Z         %276 = arith.andi %275, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5976858Z         %277 = tt.load %271, %276, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5977124Z         %278 = arith.shli %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5977359Z         %279 = arith.shrsi %278, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5977603Z         %280 = arith.shrsi %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5977891Z         %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.5978230Z         %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.5978539Z         %283 = tt.broadcast %281 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.5978783Z         %284 = arith.select %17, %283, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.5979025Z         %285 = tt.broadcast %282 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.5979257Z         %286 = arith.select %19, %285, %284 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.5979490Z         %287 = tt.reshape %286 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.5979718Z         %288 = arith.sitofp %287 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.5979976Z         %289 = ttg.local_alloc %288 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.5980305Z         %290 = ttg.local_load %289 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.5980799Z         %291 = tt.dot %263, %290, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.5981148Z         %292 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:51:05.5981276Z         %293 = arith.muli %292, %c2_i32 : i32
2026-02-21T09:51:05.5981446Z         %294 = tt.splat %293 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.5981670Z         %295 = arith.addi %294, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.5981946Z         %296 = tt.expand_dims %295 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.5982224Z         %297 = tt.broadcast %296 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.5982429Z         %298 = arith.addi %52, %297 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.5982632Z         %299 = tt.addptr %8, %298 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.5982841Z         %300 = tt.load %299 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.5983064Z         %301 = ttg.local_alloc %300 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.5983399Z         %302 = ttg.local_load %301 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.5983819Z         %303 = arith.extf %302 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.5984102Z         %304 = arith.extsi %292 : i32 to i64
2026-02-21T09:51:05.5984312Z         %305 = tt.splat %304 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5984609Z         %306 = arith.addi %305, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.5985009Z         %307 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5985365Z         %308 = arith.muli %307, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5985678Z         %309 = tt.broadcast %308 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5985984Z         %310 = arith.addi %309, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5992878Z         %311 = tt.addptr %9, %310 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5993209Z         %312 = arith.cmpi sge, %307, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5993462Z         %313 = arith.cmpi slt, %307, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5993697Z         %314 = arith.andi %312, %313 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5993996Z         %315 = tt.broadcast %314 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5994296Z         %316 = arith.andi %315, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5994539Z         %317 = tt.load %311, %316, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5994787Z         %318 = arith.shli %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5995046Z         %319 = arith.shrsi %318, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5995279Z         %320 = arith.shrsi %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.5995575Z         %321 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.5995912Z         %322 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.5996201Z         %323 = tt.broadcast %321 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.5996449Z         %324 = arith.select %17, %323, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.5996690Z         %325 = tt.broadcast %322 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.5996929Z         %326 = arith.select %19, %325, %324 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.5997159Z         %327 = tt.reshape %326 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.5997388Z         %328 = arith.sitofp %327 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.5997644Z         %329 = ttg.local_alloc %328 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.5997987Z         %330 = ttg.local_load %329 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.5998461Z         %331 = tt.dot %303, %330, %291, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.5998809Z         %332 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:05.5998935Z         %333 = arith.muli %332, %c2_i32 : i32
2026-02-21T09:51:05.5999109Z         %334 = tt.splat %333 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.5999347Z         %335 = arith.addi %334, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.5999625Z         %336 = tt.expand_dims %335 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.5999904Z         %337 = tt.broadcast %336 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6000100Z         %338 = arith.addi %52, %337 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6000304Z         %339 = tt.addptr %8, %338 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6000515Z         %340 = tt.load %339 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6000755Z         %341 = ttg.local_alloc %340 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6001085Z         %342 = ttg.local_load %341 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6001492Z         %343 = arith.extf %342 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6001776Z         %344 = arith.extsi %332 : i32 to i64
2026-02-21T09:51:05.6001985Z         %345 = tt.splat %344 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6002284Z         %346 = arith.addi %345, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6002717Z         %347 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6003094Z         %348 = arith.muli %347, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6003409Z         %349 = tt.broadcast %348 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6003720Z         %350 = arith.addi %349, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6004038Z         %351 = tt.addptr %9, %350 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6004361Z         %352 = arith.cmpi sge, %347, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6004605Z         %353 = arith.cmpi slt, %347, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6004847Z         %354 = arith.andi %352, %353 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6005152Z         %355 = tt.broadcast %354 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6005456Z         %356 = arith.andi %355, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6005703Z         %357 = tt.load %351, %356, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6005964Z         %358 = arith.shli %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6006199Z         %359 = arith.shrsi %358, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6006438Z         %360 = arith.shrsi %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6006728Z         %361 = tt.expand_dims %359 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6007066Z         %362 = tt.expand_dims %360 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6007370Z         %363 = tt.broadcast %361 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6007612Z         %364 = arith.select %17, %363, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6007852Z         %365 = tt.broadcast %362 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6008087Z         %366 = arith.select %19, %365, %364 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6008322Z         %367 = tt.reshape %366 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6008546Z         %368 = arith.sitofp %367 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6008818Z         %369 = ttg.local_alloc %368 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6009146Z         %370 = ttg.local_load %369 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6009623Z         %371 = tt.dot %343, %370, %331, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6009977Z         scf.yield %371 : tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6010113Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:51:05.6010255Z       %63 = arith.addi %52, %29 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6010454Z       %64 = tt.addptr %8, %63 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6010655Z       %65 = tt.load %64 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6010894Z       %66 = ttg.local_alloc %65 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6011215Z       %67 = ttg.local_load %66 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6011621Z       %68 = arith.extf %67 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6011951Z       %69 = arith.addi %33, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6012256Z       %70 = tt.addptr %9, %69 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6012558Z       %71 = arith.andi %37, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6012794Z       %72 = tt.load %70, %71, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6013032Z       %73 = arith.shli %72, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6013266Z       %74 = arith.shrsi %73, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6013494Z       %75 = arith.shrsi %72, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6013776Z       %76 = tt.expand_dims %74 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6014121Z       %77 = tt.expand_dims %75 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6014399Z       %78 = tt.broadcast %76 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6014635Z       %79 = arith.select %17, %78, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6014869Z       %80 = tt.broadcast %77 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6015097Z       %81 = arith.select %19, %80, %79 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6015341Z       %82 = tt.reshape %81 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6015559Z       %83 = arith.sitofp %82 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6015808Z       %84 = ttg.local_alloc %83 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6016128Z       %85 = ttg.local_load %84 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6016604Z       %86 = tt.dot %68, %85, %62, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6016987Z       %87 = arith.truncf %86 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:51:05.6017159Z       %88 = arith.extsi %47 : i32 to i64
2026-02-21T09:51:05.6017324Z       %89 = tt.splat %88 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:05.6017533Z       %90 = arith.addi %89, %21 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:05.6017799Z       %91 = tt.expand_dims %90 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6018040Z       %92 = arith.muli %91, %cst_15 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6018218Z       %93 = tt.broadcast %92 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6018423Z       %94 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:05.6018627Z       %95 = arith.addi %94, %22 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:05.6018905Z       %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6019168Z       %97 = tt.broadcast %96 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6019346Z       %98 = arith.addi %93, %97 : tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6019534Z       %99 = tt.addptr %20, %98 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6019734Z       %100 = arith.cmpi sge, %91, %cst_16 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6019902Z       %101 = arith.cmpi slt, %91, %cst_17 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6020059Z       %102 = arith.andi %100, %101 : tensor<128x1xi1, #mma>
2026-02-21T09:51:05.6020230Z       %103 = tt.broadcast %102 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6020414Z       %104 = arith.cmpi sge, %96, %cst_13 : tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6020576Z       %105 = arith.cmpi slt, %96, %cst_14 : tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6020730Z       %106 = arith.andi %104, %105 : tensor<1x128xi1, #mma>
2026-02-21T09:51:05.6020900Z       %107 = tt.broadcast %106 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6021078Z       %108 = arith.andi %103, %107 : tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6021239Z       tt.store %99, %87, %108 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:05.6021382Z       %109 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:51:05.6021520Z       %110 = arith.divsi %109, %c256_i32 : i32
2026-02-21T09:51:05.6021639Z       %111 = arith.muli %110, %c2_i32 : i32
2026-02-21T09:51:05.6021755Z       %112 = arith.subi %c64_i32, %111 : i32
2026-02-21T09:51:05.6021870Z       %113 = arith.minsi %112, %c2_i32 : i32
2026-02-21T09:51:05.6021988Z       %114 = arith.remsi %109, %c256_i32 : i32
2026-02-21T09:51:05.6022102Z       %115 = arith.remsi %114, %113 : i32
2026-02-21T09:51:05.6022216Z       %116 = arith.addi %111, %115 : i32
2026-02-21T09:51:05.6022328Z       %117 = arith.divsi %114, %113 : i32
2026-02-21T09:51:05.6022444Z       %118 = arith.muli %116, %c128_i32 : i32
2026-02-21T09:51:05.6022580Z       %119 = arith.muli %117, %c128_i32 : i32
2026-02-21T09:51:05.6022749Z       %120 = tt.splat %119 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:05.6022975Z       %121 = arith.addi %120, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:05.6023257Z       %122 = tt.expand_dims %121 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:51:05.6023514Z       %123 = arith.muli %122, %cst_6 : tensor<128x1xi32, #blocked1>
2026-02-21T09:51:05.6023709Z       %124 = tt.broadcast %123 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6023882Z       %125 = arith.extsi %118 : i32 to i64
2026-02-21T09:51:05.6024103Z       %126 = tt.splat %125 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6024400Z       %127 = arith.addi %126, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6024800Z       %128 = tt.expand_dims %127 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6025239Z       %129 = tt.broadcast %128 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6025560Z       %130 = arith.cmpi sge, %128, %cst_5 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6025812Z       %131 = arith.cmpi slt, %128, %cst_4 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6026053Z       %132 = arith.andi %130, %131 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6026369Z       %133 = tt.broadcast %132 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6026713Z       %134 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:51:05.6026926Z         %253 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:51:05.6027099Z         %254 = tt.splat %253 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6027325Z         %255 = arith.addi %254, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6027600Z         %256 = tt.expand_dims %255 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.6027881Z         %257 = tt.broadcast %256 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6028076Z         %258 = arith.addi %124, %257 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6028282Z         %259 = tt.addptr %8, %258 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6028492Z         %260 = tt.load %259 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6028715Z         %261 = ttg.local_alloc %260 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6029050Z         %262 = ttg.local_load %261 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6029471Z         %263 = arith.extf %262 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6029754Z         %264 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:05.6029965Z         %265 = tt.splat %264 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6030258Z         %266 = arith.addi %265, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6030662Z         %267 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6031015Z         %268 = arith.muli %267, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6031325Z         %269 = tt.broadcast %268 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6031634Z         %270 = arith.addi %269, %129 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6031962Z         %271 = tt.addptr %9, %270 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6032280Z         %272 = arith.cmpi sge, %267, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6032530Z         %273 = arith.cmpi slt, %267, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6032763Z         %274 = arith.andi %272, %273 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6033062Z         %275 = tt.broadcast %274 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6033362Z         %276 = arith.andi %275, %133 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6033608Z         %277 = tt.load %271, %276, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6033857Z         %278 = arith.shli %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6034111Z         %279 = arith.shrsi %278, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6034347Z         %280 = arith.shrsi %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6034636Z         %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6034977Z         %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6035267Z         %283 = tt.broadcast %281 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6035509Z         %284 = arith.select %17, %283, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6035752Z         %285 = tt.broadcast %282 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6035990Z         %286 = arith.select %19, %285, %284 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6036220Z         %287 = tt.reshape %286 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6036448Z         %288 = arith.sitofp %287 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6036702Z         %289 = ttg.local_alloc %288 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6037051Z         %290 = ttg.local_load %289 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6037532Z         %291 = tt.dot %263, %290, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6037882Z         %292 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:51:05.6038012Z         %293 = arith.muli %292, %c2_i32 : i32
2026-02-21T09:51:05.6038186Z         %294 = tt.splat %293 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6038434Z         %295 = arith.addi %294, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6038718Z         %296 = tt.expand_dims %295 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.6038997Z         %297 = tt.broadcast %296 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6039203Z         %298 = arith.addi %124, %297 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6039408Z         %299 = tt.addptr %8, %298 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6039629Z         %300 = tt.load %299 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6039892Z         %301 = ttg.local_alloc %300 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6040229Z         %302 = ttg.local_load %301 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6040643Z         %303 = arith.extf %302 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6040932Z         %304 = arith.extsi %292 : i32 to i64
2026-02-21T09:51:05.6041150Z         %305 = tt.splat %304 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6041453Z         %306 = arith.addi %305, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6041851Z         %307 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6042231Z         %308 = arith.muli %307, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6042550Z         %309 = tt.broadcast %308 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6042900Z         %310 = arith.addi %309, %129 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6043222Z         %311 = tt.addptr %9, %310 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6043545Z         %312 = arith.cmpi sge, %307, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6043800Z         %313 = arith.cmpi slt, %307, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6044043Z         %314 = arith.andi %312, %313 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6044350Z         %315 = tt.broadcast %314 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6044660Z         %316 = arith.andi %315, %133 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6044909Z         %317 = tt.load %311, %316, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6045183Z         %318 = arith.shli %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6045425Z         %319 = arith.shrsi %318, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6045663Z         %320 = arith.shrsi %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6045959Z         %321 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6046305Z         %322 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6046618Z         %323 = tt.broadcast %321 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6046871Z         %324 = arith.select %17, %323, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6047113Z         %325 = tt.broadcast %322 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6047359Z         %326 = arith.select %19, %325, %324 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6047600Z         %327 = tt.reshape %326 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6047830Z         %328 = arith.sitofp %327 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6048114Z         %329 = ttg.local_alloc %328 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6048447Z         %330 = ttg.local_load %329 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6048930Z         %331 = tt.dot %303, %330, %291, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6049287Z         %332 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:05.6049416Z         %333 = arith.muli %332, %c2_i32 : i32
2026-02-21T09:51:05.6049594Z         %334 = tt.splat %333 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6049819Z         %335 = arith.addi %334, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6050102Z         %336 = tt.expand_dims %335 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.6050403Z         %337 = tt.broadcast %336 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6050603Z         %338 = arith.addi %124, %337 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6050810Z         %339 = tt.addptr %8, %338 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6051020Z         %340 = tt.load %339 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6051251Z         %341 = ttg.local_alloc %340 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6051588Z         %342 = ttg.local_load %341 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6052000Z         %343 = arith.extf %342 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6052293Z         %344 = arith.extsi %332 : i32 to i64
2026-02-21T09:51:05.6052506Z         %345 = tt.splat %344 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6052816Z         %346 = arith.addi %345, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6053211Z         %347 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6053585Z         %348 = arith.muli %347, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6053905Z         %349 = tt.broadcast %348 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6054226Z         %350 = arith.addi %349, %129 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6054542Z         %351 = tt.addptr %9, %350 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6054882Z         %352 = arith.cmpi sge, %347, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6055129Z         %353 = arith.cmpi slt, %347, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6055373Z         %354 = arith.andi %352, %353 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6055685Z         %355 = tt.broadcast %354 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6055990Z         %356 = arith.andi %355, %133 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6056260Z         %357 = tt.load %351, %356, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6056512Z         %358 = arith.shli %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6056756Z         %359 = arith.shrsi %358, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6057002Z         %360 = arith.shrsi %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6057344Z         %361 = tt.expand_dims %359 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6057692Z         %362 = tt.expand_dims %360 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6057988Z         %363 = tt.broadcast %361 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6058235Z         %364 = arith.select %17, %363, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6058500Z         %365 = tt.broadcast %362 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6058744Z         %366 = arith.select %19, %365, %364 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6058982Z         %367 = tt.reshape %366 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6059216Z         %368 = arith.sitofp %367 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6059475Z         %369 = ttg.local_alloc %368 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6059808Z         %370 = ttg.local_load %369 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6060284Z         %371 = tt.dot %343, %370, %331, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6060641Z         scf.yield %371 : tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6060785Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:51:05.6060933Z       %135 = arith.addi %124, %29 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6061145Z       %136 = tt.addptr %8, %135 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6061372Z       %137 = tt.load %136 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6061604Z       %138 = ttg.local_alloc %137 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6061949Z       %139 = ttg.local_load %138 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6062367Z       %140 = arith.extf %139 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6062710Z       %141 = arith.addi %33, %129 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6063046Z       %142 = tt.addptr %9, %141 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6063360Z       %143 = arith.andi %37, %133 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6063616Z       %144 = tt.load %142, %143, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6063865Z       %145 = arith.shli %144, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6064107Z       %146 = arith.shrsi %145, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6064368Z       %147 = arith.shrsi %144, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6064662Z       %148 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6065007Z       %149 = tt.expand_dims %147 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6065294Z       %150 = tt.broadcast %148 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6065541Z       %151 = arith.select %17, %150, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6065792Z       %152 = tt.broadcast %149 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6066030Z       %153 = arith.select %19, %152, %151 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6066269Z       %154 = tt.reshape %153 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6066515Z       %155 = arith.sitofp %154 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6066779Z       %156 = ttg.local_alloc %155 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6067113Z       %157 = ttg.local_load %156 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6067583Z       %158 = tt.dot %140, %157, %134, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6067982Z       %159 = arith.truncf %158 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:51:05.6068161Z       %160 = arith.extsi %119 : i32 to i64
2026-02-21T09:51:05.6068334Z       %161 = tt.splat %160 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:05.6068557Z       %162 = arith.addi %161, %21 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:05.6068826Z       %163 = tt.expand_dims %162 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6069076Z       %164 = arith.muli %163, %cst_15 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6069263Z       %165 = tt.broadcast %164 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6069481Z       %166 = tt.splat %125 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:05.6069712Z       %167 = arith.addi %166, %22 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:05.6069980Z       %168 = tt.expand_dims %167 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6070251Z       %169 = tt.broadcast %168 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6070439Z       %170 = arith.addi %165, %169 : tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6070640Z       %171 = tt.addptr %20, %170 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6070877Z       %172 = arith.cmpi sge, %163, %cst_16 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6071048Z       %173 = arith.cmpi slt, %163, %cst_17 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6071214Z       %174 = arith.andi %172, %173 : tensor<128x1xi1, #mma>
2026-02-21T09:51:05.6071394Z       %175 = tt.broadcast %174 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6071589Z       %176 = arith.cmpi sge, %168, %cst_13 : tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6071759Z       %177 = arith.cmpi slt, %168, %cst_14 : tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6071925Z       %178 = arith.andi %176, %177 : tensor<1x128xi1, #mma>
2026-02-21T09:51:05.6072105Z       %179 = tt.broadcast %178 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6072305Z       %180 = arith.andi %175, %179 : tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6072477Z       tt.store %171, %159, %180 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:05.6072629Z       %181 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:51:05.6072762Z       %182 = arith.divsi %181, %c256_i32 : i32
2026-02-21T09:51:05.6072888Z       %183 = arith.muli %182, %c2_i32 : i32
2026-02-21T09:51:05.6073013Z       %184 = arith.subi %c64_i32, %183 : i32
2026-02-21T09:51:05.6073138Z       %185 = arith.minsi %184, %c2_i32 : i32
2026-02-21T09:51:05.6073262Z       %186 = arith.remsi %181, %c256_i32 : i32
2026-02-21T09:51:05.6073389Z       %187 = arith.remsi %186, %185 : i32
2026-02-21T09:51:05.6073508Z       %188 = arith.addi %183, %187 : i32
2026-02-21T09:51:05.6073630Z       %189 = arith.divsi %186, %185 : i32
2026-02-21T09:51:05.6073750Z       %190 = arith.muli %188, %c128_i32 : i32
2026-02-21T09:51:05.6073875Z       %191 = arith.muli %189, %c128_i32 : i32
2026-02-21T09:51:05.6074053Z       %192 = tt.splat %191 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:05.6074307Z       %193 = arith.addi %192, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:05.6074600Z       %194 = tt.expand_dims %193 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:51:05.6074858Z       %195 = arith.muli %194, %cst_6 : tensor<128x1xi32, #blocked1>
2026-02-21T09:51:05.6075062Z       %196 = tt.broadcast %195 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6075241Z       %197 = arith.extsi %190 : i32 to i64
2026-02-21T09:51:05.6075454Z       %198 = tt.splat %197 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6075763Z       %199 = arith.addi %198, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6076163Z       %200 = tt.expand_dims %199 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6076608Z       %201 = tt.broadcast %200 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6076938Z       %202 = arith.cmpi sge, %200, %cst_5 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6077205Z       %203 = arith.cmpi slt, %200, %cst_4 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6077447Z       %204 = arith.andi %202, %203 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6077752Z       %205 = tt.broadcast %204 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6078096Z       %206 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:51:05.6078314Z         %253 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:51:05.6078487Z         %254 = tt.splat %253 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6078732Z         %255 = arith.addi %254, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6079007Z         %256 = tt.expand_dims %255 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.6079288Z         %257 = tt.broadcast %256 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6079487Z         %258 = arith.addi %196, %257 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6079691Z         %259 = tt.addptr %8, %258 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6079902Z         %260 = tt.load %259 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6080142Z         %261 = ttg.local_alloc %260 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6080475Z         %262 = ttg.local_load %261 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6080888Z         %263 = arith.extf %262 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6081172Z         %264 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:05.6081384Z         %265 = tt.splat %264 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6081680Z         %266 = arith.addi %265, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6082089Z         %267 = tt.expand_dims %266 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6082444Z         %268 = arith.muli %267, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6082818Z         %269 = tt.broadcast %268 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6083130Z         %270 = arith.addi %269, %201 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6083447Z         %271 = tt.addptr %9, %270 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6083766Z         %272 = arith.cmpi sge, %267, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6084014Z         %273 = arith.cmpi slt, %267, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6084252Z         %274 = arith.andi %272, %273 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6084555Z         %275 = tt.broadcast %274 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6084859Z         %276 = arith.andi %275, %205 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6085105Z         %277 = tt.load %271, %276, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6085377Z         %278 = arith.shli %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6085610Z         %279 = arith.shrsi %278, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6085848Z         %280 = arith.shrsi %277, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6086141Z         %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6086479Z         %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6086792Z         %283 = tt.broadcast %281 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6087036Z         %284 = arith.select %17, %283, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6087276Z         %285 = tt.broadcast %282 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6087520Z         %286 = arith.select %19, %285, %284 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6087754Z         %287 = tt.reshape %286 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6087982Z         %288 = arith.sitofp %287 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6088256Z         %289 = ttg.local_alloc %288 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6088587Z         %290 = ttg.local_load %289 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6089066Z         %291 = tt.dot %263, %290, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6089416Z         %292 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:51:05.6089542Z         %293 = arith.muli %292, %c2_i32 : i32
2026-02-21T09:51:05.6089716Z         %294 = tt.splat %293 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6089939Z         %295 = arith.addi %294, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6090240Z         %296 = tt.expand_dims %295 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.6090517Z         %297 = tt.broadcast %296 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6090717Z         %298 = arith.addi %196, %297 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6090921Z         %299 = tt.addptr %8, %298 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6091129Z         %300 = tt.load %299 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6091354Z         %301 = ttg.local_alloc %300 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6091683Z         %302 = ttg.local_load %301 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6092101Z         %303 = arith.extf %302 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6092386Z         %304 = arith.extsi %292 : i32 to i64
2026-02-21T09:51:05.6092595Z         %305 = tt.splat %304 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6092897Z         %306 = arith.addi %305, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6093303Z         %307 = tt.expand_dims %306 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6093658Z         %308 = arith.muli %307, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6093971Z         %309 = tt.broadcast %308 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6094283Z         %310 = arith.addi %309, %201 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6094600Z         %311 = tt.addptr %9, %310 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6094936Z         %312 = arith.cmpi sge, %307, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6095181Z         %313 = arith.cmpi slt, %307, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6095418Z         %314 = arith.andi %312, %313 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6095718Z         %315 = tt.broadcast %314 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6096021Z         %316 = arith.andi %315, %205 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6096286Z         %317 = tt.load %311, %316, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6096532Z         %318 = arith.shli %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6096772Z         %319 = arith.shrsi %318, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6097007Z         %320 = arith.shrsi %317, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6097301Z         %321 = tt.expand_dims %319 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6097644Z         %322 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6097934Z         %323 = tt.broadcast %321 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6098199Z         %324 = arith.select %17, %323, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6098440Z         %325 = tt.broadcast %322 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6098675Z         %326 = arith.select %19, %325, %324 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6098910Z         %327 = tt.reshape %326 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6099137Z         %328 = arith.sitofp %327 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6099397Z         %329 = ttg.local_alloc %328 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6099728Z         %330 = ttg.local_load %329 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6100207Z         %331 = tt.dot %303, %330, %291, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6100557Z         %332 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:05.6100680Z         %333 = arith.muli %332, %c2_i32 : i32
2026-02-21T09:51:05.6100855Z         %334 = tt.splat %333 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6101080Z         %335 = arith.addi %334, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6101385Z         %336 = tt.expand_dims %335 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.6101665Z         %337 = tt.broadcast %336 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6101862Z         %338 = arith.addi %196, %337 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6102069Z         %339 = tt.addptr %8, %338 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6102277Z         %340 = tt.load %339 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6102519Z         %341 = ttg.local_alloc %340 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6102855Z         %342 = ttg.local_load %341 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6103266Z         %343 = arith.extf %342 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6103553Z         %344 = arith.extsi %332 : i32 to i64
2026-02-21T09:51:05.6103766Z         %345 = tt.splat %344 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6104081Z         %346 = arith.addi %345, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6104474Z         %347 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6104839Z         %348 = arith.muli %347, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6105149Z         %349 = tt.broadcast %348 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6105462Z         %350 = arith.addi %349, %201 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6105776Z         %351 = tt.addptr %9, %350 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6106100Z         %352 = arith.cmpi sge, %347, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6106365Z         %353 = arith.cmpi slt, %347, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6106602Z         %354 = arith.andi %352, %353 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6106905Z         %355 = tt.broadcast %354 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6107208Z         %356 = arith.andi %355, %205 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6107456Z         %357 = tt.load %351, %356, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6107706Z         %358 = arith.shli %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6107941Z         %359 = arith.shrsi %358, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6108180Z         %360 = arith.shrsi %357, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6108469Z         %361 = tt.expand_dims %359 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6108810Z         %362 = tt.expand_dims %360 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6109119Z         %363 = tt.broadcast %361 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6109360Z         %364 = arith.select %17, %363, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6109603Z         %365 = tt.broadcast %362 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6109840Z         %366 = arith.select %19, %365, %364 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6110072Z         %367 = tt.reshape %366 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6110302Z         %368 = arith.sitofp %367 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6110575Z         %369 = ttg.local_alloc %368 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6110905Z         %370 = ttg.local_load %369 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6111379Z         %371 = tt.dot %343, %370, %331, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6111730Z         scf.yield %371 : tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6111867Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:51:05.6112027Z       %207 = arith.addi %196, %29 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6112234Z       %208 = tt.addptr %8, %207 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6112445Z       %209 = tt.load %208 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6112669Z       %210 = ttg.local_alloc %209 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6113002Z       %211 = ttg.local_load %210 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6113413Z       %212 = arith.extf %211 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6113750Z       %213 = arith.addi %33, %201 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6114079Z       %214 = tt.addptr %9, %213 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6114388Z       %215 = arith.andi %37, %205 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6114635Z       %216 = tt.load %214, %215, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6114880Z       %217 = arith.shli %216, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6115120Z       %218 = arith.shrsi %217, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6115358Z       %219 = arith.shrsi %216, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6115647Z       %220 = tt.expand_dims %218 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6115986Z       %221 = tt.expand_dims %219 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6116274Z       %222 = tt.broadcast %220 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6116516Z       %223 = arith.select %17, %222, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6116758Z       %224 = tt.broadcast %221 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6116995Z       %225 = arith.select %19, %224, %223 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6117247Z       %226 = tt.reshape %225 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6117477Z       %227 = arith.sitofp %226 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6117733Z       %228 = ttg.local_alloc %227 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6118063Z       %229 = ttg.local_load %228 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6118529Z       %230 = tt.dot %212, %229, %206, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6118939Z       %231 = arith.truncf %230 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:51:05.6119118Z       %232 = arith.extsi %191 : i32 to i64
2026-02-21T09:51:05.6119283Z       %233 = tt.splat %232 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:05.6119497Z       %234 = arith.addi %233, %21 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:05.6119764Z       %235 = tt.expand_dims %234 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6120027Z       %236 = arith.muli %235, %cst_15 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6120213Z       %237 = tt.broadcast %236 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6120423Z       %238 = tt.splat %197 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:05.6120639Z       %239 = arith.addi %238, %22 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:05.6120903Z       %240 = tt.expand_dims %239 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6121170Z       %241 = tt.broadcast %240 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6121357Z       %242 = arith.addi %237, %241 : tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6121549Z       %243 = tt.addptr %20, %242 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6121755Z       %244 = arith.cmpi sge, %235, %cst_16 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6121924Z       %245 = arith.cmpi slt, %235, %cst_17 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6122102Z       %246 = arith.andi %244, %245 : tensor<128x1xi1, #mma>
2026-02-21T09:51:05.6122280Z       %247 = tt.broadcast %246 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6122470Z       %248 = arith.cmpi sge, %240, %cst_13 : tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6122684Z       %249 = arith.cmpi slt, %240, %cst_14 : tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6122842Z       %250 = arith.andi %248, %249 : tensor<1x128xi1, #mma>
2026-02-21T09:51:05.6123018Z       %251 = tt.broadcast %250 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6123201Z       %252 = arith.andi %247, %251 : tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6123372Z       tt.store %243, %231, %252 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:05.6123521Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:51:05.6123647Z     scf.for %arg3 = %26 to %2 step %c1_i32  : i32 {
2026-02-21T09:51:05.6123787Z       %38 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:51:05.6123910Z       %39 = arith.muli %38, %c2_i32 : i32
2026-02-21T09:51:05.6124031Z       %40 = arith.subi %c64_i32, %39 : i32
2026-02-21T09:51:05.6124148Z       %41 = arith.minsi %40, %c2_i32 : i32
2026-02-21T09:51:05.6124269Z       %42 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:51:05.6124388Z       %43 = arith.remsi %42, %41 : i32
2026-02-21T09:51:05.6124502Z       %44 = arith.addi %39, %43 : i32
2026-02-21T09:51:05.6124616Z       %45 = arith.divsi %42, %41 : i32
2026-02-21T09:51:05.6124748Z       %46 = arith.muli %44, %c128_i32 : i32
2026-02-21T09:51:05.6124867Z       %47 = arith.muli %45, %c128_i32 : i32
2026-02-21T09:51:05.6125035Z       %48 = tt.splat %47 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:05.6125259Z       %49 = arith.addi %48, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:05.6125536Z       %50 = tt.expand_dims %49 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:51:05.6125790Z       %51 = arith.muli %50, %cst_6 : tensor<128x1xi32, #blocked1>
2026-02-21T09:51:05.6125985Z       %52 = tt.broadcast %51 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6126183Z       %53 = arith.extsi %46 : i32 to i64
2026-02-21T09:51:05.6126390Z       %54 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6126688Z       %55 = arith.addi %54, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6127079Z       %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6127529Z       %57 = tt.broadcast %56 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6127842Z       %58 = arith.cmpi sge, %56, %cst_5 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6128091Z       %59 = arith.cmpi slt, %56, %cst_4 : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6128323Z       %60 = arith.andi %58, %59 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6128622Z       %61 = tt.broadcast %60 : tensor<1x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6128959Z       %62 = scf.for %arg4 = %c0_i32 to %c510_i32 step %c6_i32 iter_args(%arg5 = %cst) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:51:05.6129174Z         %109 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:51:05.6129352Z         %110 = tt.splat %109 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6129584Z         %111 = arith.addi %110, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6129883Z         %112 = tt.expand_dims %111 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.6130168Z         %113 = tt.broadcast %112 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6130365Z         %114 = arith.addi %52, %113 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6130573Z         %115 = tt.addptr %8, %114 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6130787Z         %116 = tt.load %115 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6131015Z         %117 = ttg.local_alloc %116 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6131354Z         %118 = ttg.local_load %117 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6131769Z         %119 = arith.extf %118 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6132062Z         %120 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:05.6139556Z         %121 = tt.splat %120 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6139888Z         %122 = arith.addi %121, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6140330Z         %123 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6140691Z         %124 = arith.muli %123, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6141014Z         %125 = tt.broadcast %124 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6141324Z         %126 = arith.addi %125, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6141674Z         %127 = tt.addptr %9, %126 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6141996Z         %128 = arith.cmpi sge, %123, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6142242Z         %129 = arith.cmpi slt, %123, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6142481Z         %130 = arith.andi %128, %129 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6142784Z         %131 = tt.broadcast %130 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6143102Z         %132 = arith.andi %131, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6143350Z         %133 = tt.load %127, %132, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6143600Z         %134 = arith.shli %133, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6143839Z         %135 = arith.shrsi %134, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6144078Z         %136 = arith.shrsi %133, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6144367Z         %137 = tt.expand_dims %135 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6144706Z         %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6144994Z         %139 = tt.broadcast %137 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6145263Z         %140 = arith.select %17, %139, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6145507Z         %141 = tt.broadcast %138 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6145743Z         %142 = arith.select %19, %141, %140 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6145979Z         %143 = tt.reshape %142 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6146206Z         %144 = arith.sitofp %143 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6146462Z         %145 = ttg.local_alloc %144 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6146792Z         %146 = ttg.local_load %145 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6147278Z         %147 = tt.dot %119, %146, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6147632Z         %148 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:51:05.6147757Z         %149 = arith.muli %148, %c2_i32 : i32
2026-02-21T09:51:05.6147929Z         %150 = tt.splat %149 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6148169Z         %151 = arith.addi %150, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6148442Z         %152 = tt.expand_dims %151 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.6148719Z         %153 = tt.broadcast %152 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6148916Z         %154 = arith.addi %52, %153 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6149123Z         %155 = tt.addptr %8, %154 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6149352Z         %156 = tt.load %155 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6149576Z         %157 = ttg.local_alloc %156 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6149907Z         %158 = ttg.local_load %157 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6150314Z         %159 = arith.extf %158 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6150600Z         %160 = arith.extsi %148 : i32 to i64
2026-02-21T09:51:05.6150812Z         %161 = tt.splat %160 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6151124Z         %162 = arith.addi %161, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6151512Z         %163 = tt.expand_dims %162 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6151866Z         %164 = arith.muli %163, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6152176Z         %165 = tt.broadcast %164 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6152485Z         %166 = arith.addi %165, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6152798Z         %167 = tt.addptr %9, %166 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6153134Z         %168 = arith.cmpi sge, %163, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6153378Z         %169 = arith.cmpi slt, %163, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6153615Z         %170 = arith.andi %168, %169 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6153920Z         %171 = tt.broadcast %170 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6154221Z         %172 = arith.andi %171, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6154464Z         %173 = tt.load %167, %172, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6154712Z         %174 = arith.shli %173, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6154947Z         %175 = arith.shrsi %174, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6155184Z         %176 = arith.shrsi %173, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6155476Z         %177 = tt.expand_dims %175 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6155814Z         %178 = tt.expand_dims %176 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6156117Z         %179 = tt.broadcast %177 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6156358Z         %180 = arith.select %17, %179, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6156598Z         %181 = tt.broadcast %178 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6156834Z         %182 = arith.select %19, %181, %180 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6157067Z         %183 = tt.reshape %182 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6157317Z         %184 = arith.sitofp %183 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6157572Z         %185 = ttg.local_alloc %184 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6157902Z         %186 = ttg.local_load %185 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6158376Z         %187 = tt.dot %159, %186, %147, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6158723Z         %188 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:05.6158867Z         %189 = arith.muli %188, %c2_i32 : i32
2026-02-21T09:51:05.6159041Z         %190 = tt.splat %189 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6159263Z         %191 = arith.addi %190, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:05.6159540Z         %192 = tt.expand_dims %191 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:05.6159814Z         %193 = tt.broadcast %192 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6160012Z         %194 = arith.addi %52, %193 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6160211Z         %195 = tt.addptr %8, %194 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6160418Z         %196 = tt.load %195 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6160645Z         %197 = ttg.local_alloc %196 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6160992Z         %198 = ttg.local_load %197 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6161404Z         %199 = arith.extf %198 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6161686Z         %200 = arith.extsi %188 : i32 to i64
2026-02-21T09:51:05.6161899Z         %201 = tt.splat %200 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6162198Z         %202 = arith.addi %201, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:05.6162643Z         %203 = tt.expand_dims %202 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6163004Z         %204 = arith.muli %203, %cst_10 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6163315Z         %205 = tt.broadcast %204 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6163621Z         %206 = arith.addi %205, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6163933Z         %207 = tt.addptr %9, %206 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6164278Z         %208 = arith.cmpi sge, %203, %cst_11 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6164523Z         %209 = arith.cmpi slt, %203, %cst_12 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6164759Z         %210 = arith.andi %208, %209 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6165061Z         %211 = tt.broadcast %210 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6165386Z         %212 = arith.andi %211, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6165630Z         %213 = tt.load %207, %212, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6165875Z         %214 = arith.shli %213, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6166111Z         %215 = arith.shrsi %214, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6166345Z         %216 = arith.shrsi %213, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6166636Z         %217 = tt.expand_dims %215 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6166998Z         %218 = tt.expand_dims %216 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6167284Z         %219 = tt.broadcast %217 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6167526Z         %220 = arith.select %17, %219, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6167766Z         %221 = tt.broadcast %218 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6168003Z         %222 = arith.select %19, %221, %220 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6168236Z         %223 = tt.reshape %222 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6168460Z         %224 = arith.sitofp %223 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6168718Z         %225 = ttg.local_alloc %224 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6169065Z         %226 = ttg.local_load %225 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6169536Z         %227 = tt.dot %199, %226, %187, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6169889Z         scf.yield %227 : tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6170026Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:51:05.6170168Z       %63 = arith.addi %52, %29 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6170363Z       %64 = tt.addptr %8, %63 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:05.6170563Z       %65 = tt.load %64 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:05.6170784Z       %66 = ttg.local_alloc %65 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:05.6171109Z       %67 = ttg.local_load %66 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6171518Z       %68 = arith.extf %67 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6171845Z       %69 = arith.addi %33, %57 : tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6172169Z       %70 = tt.addptr %9, %69 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x128xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6172471Z       %71 = arith.andi %37, %61 : tensor<2x128xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6172703Z       %72 = tt.load %70, %71, %cst_3 : tensor<2x128x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6172943Z       %73 = arith.shli %72, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6173171Z       %74 = arith.shrsi %73, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6173416Z       %75 = arith.shrsi %72, %cst_7 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:05.6173698Z       %76 = tt.expand_dims %74 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6174027Z       %77 = tt.expand_dims %75 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:51:05.6174306Z       %78 = tt.broadcast %76 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6174545Z       %79 = arith.select %17, %78, %cst_0 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6174800Z       %80 = tt.broadcast %77 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6175029Z       %81 = arith.select %19, %80, %79 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:51:05.6175253Z       %82 = tt.reshape %81 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:51:05.6175477Z       %83 = arith.sitofp %82 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:51:05.6175724Z       %84 = ttg.local_alloc %83 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:05.6176042Z       %85 = ttg.local_load %84 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:05.6176503Z       %86 = tt.dot %68, %85, %62, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:05.6176901Z       %87 = arith.truncf %86 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:51:05.6177079Z       %88 = arith.extsi %47 : i32 to i64
2026-02-21T09:51:05.6177242Z       %89 = tt.splat %88 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:05.6177448Z       %90 = arith.addi %89, %21 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:05.6177710Z       %91 = tt.expand_dims %90 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6177947Z       %92 = arith.muli %91, %cst_15 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6178127Z       %93 = tt.broadcast %92 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6178329Z       %94 = tt.splat %53 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:05.6178529Z       %95 = arith.addi %94, %22 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:05.6178791Z       %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6179047Z       %97 = tt.broadcast %96 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6179228Z       %98 = arith.addi %93, %97 : tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6179417Z       %99 = tt.addptr %20, %98 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi64, #mma>
2026-02-21T09:51:05.6179634Z       %100 = arith.cmpi sge, %91, %cst_16 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6179803Z       %101 = arith.cmpi slt, %91, %cst_17 : tensor<128x1xi64, #mma>
2026-02-21T09:51:05.6179959Z       %102 = arith.andi %100, %101 : tensor<128x1xi1, #mma>
2026-02-21T09:51:05.6180134Z       %103 = tt.broadcast %102 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6180318Z       %104 = arith.cmpi sge, %96, %cst_13 : tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6180483Z       %105 = arith.cmpi slt, %96, %cst_14 : tensor<1x128xi64, #mma>
2026-02-21T09:51:05.6180640Z       %106 = arith.andi %104, %105 : tensor<1x128xi1, #mma>
2026-02-21T09:51:05.6180811Z       %107 = tt.broadcast %106 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6181016Z       %108 = arith.andi %103, %107 : tensor<128x128xi1, #mma>
2026-02-21T09:51:05.6181173Z       tt.store %99, %87, %108 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:05.6181317Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:51:05.6181421Z     tt.return
2026-02-21T09:51:05.6181502Z   }
2026-02-21T09:51:05.6181574Z }
2026-02-21T09:51:05.6181622Z 
2026-02-21T09:51:05.6181653Z {-#
2026-02-21T09:51:05.6181734Z   external_resources: {
2026-02-21T09:51:05.6181833Z     mlir_reproducer: {
2026-02-21T09:51:05.6182853Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:51:05.6183842Z       disable_threading: false,
2026-02-21T09:51:05.6183947Z       verify_each: true
2026-02-21T09:51:05.6184037Z     }
2026-02-21T09:51:05.6184107Z   }
2026-02-21T09:51:05.6184176Z #-}
2026-02-21T09:51:05.6184452Z /tmp/torchinductor_root/lb/clbavg5reont7fzmlfj6vv4kl6ka6fhi6slqhyesoznsr4cqh7yc.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:51:05.6185174Z /tmp/torchinductor_root/lb/clbavg5reont7fzmlfj6vv4kl6ka6fhi6slqhyesoznsr4cqh7yc.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:51:05.6185725Z [396s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:51:05.6186502Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T09:51:05.6187206Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:51:05.6187371Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:51:07.4417104Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:51:07.4422245Z #blocked = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:51:07.4422620Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:51:07.4422930Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:51:07.4423603Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:51:07.4423884Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:51:07.4424136Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:51:07.4424369Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:51:07.4424555Z #smem = #ttg.shared_memory
2026-02-21T09:51:07.4424784Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:51:07.4425402Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:51:07.4425789Z     %cst = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:51:07.4425957Z     %cst_0 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:51:07.4426125Z     %cst_1 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:51:07.4426292Z     %cst_2 = arith.constant dense<8192> : tensor<1x128xi64, #mma>
2026-02-21T09:51:07.4426452Z     %cst_3 = arith.constant dense<0> : tensor<1x128xi64, #mma>
2026-02-21T09:51:07.4430681Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4430858Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4431029Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4431203Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:51:07.4431374Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:51:07.4431555Z     %cst_9 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:51:07.4431713Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:51:07.4431829Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:51:07.4431943Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:51:07.4432128Z     %cst_10 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:07.4432377Z     %cst_11 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:07.4432778Z     %cst_12 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:07.4432988Z     %cst_13 = arith.constant dense<0> : tensor<1x128xi64, #blocked>
2026-02-21T09:51:07.4433163Z     %cst_14 = arith.constant dense<8192> : tensor<1x128xi64, #blocked>
2026-02-21T09:51:07.4433345Z     %cst_15 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:51:07.4433495Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:51:07.4433610Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:51:07.4433725Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:51:07.4433866Z     %cst_16 = arith.constant dense<0> : tensor<2x128xi8, #blocked>
2026-02-21T09:51:07.4434040Z     %cst_17 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4434185Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:51:07.4434299Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:51:07.4434479Z     %cst_18 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4434693Z     %0 = tt.get_program_id x : i32
2026-02-21T09:51:07.4434807Z     %1 = arith.divsi %0, %c512_i32 : i32
2026-02-21T09:51:07.4434922Z     %2 = arith.muli %1, %c8_i32 : i32
2026-02-21T09:51:07.4435035Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:51:07.4435149Z     %4 = arith.minsi %3, %c8_i32 : i32
2026-02-21T09:51:07.4435258Z     %5 = arith.remsi %0, %c512_i32 : i32
2026-02-21T09:51:07.4435375Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:51:07.4435511Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:51:07.4435619Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:51:07.4435725Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:51:07.4435930Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:07.4436209Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:07.4436482Z     %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:51:07.4436752Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:07.4437019Z     %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:07.4437245Z     %15 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:07.4437418Z     %16 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:51:07.4437609Z     %17 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:07.4437913Z     %18 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:51:07.4438165Z     %19 = arith.muli %18, %cst_15 : tensor<128x1xi32, #blocked2>
2026-02-21T09:51:07.4438376Z     %20 = tt.broadcast %19 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:51:07.4438593Z     %21 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:51:07.4438755Z     %22 = arith.extsi %16 : i32 to i64
2026-02-21T09:51:07.4438906Z     %23 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:51:07.4439135Z     %24 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:07.4439449Z     %25 = arith.extsi %24 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:07.4439732Z     %26 = tt.splat %22 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:51:07.4440025Z     %27 = arith.extsi %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:51:07.4440317Z     %28 = arith.addi %26, %27 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:51:07.4440604Z     %29 = tt.expand_dims %28 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked>
2026-02-21T09:51:07.4440878Z     %30 = tt.broadcast %29 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:51:07.4441078Z     %31 = arith.cmpi sge, %29, %cst_13 : tensor<1x128xi64, #blocked>
2026-02-21T09:51:07.4441248Z     %32 = arith.cmpi slt, %29, %cst_14 : tensor<1x128xi64, #blocked>
2026-02-21T09:51:07.4441412Z     %33 = arith.andi %31, %32 : tensor<1x128xi1, #blocked>
2026-02-21T09:51:07.4441591Z     %34 = tt.broadcast %33 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:51:07.4441882Z     %35 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:51:07.4442308Z     %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:51:07.4442784Z     %37 = tt.expand_dims %36 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:51:07.4443044Z     %38 = arith.cmpi eq, %37, %cst_8 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:51:07.4443240Z     %39 = tt.broadcast %38 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1>
2026-02-21T09:51:07.4443456Z     %40 = arith.cmpi eq, %37, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:51:07.4443646Z     %41 = tt.broadcast %40 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1>
2026-02-21T09:51:07.4443864Z     %42 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:07.4444132Z     %43 = tt.expand_dims %17 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:51:07.4444401Z     %44 = tt.broadcast %43 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:51:07.4444592Z     %45 = arith.addi %20, %44 : tensor<128x4xi32, #blocked2>
2026-02-21T09:51:07.4444805Z     %46 = tt.addptr %21, %45 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:51:07.4445008Z     %47 = tt.load %46 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:51:07.4445294Z     %48 = ttg.memdesc_index %42[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:51:07.4445651Z     ttg.local_store %47, %48 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:51:07.4445925Z     %49 = arith.addi %17, %cst_10 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:07.4446198Z     %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:51:07.4446483Z     %51 = tt.broadcast %50 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:51:07.4446674Z     %52 = arith.addi %20, %51 : tensor<128x4xi32, #blocked2>
2026-02-21T09:51:07.4446867Z     %53 = tt.addptr %21, %52 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:51:07.4447067Z     %54 = tt.load %53 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:51:07.4447343Z     %55 = ttg.memdesc_index %42[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:51:07.4447696Z     ttg.local_store %54, %55 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:51:07.4448236Z     %56:4 = scf.for %arg3 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg4 = %cst_9, %arg5 = %c1_i32, %arg6 = %48, %arg7 = %55) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:51:07.4448655Z       %140 = arith.addi %arg3, %c4_i32 : i32
2026-02-21T09:51:07.4448782Z       %141 = arith.muli %140, %c2_i32 : i32
2026-02-21T09:51:07.4448953Z       %142 = tt.splat %141 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:07.4449176Z       %143 = arith.addi %142, %17 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:07.4449453Z       %144 = tt.expand_dims %143 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:51:07.4449730Z       %145 = tt.broadcast %144 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:51:07.4449929Z       %146 = arith.addi %20, %145 : tensor<128x4xi32, #blocked2>
2026-02-21T09:51:07.4450133Z       %147 = tt.addptr %21, %146 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:51:07.4450344Z       %148 = tt.load %147 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:51:07.4450650Z       %149 = ttg.local_load %arg6 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:07.4451086Z       %150 = arith.extf %149 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:07.4451371Z       %151 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:51:07.4451557Z       %152 = tt.splat %151 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:07.4451776Z       %153 = arith.addi %152, %25 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:07.4452050Z       %154 = tt.expand_dims %153 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4452291Z       %155 = arith.muli %154, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4452482Z       %156 = tt.broadcast %155 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:51:07.4452674Z       %157 = arith.addi %156, %30 : tensor<2x128xi64, #blocked>
2026-02-21T09:51:07.4452885Z       %158 = tt.addptr %23, %157 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:51:07.4453100Z       %159 = arith.cmpi sge, %154, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4453269Z       %160 = arith.cmpi slt, %154, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4453434Z       %161 = arith.andi %159, %160 : tensor<2x1xi1, #blocked>
2026-02-21T09:51:07.4453617Z       %162 = tt.broadcast %161 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:51:07.4453805Z       %163 = arith.andi %162, %34 : tensor<2x128xi1, #blocked>
2026-02-21T09:51:07.4453975Z       %164 = tt.load %158, %163, %cst_16 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:51:07.4454260Z       %165 = ttg.convert_layout %164 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4454548Z       %166 = arith.shli %165, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4454793Z       %167 = arith.shrsi %166, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4455037Z       %168 = arith.shrsi %165, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4455337Z       %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:51:07.4455679Z       %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:51:07.4455972Z       %171 = tt.broadcast %169 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4456219Z       %172 = arith.select %39, %171, %cst_17 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4456484Z       %173 = tt.broadcast %170 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4456728Z       %174 = arith.select %41, %173, %172 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4456962Z       %175 = tt.reshape %174 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:51:07.4457190Z       %176 = arith.sitofp %175 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:51:07.4457446Z       %177 = ttg.local_alloc %176 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:07.4457774Z       %178 = ttg.local_load %177 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:07.4458258Z       %179 = tt.dot %150, %178, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:07.4458608Z       %180 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:51:07.4458738Z       %181 = arith.cmpi slt, %180, %c2_i32 : i32
2026-02-21T09:51:07.4458870Z       %182 = arith.select %181, %180, %c0_i32 : i32
2026-02-21T09:51:07.4459139Z       %183 = ttg.memdesc_index %42[%182] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:51:07.4459517Z       ttg.local_store %148, %183 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:51:07.4459916Z       scf.yield %179, %182, %arg7, %183 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:51:07.4460223Z     } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:51:07.4460501Z     %57 = ttg.local_load %56#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:07.4460948Z     %58 = arith.extf %57 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:07.4461306Z     %59 = arith.addi %25, %cst_11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:07.4461582Z     %60 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4461826Z     %61 = arith.muli %60, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4462013Z     %62 = tt.broadcast %61 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:51:07.4462199Z     %63 = arith.addi %62, %30 : tensor<2x128xi64, #blocked>
2026-02-21T09:51:07.4462423Z     %64 = tt.addptr %23, %63 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:51:07.4462624Z     %65 = arith.cmpi sge, %60, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4462791Z     %66 = arith.cmpi slt, %60, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4462949Z     %67 = arith.andi %65, %66 : tensor<2x1xi1, #blocked>
2026-02-21T09:51:07.4463126Z     %68 = tt.broadcast %67 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:51:07.4463308Z     %69 = arith.andi %68, %34 : tensor<2x128xi1, #blocked>
2026-02-21T09:51:07.4463467Z     %70 = tt.load %64, %69, %cst_16 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:51:07.4463721Z     %71 = ttg.convert_layout %70 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4463999Z     %72 = arith.shli %71, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4464236Z     %73 = arith.shrsi %72, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4464495Z     %74 = arith.shrsi %71, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4464781Z     %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:51:07.4465120Z     %76 = tt.expand_dims %74 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:51:07.4465401Z     %77 = tt.broadcast %75 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4465643Z     %78 = arith.select %39, %77, %cst_17 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4465881Z     %79 = tt.broadcast %76 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4466110Z     %80 = arith.select %41, %79, %78 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4466341Z     %81 = tt.reshape %80 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:51:07.4466558Z     %82 = arith.sitofp %81 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:51:07.4466811Z     %83 = ttg.local_alloc %82 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:07.4467134Z     %84 = ttg.local_load %83 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:07.4467611Z     %85 = tt.dot %58, %84, %56#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:07.4468099Z     %86 = ttg.local_load %56#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:07.4468526Z     %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:07.4468850Z     %88 = arith.addi %25, %cst_12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:07.4469143Z     %89 = tt.expand_dims %88 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4469381Z     %90 = arith.muli %89, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4469564Z     %91 = tt.broadcast %90 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:51:07.4469749Z     %92 = arith.addi %91, %30 : tensor<2x128xi64, #blocked>
2026-02-21T09:51:07.4469937Z     %93 = tt.addptr %23, %92 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:51:07.4470137Z     %94 = arith.cmpi sge, %89, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4470298Z     %95 = arith.cmpi slt, %89, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:51:07.4470471Z     %96 = arith.andi %94, %95 : tensor<2x1xi1, #blocked>
2026-02-21T09:51:07.4470645Z     %97 = tt.broadcast %96 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:51:07.4470826Z     %98 = arith.andi %97, %34 : tensor<2x128xi1, #blocked>
2026-02-21T09:51:07.4470987Z     %99 = tt.load %93, %98, %cst_16 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:51:07.4471239Z     %100 = ttg.convert_layout %99 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4471525Z     %101 = arith.shli %100, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4471767Z     %102 = arith.shrsi %101, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4472013Z     %103 = arith.shrsi %100, %cst_18 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:07.4472331Z     %104 = tt.expand_dims %102 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:51:07.4472674Z     %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:51:07.4472976Z     %106 = tt.broadcast %104 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4473228Z     %107 = arith.select %39, %106, %cst_17 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4473478Z     %108 = tt.broadcast %105 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4473723Z     %109 = arith.select %41, %108, %107 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:51:07.4473961Z     %110 = tt.reshape %109 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:51:07.4474195Z     %111 = arith.sitofp %110 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:51:07.4474456Z     %112 = ttg.local_alloc %111 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:51:07.4474781Z     %113 = ttg.local_load %112 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:07.4475254Z     %114 = tt.dot %87, %113, %85, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:51:07.4475652Z     ttg.local_dealloc %42 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:07.4475875Z     %115 = arith.truncf %114 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:51:07.4476054Z     %116 = arith.extsi %9 : i32 to i64
2026-02-21T09:51:07.4476215Z     %117 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:07.4476428Z     %118 = tt.splat %116 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:07.4476709Z     %119 = arith.extsi %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:07.4477065Z     %120 = arith.extsi %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:07.4477349Z     %121 = arith.addi %118, %119 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:07.4477621Z     %122 = tt.expand_dims %121 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:51:07.4477866Z     %123 = arith.muli %122, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:51:07.4478049Z     %124 = tt.broadcast %123 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:07.4478263Z     %125 = tt.splat %22 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:07.4478493Z     %126 = arith.addi %125, %120 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:07.4478761Z     %127 = tt.expand_dims %126 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:51:07.4479031Z     %128 = tt.broadcast %127 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:51:07.4479217Z     %129 = arith.addi %124, %128 : tensor<128x128xi64, #mma>
2026-02-21T09:51:07.4479417Z     %130 = tt.addptr %117, %129 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi64, #mma>
2026-02-21T09:51:07.4479628Z     %131 = arith.cmpi sge, %122, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:51:07.4479798Z     %132 = arith.cmpi slt, %122, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:51:07.4479960Z     %133 = arith.andi %131, %132 : tensor<128x1xi1, #mma>
2026-02-21T09:51:07.4480136Z     %134 = tt.broadcast %133 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:07.4480330Z     %135 = arith.cmpi sge, %127, %cst_3 : tensor<1x128xi64, #mma>
2026-02-21T09:51:07.4480513Z     %136 = arith.cmpi slt, %127, %cst_2 : tensor<1x128xi64, #mma>
2026-02-21T09:51:07.4480677Z     %137 = arith.andi %135, %136 : tensor<1x128xi1, #mma>
2026-02-21T09:51:07.4480855Z     %138 = tt.broadcast %137 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:51:07.4481036Z     %139 = arith.andi %134, %138 : tensor<128x128xi1, #mma>
2026-02-21T09:51:07.4481205Z     tt.store %130, %115, %139 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:07.4481351Z     tt.return
2026-02-21T09:51:07.4481439Z   }
2026-02-21T09:51:07.4481519Z }
2026-02-21T09:51:07.4481568Z 
2026-02-21T09:51:07.4481602Z {-#
2026-02-21T09:51:07.4481687Z   external_resources: {
2026-02-21T09:51:07.4481793Z     mlir_reproducer: {
2026-02-21T09:51:07.4482842Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:51:07.4483848Z       disable_threading: false,
2026-02-21T09:51:07.4483975Z       verify_each: true
2026-02-21T09:51:07.4484072Z     }
2026-02-21T09:51:07.4484146Z   }
2026-02-21T09:51:07.4484222Z #-}
2026-02-21T09:51:07.4484504Z /tmp/torchinductor_root/5w/c5wwxmx5vabbpkn6yll2cuxzqosrtw6kz5sxmqkbqh3xmwwb4n63.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:51:07.4485207Z /tmp/torchinductor_root/5w/c5wwxmx5vabbpkn6yll2cuxzqosrtw6kz5sxmqkbqh3xmwwb4n63.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:51:07.4485765Z [398s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:51:07.4486518Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:51:07.4489107Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:51:07.4489286Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:51:13.4609838Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 77/77 8.4 configs/s
2026-02-21T09:51:21.4486021Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 193/193 18.2 configs/s
2026-02-21T09:51:24.5592683Z [415s] Generation 4 complete: 
2026-02-21T09:51:24.5593013Z error=4
2026-02-21T09:51:24.5593190Z ok=77
2026-02-21T09:51:24.5593335Z min=1.1042
2026-02-21T09:51:24.5593487Z mid=1.5332
2026-02-21T09:51:24.5593635Z max=67.4286
2026-02-21T09:51:24.5593809Z best={'block_sizes': [8, 128, 256],
2026-02-21T09:51:24.5594077Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:51:24.5594343Z  'l2_groupings': [2],
2026-02-21T09:51:24.5594551Z  'load_eviction_policies': ['', ''],
2026-02-21T09:51:24.5594772Z  'loop_orders': [[0, 1]],
2026-02-21T09:51:24.5594977Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:51:24.5595164Z  'num_stages': 1,
2026-02-21T09:51:24.5595375Z  'num_warps': 4,
2026-02-21T09:51:24.5595543Z  'pid_type': 'flat',
2026-02-21T09:51:24.5595732Z  'range_flattens': [None, None],
2026-02-21T09:51:24.5595949Z  'range_multi_buffers': [None, False],
2026-02-21T09:51:24.5596172Z  'range_num_stages': [0, 3],
2026-02-21T09:51:24.5596373Z  'range_unroll_factors': [0, 0],
2026-02-21T09:51:24.5597112Z  'range_warp_specializes': [],
2026-02-21T09:51:24.5597310Z  'waves_per_eu': 2}
2026-02-21T09:51:24.5717952Z [415s] Fitting surrogate: 506 points, 506 targets
2026-02-21T09:51:25.4006461Z [416s] Generation 5 starting: 80 neighbors, 4 active search path(s)
2026-02-21T09:51:38.9101625Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 1.7 configs/s
2026-02-21T09:51:45.1039603Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:51:45.1049481Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:51:45.1051696Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:51:45.1052150Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:51:45.1052470Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:51:45.1052958Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:51:45.1053368Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:51:45.1053917Z #smem = #ttg.shared_memory
2026-02-21T09:51:45.1054147Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:51:45.1054623Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:51:45.1055017Z     %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma>
2026-02-21T09:51:45.1055199Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:45.1055465Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:45.1055654Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1055814Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:51:45.1056005Z     %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1056256Z     %cst_4 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1056623Z     %cst_5 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1056858Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1057031Z     %cst_7 = arith.constant dense<0> : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1057203Z     %cst_8 = arith.constant dense<512> : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1057375Z     %cst_9 = arith.constant dense<0> : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1057549Z     %cst_10 = arith.constant dense<8192> : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1057726Z     %cst_11 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:45.1057873Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:51:45.1057988Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:51:45.1058104Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:51:45.1058226Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:51:45.1058344Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:51:45.1058457Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:51:45.1058597Z     %cst_12 = arith.constant dense<0> : tensor<2x64xi8, #blocked2>
2026-02-21T09:51:45.1058769Z     %cst_13 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1058916Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:51:45.1059129Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:51:45.1059241Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:51:45.1059418Z     %cst_14 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1059604Z     %0 = tt.get_program_id x : i32
2026-02-21T09:51:45.1059716Z     %1 = arith.muli %0, %c4_i32 : i32
2026-02-21T09:51:45.1059824Z     %2 = arith.addi %1, %c4_i32 : i32
2026-02-21T09:51:45.1059941Z     %3 = arith.minsi %2, %c32768_i32 : i32
2026-02-21T09:51:45.1060140Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:45.1060414Z     %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:45.1060677Z     %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:45.1060941Z     %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:45.1061200Z     %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1061439Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1061637Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1061866Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1062201Z     %12 = arith.extsi %11 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1062561Z     %13 = arith.extsi %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:45.1062910Z     %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:51:45.1063325Z     %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:51:45.1063745Z     %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:45.1063993Z     %17 = arith.cmpi eq, %16, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:45.1064185Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:51:45.1064403Z     %19 = arith.cmpi eq, %16, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:45.1064587Z     %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:51:45.1064791Z     %21 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:45.1064974Z     %22 = arith.subi %3, %1 : i32
2026-02-21T09:51:45.1065090Z     %23 = arith.remsi %22, %c3_i32 : i32
2026-02-21T09:51:45.1065203Z     %24 = arith.subi %22, %23 : i32
2026-02-21T09:51:45.1065316Z     %25 = arith.addi %1, %24 : i32
2026-02-21T09:51:45.1065438Z     scf.for %arg3 = %1 to %25 step %c3_i32  : i32 {
2026-02-21T09:51:45.1065575Z       %26 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:51:45.1065697Z       %27 = arith.muli %26, %c2_i32 : i32
2026-02-21T09:51:45.1065814Z       %28 = arith.subi %c128_i32, %27 : i32
2026-02-21T09:51:45.1065933Z       %29 = arith.minsi %28, %c2_i32 : i32
2026-02-21T09:51:45.1066049Z       %30 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:51:45.1066173Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T09:51:45.1066301Z       %32 = arith.addi %27, %31 : i32
2026-02-21T09:51:45.1066412Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T09:51:45.1066522Z       %34 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:51:45.1066701Z       %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:45.1066904Z       %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:45.1067067Z       %37 = arith.muli %33, %c64_i32 : i32
2026-02-21T09:51:45.1067233Z       %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:45.1067436Z       %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:45.1067646Z       %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:45.1067873Z       %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:45.1068136Z       %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:45.1068386Z       %43 = arith.muli %42, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:45.1068574Z       %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1068746Z       %45 = arith.extsi %34 : i32 to i64
2026-02-21T09:51:45.1068929Z       %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:45.1069145Z       %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:45.1069414Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1069725Z       %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1069921Z       %50 = arith.cmpi sge, %48, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1070091Z       %51 = arith.cmpi slt, %48, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1070250Z       %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked2>
2026-02-21T09:51:45.1070453Z       %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1070662Z       %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:45.1070947Z       %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:45.1071210Z       %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1071414Z       %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1071609Z       %58 = tt.addptr %9, %57 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1071822Z       %59 = tt.load %58 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1072102Z       %60 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1072495Z       ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1072779Z       %61 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1073073Z       %62 = tt.expand_dims %61 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:45.1073337Z       %63 = tt.broadcast %62 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1073569Z       %64 = arith.addi %44, %63 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1073859Z       %65 = tt.addptr %9, %64 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1074141Z       %66 = tt.load %65 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1074525Z       %67 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1075002Z       ttg.local_store %66, %67 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1075813Z       %68:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %60, %arg8 = %67) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:51:45.1076465Z         %307 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:45.1076651Z         %308 = arith.muli %307, %c2_i32 : i32
2026-02-21T09:51:45.1076914Z         %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1077259Z         %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1077687Z         %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:45.1078118Z         %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1078415Z         %313 = arith.addi %44, %312 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1078721Z         %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1079034Z         %315 = tt.load %314 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1079497Z         %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1080107Z         %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1080386Z         %318 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:45.1080633Z         %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1080913Z         %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1081191Z         %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1081458Z         %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1081648Z         %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1081843Z         %324 = arith.addi %323, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1082039Z         %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1082281Z         %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1082544Z         %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1082798Z         %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:45.1082989Z         %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1083176Z         %330 = arith.andi %329, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1083347Z         %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1083601Z         %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1083896Z         %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1084139Z         %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1084380Z         %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1084676Z         %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1085036Z         %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1085317Z         %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1085556Z         %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1085823Z         %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1086150Z         %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1086383Z         %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1086602Z         %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1086896Z         %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1087361Z         %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1087710Z         %346 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:51:45.1087863Z         %347 = arith.cmpi slt, %346, %c2_i32 : i32
2026-02-21T09:51:45.1087997Z         %348 = arith.select %347, %346, %c0_i32 : i32
2026-02-21T09:51:45.1088263Z         %349 = ttg.memdesc_index %54[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1088610Z         ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1089002Z         scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1089352Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:45.1089661Z       %69 = ttg.local_load %68#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1090080Z       %70 = arith.extf %69 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1090420Z       %71 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1090693Z       %72 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1090936Z       %73 = arith.muli %72, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1091121Z       %74 = tt.broadcast %73 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1091306Z       %75 = arith.addi %74, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1091499Z       %76 = tt.addptr %10, %75 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1091702Z       %77 = arith.cmpi sge, %72, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1091869Z       %78 = arith.cmpi slt, %72, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1092027Z       %79 = arith.andi %77, %78 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:45.1092209Z       %80 = tt.broadcast %79 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1092393Z       %81 = arith.andi %80, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1092554Z       %82 = tt.load %76, %81, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1092826Z       %83 = ttg.convert_layout %82 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1093097Z       %84 = arith.shli %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1093329Z       %85 = arith.shrsi %84, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1093556Z       %86 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1093835Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1094163Z       %88 = tt.expand_dims %86 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1094432Z       %89 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1094661Z       %90 = arith.select %18, %89, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1094887Z       %91 = tt.broadcast %88 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1095107Z       %92 = arith.select %20, %91, %90 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1095326Z       %93 = tt.reshape %92 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1095535Z       %94 = arith.sitofp %93 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1095833Z       %95 = ttg.convert_layout %94 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1096283Z       %96 = tt.dot %70, %95, %68#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1096767Z       %97 = ttg.local_load %68#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1097186Z       %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1097528Z       %99 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1097803Z       %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1098053Z       %101 = arith.muli %100, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1098259Z       %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1098453Z       %103 = arith.addi %102, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1098647Z       %104 = tt.addptr %10, %103 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1098858Z       %105 = arith.cmpi sge, %100, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1099034Z       %106 = arith.cmpi slt, %100, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1099201Z       %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:45.1099387Z       %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1099574Z       %109 = arith.andi %108, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1099744Z       %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1100004Z       %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1100281Z       %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1100520Z       %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1100777Z       %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1101065Z       %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1101398Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1101676Z       %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1101918Z       %118 = arith.select %18, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1102155Z       %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1102384Z       %120 = arith.select %20, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1102614Z       %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1102831Z       %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1103127Z       %123 = ttg.convert_layout %122 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1103579Z       %124 = tt.dot %98, %123, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1103974Z       ttg.local_dealloc %54 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:45.1104185Z       %125 = arith.truncf %124 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:45.1104445Z       %126 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:45.1104679Z       %127 = arith.muli %126, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:51:45.1104905Z       %128 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:45.1105182Z       %129 = tt.broadcast %127 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1105382Z       %130 = tt.broadcast %128 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1105558Z       %131 = arith.addi %129, %130 : tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1105745Z       %132 = tt.addptr %21, %131 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1105935Z       tt.store %132, %125 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:45.1106097Z       %133 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:51:45.1106219Z       %134 = arith.divsi %133, %c512_i32 : i32
2026-02-21T09:51:45.1106341Z       %135 = arith.muli %134, %c2_i32 : i32
2026-02-21T09:51:45.1106461Z       %136 = arith.subi %c128_i32, %135 : i32
2026-02-21T09:51:45.1106580Z       %137 = arith.minsi %136, %c2_i32 : i32
2026-02-21T09:51:45.1106700Z       %138 = arith.remsi %133, %c512_i32 : i32
2026-02-21T09:51:45.1106822Z       %139 = arith.remsi %138, %137 : i32
2026-02-21T09:51:45.1106940Z       %140 = arith.addi %135, %139 : i32
2026-02-21T09:51:45.1107052Z       %141 = arith.divsi %138, %137 : i32
2026-02-21T09:51:45.1107168Z       %142 = arith.muli %140, %c64_i32 : i32
2026-02-21T09:51:45.1107327Z       %143 = tt.splat %142 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:45.1107537Z       %144 = arith.addi %143, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:45.1107707Z       %145 = arith.muli %141, %c64_i32 : i32
2026-02-21T09:51:45.1107875Z       %146 = tt.splat %145 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:45.1108168Z       %147 = tt.splat %145 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:45.1108420Z       %148 = arith.addi %146, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:45.1108757Z       %149 = arith.addi %147, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:45.1109049Z       %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:45.1109331Z       %151 = arith.muli %150, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:45.1109582Z       %152 = tt.broadcast %151 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1109776Z       %153 = arith.extsi %142 : i32 to i64
2026-02-21T09:51:45.1117994Z       %154 = tt.splat %153 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:45.1118222Z       %155 = arith.addi %154, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:45.1118506Z       %156 = tt.expand_dims %155 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1118791Z       %157 = tt.broadcast %156 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1118999Z       %158 = arith.cmpi sge, %156, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1119179Z       %159 = arith.cmpi slt, %156, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1119351Z       %160 = arith.andi %158, %159 : tensor<1x64xi1, #blocked2>
2026-02-21T09:51:45.1119580Z       %161 = tt.broadcast %160 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1119799Z       %162 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:45.1119985Z       %163 = arith.addi %152, %56 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1120184Z       %164 = tt.addptr %9, %163 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1120390Z       %165 = tt.load %164 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1120675Z       %166 = ttg.memdesc_index %162[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1121052Z       ttg.local_store %165, %166 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1121288Z       %167 = arith.addi %152, %63 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1121486Z       %168 = tt.addptr %9, %167 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1121690Z       %169 = tt.load %168 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1121982Z       %170 = ttg.memdesc_index %162[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1122337Z       ttg.local_store %169, %170 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1122920Z       %171:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %166, %arg8 = %170) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:51:45.1123342Z         %307 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:45.1123471Z         %308 = arith.muli %307, %c2_i32 : i32
2026-02-21T09:51:45.1123644Z         %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1123872Z         %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1124148Z         %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:45.1124429Z         %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1124649Z         %313 = arith.addi %152, %312 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1124846Z         %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1125053Z         %315 = tt.load %314 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1125347Z         %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1125781Z         %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1126062Z         %318 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:45.1126234Z         %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1126456Z         %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1126731Z         %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1126982Z         %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1127175Z         %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1127367Z         %324 = arith.addi %323, %157 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1127583Z         %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1127791Z         %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1127965Z         %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1128132Z         %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:45.1128319Z         %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1128513Z         %330 = arith.andi %329, %161 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1128681Z         %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1128965Z         %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1129242Z         %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1129480Z         %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1129736Z         %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1130025Z         %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1130360Z         %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1130641Z         %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1130881Z         %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1131118Z         %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1131353Z         %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1131583Z         %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1131809Z         %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1132104Z         %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1132591Z         %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1132936Z         %346 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:51:45.1133067Z         %347 = arith.cmpi slt, %346, %c2_i32 : i32
2026-02-21T09:51:45.1133202Z         %348 = arith.select %347, %346, %c0_i32 : i32
2026-02-21T09:51:45.1133465Z         %349 = ttg.memdesc_index %162[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1133817Z         ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1134202Z         scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1134538Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:45.1134853Z       %172 = ttg.local_load %171#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1135278Z       %173 = arith.extf %172 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1135591Z       %174 = arith.addi %74, %157 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1135792Z       %175 = tt.addptr %10, %174 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1135992Z       %176 = arith.andi %80, %161 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1136164Z       %177 = tt.load %175, %176, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1136420Z       %178 = ttg.convert_layout %177 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1136705Z       %179 = arith.shli %178, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1136960Z       %180 = arith.shrsi %179, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1137196Z       %181 = arith.shrsi %178, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1137487Z       %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1137840Z       %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1138121Z       %184 = tt.broadcast %182 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1138361Z       %185 = arith.select %18, %184, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1138598Z       %186 = tt.broadcast %183 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1138833Z       %187 = arith.select %20, %186, %185 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1139058Z       %188 = tt.reshape %187 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1139280Z       %189 = arith.sitofp %188 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1139575Z       %190 = ttg.convert_layout %189 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1140041Z       %191 = tt.dot %173, %190, %171#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1140554Z       %192 = ttg.local_load %171#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1140977Z       %193 = arith.extf %192 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1141275Z       %194 = arith.addi %102, %157 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1141476Z       %195 = tt.addptr %10, %194 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1141674Z       %196 = arith.andi %108, %161 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1141846Z       %197 = tt.load %195, %196, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1142102Z       %198 = ttg.convert_layout %197 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1142384Z       %199 = arith.shli %198, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1142621Z       %200 = arith.shrsi %199, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1142855Z       %201 = arith.shrsi %198, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1143144Z       %202 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1143479Z       %203 = tt.expand_dims %201 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1143774Z       %204 = tt.broadcast %202 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1144011Z       %205 = arith.select %18, %204, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1144245Z       %206 = tt.broadcast %203 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1144478Z       %207 = arith.select %20, %206, %205 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1144708Z       %208 = tt.reshape %207 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1144943Z       %209 = arith.sitofp %208 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1145235Z       %210 = ttg.convert_layout %209 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1145712Z       %211 = tt.dot %193, %210, %191, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1146097Z       ttg.local_dealloc %162 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:45.1146312Z       %212 = arith.truncf %211 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:45.1146574Z       %213 = tt.expand_dims %149 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:45.1146813Z       %214 = arith.muli %213, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:51:45.1147040Z       %215 = tt.expand_dims %144 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:45.1147296Z       %216 = tt.broadcast %214 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1147498Z       %217 = tt.broadcast %215 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1147675Z       %218 = arith.addi %216, %217 : tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1147862Z       %219 = tt.addptr %21, %218 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1148054Z       tt.store %219, %212 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:45.1148197Z       %220 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:51:45.1148321Z       %221 = arith.divsi %220, %c512_i32 : i32
2026-02-21T09:51:45.1148457Z       %222 = arith.muli %221, %c2_i32 : i32
2026-02-21T09:51:45.1148579Z       %223 = arith.subi %c128_i32, %222 : i32
2026-02-21T09:51:45.1148698Z       %224 = arith.minsi %223, %c2_i32 : i32
2026-02-21T09:51:45.1148817Z       %225 = arith.remsi %220, %c512_i32 : i32
2026-02-21T09:51:45.1148934Z       %226 = arith.remsi %225, %224 : i32
2026-02-21T09:51:45.1149052Z       %227 = arith.addi %222, %226 : i32
2026-02-21T09:51:45.1149165Z       %228 = arith.divsi %225, %224 : i32
2026-02-21T09:51:45.1149283Z       %229 = arith.muli %227, %c64_i32 : i32
2026-02-21T09:51:45.1149443Z       %230 = tt.splat %229 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:45.1149654Z       %231 = arith.addi %230, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:45.1149821Z       %232 = arith.muli %228, %c64_i32 : i32
2026-02-21T09:51:45.1149987Z       %233 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:45.1150204Z       %234 = tt.splat %232 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:45.1150418Z       %235 = arith.addi %233, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:45.1150633Z       %236 = arith.addi %234, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:45.1150903Z       %237 = tt.expand_dims %235 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:45.1151172Z       %238 = arith.muli %237, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:45.1151371Z       %239 = tt.broadcast %238 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1151544Z       %240 = arith.extsi %229 : i32 to i64
2026-02-21T09:51:45.1151716Z       %241 = tt.splat %240 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:45.1151941Z       %242 = arith.addi %241, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:45.1152218Z       %243 = tt.expand_dims %242 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1152518Z       %244 = tt.broadcast %243 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1152720Z       %245 = arith.cmpi sge, %243, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1152897Z       %246 = arith.cmpi slt, %243, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1153066Z       %247 = arith.andi %245, %246 : tensor<1x64xi1, #blocked2>
2026-02-21T09:51:45.1153269Z       %248 = tt.broadcast %247 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1153485Z       %249 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:45.1153668Z       %250 = arith.addi %239, %56 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1153867Z       %251 = tt.addptr %9, %250 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1154070Z       %252 = tt.load %251 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1154353Z       %253 = ttg.memdesc_index %249[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1154711Z       ttg.local_store %252, %253 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1154949Z       %254 = arith.addi %239, %63 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1155149Z       %255 = tt.addptr %9, %254 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1155349Z       %256 = tt.load %255 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1155628Z       %257 = ttg.memdesc_index %249[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1156001Z       ttg.local_store %256, %257 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1156513Z       %258:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %253, %arg8 = %257) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:51:45.1156933Z         %307 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:45.1157060Z         %308 = arith.muli %307, %c2_i32 : i32
2026-02-21T09:51:45.1157231Z         %309 = tt.splat %308 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1157459Z         %310 = arith.addi %309, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1157734Z         %311 = tt.expand_dims %310 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:45.1158017Z         %312 = tt.broadcast %311 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1158215Z         %313 = arith.addi %239, %312 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1158414Z         %314 = tt.addptr %9, %313 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1158619Z         %315 = tt.load %314 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1158927Z         %316 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1159359Z         %317 = arith.extf %316 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1159641Z         %318 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:45.1159814Z         %319 = tt.splat %318 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1160037Z         %320 = arith.addi %319, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1160330Z         %321 = tt.expand_dims %320 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1160578Z         %322 = arith.muli %321, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1160768Z         %323 = tt.broadcast %322 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1160961Z         %324 = arith.addi %323, %244 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1161173Z         %325 = tt.addptr %10, %324 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1161378Z         %326 = arith.cmpi sge, %321, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1161548Z         %327 = arith.cmpi slt, %321, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1161710Z         %328 = arith.andi %326, %327 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:45.1161895Z         %329 = tt.broadcast %328 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1162083Z         %330 = arith.andi %329, %248 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1162250Z         %331 = tt.load %325, %330, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1162504Z         %332 = ttg.convert_layout %331 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1162835Z         %333 = arith.shli %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1163072Z         %334 = arith.shrsi %333, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1163305Z         %335 = arith.shrsi %332, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1163615Z         %336 = tt.expand_dims %334 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1163946Z         %337 = tt.expand_dims %335 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1164226Z         %338 = tt.broadcast %336 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1164460Z         %339 = arith.select %18, %338, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1164694Z         %340 = tt.broadcast %337 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1164924Z         %341 = arith.select %20, %340, %339 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1165152Z         %342 = tt.reshape %341 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1165371Z         %343 = arith.sitofp %342 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1165665Z         %344 = ttg.convert_layout %343 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1166127Z         %345 = tt.dot %317, %344, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1166469Z         %346 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:51:45.1166615Z         %347 = arith.cmpi slt, %346, %c2_i32 : i32
2026-02-21T09:51:45.1166746Z         %348 = arith.select %347, %346, %c0_i32 : i32
2026-02-21T09:51:45.1167012Z         %349 = ttg.memdesc_index %249[%348] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1167361Z         ttg.local_store %315, %349 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1167745Z         scf.yield %345, %348, %arg8, %349 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1168097Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:45.1168405Z       %259 = ttg.local_load %258#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1168829Z       %260 = arith.extf %259 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1169140Z       %261 = arith.addi %74, %244 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1169336Z       %262 = tt.addptr %10, %261 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1169533Z       %263 = arith.andi %80, %248 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1169698Z       %264 = tt.load %262, %263, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1169951Z       %265 = ttg.convert_layout %264 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1170230Z       %266 = arith.shli %265, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1170461Z       %267 = arith.shrsi %266, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1170695Z       %268 = arith.shrsi %265, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1170980Z       %269 = tt.expand_dims %267 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1171309Z       %270 = tt.expand_dims %268 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1171603Z       %271 = tt.broadcast %269 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1171835Z       %272 = arith.select %18, %271, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1172073Z       %273 = tt.broadcast %270 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1172302Z       %274 = arith.select %20, %273, %272 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1172531Z       %275 = tt.reshape %274 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1172749Z       %276 = arith.sitofp %275 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1173038Z       %277 = ttg.convert_layout %276 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1173497Z       %278 = tt.dot %260, %277, %258#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1173985Z       %279 = ttg.local_load %258#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1174406Z       %280 = arith.extf %279 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1174712Z       %281 = arith.addi %102, %244 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1174906Z       %282 = tt.addptr %10, %281 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1175106Z       %283 = arith.andi %108, %248 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1175271Z       %284 = tt.load %282, %283, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1175524Z       %285 = ttg.convert_layout %284 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1175801Z       %286 = arith.shli %285, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1176055Z       %287 = arith.shrsi %286, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1176289Z       %288 = arith.shrsi %285, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1176573Z       %289 = tt.expand_dims %287 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1176923Z       %290 = tt.expand_dims %288 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1177199Z       %291 = tt.broadcast %289 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1177431Z       %292 = arith.select %18, %291, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1177664Z       %293 = tt.broadcast %290 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1177891Z       %294 = arith.select %20, %293, %292 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1178115Z       %295 = tt.reshape %294 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1178331Z       %296 = arith.sitofp %295 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1178620Z       %297 = ttg.convert_layout %296 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1179075Z       %298 = tt.dot %280, %297, %278, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1179453Z       ttg.local_dealloc %249 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:45.1179682Z       %299 = arith.truncf %298 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:45.1179943Z       %300 = tt.expand_dims %236 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:45.1180176Z       %301 = arith.muli %300, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:51:45.1180398Z       %302 = tt.expand_dims %231 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:45.1180653Z       %303 = tt.broadcast %301 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1180849Z       %304 = tt.broadcast %302 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1181024Z       %305 = arith.addi %303, %304 : tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1181205Z       %306 = tt.addptr %21, %305 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1181398Z       tt.store %306, %299 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:45.1181537Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:51:45.1181654Z     scf.for %arg3 = %25 to %3 step %c1_i32  : i32 {
2026-02-21T09:51:45.1181788Z       %26 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:51:45.1181906Z       %27 = arith.muli %26, %c2_i32 : i32
2026-02-21T09:51:45.1182022Z       %28 = arith.subi %c128_i32, %27 : i32
2026-02-21T09:51:45.1182135Z       %29 = arith.minsi %28, %c2_i32 : i32
2026-02-21T09:51:45.1182265Z       %30 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:51:45.1182382Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T09:51:45.1182491Z       %32 = arith.addi %27, %31 : i32
2026-02-21T09:51:45.1182600Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T09:51:45.1182709Z       %34 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:51:45.1182863Z       %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:45.1183063Z       %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:45.1183224Z       %37 = arith.muli %33, %c64_i32 : i32
2026-02-21T09:51:45.1183383Z       %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:45.1183610Z       %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:45.1183816Z       %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:45.1184018Z       %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:45.1184278Z       %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:45.1184532Z       %43 = arith.muli %42, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:45.1184720Z       %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1184888Z       %45 = arith.extsi %34 : i32 to i64
2026-02-21T09:51:45.1185047Z       %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:45.1185260Z       %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:45.1185530Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1185798Z       %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1185992Z       %50 = arith.cmpi sge, %48, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1186158Z       %51 = arith.cmpi slt, %48, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:45.1186317Z       %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked2>
2026-02-21T09:51:45.1186492Z       %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1186700Z       %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:45.1186983Z       %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:45.1187248Z       %56 = tt.broadcast %55 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1187431Z       %57 = arith.addi %44, %56 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1187619Z       %58 = tt.addptr %9, %57 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1187816Z       %59 = tt.load %58 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1188092Z       %60 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1188440Z       ttg.local_store %59, %60 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1188707Z       %61 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1188976Z       %62 = tt.expand_dims %61 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:45.1189243Z       %63 = tt.broadcast %62 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1189425Z       %64 = arith.addi %44, %63 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1189613Z       %65 = tt.addptr %9, %64 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1189823Z       %66 = tt.load %65 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1190096Z       %67 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1190439Z       ttg.local_store %66, %67 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1190946Z       %68:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %60, %arg8 = %67) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:51:45.1191375Z         %133 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:45.1191499Z         %134 = arith.muli %133, %c2_i32 : i32
2026-02-21T09:51:45.1191669Z         %135 = tt.splat %134 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1191893Z         %136 = arith.addi %135, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:45.1192181Z         %137 = tt.expand_dims %136 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:45.1192454Z         %138 = tt.broadcast %137 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1192645Z         %139 = arith.addi %44, %138 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1192840Z         %140 = tt.addptr %9, %139 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:45.1193044Z         %141 = tt.load %140 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:45.1193338Z         %142 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1193767Z         %143 = arith.extf %142 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1194047Z         %144 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:45.1194217Z         %145 = tt.splat %144 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1194437Z         %146 = arith.addi %145, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1194730Z         %147 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1194976Z         %148 = arith.muli %147, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1195164Z         %149 = tt.broadcast %148 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1195352Z         %150 = arith.addi %149, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1195548Z         %151 = tt.addptr %10, %150 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1195755Z         %152 = arith.cmpi sge, %147, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1195926Z         %153 = arith.cmpi slt, %147, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1196087Z         %154 = arith.andi %152, %153 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:45.1196271Z         %155 = tt.broadcast %154 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1196457Z         %156 = arith.andi %155, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1196622Z         %157 = tt.load %151, %156, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1196881Z         %158 = ttg.convert_layout %157 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1197161Z         %159 = arith.shli %158, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1197409Z         %160 = arith.shrsi %159, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1197645Z         %161 = arith.shrsi %158, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1197932Z         %162 = tt.expand_dims %160 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1198278Z         %163 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1198556Z         %164 = tt.broadcast %162 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1198813Z         %165 = arith.select %18, %164, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1199046Z         %166 = tt.broadcast %163 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1199275Z         %167 = arith.select %20, %166, %165 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1199500Z         %168 = tt.reshape %167 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1199734Z         %169 = arith.sitofp %168 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1200025Z         %170 = ttg.convert_layout %169 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1200486Z         %171 = tt.dot %143, %170, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1200831Z         %172 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:51:45.1200957Z         %173 = arith.cmpi slt, %172, %c2_i32 : i32
2026-02-21T09:51:45.1201086Z         %174 = arith.select %173, %172, %c0_i32 : i32
2026-02-21T09:51:45.1201347Z         %175 = ttg.memdesc_index %54[%174] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1201694Z         ttg.local_store %141, %175 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1202077Z         scf.yield %171, %174, %arg8, %175 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:45.1202425Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:45.1202781Z       %69 = ttg.local_load %68#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1203205Z       %70 = arith.extf %69 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1203529Z       %71 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1203804Z       %72 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1204043Z       %73 = arith.muli %72, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1204228Z       %74 = tt.broadcast %73 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1204414Z       %75 = arith.addi %74, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1204599Z       %76 = tt.addptr %10, %75 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1204800Z       %77 = arith.cmpi sge, %72, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1204966Z       %78 = arith.cmpi slt, %72, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1205121Z       %79 = arith.andi %77, %78 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:45.1205324Z       %80 = tt.broadcast %79 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1205504Z       %81 = arith.andi %80, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1205663Z       %82 = tt.load %76, %81, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1205909Z       %83 = ttg.convert_layout %82 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1206180Z       %84 = arith.shli %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1206406Z       %85 = arith.shrsi %84, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1206651Z       %86 = arith.shrsi %83, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1206928Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1207251Z       %88 = tt.expand_dims %86 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1207536Z       %89 = tt.broadcast %87 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1207767Z       %90 = arith.select %18, %89, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1207992Z       %91 = tt.broadcast %88 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1208213Z       %92 = arith.select %20, %91, %90 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1208431Z       %93 = tt.reshape %92 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1208642Z       %94 = arith.sitofp %93 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1208926Z       %95 = ttg.convert_layout %94 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1209375Z       %96 = tt.dot %70, %95, %68#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1209855Z       %97 = ttg.local_load %68#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1210290Z       %98 = arith.extf %97 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1210615Z       %99 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:45.1210894Z       %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1211142Z       %101 = arith.muli %100, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1211333Z       %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1211526Z       %103 = arith.addi %102, %49 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1211723Z       %104 = tt.addptr %10, %103 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:45.1211933Z       %105 = arith.cmpi sge, %100, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1212102Z       %106 = arith.cmpi slt, %100, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:45.1212268Z       %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:45.1212453Z       %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1212638Z       %109 = arith.andi %108, %53 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:45.1212805Z       %110 = tt.load %104, %109, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:45.1213079Z       %111 = ttg.convert_layout %110 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1213360Z       %112 = arith.shli %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1213596Z       %113 = arith.shrsi %112, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1213832Z       %114 = arith.shrsi %111, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:45.1214119Z       %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1214468Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:45.1214747Z       %117 = tt.broadcast %115 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1214985Z       %118 = arith.select %18, %117, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1215221Z       %119 = tt.broadcast %116 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1215468Z       %120 = arith.select %20, %119, %118 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:45.1215692Z       %121 = tt.reshape %120 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:45.1215912Z       %122 = arith.sitofp %121 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:45.1216207Z       %123 = ttg.convert_layout %122 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:45.1216664Z       %124 = tt.dot %98, %123, %96, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:45.1217043Z       ttg.local_dealloc %54 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:45.1217256Z       %125 = arith.truncf %124 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:45.1217517Z       %126 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:45.1217754Z       %127 = arith.muli %126, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:51:45.1217998Z       %128 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:45.1218251Z       %129 = tt.broadcast %127 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1218449Z       %130 = tt.broadcast %128 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1218629Z       %131 = arith.addi %129, %130 : tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1218820Z       %132 = tt.addptr %21, %131 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:45.1219014Z       tt.store %132, %125 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:45.1219158Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:51:45.1219263Z     tt.return
2026-02-21T09:51:45.1219350Z   }
2026-02-21T09:51:45.1219431Z }
2026-02-21T09:51:45.1219481Z 
2026-02-21T09:51:45.1219514Z {-#
2026-02-21T09:51:45.1219601Z   external_resources: {
2026-02-21T09:51:45.1219704Z     mlir_reproducer: {
2026-02-21T09:51:45.1220718Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:51:45.1221732Z       disable_threading: false,
2026-02-21T09:51:45.1221841Z       verify_each: true
2026-02-21T09:51:45.1221939Z     }
2026-02-21T09:51:45.1222013Z   }
2026-02-21T09:51:45.1222092Z #-}
2026-02-21T09:51:45.1222374Z /tmp/torchinductor_root/lk/clkaq5pwstq5v3fgr6xdujxp46fgalgtiguqham6cdpakbbljfvv.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:51:45.1223063Z /tmp/torchinductor_root/lk/clkaq5pwstq5v3fgr6xdujxp46fgalgtiguqham6cdpakbbljfvv.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:51:45.1223649Z [435s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:51:45.1224443Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:51:45.1225144Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:51:45.1225320Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:51:46.2056911Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:51:46.2062048Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:51:46.2063289Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:51:46.2064153Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:51:46.2065115Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:51:46.2065892Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:51:46.2066974Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:51:46.2067364Z #smem = #ttg.shared_memory
2026-02-21T09:51:46.2067814Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:51:46.2068798Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:51:46.2069731Z     %cst = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma>
2026-02-21T09:51:46.2070214Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:51:46.2070523Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:51:46.2070786Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:51:46.2071132Z     %cst_0 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2071513Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:51:46.2071755Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:51:46.2071986Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:51:46.2072215Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:51:46.2072491Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:51:46.2072842Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:51:46.2073297Z     %cst_1 = arith.constant dense<0> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2073868Z     %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2074867Z     %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2075675Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2076061Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2076571Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2076961Z     %cst_7 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.2077630Z     %cst_8 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2078081Z     %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:46.2078328Z     %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:46.2078592Z     %cst_11 = arith.constant dense<8192> : tensor<64x1xi32, #mma>
2026-02-21T09:51:46.2078878Z     %0 = tt.get_program_id x : i32
2026-02-21T09:51:46.2079185Z     %1 = arith.muli %0, %c4_i32 : i32
2026-02-21T09:51:46.2079495Z     %2 = arith.addi %1, %c4_i32 : i32
2026-02-21T09:51:46.2079705Z     %3 = arith.minsi %2, %c32768_i32 : i32
2026-02-21T09:51:46.2080000Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.2080416Z     %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.2080959Z     %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2081389Z     %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.2081990Z     %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.2082537Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.2082982Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2083498Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2084322Z     %12 = arith.extsi %11 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2085032Z     %13 = arith.extsi %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2085676Z     %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:51:46.2086241Z     %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:51:46.2086688Z     %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:46.2086978Z     %17 = arith.cmpi eq, %16, %cst_9 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:46.2087227Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:51:46.2087506Z     %19 = arith.cmpi eq, %16, %cst_10 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:46.2087853Z     %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:51:46.2088310Z     %21 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:46.2088761Z     %22 = arith.subi %3, %1 : i32
2026-02-21T09:51:46.2088959Z     %23 = arith.remsi %22, %c3_i32 : i32
2026-02-21T09:51:46.2089270Z     %24 = arith.subi %22, %23 : i32
2026-02-21T09:51:46.2089623Z     %25 = arith.addi %1, %24 : i32
2026-02-21T09:51:46.2090033Z     scf.for %arg3 = %1 to %25 step %c3_i32  : i32 {
2026-02-21T09:51:46.2090308Z       %26 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:51:46.2090500Z       %27 = arith.muli %26, %c2_i32 : i32
2026-02-21T09:51:46.2090727Z       %28 = arith.subi %c128_i32, %27 : i32
2026-02-21T09:51:46.2090950Z       %29 = arith.minsi %28, %c2_i32 : i32
2026-02-21T09:51:46.2091211Z       %30 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:51:46.2091372Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T09:51:46.2091637Z       %32 = arith.addi %27, %31 : i32
2026-02-21T09:51:46.2091882Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T09:51:46.2092117Z       %34 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:51:46.2092304Z       %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.2092535Z       %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.2092795Z       %37 = arith.muli %33, %c64_i32 : i32
2026-02-21T09:51:46.2093104Z       %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.2093446Z       %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.2093692Z       %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.2093986Z       %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.2094368Z       %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.2094689Z       %43 = arith.muli %42, %cst_7 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.2094923Z       %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2095167Z       %45 = arith.extsi %34 : i32 to i64
2026-02-21T09:51:46.2095494Z       %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2095941Z       %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2096426Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2096956Z       %49 = tt.broadcast %48 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2097432Z       %50 = arith.cmpi sge, %48, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2097773Z       %51 = arith.cmpi slt, %48, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2098123Z       %52 = arith.andi %50, %51 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2098470Z       %53 = tt.broadcast %52 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2098853Z       %54 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<64x64xf32, #mma>)  : i32 {
2026-02-21T09:51:46.2099103Z         %139 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:51:46.2099401Z         %140 = tt.splat %139 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.2099753Z         %141 = arith.addi %140, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.2100065Z         %142 = tt.expand_dims %141 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.2100470Z         %143 = tt.broadcast %142 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2100706Z         %144 = arith.addi %44, %143 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2100909Z         %145 = tt.addptr %9, %144 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2101146Z         %146 = tt.load %145 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.2101421Z         %147 = ttg.local_alloc %146 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T09:51:46.2101862Z         %148 = ttg.local_load %147 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2102411Z         %149 = arith.extf %148 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2102696Z         %150 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:46.2103008Z         %151 = tt.splat %150 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2103323Z         %152 = arith.addi %151, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2103710Z         %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2104068Z         %154 = arith.muli %153, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2104377Z         %155 = tt.broadcast %154 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2104678Z         %156 = arith.addi %155, %49 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2104993Z         %157 = tt.addptr %10, %156 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2105311Z         %158 = arith.cmpi sge, %153, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2105553Z         %159 = arith.cmpi slt, %153, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2105804Z         %160 = arith.andi %158, %159 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2106102Z         %161 = tt.broadcast %160 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2106400Z         %162 = arith.andi %161, %53 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2106643Z         %163 = tt.load %157, %162, %cst_1 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2106890Z         %164 = arith.shli %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2107166Z         %165 = arith.shrsi %164, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2107408Z         %166 = arith.shrsi %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2107709Z         %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.2108048Z         %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.2108342Z         %169 = tt.broadcast %167 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2108595Z         %170 = arith.select %18, %169, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2108845Z         %171 = tt.broadcast %168 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2109076Z         %172 = arith.select %20, %171, %170 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2109303Z         %173 = tt.reshape %172 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:51:46.2109522Z         %174 = arith.sitofp %173 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:51:46.2109774Z         %175 = ttg.local_alloc %174 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:51:46.2110098Z         %176 = ttg.local_load %175 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2110598Z         %177 = tt.dot %149, %176, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.2110952Z         scf.yield %177 : tensor<64x64xf32, #mma>
2026-02-21T09:51:46.2111115Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:46.2111335Z       %55 = arith.truncf %54 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:46.2111590Z       %56 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:46.2111822Z       %57 = arith.muli %56, %cst_11 : tensor<64x1xi32, #mma>
2026-02-21T09:51:46.2112047Z       %58 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:46.2112294Z       %59 = tt.broadcast %57 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2112498Z       %60 = tt.broadcast %58 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2112682Z       %61 = arith.addi %59, %60 : tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2112861Z       %62 = tt.addptr %21, %61 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2113051Z       tt.store %62, %55 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:46.2113186Z       %63 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:51:46.2113307Z       %64 = arith.divsi %63, %c512_i32 : i32
2026-02-21T09:51:46.2113422Z       %65 = arith.muli %64, %c2_i32 : i32
2026-02-21T09:51:46.2113539Z       %66 = arith.subi %c128_i32, %65 : i32
2026-02-21T09:51:46.2113670Z       %67 = arith.minsi %66, %c2_i32 : i32
2026-02-21T09:51:46.2113785Z       %68 = arith.remsi %63, %c512_i32 : i32
2026-02-21T09:51:46.2113903Z       %69 = arith.remsi %68, %67 : i32
2026-02-21T09:51:46.2114015Z       %70 = arith.addi %65, %69 : i32
2026-02-21T09:51:46.2114141Z       %71 = arith.divsi %68, %67 : i32
2026-02-21T09:51:46.2114251Z       %72 = arith.muli %70, %c64_i32 : i32
2026-02-21T09:51:46.2114464Z       %73 = tt.splat %72 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.2114741Z       %74 = arith.addi %73, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.2114904Z       %75 = arith.muli %71, %c64_i32 : i32
2026-02-21T09:51:46.2115081Z       %76 = tt.splat %75 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.2115309Z       %77 = tt.splat %75 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.2115524Z       %78 = arith.addi %76, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.2115731Z       %79 = arith.addi %77, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.2115999Z       %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.2116250Z       %81 = arith.muli %80, %cst_7 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.2116450Z       %82 = tt.broadcast %81 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2116622Z       %83 = arith.extsi %72 : i32 to i64
2026-02-21T09:51:46.2116822Z       %84 = tt.splat %83 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2117131Z       %85 = arith.addi %84, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2117515Z       %86 = tt.expand_dims %85 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2117954Z       %87 = tt.broadcast %86 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2118262Z       %88 = arith.cmpi sge, %86, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2118501Z       %89 = arith.cmpi slt, %86, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2118745Z       %90 = arith.andi %88, %89 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2119164Z       %91 = tt.broadcast %90 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2119493Z       %92 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<64x64xf32, #mma>)  : i32 {
2026-02-21T09:51:46.2119704Z         %139 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:51:46.2119876Z         %140 = tt.splat %139 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.2120102Z         %141 = arith.addi %140, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.2120380Z         %142 = tt.expand_dims %141 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.2120666Z         %143 = tt.broadcast %142 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2120886Z         %144 = arith.addi %82, %143 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2121093Z         %145 = tt.addptr %9, %144 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2121472Z         %146 = tt.load %145 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.2121713Z         %147 = ttg.local_alloc %146 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T09:51:46.2122039Z         %148 = ttg.local_load %147 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2122448Z         %149 = arith.extf %148 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2122990Z         %150 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:46.2123271Z         %151 = tt.splat %150 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2123571Z         %152 = arith.addi %151, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2123976Z         %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2124330Z         %154 = arith.muli %153, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2124640Z         %155 = tt.broadcast %154 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2124939Z         %156 = arith.addi %155, %87 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2125276Z         %157 = tt.addptr %10, %156 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2125591Z         %158 = arith.cmpi sge, %153, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2125836Z         %159 = arith.cmpi slt, %153, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2126082Z         %160 = arith.andi %158, %159 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2126397Z         %161 = tt.broadcast %160 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2126733Z         %162 = arith.andi %161, %91 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2127076Z         %163 = tt.load %157, %162, %cst_1 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2127321Z         %164 = arith.shli %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2127578Z         %165 = arith.shrsi %164, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2127810Z         %166 = arith.shrsi %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2128097Z         %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.2128596Z         %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.2128905Z         %169 = tt.broadcast %167 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2129143Z         %170 = arith.select %18, %169, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2129376Z         %171 = tt.broadcast %168 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2129608Z         %172 = arith.select %20, %171, %170 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2129839Z         %173 = tt.reshape %172 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:51:46.2130057Z         %174 = arith.sitofp %173 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:51:46.2130330Z         %175 = ttg.local_alloc %174 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:51:46.2130649Z         %176 = ttg.local_load %175 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2131119Z         %177 = tt.dot %149, %176, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.2131469Z         scf.yield %177 : tensor<64x64xf32, #mma>
2026-02-21T09:51:46.2131651Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:46.2131855Z       %93 = arith.truncf %92 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:46.2132110Z       %94 = tt.expand_dims %79 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:46.2132344Z       %95 = arith.muli %94, %cst_11 : tensor<64x1xi32, #mma>
2026-02-21T09:51:46.2132625Z       %96 = tt.expand_dims %74 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:46.2132871Z       %97 = tt.broadcast %95 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2133092Z       %98 = tt.broadcast %96 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2133262Z       %99 = arith.addi %97, %98 : tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2133457Z       %100 = tt.addptr %21, %99 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2133649Z       tt.store %100, %93 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:46.2133786Z       %101 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:51:46.2133911Z       %102 = arith.divsi %101, %c512_i32 : i32
2026-02-21T09:51:46.2134031Z       %103 = arith.muli %102, %c2_i32 : i32
2026-02-21T09:51:46.2134172Z       %104 = arith.subi %c128_i32, %103 : i32
2026-02-21T09:51:46.2134290Z       %105 = arith.minsi %104, %c2_i32 : i32
2026-02-21T09:51:46.2134421Z       %106 = arith.remsi %101, %c512_i32 : i32
2026-02-21T09:51:46.2134567Z       %107 = arith.remsi %106, %105 : i32
2026-02-21T09:51:46.2134699Z       %108 = arith.addi %103, %107 : i32
2026-02-21T09:51:46.2134816Z       %109 = arith.divsi %106, %105 : i32
2026-02-21T09:51:46.2134931Z       %110 = arith.muli %108, %c64_i32 : i32
2026-02-21T09:51:46.2135092Z       %111 = tt.splat %110 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.2135305Z       %112 = arith.addi %111, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.2135473Z       %113 = arith.muli %109, %c64_i32 : i32
2026-02-21T09:51:46.2135652Z       %114 = tt.splat %113 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.2135867Z       %115 = tt.splat %113 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.2136082Z       %116 = arith.addi %114, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.2136292Z       %117 = arith.addi %115, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.2136564Z       %118 = tt.expand_dims %116 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.2136815Z       %119 = arith.muli %118, %cst_7 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.2137014Z       %120 = tt.broadcast %119 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2137196Z       %121 = arith.extsi %110 : i32 to i64
2026-02-21T09:51:46.2137408Z       %122 = tt.splat %121 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2137708Z       %123 = arith.addi %122, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2138112Z       %124 = tt.expand_dims %123 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2138570Z       %125 = tt.broadcast %124 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2138884Z       %126 = arith.cmpi sge, %124, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2139128Z       %127 = arith.cmpi slt, %124, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2139363Z       %128 = arith.andi %126, %127 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2139666Z       %129 = tt.broadcast %128 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2140001Z       %130 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<64x64xf32, #mma>)  : i32 {
2026-02-21T09:51:46.2140213Z         %139 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:51:46.2140386Z         %140 = tt.splat %139 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.2140614Z         %141 = arith.addi %140, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.2140891Z         %142 = tt.expand_dims %141 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.2141219Z         %143 = tt.broadcast %142 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2141417Z         %144 = arith.addi %120, %143 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2141617Z         %145 = tt.addptr %9, %144 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2141824Z         %146 = tt.load %145 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.2142049Z         %147 = ttg.local_alloc %146 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T09:51:46.2142375Z         %148 = ttg.local_load %147 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2142796Z         %149 = arith.extf %148 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2143076Z         %150 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:46.2143288Z         %151 = tt.splat %150 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2143609Z         %152 = arith.addi %151, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2143996Z         %153 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2144353Z         %154 = arith.muli %153, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2144658Z         %155 = tt.broadcast %154 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2144983Z         %156 = arith.addi %155, %125 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2145294Z         %157 = tt.addptr %10, %156 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2145610Z         %158 = arith.cmpi sge, %153, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2145877Z         %159 = arith.cmpi slt, %153, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2146130Z         %160 = arith.andi %158, %159 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2146494Z         %161 = tt.broadcast %160 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2146795Z         %162 = arith.andi %161, %129 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2147036Z         %163 = tt.load %157, %162, %cst_1 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2147283Z         %164 = arith.shli %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2147519Z         %165 = arith.shrsi %164, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2147752Z         %166 = arith.shrsi %163, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2148040Z         %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.2148372Z         %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.2148658Z         %169 = tt.broadcast %167 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2148896Z         %170 = arith.select %18, %169, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2149146Z         %171 = tt.broadcast %168 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2149377Z         %172 = arith.select %20, %171, %170 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2149611Z         %173 = tt.reshape %172 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:51:46.2149852Z         %174 = arith.sitofp %173 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:51:46.2150192Z         %175 = ttg.local_alloc %174 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:51:46.2150511Z         %176 = ttg.local_load %175 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2150994Z         %177 = tt.dot %149, %176, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.2151359Z         scf.yield %177 : tensor<64x64xf32, #mma>
2026-02-21T09:51:46.2151521Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:46.2151744Z       %131 = arith.truncf %130 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:46.2152041Z       %132 = tt.expand_dims %117 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:46.2152361Z       %133 = arith.muli %132, %cst_11 : tensor<64x1xi32, #mma>
2026-02-21T09:51:46.2152591Z       %134 = tt.expand_dims %112 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:46.2152848Z       %135 = tt.broadcast %133 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2153047Z       %136 = tt.broadcast %134 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2153225Z       %137 = arith.addi %135, %136 : tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2153415Z       %138 = tt.addptr %21, %137 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2153606Z       tt.store %138, %131 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:46.2153750Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:51:46.2153874Z     scf.for %arg3 = %25 to %3 step %c1_i32  : i32 {
2026-02-21T09:51:46.2154007Z       %26 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:51:46.2154131Z       %27 = arith.muli %26, %c2_i32 : i32
2026-02-21T09:51:46.2154264Z       %28 = arith.subi %c128_i32, %27 : i32
2026-02-21T09:51:46.2154383Z       %29 = arith.minsi %28, %c2_i32 : i32
2026-02-21T09:51:46.2154501Z       %30 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:51:46.2154619Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T09:51:46.2154732Z       %32 = arith.addi %27, %31 : i32
2026-02-21T09:51:46.2154841Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T09:51:46.2154969Z       %34 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:51:46.2155126Z       %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.2155337Z       %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.2155514Z       %37 = arith.muli %33, %c64_i32 : i32
2026-02-21T09:51:46.2155680Z       %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.2155887Z       %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.2156093Z       %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.2156301Z       %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.2156562Z       %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.2156821Z       %43 = arith.muli %42, %cst_7 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.2157013Z       %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2157185Z       %45 = arith.extsi %34 : i32 to i64
2026-02-21T09:51:46.2157432Z       %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2157722Z       %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2166081Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2166582Z       %49 = tt.broadcast %48 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2166888Z       %50 = arith.cmpi sge, %48, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2167128Z       %51 = arith.cmpi slt, %48, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2167395Z       %52 = arith.andi %50, %51 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2167689Z       %53 = tt.broadcast %52 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2168037Z       %54 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<64x64xf32, #mma>)  : i32 {
2026-02-21T09:51:46.2168275Z         %63 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:51:46.2168535Z         %64 = tt.splat %63 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.2168753Z         %65 = arith.addi %64, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.2169023Z         %66 = tt.expand_dims %65 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.2169314Z         %67 = tt.broadcast %66 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2169504Z         %68 = arith.addi %44, %67 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2169702Z         %69 = tt.addptr %9, %68 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.2169902Z         %70 = tt.load %69 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.2170137Z         %71 = ttg.local_alloc %70 : (tensor<64x4xbf16, #blocked1>) -> !ttg.memdesc<64x4xbf16, #shared, #smem>
2026-02-21T09:51:46.2170462Z         %72 = ttg.local_load %71 : !ttg.memdesc<64x4xbf16, #shared, #smem> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2170862Z         %73 = arith.extf %72 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2171141Z         %74 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:46.2171374Z         %75 = tt.splat %74 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2171669Z         %76 = arith.addi %75, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:51:46.2172060Z         %77 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2172447Z         %78 = arith.muli %77, %cst_6 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2172749Z         %79 = tt.broadcast %78 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2173047Z         %80 = arith.addi %79, %49 : tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2173370Z         %81 = tt.addptr %10, %80 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2173687Z         %82 = arith.cmpi sge, %77, %cst_5 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2174014Z         %83 = arith.cmpi slt, %77, %cst_4 : tensor<2x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2174267Z         %84 = arith.andi %82, %83 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2174558Z         %85 = tt.broadcast %84 : tensor<2x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2174877Z         %86 = arith.andi %85, %53 : tensor<2x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2175114Z         %87 = tt.load %81, %86, %cst_1 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2175353Z         %88 = arith.shli %87, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2175597Z         %89 = arith.shrsi %88, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2175828Z         %90 = arith.shrsi %87, %cst_8 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.2176108Z         %91 = tt.expand_dims %89 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.2176437Z         %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.2176749Z         %93 = tt.broadcast %91 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2176981Z         %94 = arith.select %18, %93, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2177243Z         %95 = tt.broadcast %92 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2177502Z         %96 = arith.select %20, %95, %94 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.2177801Z         %97 = tt.reshape %96 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:51:46.2178026Z         %98 = arith.sitofp %97 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:51:46.2178284Z         %99 = ttg.local_alloc %98 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:51:46.2178629Z         %100 = ttg.local_load %99 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.2179190Z         %101 = tt.dot %73, %100, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.2179573Z         scf.yield %101 : tensor<64x64xf32, #mma>
2026-02-21T09:51:46.2179740Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:46.2179943Z       %55 = arith.truncf %54 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:46.2180212Z       %56 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:46.2180447Z       %57 = arith.muli %56, %cst_11 : tensor<64x1xi32, #mma>
2026-02-21T09:51:46.2180686Z       %58 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:46.2180937Z       %59 = tt.broadcast %57 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2181130Z       %60 = tt.broadcast %58 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2181308Z       %61 = arith.addi %59, %60 : tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2181509Z       %62 = tt.addptr %21, %61 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:46.2181694Z       tt.store %62, %55 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:46.2181831Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:51:46.2181935Z     tt.return
2026-02-21T09:51:46.2182016Z   }
2026-02-21T09:51:46.2182095Z }
2026-02-21T09:51:46.2182141Z 
2026-02-21T09:51:46.2182172Z {-#
2026-02-21T09:51:46.2182255Z   external_resources: {
2026-02-21T09:51:46.2182357Z     mlir_reproducer: {
2026-02-21T09:51:46.2183359Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:51:46.2184486Z       disable_threading: false,
2026-02-21T09:51:46.2184805Z       verify_each: true
2026-02-21T09:51:46.2184894Z     }
2026-02-21T09:51:46.2185062Z   }
2026-02-21T09:51:46.2185134Z #-}
2026-02-21T09:51:46.2185488Z /tmp/torchinductor_root/iz/cizbuh3veqzxvqwr47uaeiykj2usnn4s6fiecuivt63my44owufy.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:51:46.2186172Z /tmp/torchinductor_root/iz/cizbuh3veqzxvqwr47uaeiykj2usnn4s6fiecuivt63my44owufy.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:51:46.2186734Z [437s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:51:46.2187507Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:51:46.2188263Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:51:46.2188430Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:51:46.7175299Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:51:46.7187220Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:51:46.7187765Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:51:46.7188374Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:51:46.7188936Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:51:46.7189565Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:51:46.7189978Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:51:46.7190335Z #smem = #ttg.shared_memory
2026-02-21T09:51:46.7190830Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:51:46.7191795Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:51:46.7192599Z     %cst = arith.constant dense<8192> : tensor<64x1xi32, #mma>
2026-02-21T09:51:46.7192924Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:46.7193295Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:46.7193572Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7193836Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:51:46.7194221Z     %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7194867Z     %cst_4 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7195420Z     %cst_5 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7195792Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7196028Z     %cst_7 = arith.constant dense<0> : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7196333Z     %cst_8 = arith.constant dense<512> : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7196551Z     %cst_9 = arith.constant dense<0> : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7196843Z     %cst_10 = arith.constant dense<8192> : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7197185Z     %cst_11 = arith.constant dense<1024> : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.7197474Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:51:46.7197650Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:51:46.7197873Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:51:46.7198101Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:51:46.7198322Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:51:46.7198529Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:51:46.7198795Z     %cst_12 = arith.constant dense<0> : tensor<2x64xi8, #blocked2>
2026-02-21T09:51:46.7199123Z     %cst_13 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7199395Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:51:46.7199601Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:51:46.7199899Z     %cst_14 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7200151Z     %0 = tt.get_program_id x : i32
2026-02-21T09:51:46.7200305Z     %1 = arith.muli %0, %c4_i32 : i32
2026-02-21T09:51:46.7200537Z     %2 = arith.addi %1, %c4_i32 : i32
2026-02-21T09:51:46.7200678Z     %3 = arith.minsi %2, %c32768_i32 : i32
2026-02-21T09:51:46.7200963Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.7201480Z     %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.7201887Z     %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:46.7202232Z     %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.7202681Z     %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7203025Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7203387Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7203837Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7204453Z     %12 = arith.extsi %11 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7205216Z     %13 = arith.extsi %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:46.7205790Z     %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:51:46.7206274Z     %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:51:46.7206691Z     %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:46.7206943Z     %17 = arith.cmpi eq, %16, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:46.7207263Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:51:46.7207560Z     %19 = arith.cmpi eq, %16, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:46.7207793Z     %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:51:46.7208002Z     %21 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:46.7208174Z     %22 = arith.subi %3, %1 : i32
2026-02-21T09:51:46.7208356Z     %23 = arith.remsi %22, %c3_i32 : i32
2026-02-21T09:51:46.7208529Z     %24 = arith.subi %22, %23 : i32
2026-02-21T09:51:46.7208696Z     %25 = arith.addi %1, %24 : i32
2026-02-21T09:51:46.7208878Z     scf.for %arg3 = %1 to %25 step %c3_i32  : i32 {
2026-02-21T09:51:46.7209089Z       %26 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:51:46.7209282Z       %27 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:51:46.7209465Z       %28 = arith.muli %26, %c64_i32 : i32
2026-02-21T09:51:46.7209714Z       %29 = tt.splat %28 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.7210025Z       %30 = arith.addi %29, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.7210278Z       %31 = arith.muli %27, %c64_i32 : i32
2026-02-21T09:51:46.7210532Z       %32 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.7210856Z       %33 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.7211180Z       %34 = arith.addi %32, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.7211496Z       %35 = arith.addi %33, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.7211885Z       %36 = tt.expand_dims %34 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.7212153Z       %37 = arith.muli %36, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.7212404Z       %38 = tt.broadcast %37 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7212668Z       %39 = arith.extsi %28 : i32 to i64
2026-02-21T09:51:46.7212914Z       %40 = tt.splat %39 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:46.7213255Z       %41 = arith.addi %40, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:46.7213681Z       %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7214104Z       %43 = tt.broadcast %42 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7214403Z       %44 = arith.cmpi sge, %42, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7214666Z       %45 = arith.cmpi slt, %42, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7214917Z       %46 = arith.andi %44, %45 : tensor<1x64xi1, #blocked2>
2026-02-21T09:51:46.7215197Z       %47 = tt.broadcast %46 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7215520Z       %48 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:46.7215934Z       %49 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.7216366Z       %50 = tt.broadcast %49 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7216653Z       %51 = arith.addi %38, %50 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7216946Z       %52 = tt.addptr %9, %51 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7217248Z       %53 = tt.load %52 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7217677Z       %54 = ttg.memdesc_index %48[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7218240Z       ttg.local_store %53, %54 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7218514Z       %55 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7218804Z       %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.7219170Z       %57 = tt.broadcast %56 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7219374Z       %58 = arith.addi %38, %57 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7219564Z       %59 = tt.addptr %9, %58 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7219763Z       %60 = tt.load %59 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7220079Z       %61 = ttg.memdesc_index %48[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7220498Z       ttg.local_store %60, %61 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7221014Z       %62:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %54, %arg8 = %61) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:51:46.7221429Z         %289 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:46.7221557Z         %290 = arith.muli %289, %c2_i32 : i32
2026-02-21T09:51:46.7221734Z         %291 = tt.splat %290 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7221979Z         %292 = arith.addi %291, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7222257Z         %293 = tt.expand_dims %292 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.7222537Z         %294 = tt.broadcast %293 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7222733Z         %295 = arith.addi %38, %294 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7222935Z         %296 = tt.addptr %9, %295 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7223140Z         %297 = tt.load %296 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7223444Z         %298 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7223991Z         %299 = arith.extf %298 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7224274Z         %300 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:46.7224449Z         %301 = tt.splat %300 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7224674Z         %302 = arith.addi %301, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7224953Z         %303 = tt.expand_dims %302 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7225213Z         %304 = arith.muli %303, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7225411Z         %305 = tt.broadcast %304 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7225604Z         %306 = arith.addi %305, %43 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7225802Z         %307 = tt.addptr %10, %306 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7226015Z         %308 = arith.cmpi sge, %303, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7226189Z         %309 = arith.cmpi slt, %303, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7226373Z         %310 = arith.andi %308, %309 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:46.7226562Z         %311 = tt.broadcast %310 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7226751Z         %312 = arith.andi %311, %47 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7226921Z         %313 = tt.load %307, %312, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7227178Z         %314 = ttg.convert_layout %313 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7227490Z         %315 = arith.shli %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7227728Z         %316 = arith.shrsi %315, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7227966Z         %317 = arith.shrsi %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7228254Z         %318 = tt.expand_dims %316 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7228588Z         %319 = tt.expand_dims %317 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7228872Z         %320 = tt.broadcast %318 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7229116Z         %321 = arith.select %18, %320, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7229361Z         %322 = tt.broadcast %319 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7229596Z         %323 = arith.select %20, %322, %321 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7229842Z         %324 = tt.reshape %323 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7230064Z         %325 = arith.sitofp %324 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7230362Z         %326 = ttg.convert_layout %325 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7230833Z         %327 = tt.dot %299, %326, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7231189Z         %328 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:51:46.7231319Z         %329 = arith.cmpi slt, %328, %c2_i32 : i32
2026-02-21T09:51:46.7231454Z         %330 = arith.select %329, %328, %c0_i32 : i32
2026-02-21T09:51:46.7231718Z         %331 = ttg.memdesc_index %48[%330] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7232069Z         ttg.local_store %297, %331 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7232464Z         scf.yield %327, %330, %arg8, %331 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7232799Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:46.7233128Z       %63 = ttg.local_load %62#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7233551Z       %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7233873Z       %65 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7234149Z       %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7234407Z       %67 = arith.muli %66, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7234593Z       %68 = tt.broadcast %67 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7234784Z       %69 = arith.addi %68, %43 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7234972Z       %70 = tt.addptr %10, %69 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7235178Z       %71 = arith.cmpi sge, %66, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7235358Z       %72 = arith.cmpi slt, %66, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7235520Z       %73 = arith.andi %71, %72 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:46.7235704Z       %74 = tt.broadcast %73 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7235886Z       %75 = arith.andi %74, %47 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7236050Z       %76 = tt.load %70, %75, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7236300Z       %77 = ttg.convert_layout %76 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7236577Z       %78 = arith.shli %77, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7236814Z       %79 = arith.shrsi %78, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7237045Z       %80 = arith.shrsi %77, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7237326Z       %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7237650Z       %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7237943Z       %83 = tt.broadcast %81 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7238177Z       %84 = arith.select %18, %83, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7238407Z       %85 = tt.broadcast %82 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7238629Z       %86 = arith.select %20, %85, %84 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7238848Z       %87 = tt.reshape %86 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7239064Z       %88 = arith.sitofp %87 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7239353Z       %89 = ttg.convert_layout %88 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7239805Z       %90 = tt.dot %64, %89, %62#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7240288Z       %91 = ttg.local_load %62#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7240705Z       %92 = arith.extf %91 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7241043Z       %93 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7241321Z       %94 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7241565Z       %95 = arith.muli %94, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7241752Z       %96 = tt.broadcast %95 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7241940Z       %97 = arith.addi %96, %43 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7242130Z       %98 = tt.addptr %10, %97 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7242347Z       %99 = arith.cmpi sge, %94, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7242517Z       %100 = arith.cmpi slt, %94, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7242778Z       %101 = arith.andi %99, %100 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:46.7242964Z       %102 = tt.broadcast %101 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7243152Z       %103 = arith.andi %102, %47 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7243341Z       %104 = tt.load %98, %103, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7243593Z       %105 = ttg.convert_layout %104 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7243877Z       %106 = arith.shli %105, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7244112Z       %107 = arith.shrsi %106, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7244352Z       %108 = arith.shrsi %105, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7244641Z       %109 = tt.expand_dims %107 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7244974Z       %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7245256Z       %111 = tt.broadcast %109 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7245495Z       %112 = arith.select %18, %111, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7245729Z       %113 = tt.broadcast %110 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7245981Z       %114 = arith.select %20, %113, %112 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7246208Z       %115 = tt.reshape %114 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7246429Z       %116 = arith.sitofp %115 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7246723Z       %117 = ttg.convert_layout %116 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7247178Z       %118 = tt.dot %92, %117, %90, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7247555Z       ttg.local_dealloc %48 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:46.7247763Z       %119 = arith.truncf %118 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:46.7248027Z       %120 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:46.7248260Z       %121 = arith.muli %120, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:51:46.7248487Z       %122 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:46.7248758Z       %123 = tt.broadcast %121 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7248957Z       %124 = tt.broadcast %122 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7249137Z       %125 = arith.addi %123, %124 : tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7249327Z       %126 = tt.addptr %21, %125 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7249520Z       tt.store %126, %119 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:46.7249663Z       %127 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:51:46.7249786Z       %128 = arith.remsi %127, %c128_i32 : i32
2026-02-21T09:51:46.7249910Z       %129 = arith.divsi %127, %c128_i32 : i32
2026-02-21T09:51:46.7250046Z       %130 = arith.muli %128, %c64_i32 : i32
2026-02-21T09:51:46.7250209Z       %131 = tt.splat %130 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.7250420Z       %132 = arith.addi %131, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.7250586Z       %133 = arith.muli %129, %c64_i32 : i32
2026-02-21T09:51:46.7250757Z       %134 = tt.splat %133 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.7250989Z       %135 = tt.splat %133 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.7251204Z       %136 = arith.addi %134, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.7251415Z       %137 = arith.addi %135, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.7251689Z       %138 = tt.expand_dims %136 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.7251948Z       %139 = arith.muli %138, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.7252141Z       %140 = tt.broadcast %139 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7252320Z       %141 = arith.extsi %130 : i32 to i64
2026-02-21T09:51:46.7252486Z       %142 = tt.splat %141 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:46.7252710Z       %143 = arith.addi %142, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:46.7252991Z       %144 = tt.expand_dims %143 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7253270Z       %145 = tt.broadcast %144 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7253491Z       %146 = arith.cmpi sge, %144, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7253666Z       %147 = arith.cmpi slt, %144, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7253838Z       %148 = arith.andi %146, %147 : tensor<1x64xi1, #blocked2>
2026-02-21T09:51:46.7254028Z       %149 = tt.broadcast %148 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7254244Z       %150 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:46.7254432Z       %151 = arith.addi %140, %50 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7254630Z       %152 = tt.addptr %9, %151 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7254836Z       %153 = tt.load %152 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7255116Z       %154 = ttg.memdesc_index %150[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7255477Z       ttg.local_store %153, %154 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7255719Z       %155 = arith.addi %140, %57 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7255915Z       %156 = tt.addptr %9, %155 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7256121Z       %157 = tt.load %156 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7256416Z       %158 = ttg.memdesc_index %150[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7256770Z       ttg.local_store %157, %158 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7257291Z       %159:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %154, %arg8 = %158) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:51:46.7257704Z         %289 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:46.7257858Z         %290 = arith.muli %289, %c2_i32 : i32
2026-02-21T09:51:46.7258032Z         %291 = tt.splat %290 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7258257Z         %292 = arith.addi %291, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7258538Z         %293 = tt.expand_dims %292 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.7258831Z         %294 = tt.broadcast %293 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7259029Z         %295 = arith.addi %140, %294 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7259236Z         %296 = tt.addptr %9, %295 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7259444Z         %297 = tt.load %296 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7259740Z         %298 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7260174Z         %299 = arith.extf %298 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7260457Z         %300 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:46.7260633Z         %301 = tt.splat %300 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7260855Z         %302 = arith.addi %301, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7261132Z         %303 = tt.expand_dims %302 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7261400Z         %304 = arith.muli %303, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7261596Z         %305 = tt.broadcast %304 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7261793Z         %306 = arith.addi %305, %145 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7261989Z         %307 = tt.addptr %10, %306 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7262202Z         %308 = arith.cmpi sge, %303, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7262375Z         %309 = arith.cmpi slt, %303, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7262540Z         %310 = arith.andi %308, %309 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:46.7262730Z         %311 = tt.broadcast %310 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7262922Z         %312 = arith.andi %311, %149 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7263093Z         %313 = tt.load %307, %312, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7263352Z         %314 = ttg.convert_layout %313 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7263635Z         %315 = arith.shli %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7263875Z         %316 = arith.shrsi %315, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7264110Z         %317 = arith.shrsi %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7264417Z         %318 = tt.expand_dims %316 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7264752Z         %319 = tt.expand_dims %317 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7265035Z         %320 = tt.broadcast %318 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7265281Z         %321 = arith.select %18, %320, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7265535Z         %322 = tt.broadcast %319 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7265767Z         %323 = arith.select %20, %322, %321 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7265996Z         %324 = tt.reshape %323 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7266220Z         %325 = arith.sitofp %324 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7266540Z         %326 = ttg.convert_layout %325 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7267004Z         %327 = tt.dot %299, %326, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7267354Z         %328 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:51:46.7267482Z         %329 = arith.cmpi slt, %328, %c2_i32 : i32
2026-02-21T09:51:46.7267620Z         %330 = arith.select %329, %328, %c0_i32 : i32
2026-02-21T09:51:46.7267889Z         %331 = ttg.memdesc_index %150[%330] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7268243Z         ttg.local_store %297, %331 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7268633Z         scf.yield %327, %330, %arg8, %331 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7268964Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:46.7269293Z       %160 = ttg.local_load %159#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7269754Z       %161 = arith.extf %160 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7270052Z       %162 = arith.addi %68, %145 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7270254Z       %163 = tt.addptr %10, %162 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7270459Z       %164 = arith.andi %74, %149 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7270627Z       %165 = tt.load %163, %164, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7270886Z       %166 = ttg.convert_layout %165 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7271166Z       %167 = arith.shli %166, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7271405Z       %168 = arith.shrsi %167, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7271647Z       %169 = arith.shrsi %166, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7271935Z       %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7272270Z       %171 = tt.expand_dims %169 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7272562Z       %172 = tt.broadcast %170 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7272804Z       %173 = arith.select %18, %172, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7273042Z       %174 = tt.broadcast %171 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7273276Z       %175 = arith.select %20, %174, %173 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7273510Z       %176 = tt.reshape %175 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7273751Z       %177 = arith.sitofp %176 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7274050Z       %178 = ttg.convert_layout %177 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7274533Z       %179 = tt.dot %161, %178, %159#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7275027Z       %180 = ttg.local_load %159#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7275462Z       %181 = arith.extf %180 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7275766Z       %182 = arith.addi %96, %145 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7275966Z       %183 = tt.addptr %10, %182 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7276172Z       %184 = arith.andi %102, %149 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7276343Z       %185 = tt.load %183, %184, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7276607Z       %186 = ttg.convert_layout %185 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7276898Z       %187 = arith.shli %186, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7277138Z       %188 = arith.shrsi %187, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7277398Z       %189 = arith.shrsi %186, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7277687Z       %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7278027Z       %191 = tt.expand_dims %189 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7278316Z       %192 = tt.broadcast %190 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7278557Z       %193 = arith.select %18, %192, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7278801Z       %194 = tt.broadcast %191 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7279034Z       %195 = arith.select %20, %194, %193 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7279266Z       %196 = tt.reshape %195 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7279493Z       %197 = arith.sitofp %196 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7279791Z       %198 = ttg.convert_layout %197 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7280259Z       %199 = tt.dot %181, %198, %179, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7280659Z       ttg.local_dealloc %150 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:46.7280874Z       %200 = arith.truncf %199 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:46.7281144Z       %201 = tt.expand_dims %137 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:46.7281386Z       %202 = arith.muli %201, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:51:46.7281619Z       %203 = tt.expand_dims %132 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:46.7281897Z       %204 = tt.broadcast %202 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7282100Z       %205 = tt.broadcast %203 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7282285Z       %206 = arith.addi %204, %205 : tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7282475Z       %207 = tt.addptr %21, %206 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7282725Z       tt.store %207, %200 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:46.7282892Z       %208 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:51:46.7283024Z       %209 = arith.remsi %208, %c128_i32 : i32
2026-02-21T09:51:46.7283153Z       %210 = arith.divsi %208, %c128_i32 : i32
2026-02-21T09:51:46.7283275Z       %211 = arith.muli %209, %c64_i32 : i32
2026-02-21T09:51:46.7283444Z       %212 = tt.splat %211 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.7283657Z       %213 = arith.addi %212, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.7283830Z       %214 = arith.muli %210, %c64_i32 : i32
2026-02-21T09:51:46.7284002Z       %215 = tt.splat %214 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.7284222Z       %216 = tt.splat %214 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.7284442Z       %217 = arith.addi %215, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.7284657Z       %218 = arith.addi %216, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.7284933Z       %219 = tt.expand_dims %217 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.7285187Z       %220 = arith.muli %219, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.7285405Z       %221 = tt.broadcast %220 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7285585Z       %222 = arith.extsi %211 : i32 to i64
2026-02-21T09:51:46.7285755Z       %223 = tt.splat %222 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:46.7285985Z       %224 = arith.addi %223, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:46.7286265Z       %225 = tt.expand_dims %224 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7286549Z       %226 = tt.broadcast %225 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7286758Z       %227 = arith.cmpi sge, %225, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7286934Z       %228 = arith.cmpi slt, %225, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7287108Z       %229 = arith.andi %227, %228 : tensor<1x64xi1, #blocked2>
2026-02-21T09:51:46.7287299Z       %230 = tt.broadcast %229 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7287521Z       %231 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:46.7287710Z       %232 = arith.addi %221, %50 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7287915Z       %233 = tt.addptr %9, %232 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7288150Z       %234 = tt.load %233 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7288436Z       %235 = ttg.memdesc_index %231[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7288799Z       ttg.local_store %234, %235 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7289039Z       %236 = arith.addi %221, %57 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7289241Z       %237 = tt.addptr %9, %236 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7289449Z       %238 = tt.load %237 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7289758Z       %239 = ttg.memdesc_index %231[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7290119Z       ttg.local_store %238, %239 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7290658Z       %240:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %235, %arg8 = %239) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:51:46.7291074Z         %289 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:46.7291206Z         %290 = arith.muli %289, %c2_i32 : i32
2026-02-21T09:51:46.7291383Z         %291 = tt.splat %290 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7291614Z         %292 = arith.addi %291, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7291902Z         %293 = tt.expand_dims %292 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.7292183Z         %294 = tt.broadcast %293 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7292385Z         %295 = arith.addi %221, %294 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7292590Z         %296 = tt.addptr %9, %295 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7292804Z         %297 = tt.load %296 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7293110Z         %298 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7293564Z         %299 = arith.extf %298 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7293854Z         %300 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:46.7294031Z         %301 = tt.splat %300 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7294263Z         %302 = arith.addi %301, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7294548Z         %303 = tt.expand_dims %302 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7294803Z         %304 = arith.muli %303, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7295004Z         %305 = tt.broadcast %304 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7295204Z         %306 = arith.addi %305, %226 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7295412Z         %307 = tt.addptr %10, %306 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7295630Z         %308 = arith.cmpi sge, %303, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7295808Z         %309 = arith.cmpi slt, %303, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7295982Z         %310 = arith.andi %308, %309 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:46.7296189Z         %311 = tt.broadcast %310 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7296386Z         %312 = arith.andi %311, %230 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7296558Z         %313 = tt.load %307, %312, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7296822Z         %314 = ttg.convert_layout %313 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7297112Z         %315 = arith.shli %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7297354Z         %316 = arith.shrsi %315, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7297615Z         %317 = arith.shrsi %314, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7297906Z         %318 = tt.expand_dims %316 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7298249Z         %319 = tt.expand_dims %317 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7298553Z         %320 = tt.broadcast %318 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7298794Z         %321 = arith.select %18, %320, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7299040Z         %322 = tt.broadcast %319 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7299275Z         %323 = arith.select %20, %322, %321 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7299511Z         %324 = tt.reshape %323 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7299741Z         %325 = arith.sitofp %324 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7300040Z         %326 = ttg.convert_layout %325 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7300514Z         %327 = tt.dot %299, %326, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7300868Z         %328 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:51:46.7301000Z         %329 = arith.cmpi slt, %328, %c2_i32 : i32
2026-02-21T09:51:46.7301140Z         %330 = arith.select %329, %328, %c0_i32 : i32
2026-02-21T09:51:46.7301426Z         %331 = ttg.memdesc_index %231[%330] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7301786Z         ttg.local_store %297, %331 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7302180Z         scf.yield %327, %330, %arg8, %331 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7302514Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:46.7302839Z       %241 = ttg.local_load %240#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7303271Z       %242 = arith.extf %241 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7303577Z       %243 = arith.addi %68, %226 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7303785Z       %244 = tt.addptr %10, %243 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7303988Z       %245 = arith.andi %74, %230 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7304165Z       %246 = tt.load %244, %245, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7304436Z       %247 = ttg.convert_layout %246 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7304726Z       %248 = arith.shli %247, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7304970Z       %249 = arith.shrsi %248, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7305208Z       %250 = arith.shrsi %247, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7305501Z       %251 = tt.expand_dims %249 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7305852Z       %252 = tt.expand_dims %250 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7306140Z       %253 = tt.broadcast %251 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7306387Z       %254 = arith.select %18, %253, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7306626Z       %255 = tt.broadcast %252 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7306881Z       %256 = arith.select %20, %255, %254 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7307111Z       %257 = tt.reshape %256 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7307336Z       %258 = arith.sitofp %257 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7307634Z       %259 = ttg.convert_layout %258 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7308102Z       %260 = tt.dot %242, %259, %240#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7308598Z       %261 = ttg.local_load %240#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7309038Z       %262 = arith.extf %261 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7309337Z       %263 = arith.addi %96, %226 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7309561Z       %264 = tt.addptr %10, %263 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7309762Z       %265 = arith.andi %102, %230 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7309938Z       %266 = tt.load %264, %265, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7310201Z       %267 = ttg.convert_layout %266 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7310484Z       %268 = arith.shli %267, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7310727Z       %269 = arith.shrsi %268, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7310965Z       %270 = arith.shrsi %267, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7311260Z       %271 = tt.expand_dims %269 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7311608Z       %272 = tt.expand_dims %270 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7311891Z       %273 = tt.broadcast %271 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7312134Z       %274 = arith.select %18, %273, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7312371Z       %275 = tt.broadcast %272 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7312621Z       %276 = arith.select %20, %275, %274 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7312855Z       %277 = tt.reshape %276 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7313078Z       %278 = arith.sitofp %277 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7313379Z       %279 = ttg.convert_layout %278 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7313849Z       %280 = tt.dot %262, %279, %260, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7314246Z       ttg.local_dealloc %231 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:46.7314457Z       %281 = arith.truncf %280 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:46.7314719Z       %282 = tt.expand_dims %218 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:46.7314977Z       %283 = arith.muli %282, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:51:46.7315205Z       %284 = tt.expand_dims %213 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:46.7315462Z       %285 = tt.broadcast %283 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7315664Z       %286 = tt.broadcast %284 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7315842Z       %287 = arith.addi %285, %286 : tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7316032Z       %288 = tt.addptr %21, %287 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7316227Z       tt.store %288, %281 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:46.7316366Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:51:46.7316489Z     scf.for %arg3 = %25 to %3 step %c1_i32  : i32 {
2026-02-21T09:51:46.7316623Z       %26 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:51:46.7316748Z       %27 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:51:46.7316869Z       %28 = arith.muli %26, %c64_i32 : i32
2026-02-21T09:51:46.7317031Z       %29 = tt.splat %28 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.7317233Z       %30 = arith.addi %29, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:46.7317422Z       %31 = arith.muli %27, %c64_i32 : i32
2026-02-21T09:51:46.7317587Z       %32 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.7317799Z       %33 = tt.splat %31 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.7318010Z       %34 = arith.addi %32, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:46.7318214Z       %35 = arith.addi %33, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:46.7318482Z       %36 = tt.expand_dims %34 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.7318733Z       %37 = arith.muli %36, %cst_11 : tensor<64x1xi32, #blocked1>
2026-02-21T09:51:46.7318922Z       %38 = tt.broadcast %37 : tensor<64x1xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7319095Z       %39 = arith.extsi %28 : i32 to i64
2026-02-21T09:51:46.7319258Z       %40 = tt.splat %39 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:46.7319475Z       %41 = arith.addi %40, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:46.7319749Z       %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7320020Z       %43 = tt.broadcast %42 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7320234Z       %44 = arith.cmpi sge, %42, %cst_9 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7320403Z       %45 = arith.cmpi slt, %42, %cst_10 : tensor<1x64xi64, #blocked2>
2026-02-21T09:51:46.7320567Z       %46 = arith.andi %44, %45 : tensor<1x64xi1, #blocked2>
2026-02-21T09:51:46.7320744Z       %47 = tt.broadcast %46 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7320958Z       %48 = ttg.local_alloc : () -> !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:46.7321227Z       %49 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.7321509Z       %50 = tt.broadcast %49 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7321696Z       %51 = arith.addi %38, %50 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7321887Z       %52 = tt.addptr %9, %51 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7322089Z       %53 = tt.load %52 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7322386Z       %54 = ttg.memdesc_index %48[%c0_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7322779Z       ttg.local_store %53, %54 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7323052Z       %55 = arith.addi %8, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7323324Z       %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.7323595Z       %57 = tt.broadcast %56 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7323782Z       %58 = arith.addi %38, %57 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7323972Z       %59 = tt.addptr %9, %58 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7324172Z       %60 = tt.load %59 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7324446Z       %61 = ttg.memdesc_index %48[%c1_i32] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7324795Z       ttg.local_store %60, %61 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7325324Z       %62:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %54, %arg8 = %61) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>)  : i32 {
2026-02-21T09:51:46.7325740Z         %127 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:51:46.7325867Z         %128 = arith.muli %127, %c2_i32 : i32
2026-02-21T09:51:46.7326042Z         %129 = tt.splat %128 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7326267Z         %130 = arith.addi %129, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:46.7326547Z         %131 = tt.expand_dims %130 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:46.7326824Z         %132 = tt.broadcast %131 : tensor<1x4xi32, #blocked1> -> tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7327020Z         %133 = arith.addi %38, %132 : tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7327223Z         %134 = tt.addptr %9, %133 : tensor<64x4x!tt.ptr<bf16>, #blocked1>, tensor<64x4xi32, #blocked1>
2026-02-21T09:51:46.7327429Z         %135 = tt.load %134 : tensor<64x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:46.7327728Z         %136 = ttg.local_load %arg7 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7328178Z         %137 = arith.extf %136 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7328465Z         %138 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:51:46.7328641Z         %139 = tt.splat %138 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7328865Z         %140 = arith.addi %139, %12 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7329145Z         %141 = tt.expand_dims %140 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7329393Z         %142 = arith.muli %141, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7329607Z         %143 = tt.broadcast %142 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7329800Z         %144 = arith.addi %143, %43 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7329995Z         %145 = tt.addptr %10, %144 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7330207Z         %146 = arith.cmpi sge, %141, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7330397Z         %147 = arith.cmpi slt, %141, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7330563Z         %148 = arith.andi %146, %147 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:46.7330750Z         %149 = tt.broadcast %148 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7330943Z         %150 = arith.andi %149, %47 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7331114Z         %151 = tt.load %145, %150, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7331374Z         %152 = ttg.convert_layout %151 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7331658Z         %153 = arith.shli %152, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7331898Z         %154 = arith.shrsi %153, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7332140Z         %155 = arith.shrsi %152, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7332435Z         %156 = tt.expand_dims %154 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7332771Z         %157 = tt.expand_dims %155 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7333072Z         %158 = tt.broadcast %156 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7333310Z         %159 = arith.select %18, %158, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7333548Z         %160 = tt.broadcast %157 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7333782Z         %161 = arith.select %20, %160, %159 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7334011Z         %162 = tt.reshape %161 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7334234Z         %163 = arith.sitofp %162 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7334527Z         %164 = ttg.convert_layout %163 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7334997Z         %165 = tt.dot %137, %164, %arg5, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7335347Z         %166 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:51:46.7335474Z         %167 = arith.cmpi slt, %166, %c2_i32 : i32
2026-02-21T09:51:46.7335608Z         %168 = arith.select %167, %166, %c0_i32 : i32
2026-02-21T09:51:46.7335869Z         %169 = ttg.memdesc_index %48[%168] : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7336240Z         ttg.local_store %135, %169 : tensor<64x4xbf16, #blocked1> -> !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7336629Z         scf.yield %165, %168, %arg8, %169 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>, !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4>
2026-02-21T09:51:46.7336964Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:51:46.7337276Z       %63 = ttg.local_load %62#2 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7337716Z       %64 = arith.extf %63 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7338042Z       %65 = arith.addi %12, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7338336Z       %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7338581Z       %67 = arith.muli %66, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7338767Z       %68 = tt.broadcast %67 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7338956Z       %69 = arith.addi %68, %43 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7339145Z       %70 = tt.addptr %10, %69 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7339352Z       %71 = arith.cmpi sge, %66, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7339517Z       %72 = arith.cmpi slt, %66, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7339677Z       %73 = arith.andi %71, %72 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:46.7339858Z       %74 = tt.broadcast %73 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7340040Z       %75 = arith.andi %74, %47 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7340205Z       %76 = tt.load %70, %75, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7340452Z       %77 = ttg.convert_layout %76 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7340728Z       %78 = arith.shli %77, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7340974Z       %79 = arith.shrsi %78, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7341207Z       %80 = arith.shrsi %77, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7341490Z       %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7341816Z       %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7342090Z       %83 = tt.broadcast %81 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7342321Z       %84 = arith.select %18, %83, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7342548Z       %85 = tt.broadcast %82 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7342770Z       %86 = arith.select %20, %85, %84 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7342985Z       %87 = tt.reshape %86 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7343199Z       %88 = arith.sitofp %87 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7343483Z       %89 = ttg.convert_layout %88 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7343952Z       %90 = tt.dot %64, %89, %62#0, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7344443Z       %91 = ttg.local_load %62#3 : !ttg.memdesc<64x4xbf16, #shared, #smem, mutable, 2x64x4> -> tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7344868Z       %92 = arith.extf %91 : tensor<64x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7345195Z       %93 = arith.addi %12, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:46.7345492Z       %94 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7345733Z       %95 = arith.muli %94, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7345920Z       %96 = tt.broadcast %95 : tensor<2x1xi64, #blocked2> -> tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7346106Z       %97 = arith.addi %96, %43 : tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7346314Z       %98 = tt.addptr %10, %97 : tensor<2x64x!tt.ptr<i8>, #blocked2>, tensor<2x64xi64, #blocked2>
2026-02-21T09:51:46.7346518Z       %99 = arith.cmpi sge, %94, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7346688Z       %100 = arith.cmpi slt, %94, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:46.7346853Z       %101 = arith.andi %99, %100 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:46.7347034Z       %102 = tt.broadcast %101 : tensor<2x1xi1, #blocked2> -> tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7347228Z       %103 = arith.andi %102, %47 : tensor<2x64xi1, #blocked2>
2026-02-21T09:51:46.7347397Z       %104 = tt.load %98, %103, %cst_12 : tensor<2x64x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:46.7347648Z       %105 = ttg.convert_layout %104 : tensor<2x64xi8, #blocked2> -> tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7347932Z       %106 = arith.shli %105, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7348168Z       %107 = arith.shrsi %106, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7348410Z       %108 = arith.shrsi %105, %cst_14 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:46.7348720Z       %109 = tt.expand_dims %107 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7349054Z       %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:51:46.7349335Z       %111 = tt.broadcast %109 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7349571Z       %112 = arith.select %18, %111, %cst_13 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7349810Z       %113 = tt.broadcast %110 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7350043Z       %114 = arith.select %20, %113, %112 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:51:46.7350269Z       %115 = tt.reshape %114 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked3>
2026-02-21T09:51:46.7350487Z       %116 = arith.sitofp %115 : tensor<4x64xi8, #blocked3> to tensor<4x64xf32, #blocked3>
2026-02-21T09:51:46.7350779Z       %117 = ttg.convert_layout %116 : tensor<4x64xf32, #blocked3> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:46.7351242Z       %118 = tt.dot %92, %117, %90, inputPrecision = tf32 : tensor<64x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:51:46.7351620Z       ttg.local_dealloc %48 : !ttg.memdesc<2x64x4xbf16, #shared, #smem, mutable>
2026-02-21T09:51:46.7351843Z       %119 = arith.truncf %118 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:51:46.7352106Z       %120 = tt.expand_dims %35 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:51:46.7352340Z       %121 = arith.muli %120, %cst : tensor<64x1xi32, #mma>
2026-02-21T09:51:46.7352567Z       %122 = tt.expand_dims %30 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:51:46.7352823Z       %123 = tt.broadcast %121 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7353039Z       %124 = tt.broadcast %122 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7353218Z       %125 = arith.addi %123, %124 : tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7353404Z       %126 = tt.addptr %21, %125 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:51:46.7353597Z       tt.store %126, %119 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:46.7353735Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:51:46.7353836Z     tt.return
2026-02-21T09:51:46.7353918Z   }
2026-02-21T09:51:46.7353993Z }
2026-02-21T09:51:46.7354038Z 
2026-02-21T09:51:46.7354085Z {-#
2026-02-21T09:51:46.7354164Z   external_resources: {
2026-02-21T09:51:46.7354264Z     mlir_reproducer: {
2026-02-21T09:51:46.7355265Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:51:46.7356264Z       disable_threading: false,
2026-02-21T09:51:46.7356372Z       verify_each: true
2026-02-21T09:51:46.7356460Z     }
2026-02-21T09:51:46.7356531Z   }
2026-02-21T09:51:46.7356602Z #-}
2026-02-21T09:51:46.7356878Z /tmp/torchinductor_root/2y/c2y7gy7ipehhdbfa3y5ziol7zm3dkrshalpbtxjnxv6uzlhj6n2z.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:51:46.7357576Z /tmp/torchinductor_root/2y/c2y7gy7ipehhdbfa3y5ziol7zm3dkrshalpbtxjnxv6uzlhj6n2z.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:51:46.7358120Z [437s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:51:46.7358900Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:51:46.7359598Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:51:46.7359764Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:51:47.9714358Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:51:47.9716713Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 4], order = [2, 1, 0]}>
2026-02-21T09:51:47.9717216Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:51:47.9718025Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 4], order = [1, 0]}>
2026-02-21T09:51:47.9718473Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 4], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:51:47.9718857Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:51:47.9719216Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:51:47.9719486Z #smem = #ttg.shared_memory
2026-02-21T09:51:47.9726942Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:51:47.9727592Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:51:47.9727976Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:51:47.9728157Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:47.9728391Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:47.9728567Z     %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:51:47.9728751Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma>
2026-02-21T09:51:47.9728939Z     %cst_4 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:47.9729111Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:47.9729285Z     %cst_6 = arith.constant dense<512> : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:47.9729457Z     %cst_7 = arith.constant dense<0> : tensor<1x256xi64, #blocked2>
2026-02-21T09:51:47.9729629Z     %cst_8 = arith.constant dense<8192> : tensor<1x256xi64, #blocked2>
2026-02-21T09:51:47.9729803Z     %cst_9 = arith.constant dense<0> : tensor<2x256xi8, #blocked2>
2026-02-21T09:51:47.9729949Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:51:47.9730068Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:51:47.9730183Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:51:47.9730300Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T09:51:47.9730443Z     %cst_10 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked>
2026-02-21T09:51:47.9730587Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:51:47.9730704Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:51:47.9730980Z     %cst_11 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:47.9731177Z     %0 = tt.get_program_id x : i32
2026-02-21T09:51:47.9731287Z     %1 = arith.divsi %0, %c256_i32 : i32
2026-02-21T09:51:47.9731401Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:51:47.9731514Z     %3 = arith.subi %c32_i32, %2 : i32
2026-02-21T09:51:47.9731622Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:51:47.9731735Z     %5 = arith.remsi %0, %c256_i32 : i32
2026-02-21T09:51:47.9731846Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:51:47.9731955Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:51:47.9732058Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:51:47.9732170Z     %9 = arith.muli %7, %c256_i32 : i32
2026-02-21T09:51:47.9732361Z     %10 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:47.9732642Z     %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:47.9732892Z     %12 = tt.splat %9 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:47.9733099Z     %13 = arith.addi %12, %10 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:47.9733263Z     %14 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:51:47.9733458Z     %15 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:47.9733748Z     %16 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:47.9733992Z     %17 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:47.9734198Z     %18 = tt.splat %14 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:47.9734408Z     %19 = arith.addi %17, %15 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:47.9734613Z     %20 = arith.addi %18, %16 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:47.9734849Z     %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:47.9735178Z     %22 = tt.expand_dims %19 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:51:47.9735428Z     %23 = arith.muli %22, %cst_2 : tensor<128x1xi32, #blocked1>
2026-02-21T09:51:47.9735623Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:47.9735853Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:47.9736016Z     %26 = arith.extsi %9 : i32 to i64
2026-02-21T09:51:47.9736165Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:47.9736401Z     %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:47.9736719Z     %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:47.9737010Z     %30 = tt.splat %26 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:47.9737303Z     %31 = arith.extsi %11 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:47.9737598Z     %32 = arith.addi %30, %31 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:47.9737870Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x256xi64, #blocked2>
2026-02-21T09:51:47.9738145Z     %34 = tt.broadcast %33 : tensor<1x256xi64, #blocked2> -> tensor<2x256xi64, #blocked2>
2026-02-21T09:51:47.9738343Z     %35 = arith.cmpi sge, %33, %cst_7 : tensor<1x256xi64, #blocked2>
2026-02-21T09:51:47.9738530Z     %36 = arith.cmpi slt, %33, %cst_8 : tensor<1x256xi64, #blocked2>
2026-02-21T09:51:47.9738691Z     %37 = arith.andi %35, %36 : tensor<1x256xi1, #blocked2>
2026-02-21T09:51:47.9738877Z     %38 = tt.broadcast %37 : tensor<1x256xi1, #blocked2> -> tensor<2x256xi1, #blocked2>
2026-02-21T09:51:47.9739161Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:51:47.9739572Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:51:47.9739975Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:47.9740227Z     %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:47.9740418Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:51:47.9740627Z     %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:51:47.9740814Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:51:47.9741074Z     %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst_3) -> (tensor<128x256xf32, #mma>)  : i32 {
2026-02-21T09:51:47.9741291Z       %56 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:51:47.9741475Z       %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:47.9741689Z       %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:51:47.9741958Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:51:47.9742224Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:51:47.9742414Z       %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked1>
2026-02-21T09:51:47.9742609Z       %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:51:47.9742832Z       %63 = tt.load %62 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:51:47.9743051Z       %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:51:47.9743377Z       %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:47.9743800Z       %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:47.9744077Z       %67 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:51:47.9744245Z       %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:47.9744460Z       %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:47.9744729Z       %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:51:47.9744973Z       %71 = arith.muli %70, %cst_4 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:47.9745157Z       %72 = tt.broadcast %71 : tensor<2x1xi64, #blocked2> -> tensor<2x256xi64, #blocked2>
2026-02-21T09:51:47.9745347Z       %73 = arith.addi %72, %34 : tensor<2x256xi64, #blocked2>
2026-02-21T09:51:47.9745541Z       %74 = tt.addptr %27, %73 : tensor<2x256x!tt.ptr<i8>, #blocked2>, tensor<2x256xi64, #blocked2>
2026-02-21T09:51:47.9745746Z       %75 = arith.cmpi sge, %70, %cst_5 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:47.9745913Z       %76 = arith.cmpi slt, %70, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:51:47.9746069Z       %77 = arith.andi %75, %76 : tensor<2x1xi1, #blocked2>
2026-02-21T09:51:47.9746272Z       %78 = tt.broadcast %77 : tensor<2x1xi1, #blocked2> -> tensor<2x256xi1, #blocked2>
2026-02-21T09:51:47.9746455Z       %79 = arith.andi %78, %38 : tensor<2x256xi1, #blocked2>
2026-02-21T09:51:47.9746622Z       %80 = tt.load %74, %79, %cst_9 : tensor<2x256x!tt.ptr<i8>, #blocked2>
2026-02-21T09:51:47.9746873Z       %81 = ttg.convert_layout %80 : tensor<2x256xi8, #blocked2> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:47.9747154Z       %82 = arith.shli %81, %cst_11 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:47.9747388Z       %83 = arith.shrsi %82, %cst_11 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:47.9747620Z       %84 = arith.shrsi %81, %cst_11 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:47.9747904Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:51:47.9748240Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:51:47.9748521Z       %87 = tt.broadcast %85 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:51:47.9748762Z       %88 = arith.select %43, %87, %cst_10 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:51:47.9748995Z       %89 = tt.broadcast %86 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:51:47.9749242Z       %90 = arith.select %45, %89, %88 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:51:47.9749470Z       %91 = tt.reshape %90 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked2>
2026-02-21T09:51:47.9749686Z       %92 = arith.sitofp %91 : tensor<4x256xi8, #blocked2> to tensor<4x256xf32, #blocked2>
2026-02-21T09:51:47.9749934Z       %93 = ttg.local_alloc %92 : (tensor<4x256xf32, #blocked2>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:51:47.9750253Z       %94 = ttg.local_load %93 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:47.9750745Z       %95 = tt.dot %66, %94, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:51:47.9751097Z       scf.yield %95 : tensor<128x256xf32, #mma>
2026-02-21T09:51:47.9751245Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32}
2026-02-21T09:51:47.9751433Z     %47 = arith.truncf %46 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:51:47.9751715Z     %48 = tt.expand_dims %20 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:51:47.9751946Z     %49 = arith.muli %48, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:51:47.9752171Z     %50 = tt.expand_dims %13 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:51:47.9752425Z     %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:51:47.9752629Z     %52 = tt.broadcast %50 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:51:47.9752806Z     %53 = arith.addi %51, %52 : tensor<128x256xi32, #mma>
2026-02-21T09:51:47.9752979Z     %54 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:47.9753193Z     %55 = tt.addptr %54, %53 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi32, #mma>
2026-02-21T09:51:47.9753385Z     tt.store %55, %47 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:47.9753517Z     tt.return
2026-02-21T09:51:47.9753595Z   }
2026-02-21T09:51:47.9753673Z }
2026-02-21T09:51:47.9753717Z 
2026-02-21T09:51:47.9753748Z {-#
2026-02-21T09:51:47.9753831Z   external_resources: {
2026-02-21T09:51:47.9753928Z     mlir_reproducer: {
2026-02-21T09:51:47.9754958Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:51:47.9755943Z       disable_threading: false,
2026-02-21T09:51:47.9756052Z       verify_each: true
2026-02-21T09:51:47.9756141Z     }
2026-02-21T09:51:47.9756214Z   }
2026-02-21T09:51:47.9756282Z #-}
2026-02-21T09:51:47.9756570Z /tmp/torchinductor_root/su/csukzv5zbzywkh36jczfxtklbfcnhvaid6ak4lorzwsrnck6yo75.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:51:47.9757250Z /tmp/torchinductor_root/su/csukzv5zbzywkh36jczfxtklbfcnhvaid6ak4lorzwsrnck6yo75.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:51:47.9757800Z [438s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:51:47.9758543Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:51:47.9759202Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:51:47.9759368Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:51:48.5776220Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:51:48.5779776Z #blocked = #ttg.blocked<{sizePerThread = [1, 16], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:51:48.5780702Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:51:48.5781646Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:51:48.5782456Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:51:48.5783213Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:51:48.5783905Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:51:48.5784543Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:51:48.5784985Z #smem = #ttg.shared_memory
2026-02-21T09:51:48.5785418Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:51:48.5786299Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:51:48.5787005Z     %cst = arith.constant dense<4> : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5787288Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:51:48.5787499Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:51:48.5787710Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:51:48.5788033Z     %cst_0 = arith.constant dense<0> : tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5788362Z     %cst_1 = arith.constant dense<0> : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5788627Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:51:48.5788839Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:51:48.5789051Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:51:48.5789322Z     %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:51:48.5789661Z     %cst_3 = arith.constant dense<8192> : tensor<1x256xi64, #blocked>
2026-02-21T09:51:48.5789990Z     %cst_4 = arith.constant dense<0> : tensor<1x256xi64, #blocked>
2026-02-21T09:51:48.5790308Z     %cst_5 = arith.constant dense<512> : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5790617Z     %cst_6 = arith.constant dense<0> : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5790930Z     %cst_7 = arith.constant dense<8192> : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5791322Z     %cst_8 = arith.constant dense<4> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:48.5791778Z     %cst_9 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:48.5792125Z     %c504_i32 = arith.constant 504 : i32
2026-02-21T09:51:48.5792336Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:51:48.5792557Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:51:48.5792836Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma>
2026-02-21T09:51:48.5793217Z     %cst_11 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:51:48.5793537Z     %cst_12 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:51:48.5793861Z     %cst_13 = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:51:48.5794131Z     %0 = tt.get_program_id x : i32
2026-02-21T09:51:48.5794338Z     %1 = arith.divsi %0, %c64_i32 : i32
2026-02-21T09:51:48.5794550Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:51:48.5794757Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:51:48.5794961Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:51:48.5795113Z     %5 = arith.remsi %0, %c64_i32 : i32
2026-02-21T09:51:48.5795289Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:51:48.5795439Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:51:48.5795585Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:51:48.5795730Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:51:48.5796001Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:48.5796390Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:48.5796758Z     %12 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:48.5797055Z     %13 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:48.5797342Z     %14 = arith.addi %12, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:51:48.5797636Z     %15 = arith.addi %13, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:51:48.5797863Z     %16 = arith.muli %8, %c256_i32 : i32
2026-02-21T09:51:48.5798135Z     %17 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:51:48.5798502Z     %18 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:48.5798831Z     %19 = tt.splat %16 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:48.5799117Z     %20 = arith.addi %19, %18 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:51:48.5799437Z     %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:48.5799883Z     %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:51:48.5800232Z     %23 = arith.muli %22, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:51:48.5800496Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:51:48.5800797Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:51:48.5801011Z     %26 = arith.extsi %16 : i32 to i64
2026-02-21T09:51:48.5801219Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:51:48.5801530Z     %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:48.5801974Z     %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:48.5802357Z     %30 = tt.splat %26 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:51:48.5802856Z     %31 = arith.extsi %17 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:51:48.5803256Z     %32 = arith.addi %30, %31 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:51:48.5803623Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T09:51:48.5803995Z     %34 = tt.broadcast %33 : tensor<1x256xi64, #blocked> -> tensor<4x256xi64, #blocked>
2026-02-21T09:51:48.5804316Z     %35 = arith.cmpi sge, %33, %cst_4 : tensor<1x256xi64, #blocked>
2026-02-21T09:51:48.5804546Z     %36 = arith.cmpi slt, %33, %cst_3 : tensor<1x256xi64, #blocked>
2026-02-21T09:51:48.5804760Z     %37 = arith.andi %35, %36 : tensor<1x256xi1, #blocked>
2026-02-21T09:51:48.5805029Z     %38 = tt.broadcast %37 : tensor<1x256xi1, #blocked> -> tensor<4x256xi1, #blocked>
2026-02-21T09:51:48.5805337Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:51:48.5805791Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:51:48.5806248Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:51:48.5806527Z     %42 = arith.cmpi eq, %41, %cst_11 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:51:48.5806742Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked1> -> tensor<4x2x256xi1, #blocked1>
2026-02-21T09:51:48.5806977Z     %44 = arith.cmpi eq, %41, %cst_12 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:51:48.5807187Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked1> -> tensor<4x2x256xi1, #blocked1>
2026-02-21T09:51:48.5807423Z     %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:51:48.5807715Z     %47 = tt.expand_dims %21 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:51:48.5808012Z     %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:51:48.5808215Z     %49 = arith.addi %24, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:51:48.5808426Z     %50 = tt.addptr %25, %49 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:51:48.5808646Z     %51 = tt.load %50 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:51:48.5808899Z     %52 = tt.expand_dims %29 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5809157Z     %53 = arith.muli %52, %cst_7 : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5809350Z     %54 = tt.broadcast %53 : tensor<4x1xi64, #blocked> -> tensor<4x256xi64, #blocked>
2026-02-21T09:51:48.5809569Z     %55 = arith.addi %54, %34 : tensor<4x256xi64, #blocked>
2026-02-21T09:51:48.5809776Z     %56 = tt.addptr %27, %55 : tensor<4x256x!tt.ptr<i8>, #blocked>, tensor<4x256xi64, #blocked>
2026-02-21T09:51:48.5809990Z     %57 = arith.cmpi sge, %52, %cst_6 : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5810171Z     %58 = arith.cmpi slt, %52, %cst_5 : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5810336Z     %59 = arith.andi %57, %58 : tensor<4x1xi1, #blocked>
2026-02-21T09:51:48.5810528Z     %60 = tt.broadcast %59 : tensor<4x1xi1, #blocked> -> tensor<4x256xi1, #blocked>
2026-02-21T09:51:48.5810721Z     %61 = arith.andi %60, %38 : tensor<4x256xi1, #blocked>
2026-02-21T09:51:48.5810945Z     %62 = tt.load %56, %61, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:51:48.5811310Z     %63 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:51:48.5811703Z     ttg.local_store %51, %63 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:51:48.5811998Z     %64 = arith.addi %21, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:48.5812303Z     %65 = tt.expand_dims %64 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:51:48.5812598Z     %66 = tt.broadcast %65 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:51:48.5812819Z     %67 = arith.addi %24, %66 : tensor<128x8xi32, #blocked2>
2026-02-21T09:51:48.5813029Z     %68 = tt.addptr %25, %67 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:51:48.5813249Z     %69 = tt.load %68 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:51:48.5813451Z     %70 = arith.addi %29, %cst_8 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:48.5813739Z     %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5813997Z     %72 = arith.muli %71, %cst_7 : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5814211Z     %73 = tt.broadcast %72 : tensor<4x1xi64, #blocked> -> tensor<4x256xi64, #blocked>
2026-02-21T09:51:48.5814405Z     %74 = arith.addi %73, %34 : tensor<4x256xi64, #blocked>
2026-02-21T09:51:48.5814609Z     %75 = tt.addptr %27, %74 : tensor<4x256x!tt.ptr<i8>, #blocked>, tensor<4x256xi64, #blocked>
2026-02-21T09:51:48.5814829Z     %76 = arith.cmpi sge, %71, %cst_6 : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5814997Z     %77 = arith.cmpi slt, %71, %cst_5 : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5815166Z     %78 = arith.andi %76, %77 : tensor<4x1xi1, #blocked>
2026-02-21T09:51:48.5815340Z     %79 = tt.broadcast %78 : tensor<4x1xi1, #blocked> -> tensor<4x256xi1, #blocked>
2026-02-21T09:51:48.5815517Z     %80 = arith.andi %79, %38 : tensor<4x256xi1, #blocked>
2026-02-21T09:51:48.5815719Z     %81 = tt.load %75, %80, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:51:48.5816047Z     %82 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:51:48.5816403Z     ttg.local_store %69, %82 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:51:48.5817026Z     %83:6 = scf.for %arg3 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg4 = %cst_10, %arg5 = %c1_i32, %arg6 = %63, %arg7 = %82, %arg8 = %62, %arg9 = %81) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x256xi8, #blocked>, tensor<4x256xi8, #blocked>)  : i32 {
2026-02-21T09:51:48.5817541Z       %129 = arith.addi %arg3, %c8_i32 : i32
2026-02-21T09:51:48.5817662Z       %130 = arith.muli %129, %c2_i32 : i32
2026-02-21T09:51:48.5817863Z       %131 = tt.splat %130 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:48.5818084Z       %132 = arith.addi %131, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:51:48.5818363Z       %133 = tt.expand_dims %132 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:51:48.5818642Z       %134 = tt.broadcast %133 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:51:48.5818835Z       %135 = arith.addi %24, %134 : tensor<128x8xi32, #blocked2>
2026-02-21T09:51:48.5819039Z       %136 = tt.addptr %25, %135 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:51:48.5819247Z       %137 = tt.load %136 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:51:48.5819552Z       %138 = ttg.local_load %arg6 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:48.5819993Z       %139 = arith.extf %138 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:48.5820274Z       %140 = arith.extsi %129 : i32 to i64
2026-02-21T09:51:48.5820443Z       %141 = tt.splat %140 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:48.5820659Z       %142 = arith.addi %141, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:51:48.5820945Z       %143 = tt.expand_dims %142 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5821188Z       %144 = arith.muli %143, %cst_7 : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5821374Z       %145 = tt.broadcast %144 : tensor<4x1xi64, #blocked> -> tensor<4x256xi64, #blocked>
2026-02-21T09:51:48.5821564Z       %146 = arith.addi %145, %34 : tensor<4x256xi64, #blocked>
2026-02-21T09:51:48.5821756Z       %147 = tt.addptr %27, %146 : tensor<4x256x!tt.ptr<i8>, #blocked>, tensor<4x256xi64, #blocked>
2026-02-21T09:51:48.5821964Z       %148 = arith.cmpi sge, %143, %cst_6 : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5822150Z       %149 = arith.cmpi slt, %143, %cst_5 : tensor<4x1xi64, #blocked>
2026-02-21T09:51:48.5822311Z       %150 = arith.andi %148, %149 : tensor<4x1xi1, #blocked>
2026-02-21T09:51:48.5822493Z       %151 = tt.broadcast %150 : tensor<4x1xi1, #blocked> -> tensor<4x256xi1, #blocked>
2026-02-21T09:51:48.5822679Z       %152 = arith.andi %151, %38 : tensor<4x256xi1, #blocked>
2026-02-21T09:51:48.5822844Z       %153 = tt.load %147, %152, %cst_1 : tensor<4x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:51:48.5823031Z       %154 = arith.shli %arg8, %cst : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5823189Z       %155 = arith.shrsi %154, %cst : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5823430Z       %156 = ttg.convert_layout %155 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:48.5823677Z       %157 = arith.shrsi %arg8, %cst : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5823918Z       %158 = ttg.convert_layout %157 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:48.5824255Z       %159 = tt.expand_dims %156 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1>
2026-02-21T09:51:48.5824596Z       %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1>
2026-02-21T09:51:48.5824885Z       %161 = tt.broadcast %159 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5825130Z       %162 = arith.select %43, %161, %cst_0 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5825374Z       %163 = tt.broadcast %160 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5825626Z       %164 = arith.select %45, %163, %162 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5825863Z       %165 = tt.reshape %164 : tensor<4x2x256xi8, #blocked1> -> tensor<8x256xi8, #blocked3>
2026-02-21T09:51:48.5826092Z       %166 = arith.sitofp %165 : tensor<8x256xi8, #blocked3> to tensor<8x256xf32, #blocked3>
2026-02-21T09:51:48.5826344Z       %167 = ttg.local_alloc %166 : (tensor<8x256xf32, #blocked3>) -> !ttg.memdesc<8x256xf32, #shared1, #smem>
2026-02-21T09:51:48.5826676Z       %168 = ttg.local_load %167 : !ttg.memdesc<8x256xf32, #shared1, #smem> -> tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:48.5827159Z       %169 = tt.dot %139, %168, %arg4, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:51:48.5827509Z       %170 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:51:48.5827637Z       %171 = arith.cmpi slt, %170, %c2_i32 : i32
2026-02-21T09:51:48.5827764Z       %172 = arith.select %171, %170, %c0_i32 : i32
2026-02-21T09:51:48.5828031Z       %173 = ttg.memdesc_index %46[%172] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:51:48.5828388Z       ttg.local_store %137, %173 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:51:48.5828889Z       scf.yield %169, %172, %arg7, %173, %arg9, %153 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x256xi8, #blocked>, tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5829294Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 3 : i32}
2026-02-21T09:51:48.5829586Z     %84 = ttg.local_load %83#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:48.5830013Z     %85 = arith.extf %84 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:48.5830337Z     %86 = arith.shli %83#4, %cst : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5830490Z     %87 = arith.shrsi %86, %cst : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5830728Z     %88 = ttg.convert_layout %87 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:48.5830967Z     %89 = arith.shrsi %83#4, %cst : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5831224Z     %90 = ttg.convert_layout %89 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:48.5831556Z     %91 = tt.expand_dims %88 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1>
2026-02-21T09:51:48.5831892Z     %92 = tt.expand_dims %90 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1>
2026-02-21T09:51:48.5832174Z     %93 = tt.broadcast %91 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5832410Z     %94 = arith.select %43, %93, %cst_0 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5832648Z     %95 = tt.broadcast %92 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5832879Z     %96 = arith.select %45, %95, %94 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5833103Z     %97 = tt.reshape %96 : tensor<4x2x256xi8, #blocked1> -> tensor<8x256xi8, #blocked3>
2026-02-21T09:51:48.5833322Z     %98 = arith.sitofp %97 : tensor<8x256xi8, #blocked3> to tensor<8x256xf32, #blocked3>
2026-02-21T09:51:48.5833565Z     %99 = ttg.local_alloc %98 : (tensor<8x256xf32, #blocked3>) -> !ttg.memdesc<8x256xf32, #shared1, #smem>
2026-02-21T09:51:48.5833902Z     %100 = ttg.local_load %99 : !ttg.memdesc<8x256xf32, #shared1, #smem> -> tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:48.5834369Z     %101 = tt.dot %85, %100, %83#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:51:48.5834854Z     %102 = ttg.local_load %83#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:48.5835282Z     %103 = arith.extf %102 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:48.5835581Z     %104 = arith.shli %83#5, %cst : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5835737Z     %105 = arith.shrsi %104, %cst : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5835983Z     %106 = ttg.convert_layout %105 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:48.5836228Z     %107 = arith.shrsi %83#5, %cst : tensor<4x256xi8, #blocked>
2026-02-21T09:51:48.5836469Z     %108 = ttg.convert_layout %107 : tensor<4x256xi8, #blocked> -> tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:51:48.5836803Z     %109 = tt.expand_dims %106 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1>
2026-02-21T09:51:48.5837159Z     %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<4x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x256xi8, #blocked1>
2026-02-21T09:51:48.5837450Z     %111 = tt.broadcast %109 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5837691Z     %112 = arith.select %43, %111, %cst_0 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5837933Z     %113 = tt.broadcast %110 : tensor<4x1x256xi8, #blocked1> -> tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5838173Z     %114 = arith.select %45, %113, %112 : tensor<4x2x256xi1, #blocked1>, tensor<4x2x256xi8, #blocked1>
2026-02-21T09:51:48.5838423Z     %115 = tt.reshape %114 : tensor<4x2x256xi8, #blocked1> -> tensor<8x256xi8, #blocked3>
2026-02-21T09:51:48.5838648Z     %116 = arith.sitofp %115 : tensor<8x256xi8, #blocked3> to tensor<8x256xf32, #blocked3>
2026-02-21T09:51:48.5838898Z     %117 = ttg.local_alloc %116 : (tensor<8x256xf32, #blocked3>) -> !ttg.memdesc<8x256xf32, #shared1, #smem>
2026-02-21T09:51:48.5839238Z     %118 = ttg.local_load %117 : !ttg.memdesc<8x256xf32, #shared1, #smem> -> tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:51:48.5839704Z     %119 = tt.dot %103, %118, %101, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:51:48.5840084Z     ttg.local_dealloc %46 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:51:48.5840301Z     %120 = arith.truncf %119 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:51:48.5840575Z     %121 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:51:48.5840810Z     %122 = arith.muli %121, %cst_13 : tensor<128x1xi32, #mma>
2026-02-21T09:51:48.5841046Z     %123 = tt.expand_dims %20 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:51:48.5841303Z     %124 = tt.broadcast %122 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:51:48.5841508Z     %125 = tt.broadcast %123 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:51:48.5841688Z     %126 = arith.addi %124, %125 : tensor<128x256xi32, #mma>
2026-02-21T09:51:48.5841883Z     %127 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:48.5842102Z     %128 = tt.addptr %127, %126 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi32, #mma>
2026-02-21T09:51:48.5842303Z     tt.store %128, %120 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:51:48.5842434Z     tt.return
2026-02-21T09:51:48.5842513Z   }
2026-02-21T09:51:48.5842634Z }
2026-02-21T09:51:48.5842676Z 
2026-02-21T09:51:48.5842706Z {-#
2026-02-21T09:51:48.5842788Z   external_resources: {
2026-02-21T09:51:48.5842889Z     mlir_reproducer: {
2026-02-21T09:51:48.5843877Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:51:48.5844858Z       disable_threading: false,
2026-02-21T09:51:48.5844966Z       verify_each: true
2026-02-21T09:51:48.5845054Z     }
2026-02-21T09:51:48.5845129Z   }
2026-02-21T09:51:48.5845198Z #-}
2026-02-21T09:51:48.5845480Z /tmp/torchinductor_root/os/cosunfwyrn6frbsp3rjael2doabicbr5htzyug5z4hpwzysdspuh.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:51:48.5846178Z /tmp/torchinductor_root/os/cosunfwyrn6frbsp3rjael2doabicbr5htzyug5z4hpwzysdspuh.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:51:48.5846727Z [439s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:51:48.5847455Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:51:48.5848126Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:51:48.5848292Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:51:48.7037180Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 80/80 8.2 configs/s
2026-02-21T09:51:59.0282559Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 193/193 13.7 configs/s
2026-02-21T09:52:01.9061790Z [452s] Generation 5 complete: 
2026-02-21T09:52:01.9062216Z error=5
2026-02-21T09:52:01.9062424Z ok=79
2026-02-21T09:52:01.9062629Z min=1.0552
2026-02-21T09:52:01.9062834Z mid=1.4344
2026-02-21T09:52:01.9063083Z max=66.5692
2026-02-21T09:52:01.9063312Z best={'block_sizes': [8, 128, 128],
2026-02-21T09:52:01.9063712Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:52:01.9064072Z  'l2_groupings': [2],
2026-02-21T09:52:01.9064344Z  'load_eviction_policies': ['', ''],
2026-02-21T09:52:01.9064649Z  'loop_orders': [[0, 1]],
2026-02-21T09:52:01.9064927Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:52:01.9065201Z  'num_stages': 1,
2026-02-21T09:52:01.9065431Z  'num_warps': 4,
2026-02-21T09:52:01.9065665Z  'pid_type': 'flat',
2026-02-21T09:52:01.9065919Z  'range_flattens': [None, None],
2026-02-21T09:52:01.9066224Z  'range_multi_buffers': [None, False],
2026-02-21T09:52:01.9066536Z  'range_num_stages': [0, 2],
2026-02-21T09:52:01.9066815Z  'range_unroll_factors': [0, 0],
2026-02-21T09:52:01.9067120Z  'range_warp_specializes': [],
2026-02-21T09:52:01.9067396Z  'waves_per_eu': 2}
2026-02-21T09:52:01.9199355Z [452s] Fitting surrogate: 590 points, 590 targets
2026-02-21T09:52:02.7329414Z [453s] Generation 6 starting: 81 neighbors, 4 active search path(s)
2026-02-21T09:52:20.9076300Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.6 configs/s
2026-02-21T09:52:25.4079851Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:52:25.4094939Z #blocked = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:52:25.4095486Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:52:25.4095955Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:52:25.4096413Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:52:25.4096850Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:52:25.4097250Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:52:25.4097534Z #smem = #ttg.shared_memory
2026-02-21T09:52:25.4097873Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:52:25.4099049Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:52:25.4099645Z     %cst = arith.constant dense<4> : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4099892Z     %c9728_i32 = arith.constant 9728 : i32
2026-02-21T09:52:25.4100080Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:52:25.4100308Z     %cst_0 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4100569Z     %c29184_i32 = arith.constant 29184 : i32
2026-02-21T09:52:25.4100750Z     %c19456_i32 = arith.constant 19456 : i32
2026-02-21T09:52:25.4101022Z     %cst_1 = arith.constant dense<0> : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4101195Z     %c38912_i32 = arith.constant 38912 : i32
2026-02-21T09:52:25.4101335Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:52:25.4101478Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:52:25.4101615Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:52:25.4101758Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:52:25.4101896Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:52:25.4102097Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:52:25.4102269Z     %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4102481Z     %cst_3 = arith.constant dense<8192> : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4102738Z     %cst_4 = arith.constant dense<4> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4103031Z     %cst_5 = arith.constant dense<8> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4103325Z     %cst_6 = arith.constant dense<2> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4103605Z     %cst_7 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4103824Z     %c6_i32 = arith.constant 6 : i32
2026-02-21T09:52:25.4103961Z     %c506_i32 = arith.constant 506 : i32
2026-02-21T09:52:25.4104099Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:52:25.4104234Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:52:25.4104370Z     %c13823_i32 = arith.constant 13823 : i32
2026-02-21T09:52:25.4104563Z     %cst_8 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4104811Z     %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:25.4105120Z     %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:25.4105327Z     %cst_11 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4105537Z     %cst_12 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4105739Z     %cst_13 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4105947Z     %cst_14 = arith.constant dense<0> : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4106146Z     %cst_15 = arith.constant dense<0> : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4106351Z     %cst_16 = arith.constant dense<8192> : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4106553Z     %cst_17 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4106753Z     %cst_18 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4106953Z     %cst_19 = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4107121Z     %0 = tt.get_program_id x : i32
2026-02-21T09:52:25.4107359Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4107694Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4108010Z     %3 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4108302Z     %4 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4108556Z     %5 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4108825Z     %6 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4109195Z     %7 = arith.extsi %6 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4109572Z     %8 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4109946Z     %9 = arith.extsi %8 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4110391Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:52:25.4110829Z     %11 = tt.expand_dims %10 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:52:25.4111253Z     %12 = tt.expand_dims %11 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:25.4111507Z     %13 = arith.cmpi eq, %12, %cst_9 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:25.4111712Z     %14 = tt.broadcast %13 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1>
2026-02-21T09:52:25.4111920Z     %15 = arith.cmpi eq, %12, %cst_10 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:25.4112115Z     %16 = tt.broadcast %15 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1>
2026-02-21T09:52:25.4112333Z     %17 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.4112604Z     %18 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4112910Z     %19 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4113217Z     %20 = arith.extsi %19 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4113449Z     %21 = arith.subi %c13823_i32, %0 : i32
2026-02-21T09:52:25.4113575Z     %22 = arith.divui %21, %c9728_i32 : i32
2026-02-21T09:52:25.4113695Z     %23 = arith.remsi %22, %c4_i32 : i32
2026-02-21T09:52:25.4113840Z     %24 = arith.subi %22, %23 : i32
2026-02-21T09:52:25.4113954Z     %25 = arith.muli %24, %c9728_i32 : i32
2026-02-21T09:52:25.4114076Z     %26 = arith.addi %0, %25 : i32
2026-02-21T09:52:25.4114205Z     scf.for %arg3 = %0 to %26 step %c38912_i32  : i32 {
2026-02-21T09:52:25.4114349Z       %27 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:52:25.4114476Z       %28 = arith.muli %27, %c8_i32 : i32
2026-02-21T09:52:25.4114595Z       %29 = arith.subi %c128_i32, %28 : i32
2026-02-21T09:52:25.4114718Z       %30 = arith.minsi %29, %c8_i32 : i32
2026-02-21T09:52:25.4114839Z       %31 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:52:25.4114965Z       %32 = arith.remsi %31, %30 : i32
2026-02-21T09:52:25.4115079Z       %33 = arith.addi %28, %32 : i32
2026-02-21T09:52:25.4115196Z       %34 = arith.divsi %31, %30 : i32
2026-02-21T09:52:25.4115311Z       %35 = arith.muli %33, %c128_i32 : i32
2026-02-21T09:52:25.4115484Z       %36 = tt.splat %35 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4115714Z       %37 = arith.addi %36, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4115888Z       %38 = arith.muli %34, %c256_i32 : i32
2026-02-21T09:52:25.4116115Z       %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4116367Z       %40 = arith.muli %39, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4116584Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4116761Z       %42 = arith.extsi %38 : i32 to i64
2026-02-21T09:52:25.4116927Z       %43 = tt.splat %42 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4117145Z       %44 = arith.addi %43, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4117415Z       %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4117690Z       %46 = tt.broadcast %45 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4117911Z       %47 = arith.cmpi sge, %45, %cst_14 : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4118081Z       %48 = arith.cmpi slt, %45, %cst_3 : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4118246Z       %49 = arith.andi %47, %48 : tensor<1x256xi1, #blocked>
2026-02-21T09:52:25.4118428Z       %50 = tt.broadcast %49 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4118649Z       %51 = ttg.local_alloc : () -> !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.4118932Z       %52 = tt.expand_dims %3 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4119207Z       %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4119400Z       %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4119596Z       %55 = tt.addptr %4, %54 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4119805Z       %56 = tt.load %55 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4120042Z       %57 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4120287Z       %58 = arith.muli %57, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4120475Z       %59 = tt.broadcast %58 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4120660Z       %60 = arith.addi %59, %46 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4120858Z       %61 = tt.addptr %5, %60 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4121061Z       %62 = arith.cmpi sge, %57, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4121238Z       %63 = arith.cmpi slt, %57, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4121422Z       %64 = arith.andi %62, %63 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4121598Z       %65 = tt.broadcast %64 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4121786Z       %66 = arith.andi %65, %50 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4121993Z       %67 = tt.load %61, %66, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4122336Z       %68 = ttg.memdesc_index %51[%c0_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4122772Z       ttg.local_store %56, %68 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4123045Z       %69 = arith.addi %3, %cst_7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4123329Z       %70 = tt.expand_dims %69 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4123600Z       %71 = tt.broadcast %70 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4123796Z       %72 = arith.addi %41, %71 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4123994Z       %73 = tt.addptr %4, %72 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4124196Z       %74 = tt.load %73 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4124405Z       %75 = arith.addi %7, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4124675Z       %76 = tt.expand_dims %75 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4124920Z       %77 = arith.muli %76, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4125104Z       %78 = tt.broadcast %77 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4125296Z       %79 = arith.addi %78, %46 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4125504Z       %80 = tt.addptr %5, %79 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4125724Z       %81 = arith.cmpi sge, %76, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4125896Z       %82 = arith.cmpi slt, %76, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4126052Z       %83 = arith.andi %81, %82 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4126228Z       %84 = tt.broadcast %83 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4126409Z       %85 = arith.andi %84, %50 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4126643Z       %86 = tt.load %80, %85, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4126978Z       %87 = ttg.memdesc_index %51[%c1_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4127335Z       ttg.local_store %74, %87 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4127608Z       %88 = arith.addi %3, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4127878Z       %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4128148Z       %90 = tt.broadcast %89 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4128336Z       %91 = arith.addi %41, %90 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4128529Z       %92 = tt.addptr %4, %91 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4128731Z       %93 = tt.load %92 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4128917Z       %94 = arith.addi %7, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4129207Z       %95 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4129444Z       %96 = arith.muli %95, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4129627Z       %97 = tt.broadcast %96 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4129813Z       %98 = arith.addi %97, %46 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4129997Z       %99 = tt.addptr %5, %98 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4130200Z       %100 = arith.cmpi sge, %95, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4130368Z       %101 = arith.cmpi slt, %95, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4130529Z       %102 = arith.andi %100, %101 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4130711Z       %103 = tt.broadcast %102 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4130895Z       %104 = arith.andi %103, %50 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4131108Z       %105 = tt.load %99, %104, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4131449Z       %106 = ttg.memdesc_index %51[%c2_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4131812Z       ttg.local_store %93, %106 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4132604Z       %107:8 = scf.for %arg4 = %c0_i32 to %c506_i32 step %c2_i32 iter_args(%arg5 = %cst_8, %arg6 = %c2_i32, %arg7 = %68, %arg8 = %87, %arg9 = %106, %arg10 = %67, %arg11 = %86, %arg12 = %105) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:52:25.4133271Z         %553 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:52:25.4133395Z         %554 = arith.muli %553, %c2_i32 : i32
2026-02-21T09:52:25.4133586Z         %555 = tt.splat %554 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4133807Z         %556 = arith.addi %555, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4134084Z         %557 = tt.expand_dims %556 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4134359Z         %558 = tt.broadcast %557 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4134570Z         %559 = arith.addi %41, %558 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4134773Z         %560 = tt.addptr %4, %559 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4134978Z         %561 = tt.load %560 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4135282Z         %562 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4135724Z         %563 = arith.extf %562 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4136009Z         %564 = arith.extsi %553 : i32 to i64
2026-02-21T09:52:25.4136182Z         %565 = tt.splat %564 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4136404Z         %566 = arith.addi %565, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4136681Z         %567 = tt.expand_dims %566 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4136926Z         %568 = arith.muli %567, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4137135Z         %569 = tt.broadcast %568 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4137331Z         %570 = arith.addi %569, %46 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4137527Z         %571 = tt.addptr %5, %570 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4137736Z         %572 = arith.cmpi sge, %567, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4137905Z         %573 = arith.cmpi slt, %567, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4138070Z         %574 = arith.andi %572, %573 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4138256Z         %575 = tt.broadcast %574 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4138443Z         %576 = arith.andi %575, %50 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4138610Z         %577 = tt.load %571, %576, %cst_1 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4138782Z         %578 = arith.shli %arg10, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4138947Z         %579 = arith.shrsi %578, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4139192Z         %580 = ttg.convert_layout %579 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4139444Z         %581 = arith.shrsi %arg10, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4139690Z         %582 = ttg.convert_layout %581 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4140040Z         %583 = tt.expand_dims %580 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4140389Z         %584 = tt.expand_dims %582 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4140681Z         %585 = tt.broadcast %583 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4140927Z         %586 = arith.select %14, %585, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4141172Z         %587 = tt.broadcast %584 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4141430Z         %588 = arith.select %16, %587, %586 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4141666Z         %589 = tt.reshape %588 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4141893Z         %590 = arith.sitofp %589 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4142206Z         %591 = ttg.convert_layout %590 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4142682Z         %592 = tt.dot %563, %591, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4143041Z         %593 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:25.4143169Z         %594 = arith.cmpi slt, %593, %c3_i32 : i32
2026-02-21T09:52:25.4143307Z         %595 = arith.select %594, %593, %c0_i32 : i32
2026-02-21T09:52:25.4143580Z         %596 = ttg.memdesc_index %51[%595] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4143947Z         ttg.local_store %561, %596 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4144568Z         scf.yield %592, %595, %arg8, %arg9, %596, %arg11, %arg12, %577 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4145109Z       } {tt.flatten, tt.num_stages = 4 : i32}
2026-02-21T09:52:25.4145392Z       %108 = ttg.local_load %107#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4145826Z       %109 = arith.extf %108 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4146129Z       %110 = arith.shli %107#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4146291Z       %111 = arith.shrsi %110, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4146535Z       %112 = ttg.convert_layout %111 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4146785Z       %113 = arith.shrsi %107#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4147029Z       %114 = ttg.convert_layout %113 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4147365Z       %115 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4147713Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4148004Z       %117 = tt.broadcast %115 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4148264Z       %118 = arith.select %14, %117, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4148510Z       %119 = tt.broadcast %116 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4148747Z       %120 = arith.select %16, %119, %118 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4148982Z       %121 = tt.reshape %120 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4149207Z       %122 = arith.sitofp %121 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4149519Z       %123 = ttg.convert_layout %122 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4149991Z       %124 = tt.dot %109, %123, %107#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4150503Z       %125 = ttg.local_load %107#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4150934Z       %126 = arith.extf %125 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4151235Z       %127 = arith.shli %107#6, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4151394Z       %128 = arith.shrsi %127, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4151643Z       %129 = ttg.convert_layout %128 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4151888Z       %130 = arith.shrsi %107#6, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4152131Z       %131 = ttg.convert_layout %130 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4152469Z       %132 = tt.expand_dims %129 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4152811Z       %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4153166Z       %134 = tt.broadcast %132 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4153427Z       %135 = arith.select %14, %134, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4153674Z       %136 = tt.broadcast %133 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4153915Z       %137 = arith.select %16, %136, %135 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4154148Z       %138 = tt.reshape %137 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4154376Z       %139 = arith.sitofp %138 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4154670Z       %140 = ttg.convert_layout %139 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4155138Z       %141 = tt.dot %126, %140, %124, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4155631Z       %142 = ttg.local_load %107#4 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4156059Z       %143 = arith.extf %142 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4156371Z       %144 = arith.shli %107#7, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4156533Z       %145 = arith.shrsi %144, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4156775Z       %146 = ttg.convert_layout %145 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4157023Z       %147 = arith.shrsi %107#7, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4157266Z       %148 = ttg.convert_layout %147 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4157603Z       %149 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4157968Z       %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4158255Z       %151 = tt.broadcast %149 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4158502Z       %152 = arith.select %14, %151, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4158763Z       %153 = tt.broadcast %150 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4159004Z       %154 = arith.select %16, %153, %152 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4159241Z       %155 = tt.reshape %154 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4159464Z       %156 = arith.sitofp %155 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4159761Z       %157 = ttg.convert_layout %156 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4160228Z       %158 = tt.dot %143, %157, %141, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4160615Z       ttg.local_dealloc %51 : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.4160833Z       %159 = arith.truncf %158 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:52:25.4161005Z       %160 = arith.extsi %35 : i32 to i64
2026-02-21T09:52:25.4161171Z       %161 = tt.splat %160 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4161404Z       %162 = arith.addi %161, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4161671Z       %163 = tt.expand_dims %162 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4161915Z       %164 = arith.muli %163, %cst_17 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4162096Z       %165 = tt.broadcast %164 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4162305Z       %166 = tt.splat %42 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4162516Z       %167 = arith.addi %166, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4162831Z       %168 = tt.expand_dims %167 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4163094Z       %169 = tt.broadcast %168 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4163277Z       %170 = arith.addi %165, %169 : tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4163476Z       %171 = tt.addptr %17, %170 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4163678Z       %172 = arith.cmpi sge, %163, %cst_18 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4163847Z       %173 = arith.cmpi slt, %163, %cst_19 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4164009Z       %174 = arith.andi %172, %173 : tensor<128x1xi1, #mma>
2026-02-21T09:52:25.4164204Z       %175 = tt.broadcast %174 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4164392Z       %176 = arith.cmpi sge, %168, %cst_15 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4164555Z       %177 = arith.cmpi slt, %168, %cst_16 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4164712Z       %178 = arith.andi %176, %177 : tensor<1x256xi1, #mma>
2026-02-21T09:52:25.4164887Z       %179 = tt.broadcast %178 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4165065Z       %180 = arith.andi %175, %179 : tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4165230Z       tt.store %171, %159, %180 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.4165399Z       %181 = arith.addi %arg3, %c9728_i32 : i32
2026-02-21T09:52:25.4165526Z       %182 = arith.divsi %181, %c256_i32 : i32
2026-02-21T09:52:25.4165645Z       %183 = arith.muli %182, %c8_i32 : i32
2026-02-21T09:52:25.4165766Z       %184 = arith.subi %c128_i32, %183 : i32
2026-02-21T09:52:25.4165882Z       %185 = arith.minsi %184, %c8_i32 : i32
2026-02-21T09:52:25.4166004Z       %186 = arith.remsi %181, %c256_i32 : i32
2026-02-21T09:52:25.4166123Z       %187 = arith.remsi %186, %185 : i32
2026-02-21T09:52:25.4166235Z       %188 = arith.addi %183, %187 : i32
2026-02-21T09:52:25.4166367Z       %189 = arith.divsi %186, %185 : i32
2026-02-21T09:52:25.4166481Z       %190 = arith.muli %188, %c128_i32 : i32
2026-02-21T09:52:25.4166653Z       %191 = tt.splat %190 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4166878Z       %192 = arith.addi %191, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4167056Z       %193 = arith.muli %189, %c256_i32 : i32
2026-02-21T09:52:25.4167290Z       %194 = tt.expand_dims %192 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4167545Z       %195 = arith.muli %194, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4167746Z       %196 = tt.broadcast %195 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4167921Z       %197 = arith.extsi %193 : i32 to i64
2026-02-21T09:52:25.4168093Z       %198 = tt.splat %197 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4168314Z       %199 = arith.addi %198, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4168587Z       %200 = tt.expand_dims %199 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4168888Z       %201 = tt.broadcast %200 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4169092Z       %202 = arith.cmpi sge, %200, %cst_14 : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4169269Z       %203 = arith.cmpi slt, %200, %cst_3 : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4169437Z       %204 = arith.andi %202, %203 : tensor<1x256xi1, #blocked>
2026-02-21T09:52:25.4169622Z       %205 = tt.broadcast %204 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4169840Z       %206 = ttg.local_alloc : () -> !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.4170028Z       %207 = arith.addi %196, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4170229Z       %208 = tt.addptr %4, %207 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4170434Z       %209 = tt.load %208 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4170596Z       %210 = arith.addi %59, %201 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4170790Z       %211 = tt.addptr %5, %210 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4170990Z       %212 = arith.andi %65, %205 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4171205Z       %213 = tt.load %211, %212, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4171546Z       %214 = ttg.memdesc_index %206[%c0_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4171933Z       ttg.local_store %209, %214 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4172177Z       %215 = arith.addi %196, %71 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4172375Z       %216 = tt.addptr %4, %215 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4172588Z       %217 = tt.load %216 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4172744Z       %218 = arith.addi %78, %201 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4172951Z       %219 = tt.addptr %5, %218 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4173150Z       %220 = arith.andi %84, %205 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4173362Z       %221 = tt.load %219, %220, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4173703Z       %222 = ttg.memdesc_index %206[%c1_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4174080Z       ttg.local_store %217, %222 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4174323Z       %223 = arith.addi %196, %90 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4174525Z       %224 = tt.addptr %4, %223 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4174727Z       %225 = tt.load %224 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4174890Z       %226 = arith.addi %97, %201 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4175078Z       %227 = tt.addptr %5, %226 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4175274Z       %228 = arith.andi %103, %205 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4175482Z       %229 = tt.load %227, %228, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4175820Z       %230 = ttg.memdesc_index %206[%c2_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4176180Z       ttg.local_store %225, %230 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4176970Z       %231:8 = scf.for %arg4 = %c0_i32 to %c506_i32 step %c2_i32 iter_args(%arg5 = %cst_8, %arg6 = %c2_i32, %arg7 = %214, %arg8 = %222, %arg9 = %230, %arg10 = %213, %arg11 = %221, %arg12 = %229) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:52:25.4177642Z         %553 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:52:25.4186049Z         %554 = arith.muli %553, %c2_i32 : i32
2026-02-21T09:52:25.4186250Z         %555 = tt.splat %554 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4186482Z         %556 = arith.addi %555, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4186762Z         %557 = tt.expand_dims %556 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4187046Z         %558 = tt.broadcast %557 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4187251Z         %559 = arith.addi %196, %558 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4187456Z         %560 = tt.addptr %4, %559 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4187666Z         %561 = tt.load %560 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4188013Z         %562 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4188457Z         %563 = arith.extf %562 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4188745Z         %564 = arith.extsi %553 : i32 to i64
2026-02-21T09:52:25.4188918Z         %565 = tt.splat %564 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4189144Z         %566 = arith.addi %565, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4189440Z         %567 = tt.expand_dims %566 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4189688Z         %568 = arith.muli %567, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4189885Z         %569 = tt.broadcast %568 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4190078Z         %570 = arith.addi %569, %201 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4190294Z         %571 = tt.addptr %5, %570 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4190500Z         %572 = arith.cmpi sge, %567, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4190675Z         %573 = arith.cmpi slt, %567, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4190843Z         %574 = arith.andi %572, %573 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4191031Z         %575 = tt.broadcast %574 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4191229Z         %576 = arith.andi %575, %205 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4191397Z         %577 = tt.load %571, %576, %cst_1 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4191574Z         %578 = arith.shli %arg10, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4191738Z         %579 = arith.shrsi %578, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4191985Z         %580 = ttg.convert_layout %579 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4192237Z         %581 = arith.shrsi %arg10, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4192481Z         %582 = ttg.convert_layout %581 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4192842Z         %583 = tt.expand_dims %580 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4193193Z         %584 = tt.expand_dims %582 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4193487Z         %585 = tt.broadcast %583 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4193739Z         %586 = arith.select %14, %585, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4193983Z         %587 = tt.broadcast %584 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4194230Z         %588 = arith.select %16, %587, %586 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4194467Z         %589 = tt.reshape %588 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4194700Z         %590 = arith.sitofp %589 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4195001Z         %591 = ttg.convert_layout %590 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4195477Z         %592 = tt.dot %563, %591, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4195845Z         %593 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:25.4195977Z         %594 = arith.cmpi slt, %593, %c3_i32 : i32
2026-02-21T09:52:25.4196111Z         %595 = arith.select %594, %593, %c0_i32 : i32
2026-02-21T09:52:25.4196386Z         %596 = ttg.memdesc_index %206[%595] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4196751Z         ttg.local_store %561, %596 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4197377Z         scf.yield %592, %595, %arg8, %arg9, %596, %arg11, %arg12, %577 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4197919Z       } {tt.flatten, tt.num_stages = 4 : i32}
2026-02-21T09:52:25.4198212Z       %232 = ttg.local_load %231#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4198650Z       %233 = arith.extf %232 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4198955Z       %234 = arith.shli %231#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4199118Z       %235 = arith.shrsi %234, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4199365Z       %236 = ttg.convert_layout %235 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4199614Z       %237 = arith.shrsi %231#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4199858Z       %238 = ttg.convert_layout %237 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4200197Z       %239 = tt.expand_dims %236 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4200542Z       %240 = tt.expand_dims %238 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4200834Z       %241 = tt.broadcast %239 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4201094Z       %242 = arith.select %14, %241, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4201340Z       %243 = tt.broadcast %240 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4201583Z       %244 = arith.select %16, %243, %242 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4201817Z       %245 = tt.reshape %244 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4202047Z       %246 = arith.sitofp %245 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4202348Z       %247 = ttg.convert_layout %246 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4202885Z       %248 = tt.dot %233, %247, %231#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4203390Z       %249 = ttg.local_load %231#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4203822Z       %250 = arith.extf %249 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4204143Z       %251 = arith.shli %231#6, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4204305Z       %252 = arith.shrsi %251, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4204552Z       %253 = ttg.convert_layout %252 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4204802Z       %254 = arith.shrsi %231#6, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4205045Z       %255 = ttg.convert_layout %254 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4205383Z       %256 = tt.expand_dims %253 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4205746Z       %257 = tt.expand_dims %255 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4206034Z       %258 = tt.broadcast %256 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4206280Z       %259 = arith.select %14, %258, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4206538Z       %260 = tt.broadcast %257 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4206780Z       %261 = arith.select %16, %260, %259 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4207019Z       %262 = tt.reshape %261 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4207246Z       %263 = arith.sitofp %262 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4207547Z       %264 = ttg.convert_layout %263 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4208012Z       %265 = tt.dot %250, %264, %248, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4208511Z       %266 = ttg.local_load %231#4 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4208949Z       %267 = arith.extf %266 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4209246Z       %268 = arith.shli %231#7, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4209426Z       %269 = arith.shrsi %268, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4209671Z       %270 = ttg.convert_layout %269 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4209918Z       %271 = arith.shrsi %231#7, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4210162Z       %272 = ttg.convert_layout %271 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4210499Z       %273 = tt.expand_dims %270 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4210844Z       %274 = tt.expand_dims %272 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4211135Z       %275 = tt.broadcast %273 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4211379Z       %276 = arith.select %14, %275, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4211628Z       %277 = tt.broadcast %274 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4211867Z       %278 = arith.select %16, %277, %276 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4212105Z       %279 = tt.reshape %278 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4212346Z       %280 = arith.sitofp %279 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4212644Z       %281 = ttg.convert_layout %280 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4213107Z       %282 = tt.dot %267, %281, %265, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4213493Z       ttg.local_dealloc %206 : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.4213726Z       %283 = arith.truncf %282 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:52:25.4213905Z       %284 = arith.extsi %190 : i32 to i64
2026-02-21T09:52:25.4214072Z       %285 = tt.splat %284 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4214288Z       %286 = arith.addi %285, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4214559Z       %287 = tt.expand_dims %286 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4214821Z       %288 = arith.muli %287, %cst_17 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4215006Z       %289 = tt.broadcast %288 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4215214Z       %290 = tt.splat %197 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4215426Z       %291 = arith.addi %290, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4215691Z       %292 = tt.expand_dims %291 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4215955Z       %293 = tt.broadcast %292 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4216142Z       %294 = arith.addi %289, %293 : tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4216335Z       %295 = tt.addptr %17, %294 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4216541Z       %296 = arith.cmpi sge, %287, %cst_18 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4216711Z       %297 = arith.cmpi slt, %287, %cst_19 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4216873Z       %298 = arith.andi %296, %297 : tensor<128x1xi1, #mma>
2026-02-21T09:52:25.4217051Z       %299 = tt.broadcast %298 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4217252Z       %300 = arith.cmpi sge, %292, %cst_15 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4217421Z       %301 = arith.cmpi slt, %292, %cst_16 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4217580Z       %302 = arith.andi %300, %301 : tensor<1x256xi1, #mma>
2026-02-21T09:52:25.4217756Z       %303 = tt.broadcast %302 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4217937Z       %304 = arith.andi %299, %303 : tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4218101Z       tt.store %295, %283, %304 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.4218256Z       %305 = arith.addi %arg3, %c19456_i32 : i32
2026-02-21T09:52:25.4218383Z       %306 = arith.divsi %305, %c256_i32 : i32
2026-02-21T09:52:25.4218508Z       %307 = arith.muli %306, %c8_i32 : i32
2026-02-21T09:52:25.4218630Z       %308 = arith.subi %c128_i32, %307 : i32
2026-02-21T09:52:25.4218751Z       %309 = arith.minsi %308, %c8_i32 : i32
2026-02-21T09:52:25.4218871Z       %310 = arith.remsi %305, %c256_i32 : i32
2026-02-21T09:52:25.4218992Z       %311 = arith.remsi %310, %309 : i32
2026-02-21T09:52:25.4219112Z       %312 = arith.addi %307, %311 : i32
2026-02-21T09:52:25.4219227Z       %313 = arith.divsi %310, %309 : i32
2026-02-21T09:52:25.4219346Z       %314 = arith.muli %312, %c128_i32 : i32
2026-02-21T09:52:25.4219517Z       %315 = tt.splat %314 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4219765Z       %316 = arith.addi %315, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4219937Z       %317 = arith.muli %313, %c256_i32 : i32
2026-02-21T09:52:25.4220168Z       %318 = tt.expand_dims %316 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4220424Z       %319 = arith.muli %318, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4220622Z       %320 = tt.broadcast %319 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4220802Z       %321 = arith.extsi %317 : i32 to i64
2026-02-21T09:52:25.4220969Z       %322 = tt.splat %321 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4221207Z       %323 = arith.addi %322, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4221486Z       %324 = tt.expand_dims %323 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4221765Z       %325 = tt.broadcast %324 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4221971Z       %326 = arith.cmpi sge, %324, %cst_14 : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4222161Z       %327 = arith.cmpi slt, %324, %cst_3 : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4222328Z       %328 = arith.andi %326, %327 : tensor<1x256xi1, #blocked>
2026-02-21T09:52:25.4222514Z       %329 = tt.broadcast %328 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4222733Z       %330 = ttg.local_alloc : () -> !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.4222926Z       %331 = arith.addi %320, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4223127Z       %332 = tt.addptr %4, %331 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4223336Z       %333 = tt.load %332 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4223494Z       %334 = arith.addi %59, %325 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4223688Z       %335 = tt.addptr %5, %334 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4223887Z       %336 = arith.andi %65, %329 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4224100Z       %337 = tt.load %335, %336, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4224464Z       %338 = ttg.memdesc_index %330[%c0_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4224828Z       ttg.local_store %333, %338 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4225075Z       %339 = arith.addi %320, %71 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4225275Z       %340 = tt.addptr %4, %339 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4225479Z       %341 = tt.load %340 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4225638Z       %342 = arith.addi %78, %325 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4225829Z       %343 = tt.addptr %5, %342 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4226028Z       %344 = arith.andi %84, %329 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4226241Z       %345 = tt.load %343, %344, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4226579Z       %346 = ttg.memdesc_index %330[%c1_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4226941Z       ttg.local_store %341, %346 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4227180Z       %347 = arith.addi %320, %90 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4227383Z       %348 = tt.addptr %4, %347 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4227605Z       %349 = tt.load %348 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4227764Z       %350 = arith.addi %97, %325 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4227958Z       %351 = tt.addptr %5, %350 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4228154Z       %352 = arith.andi %103, %329 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4228367Z       %353 = tt.load %351, %352, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4228702Z       %354 = ttg.memdesc_index %330[%c2_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4229080Z       ttg.local_store %349, %354 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4229878Z       %355:8 = scf.for %arg4 = %c0_i32 to %c506_i32 step %c2_i32 iter_args(%arg5 = %cst_8, %arg6 = %c2_i32, %arg7 = %338, %arg8 = %346, %arg9 = %354, %arg10 = %337, %arg11 = %345, %arg12 = %353) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:52:25.4230550Z         %553 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:52:25.4230676Z         %554 = arith.muli %553, %c2_i32 : i32
2026-02-21T09:52:25.4230848Z         %555 = tt.splat %554 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4231077Z         %556 = arith.addi %555, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4231359Z         %557 = tt.expand_dims %556 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4231638Z         %558 = tt.broadcast %557 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4231835Z         %559 = arith.addi %320, %558 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4232037Z         %560 = tt.addptr %4, %559 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4232246Z         %561 = tt.load %560 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4232562Z         %562 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4233000Z         %563 = arith.extf %562 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4233284Z         %564 = arith.extsi %553 : i32 to i64
2026-02-21T09:52:25.4233453Z         %565 = tt.splat %564 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4233674Z         %566 = arith.addi %565, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4233950Z         %567 = tt.expand_dims %566 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4234195Z         %568 = arith.muli %567, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4234391Z         %569 = tt.broadcast %568 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4234584Z         %570 = arith.addi %569, %325 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4234784Z         %571 = tt.addptr %5, %570 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4234994Z         %572 = arith.cmpi sge, %567, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4235166Z         %573 = arith.cmpi slt, %567, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4235345Z         %574 = arith.andi %572, %573 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4235530Z         %575 = tt.broadcast %574 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4235721Z         %576 = arith.andi %575, %329 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4235888Z         %577 = tt.load %571, %576, %cst_1 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4236067Z         %578 = arith.shli %arg10, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4236231Z         %579 = arith.shrsi %578, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4236476Z         %580 = ttg.convert_layout %579 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4236767Z         %581 = arith.shrsi %arg10, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4237012Z         %582 = ttg.convert_layout %581 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4237351Z         %583 = tt.expand_dims %580 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4237728Z         %584 = tt.expand_dims %582 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4238020Z         %585 = tt.broadcast %583 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4238271Z         %586 = arith.select %14, %585, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4238517Z         %587 = tt.broadcast %584 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4238761Z         %588 = arith.select %16, %587, %586 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4238999Z         %589 = tt.reshape %588 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4239226Z         %590 = arith.sitofp %589 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4239528Z         %591 = ttg.convert_layout %590 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4240004Z         %592 = tt.dot %563, %591, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4240372Z         %593 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:25.4240503Z         %594 = arith.cmpi slt, %593, %c3_i32 : i32
2026-02-21T09:52:25.4240637Z         %595 = arith.select %594, %593, %c0_i32 : i32
2026-02-21T09:52:25.4240913Z         %596 = ttg.memdesc_index %330[%595] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4241277Z         ttg.local_store %561, %596 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4241900Z         scf.yield %592, %595, %arg8, %arg9, %596, %arg11, %arg12, %577 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4242421Z       } {tt.flatten, tt.num_stages = 4 : i32}
2026-02-21T09:52:25.4242748Z       %356 = ttg.local_load %355#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4243185Z       %357 = arith.extf %356 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4243489Z       %358 = arith.shli %355#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4243669Z       %359 = arith.shrsi %358, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4243916Z       %360 = ttg.convert_layout %359 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4244163Z       %361 = arith.shrsi %355#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4244406Z       %362 = ttg.convert_layout %361 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4244746Z       %363 = tt.expand_dims %360 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4245115Z       %364 = tt.expand_dims %362 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4245408Z       %365 = tt.broadcast %363 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4245656Z       %366 = arith.select %14, %365, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4245900Z       %367 = tt.broadcast %364 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4246158Z       %368 = arith.select %16, %367, %366 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4246394Z       %369 = tt.reshape %368 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4246619Z       %370 = arith.sitofp %369 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4246917Z       %371 = ttg.convert_layout %370 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4247389Z       %372 = tt.dot %357, %371, %355#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4247891Z       %373 = ttg.local_load %355#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4248329Z       %374 = arith.extf %373 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4248635Z       %375 = arith.shli %355#6, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4248814Z       %376 = arith.shrsi %375, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4249057Z       %377 = ttg.convert_layout %376 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4249305Z       %378 = arith.shrsi %355#6, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4249548Z       %379 = ttg.convert_layout %378 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4249884Z       %380 = tt.expand_dims %377 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4250232Z       %381 = tt.expand_dims %379 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4250519Z       %382 = tt.broadcast %380 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4250767Z       %383 = arith.select %14, %382, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4251010Z       %384 = tt.broadcast %381 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4251248Z       %385 = arith.select %16, %384, %383 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4251483Z       %386 = tt.reshape %385 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4251727Z       %387 = arith.sitofp %386 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4252026Z       %388 = ttg.convert_layout %387 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4252495Z       %389 = tt.dot %374, %388, %372, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4252991Z       %390 = ttg.local_load %355#4 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4253439Z       %391 = arith.extf %390 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4253738Z       %392 = arith.shli %355#7, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4253899Z       %393 = arith.shrsi %392, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4254142Z       %394 = ttg.convert_layout %393 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4254405Z       %395 = arith.shrsi %355#7, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4254647Z       %396 = ttg.convert_layout %395 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4254983Z       %397 = tt.expand_dims %394 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4255327Z       %398 = tt.expand_dims %396 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4255616Z       %399 = tt.broadcast %397 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4255864Z       %400 = arith.select %14, %399, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4256104Z       %401 = tt.broadcast %398 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4256345Z       %402 = arith.select %16, %401, %400 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4256578Z       %403 = tt.reshape %402 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4256823Z       %404 = arith.sitofp %403 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4257122Z       %405 = ttg.convert_layout %404 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4257590Z       %406 = tt.dot %391, %405, %389, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4257977Z       ttg.local_dealloc %330 : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.4258195Z       %407 = arith.truncf %406 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:52:25.4258371Z       %408 = arith.extsi %314 : i32 to i64
2026-02-21T09:52:25.4258538Z       %409 = tt.splat %408 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4258752Z       %410 = arith.addi %409, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4259022Z       %411 = tt.expand_dims %410 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4259265Z       %412 = arith.muli %411, %cst_17 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4259449Z       %413 = tt.broadcast %412 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4259660Z       %414 = tt.splat %321 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4259884Z       %415 = arith.addi %414, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4260150Z       %416 = tt.expand_dims %415 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4260413Z       %417 = tt.broadcast %416 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4260596Z       %418 = arith.addi %413, %417 : tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4260793Z       %419 = tt.addptr %17, %418 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4260995Z       %420 = arith.cmpi sge, %411, %cst_18 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4261180Z       %421 = arith.cmpi slt, %411, %cst_19 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4261343Z       %422 = arith.andi %420, %421 : tensor<128x1xi1, #mma>
2026-02-21T09:52:25.4261521Z       %423 = tt.broadcast %422 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4261710Z       %424 = arith.cmpi sge, %416, %cst_15 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4261876Z       %425 = arith.cmpi slt, %416, %cst_16 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4262048Z       %426 = arith.andi %424, %425 : tensor<1x256xi1, #mma>
2026-02-21T09:52:25.4262220Z       %427 = tt.broadcast %426 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4262401Z       %428 = arith.andi %423, %427 : tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4262562Z       tt.store %419, %407, %428 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.4262716Z       %429 = arith.addi %arg3, %c29184_i32 : i32
2026-02-21T09:52:25.4262844Z       %430 = arith.divsi %429, %c256_i32 : i32
2026-02-21T09:52:25.4262966Z       %431 = arith.muli %430, %c8_i32 : i32
2026-02-21T09:52:25.4263087Z       %432 = arith.subi %c128_i32, %431 : i32
2026-02-21T09:52:25.4263205Z       %433 = arith.minsi %432, %c8_i32 : i32
2026-02-21T09:52:25.4263324Z       %434 = arith.remsi %429, %c256_i32 : i32
2026-02-21T09:52:25.4263441Z       %435 = arith.remsi %434, %433 : i32
2026-02-21T09:52:25.4263556Z       %436 = arith.addi %431, %435 : i32
2026-02-21T09:52:25.4263669Z       %437 = arith.divsi %434, %433 : i32
2026-02-21T09:52:25.4263788Z       %438 = arith.muli %436, %c128_i32 : i32
2026-02-21T09:52:25.4263959Z       %439 = tt.splat %438 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4264184Z       %440 = arith.addi %439, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4264378Z       %441 = arith.muli %437, %c256_i32 : i32
2026-02-21T09:52:25.4264604Z       %442 = tt.expand_dims %440 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4264861Z       %443 = arith.muli %442, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4265057Z       %444 = tt.broadcast %443 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4265235Z       %445 = arith.extsi %441 : i32 to i64
2026-02-21T09:52:25.4265405Z       %446 = tt.splat %445 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4265623Z       %447 = arith.addi %446, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4265900Z       %448 = tt.expand_dims %447 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4266180Z       %449 = tt.broadcast %448 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4266385Z       %450 = arith.cmpi sge, %448, %cst_14 : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4266564Z       %451 = arith.cmpi slt, %448, %cst_3 : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4266729Z       %452 = arith.andi %450, %451 : tensor<1x256xi1, #blocked>
2026-02-21T09:52:25.4266917Z       %453 = tt.broadcast %452 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4267148Z       %454 = ttg.local_alloc : () -> !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.4267337Z       %455 = arith.addi %444, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4267539Z       %456 = tt.addptr %4, %455 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4267745Z       %457 = tt.load %456 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4267905Z       %458 = arith.addi %59, %449 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4268097Z       %459 = tt.addptr %5, %458 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4268293Z       %460 = arith.andi %65, %453 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4268524Z       %461 = tt.load %459, %460, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4268867Z       %462 = ttg.memdesc_index %454[%c0_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4269235Z       ttg.local_store %457, %462 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4269491Z       %463 = arith.addi %444, %71 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4269692Z       %464 = tt.addptr %4, %463 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4269895Z       %465 = tt.load %464 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4270054Z       %466 = arith.addi %78, %449 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4270248Z       %467 = tt.addptr %5, %466 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4270446Z       %468 = arith.andi %84, %453 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4270658Z       %469 = tt.load %467, %468, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4270994Z       %470 = ttg.memdesc_index %454[%c1_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4271356Z       ttg.local_store %465, %470 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4271596Z       %471 = arith.addi %444, %90 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4271793Z       %472 = tt.addptr %4, %471 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4272012Z       %473 = tt.load %472 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4272168Z       %474 = arith.addi %97, %449 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4272360Z       %475 = tt.addptr %5, %474 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4272558Z       %476 = arith.andi %103, %453 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4272768Z       %477 = tt.load %475, %476, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4273107Z       %478 = ttg.memdesc_index %454[%c2_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4273466Z       ttg.local_store %473, %478 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4274246Z       %479:8 = scf.for %arg4 = %c0_i32 to %c506_i32 step %c2_i32 iter_args(%arg5 = %cst_8, %arg6 = %c2_i32, %arg7 = %462, %arg8 = %470, %arg9 = %478, %arg10 = %461, %arg11 = %469, %arg12 = %477) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:52:25.4274918Z         %553 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:52:25.4275060Z         %554 = arith.muli %553, %c2_i32 : i32
2026-02-21T09:52:25.4275232Z         %555 = tt.splat %554 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4275458Z         %556 = arith.addi %555, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4275735Z         %557 = tt.expand_dims %556 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4276018Z         %558 = tt.broadcast %557 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4276213Z         %559 = arith.addi %444, %558 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4276434Z         %560 = tt.addptr %4, %559 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4276643Z         %561 = tt.load %560 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4276945Z         %562 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4277402Z         %563 = arith.extf %562 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4277687Z         %564 = arith.extsi %553 : i32 to i64
2026-02-21T09:52:25.4277858Z         %565 = tt.splat %564 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4278082Z         %566 = arith.addi %565, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4278353Z         %567 = tt.expand_dims %566 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4278605Z         %568 = arith.muli %567, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4278797Z         %569 = tt.broadcast %568 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4278990Z         %570 = arith.addi %569, %449 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4279188Z         %571 = tt.addptr %5, %570 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4279396Z         %572 = arith.cmpi sge, %567, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4279571Z         %573 = arith.cmpi slt, %567, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4279733Z         %574 = arith.andi %572, %573 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4279934Z         %575 = tt.broadcast %574 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4280125Z         %576 = arith.andi %575, %453 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4280295Z         %577 = tt.load %571, %576, %cst_1 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4280473Z         %578 = arith.shli %arg10, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4280634Z         %579 = arith.shrsi %578, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4280884Z         %580 = ttg.convert_layout %579 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4281134Z         %581 = arith.shrsi %arg10, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4281376Z         %582 = ttg.convert_layout %581 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4281717Z         %583 = tt.expand_dims %580 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4282066Z         %584 = tt.expand_dims %582 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4282365Z         %585 = tt.broadcast %583 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4282664Z         %586 = arith.select %14, %585, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4282926Z         %587 = tt.broadcast %584 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4283172Z         %588 = arith.select %16, %587, %586 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4283408Z         %589 = tt.reshape %588 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4283637Z         %590 = arith.sitofp %589 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4283940Z         %591 = ttg.convert_layout %590 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4284428Z         %592 = tt.dot %563, %591, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4284781Z         %593 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:25.4284908Z         %594 = arith.cmpi slt, %593, %c3_i32 : i32
2026-02-21T09:52:25.4285043Z         %595 = arith.select %594, %593, %c0_i32 : i32
2026-02-21T09:52:25.4285331Z         %596 = ttg.memdesc_index %454[%595] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4285692Z         ttg.local_store %561, %596 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4286320Z         scf.yield %592, %595, %arg8, %arg9, %596, %arg11, %arg12, %577 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4286845Z       } {tt.flatten, tt.num_stages = 4 : i32}
2026-02-21T09:52:25.4287125Z       %480 = ttg.local_load %479#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4287566Z       %481 = arith.extf %480 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4287873Z       %482 = arith.shli %479#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4288051Z       %483 = arith.shrsi %482, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4288296Z       %484 = ttg.convert_layout %483 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4288545Z       %485 = arith.shrsi %479#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4288789Z       %486 = ttg.convert_layout %485 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4289126Z       %487 = tt.expand_dims %484 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4289474Z       %488 = tt.expand_dims %486 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4289769Z       %489 = tt.broadcast %487 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4290020Z       %490 = arith.select %14, %489, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4290271Z       %491 = tt.broadcast %488 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4290521Z       %492 = arith.select %16, %491, %490 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4290758Z       %493 = tt.reshape %492 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4290996Z       %494 = arith.sitofp %493 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4291313Z       %495 = ttg.convert_layout %494 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4291786Z       %496 = tt.dot %481, %495, %479#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4292292Z       %497 = ttg.local_load %479#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4292748Z       %498 = arith.extf %497 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4293059Z       %499 = arith.shli %479#6, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4293229Z       %500 = arith.shrsi %499, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4293480Z       %501 = ttg.convert_layout %500 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4293748Z       %502 = arith.shrsi %479#6, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4293992Z       %503 = ttg.convert_layout %502 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4294334Z       %504 = tt.expand_dims %501 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4294682Z       %505 = tt.expand_dims %503 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4294976Z       %506 = tt.broadcast %504 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4295228Z       %507 = arith.select %14, %506, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4295477Z       %508 = tt.broadcast %505 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4295725Z       %509 = arith.select %16, %508, %507 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4295967Z       %510 = tt.reshape %509 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4296196Z       %511 = arith.sitofp %510 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4296517Z       %512 = ttg.convert_layout %511 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4296986Z       %513 = tt.dot %498, %512, %496, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4297487Z       %514 = ttg.local_load %479#4 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4297927Z       %515 = arith.extf %514 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4298229Z       %516 = arith.shli %479#7, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4298396Z       %517 = arith.shrsi %516, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4298644Z       %518 = ttg.convert_layout %517 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4298894Z       %519 = arith.shrsi %479#7, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4299147Z       %520 = ttg.convert_layout %519 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4299499Z       %521 = tt.expand_dims %518 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4299848Z       %522 = tt.expand_dims %520 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4300144Z       %523 = tt.broadcast %521 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4300391Z       %524 = arith.select %14, %523, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4300643Z       %525 = tt.broadcast %522 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4300901Z       %526 = arith.select %16, %525, %524 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4301145Z       %527 = tt.reshape %526 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4301375Z       %528 = arith.sitofp %527 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4301677Z       %529 = ttg.convert_layout %528 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4302159Z       %530 = tt.dot %515, %529, %513, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4302553Z       ttg.local_dealloc %454 : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.4302775Z       %531 = arith.truncf %530 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:52:25.4302959Z       %532 = arith.extsi %438 : i32 to i64
2026-02-21T09:52:25.4303126Z       %533 = tt.splat %532 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4303348Z       %534 = arith.addi %533, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4303621Z       %535 = tt.expand_dims %534 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4303869Z       %536 = arith.muli %535, %cst_17 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4304057Z       %537 = tt.broadcast %536 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4304268Z       %538 = tt.splat %445 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4304505Z       %539 = arith.addi %538, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4304774Z       %540 = tt.expand_dims %539 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4305042Z       %541 = tt.broadcast %540 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4305231Z       %542 = arith.addi %537, %541 : tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4305423Z       %543 = tt.addptr %17, %542 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4305634Z       %544 = arith.cmpi sge, %535, %cst_18 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4305807Z       %545 = arith.cmpi slt, %535, %cst_19 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4305972Z       %546 = arith.andi %544, %545 : tensor<128x1xi1, #mma>
2026-02-21T09:52:25.4306154Z       %547 = tt.broadcast %546 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4306345Z       %548 = arith.cmpi sge, %540, %cst_15 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4306518Z       %549 = arith.cmpi slt, %540, %cst_16 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4306676Z       %550 = arith.andi %548, %549 : tensor<1x256xi1, #mma>
2026-02-21T09:52:25.4306855Z       %551 = tt.broadcast %550 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4307037Z       %552 = arith.andi %547, %551 : tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4307206Z       tt.store %543, %531, %552 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.4307377Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:52:25.4307513Z     scf.for %arg3 = %26 to %c4096_i32 step %c9728_i32  : i32 {
2026-02-21T09:52:25.4307666Z       %27 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:52:25.4307791Z       %28 = arith.muli %27, %c8_i32 : i32
2026-02-21T09:52:25.4307914Z       %29 = arith.subi %c128_i32, %28 : i32
2026-02-21T09:52:25.4308032Z       %30 = arith.minsi %29, %c8_i32 : i32
2026-02-21T09:52:25.4308161Z       %31 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:52:25.4308283Z       %32 = arith.remsi %31, %30 : i32
2026-02-21T09:52:25.4308401Z       %33 = arith.addi %28, %32 : i32
2026-02-21T09:52:25.4308536Z       %34 = arith.divsi %31, %30 : i32
2026-02-21T09:52:25.4308648Z       %35 = arith.muli %33, %c128_i32 : i32
2026-02-21T09:52:25.4308820Z       %36 = tt.splat %35 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4309045Z       %37 = arith.addi %36, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.4309222Z       %38 = arith.muli %34, %c256_i32 : i32
2026-02-21T09:52:25.4309460Z       %39 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4309718Z       %40 = arith.muli %39, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.4309918Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4310097Z       %42 = arith.extsi %38 : i32 to i64
2026-02-21T09:52:25.4310268Z       %43 = tt.splat %42 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4310486Z       %44 = arith.addi %43, %9 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:25.4310763Z       %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4311045Z       %46 = tt.broadcast %45 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4311245Z       %47 = arith.cmpi sge, %45, %cst_14 : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4311423Z       %48 = arith.cmpi slt, %45, %cst_3 : tensor<1x256xi64, #blocked>
2026-02-21T09:52:25.4311586Z       %49 = arith.andi %47, %48 : tensor<1x256xi1, #blocked>
2026-02-21T09:52:25.4311769Z       %50 = tt.broadcast %49 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4312004Z       %51 = ttg.local_alloc : () -> !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.4312275Z       %52 = tt.expand_dims %3 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4312552Z       %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4312741Z       %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4312942Z       %55 = tt.addptr %4, %54 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4313146Z       %56 = tt.load %55 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4313384Z       %57 = tt.expand_dims %7 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4313629Z       %58 = arith.muli %57, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4313816Z       %59 = tt.broadcast %58 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4314007Z       %60 = arith.addi %59, %46 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4314197Z       %61 = tt.addptr %5, %60 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4314403Z       %62 = arith.cmpi sge, %57, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4314578Z       %63 = arith.cmpi slt, %57, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4314737Z       %64 = arith.andi %62, %63 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4314930Z       %65 = tt.broadcast %64 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4315112Z       %66 = arith.andi %65, %50 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4315323Z       %67 = tt.load %61, %66, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4315666Z       %68 = ttg.memdesc_index %51[%c0_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4316029Z       ttg.local_store %56, %68 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4316325Z       %69 = arith.addi %3, %cst_7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4316600Z       %70 = tt.expand_dims %69 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4316881Z       %71 = tt.broadcast %70 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4317078Z       %72 = arith.addi %41, %71 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4317292Z       %73 = tt.addptr %4, %72 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4317499Z       %74 = tt.load %73 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4317689Z       %75 = arith.addi %7, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4317966Z       %76 = tt.expand_dims %75 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4318214Z       %77 = arith.muli %76, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4318398Z       %78 = tt.broadcast %77 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4318589Z       %79 = arith.addi %78, %46 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4318779Z       %80 = tt.addptr %5, %79 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4318983Z       %81 = arith.cmpi sge, %76, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4319153Z       %82 = arith.cmpi slt, %76, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4319317Z       %83 = arith.andi %81, %82 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4319498Z       %84 = tt.broadcast %83 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4319697Z       %85 = arith.andi %84, %50 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4319906Z       %86 = tt.load %80, %85, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4320241Z       %87 = ttg.memdesc_index %51[%c1_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4320607Z       ttg.local_store %74, %87 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4320885Z       %88 = arith.addi %3, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4321160Z       %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4321436Z       %90 = tt.broadcast %89 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4321625Z       %91 = arith.addi %41, %90 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4321829Z       %92 = tt.addptr %4, %91 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4322038Z       %93 = tt.load %92 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4322227Z       %94 = arith.addi %7, %cst_4 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4322502Z       %95 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4322796Z       %96 = arith.muli %95, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4322986Z       %97 = tt.broadcast %96 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4323180Z       %98 = arith.addi %97, %46 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4323368Z       %99 = tt.addptr %5, %98 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4323575Z       %100 = arith.cmpi sge, %95, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4323749Z       %101 = arith.cmpi slt, %95, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4323913Z       %102 = arith.andi %100, %101 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4324116Z       %103 = tt.broadcast %102 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4324305Z       %104 = arith.andi %103, %50 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4324524Z       %105 = tt.load %99, %104, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4324884Z       %106 = ttg.memdesc_index %51[%c2_i32] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4325251Z       ttg.local_store %93, %106 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4326028Z       %107:8 = scf.for %arg4 = %c0_i32 to %c506_i32 step %c2_i32 iter_args(%arg5 = %cst_8, %arg6 = %c2_i32, %arg7 = %68, %arg8 = %87, %arg9 = %106, %arg10 = %67, %arg11 = %86, %arg12 = %105) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:52:25.4326695Z         %181 = arith.addi %arg4, %c6_i32 : i32
2026-02-21T09:52:25.4326828Z         %182 = arith.muli %181, %c2_i32 : i32
2026-02-21T09:52:25.4327007Z         %183 = tt.splat %182 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4327233Z         %184 = arith.addi %183, %3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.4327511Z         %185 = tt.expand_dims %184 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.4327806Z         %186 = tt.broadcast %185 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4328005Z         %187 = arith.addi %41, %186 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4328212Z         %188 = tt.addptr %4, %187 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.4328418Z         %189 = tt.load %188 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.4328722Z         %190 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4329162Z         %191 = arith.extf %190 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4329452Z         %192 = arith.extsi %181 : i32 to i64
2026-02-21T09:52:25.4329623Z         %193 = tt.splat %192 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4329842Z         %194 = arith.addi %193, %7 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.4330117Z         %195 = tt.expand_dims %194 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4330365Z         %196 = arith.muli %195, %cst_11 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4330560Z         %197 = tt.broadcast %196 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4330771Z         %198 = arith.addi %197, %46 : tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4330966Z         %199 = tt.addptr %5, %198 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:52:25.4331176Z         %200 = arith.cmpi sge, %195, %cst_12 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4331349Z         %201 = arith.cmpi slt, %195, %cst_13 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:25.4331517Z         %202 = arith.andi %200, %201 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:25.4331702Z         %203 = tt.broadcast %202 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4331892Z         %204 = arith.andi %203, %50 : tensor<2x256xi1, #blocked>
2026-02-21T09:52:25.4332080Z         %205 = tt.load %199, %204, %cst_1 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:25.4332253Z         %206 = arith.shli %arg10, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4332419Z         %207 = arith.shrsi %206, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4332667Z         %208 = ttg.convert_layout %207 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4332941Z         %209 = arith.shrsi %arg10, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4333187Z         %210 = ttg.convert_layout %209 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4333527Z         %211 = tt.expand_dims %208 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4333878Z         %212 = tt.expand_dims %210 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4334173Z         %213 = tt.broadcast %211 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4334423Z         %214 = arith.select %14, %213, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4334674Z         %215 = tt.broadcast %212 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4334918Z         %216 = arith.select %16, %215, %214 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4335157Z         %217 = tt.reshape %216 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4335384Z         %218 = arith.sitofp %217 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4335704Z         %219 = ttg.convert_layout %218 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4336179Z         %220 = tt.dot %191, %219, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4336528Z         %221 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:25.4336658Z         %222 = arith.cmpi slt, %221, %c3_i32 : i32
2026-02-21T09:52:25.4336790Z         %223 = arith.select %222, %221, %c0_i32 : i32
2026-02-21T09:52:25.4337059Z         %224 = ttg.memdesc_index %51[%223] : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4337427Z         ttg.local_store %189, %224 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>
2026-02-21T09:52:25.4338050Z         scf.yield %220, %223, %arg8, %arg9, %224, %arg11, %arg12, %205 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4338577Z       } {tt.flatten, tt.num_stages = 4 : i32}
2026-02-21T09:52:25.4338874Z       %108 = ttg.local_load %107#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4339309Z       %109 = arith.extf %108 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4339616Z       %110 = arith.shli %107#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4339781Z       %111 = arith.shrsi %110, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4340025Z       %112 = ttg.convert_layout %111 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4340291Z       %113 = arith.shrsi %107#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4340532Z       %114 = ttg.convert_layout %113 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4340873Z       %115 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4341237Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4341527Z       %117 = tt.broadcast %115 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4341776Z       %118 = arith.select %14, %117, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4342020Z       %119 = tt.broadcast %116 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4342265Z       %120 = arith.select %16, %119, %118 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4342506Z       %121 = tt.reshape %120 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4342734Z       %122 = arith.sitofp %121 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4343034Z       %123 = ttg.convert_layout %122 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4343503Z       %124 = tt.dot %109, %123, %107#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4344021Z       %125 = ttg.local_load %107#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4344456Z       %126 = arith.extf %125 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4344757Z       %127 = arith.shli %107#6, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4344923Z       %128 = arith.shrsi %127, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4345168Z       %129 = ttg.convert_layout %128 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4345415Z       %130 = arith.shrsi %107#6, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4345660Z       %131 = ttg.convert_layout %130 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4345998Z       %132 = tt.expand_dims %129 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4346347Z       %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4346641Z       %134 = tt.broadcast %132 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4346887Z       %135 = arith.select %14, %134, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4347147Z       %136 = tt.broadcast %133 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4347386Z       %137 = arith.select %16, %136, %135 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4347623Z       %138 = tt.reshape %137 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4347852Z       %139 = arith.sitofp %138 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4348149Z       %140 = ttg.convert_layout %139 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4348632Z       %141 = tt.dot %126, %140, %124, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4349132Z       %142 = ttg.local_load %107#4 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 3x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4349582Z       %143 = arith.extf %142 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4349884Z       %144 = arith.shli %107#7, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4350047Z       %145 = arith.shrsi %144, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4350292Z       %146 = ttg.convert_layout %145 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4350544Z       %147 = arith.shrsi %107#7, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:52:25.4350787Z       %148 = ttg.convert_layout %147 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.4351125Z       %149 = tt.expand_dims %146 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4351468Z       %150 = tt.expand_dims %148 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:52:25.4351758Z       %151 = tt.broadcast %149 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4352002Z       %152 = arith.select %14, %151, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4352261Z       %153 = tt.broadcast %150 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4352504Z       %154 = arith.select %16, %153, %152 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:52:25.4352741Z       %155 = tt.reshape %154 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.4352968Z       %156 = arith.sitofp %155 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.4353267Z       %157 = ttg.convert_layout %156 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.4353729Z       %158 = tt.dot %143, %157, %141, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.4354117Z       ttg.local_dealloc %51 : !ttg.memdesc<3x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.4354337Z       %159 = arith.truncf %158 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:52:25.4354510Z       %160 = arith.extsi %35 : i32 to i64
2026-02-21T09:52:25.4354680Z       %161 = tt.splat %160 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4354895Z       %162 = arith.addi %161, %18 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.4355180Z       %163 = tt.expand_dims %162 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4355427Z       %164 = arith.muli %163, %cst_17 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4355609Z       %165 = tt.broadcast %164 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4355821Z       %166 = tt.splat %42 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4356032Z       %167 = arith.addi %166, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.4356300Z       %168 = tt.expand_dims %167 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4356584Z       %169 = tt.broadcast %168 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4356768Z       %170 = arith.addi %165, %169 : tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4356966Z       %171 = tt.addptr %17, %170 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:52:25.4357168Z       %172 = arith.cmpi sge, %163, %cst_18 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4357351Z       %173 = arith.cmpi slt, %163, %cst_19 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.4357512Z       %174 = arith.andi %172, %173 : tensor<128x1xi1, #mma>
2026-02-21T09:52:25.4357691Z       %175 = tt.broadcast %174 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4357880Z       %176 = arith.cmpi sge, %168, %cst_15 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4358045Z       %177 = arith.cmpi slt, %168, %cst_16 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.4358206Z       %178 = arith.andi %176, %177 : tensor<1x256xi1, #mma>
2026-02-21T09:52:25.4358379Z       %179 = tt.broadcast %178 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4358561Z       %180 = arith.andi %175, %179 : tensor<128x256xi1, #mma>
2026-02-21T09:52:25.4358723Z       tt.store %171, %159, %180 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.4358873Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:52:25.4358979Z     tt.return
2026-02-21T09:52:25.4359058Z   }
2026-02-21T09:52:25.4359138Z }
2026-02-21T09:52:25.4359183Z 
2026-02-21T09:52:25.4359215Z {-#
2026-02-21T09:52:25.4359299Z   external_resources: {
2026-02-21T09:52:25.4359397Z     mlir_reproducer: {
2026-02-21T09:52:25.4360416Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:52:25.4360462Z       disable_threading: false,
2026-02-21T09:52:25.4360502Z       verify_each: true
2026-02-21T09:52:25.4360534Z     }
2026-02-21T09:52:25.4360565Z   }
2026-02-21T09:52:25.4360598Z #-}
2026-02-21T09:52:25.4360839Z /tmp/torchinductor_root/5y/c5yo5boeljmnkfo5b4u2etg2hep4hcalbcy3we66musd3ytrm6yd.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:52:25.4361264Z /tmp/torchinductor_root/5y/c5yo5boeljmnkfo5b4u2etg2hep4hcalbcy3we66musd3ytrm6yd.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:52:25.4361380Z [476s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:52:25.4362013Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:52:25.4362082Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:52:25.4362166Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:52:25.7094585Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:52:25.7113177Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}>
2026-02-21T09:52:25.7113479Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:52:25.7113686Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:52:25.7114011Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}>
2026-02-21T09:52:25.7114158Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:52:25.7114294Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:52:25.7114427Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:52:25.7114486Z #smem = #ttg.shared_memory
2026-02-21T09:52:25.7114701Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:52:25.7115061Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:52:25.7115159Z     %cst = arith.constant dense<8192> : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7115252Z     %cst_0 = arith.constant dense<0> : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7115348Z     %cst_1 = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7115422Z     %cst_2 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7115621Z     %cst_3 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7115718Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:52:25.7115810Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:52:25.7115912Z     %cst_6 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7115975Z     %c13823_i32 = arith.constant 13823 : i32
2026-02-21T09:52:25.7116033Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:52:25.7116089Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:52:25.7116239Z     %cst_7 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7116380Z     %cst_8 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7116513Z     %cst_9 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7116612Z     %cst_10 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7116707Z     %cst_11 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7116765Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:52:25.7116821Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:52:25.7116876Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:52:25.7116941Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:52:25.7116997Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:52:25.7117114Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:52:25.7117174Z     %c38912_i32 = arith.constant 38912 : i32
2026-02-21T09:52:25.7117235Z     %c19456_i32 = arith.constant 19456 : i32
2026-02-21T09:52:25.7117295Z     %c29184_i32 = arith.constant 29184 : i32
2026-02-21T09:52:25.7117384Z     %cst_12 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7117438Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:52:25.7117497Z     %c9728_i32 = arith.constant 9728 : i32
2026-02-21T09:52:25.7117636Z     %cst_13 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7117697Z     %0 = tt.get_program_id x : i32
2026-02-21T09:52:25.7117892Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7118040Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7118196Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7118392Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7118542Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7118686Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7118804Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7118907Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7119112Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:52:25.7119412Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:52:25.7119609Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:52:25.7119702Z     %12 = arith.cmpi eq, %11, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:52:25.7119824Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:52:25.7119930Z     %14 = arith.cmpi eq, %11, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:52:25.7120046Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:52:25.7120157Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.7120354Z     %17 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7120576Z     %18 = arith.extsi %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7120631Z     %19 = arith.subi %c13823_i32, %0 : i32
2026-02-21T09:52:25.7120684Z     %20 = arith.divui %19, %c9728_i32 : i32
2026-02-21T09:52:25.7120732Z     %21 = arith.remsi %20, %c4_i32 : i32
2026-02-21T09:52:25.7120780Z     %22 = arith.subi %20, %21 : i32
2026-02-21T09:52:25.7120831Z     %23 = arith.muli %22, %c9728_i32 : i32
2026-02-21T09:52:25.7120879Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:52:25.7120945Z     scf.for %arg3 = %0 to %24 step %c38912_i32  : i32 {
2026-02-21T09:52:25.7121003Z       %25 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:52:25.7121053Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:52:25.7121103Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:52:25.7121151Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:52:25.7121209Z       %29 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:52:25.7121275Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:52:25.7121321Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:52:25.7121371Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:52:25.7121421Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:52:25.7121537Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7121653Z       %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7121702Z       %36 = arith.muli %32, %c256_i32 : i32
2026-02-21T09:52:25.7121808Z       %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7121930Z       %38 = arith.addi %37, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7122109Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7122183Z       %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7122296Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7122494Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:52:25.7122675Z       %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7122784Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.7122950Z       %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.7123057Z       %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7123134Z       %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7123257Z       %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7123336Z       %49 = tt.load %48 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7123560Z       %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7123728Z       ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7123866Z       %51 = arith.addi %6, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7124034Z       %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.7124149Z       %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7124220Z       %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7124338Z       %55 = tt.addptr %7, %54 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7124416Z       %56 = tt.load %55 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7124630Z       %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7124792Z       ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7125210Z       %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:52:25.7125335Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7125460Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7125518Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:52:25.7125573Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:52:25.7125689Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7125799Z         %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7125971Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.7126110Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7126183Z         %417 = arith.addi %41, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7126312Z         %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7126388Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7126660Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7126898Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7127075Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7127171Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7127284Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7127353Z         %425 = arith.addi %424, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7127479Z         %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7127571Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7127752Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7127868Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7128010Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7128128Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7128314Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7128487Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7128614Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7128738Z         %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7128849Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7128978Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7129084Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7129197Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7129349Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7129587Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7129911Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7129969Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:25.7130026Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:52:25.7130092Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:52:25.7130310Z         %446 = ttg.memdesc_index %44[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7130498Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7130720Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7130785Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:52:25.7130878Z       %59 = arith.addi %5, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7131075Z       %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7131273Z       %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7131415Z       %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7131477Z       %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7131569Z       %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7131629Z       %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7131730Z       %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7131791Z       %67 = tt.load %66 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7131949Z       %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7132045Z       %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7132147Z       %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7132243Z       %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7132390Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7132537Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7132631Z       %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7132733Z       %75 = arith.select %13, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7132827Z       %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7132923Z       %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7133011Z       %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7133106Z       %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7141898Z       %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7142093Z       %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7142360Z       %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7142455Z       %83 = arith.addi %5, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7142691Z       %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7142884Z       %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7143048Z       %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7143115Z       %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7143205Z       %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7143267Z       %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7143369Z       %90 = tt.addptr %8, %89 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7143430Z       %91 = tt.load %90 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7143574Z       %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7143671Z       %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7143770Z       %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7143866Z       %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7144015Z       %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7144178Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7144274Z       %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7144384Z       %99 = arith.select %13, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7144477Z       %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7144578Z       %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7144673Z       %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7144766Z       %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7144885Z       %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7145058Z       %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7145319Z       %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7145407Z       ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.7145517Z       %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:52:25.7145561Z       %108 = arith.extsi %33 : i32 to i64
2026-02-21T09:52:25.7145603Z       %109 = arith.extsi %36 : i32 to i64
2026-02-21T09:52:25.7145694Z       %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7145780Z       %111 = arith.addi %110, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7145925Z       %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7146006Z       %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7146091Z       %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7146175Z       %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7146261Z       %116 = arith.addi %115, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7146420Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7146503Z       %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7146562Z       %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7146662Z       %120 = tt.addptr %16, %119 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7146731Z       %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7146798Z       %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7146861Z       %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma>
2026-02-21T09:52:25.7146943Z       %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7147005Z       %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7147070Z       %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7147128Z       %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma>
2026-02-21T09:52:25.7147208Z       %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7147271Z       %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7147359Z       tt.store %120, %107, %129 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.7147407Z       %130 = arith.addi %arg3, %c9728_i32 : i32
2026-02-21T09:52:25.7147454Z       %131 = arith.divsi %130, %c256_i32 : i32
2026-02-21T09:52:25.7147501Z       %132 = arith.muli %131, %c8_i32 : i32
2026-02-21T09:52:25.7147546Z       %133 = arith.subi %c128_i32, %132 : i32
2026-02-21T09:52:25.7147590Z       %134 = arith.minsi %133, %c8_i32 : i32
2026-02-21T09:52:25.7147636Z       %135 = arith.remsi %130, %c256_i32 : i32
2026-02-21T09:52:25.7147678Z       %136 = arith.remsi %135, %134 : i32
2026-02-21T09:52:25.7147718Z       %137 = arith.addi %132, %136 : i32
2026-02-21T09:52:25.7147760Z       %138 = arith.divsi %135, %134 : i32
2026-02-21T09:52:25.7147804Z       %139 = arith.muli %137, %c128_i32 : i32
2026-02-21T09:52:25.7147899Z       %140 = tt.splat %139 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7147993Z       %141 = arith.addi %140, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7148048Z       %142 = arith.muli %138, %c256_i32 : i32
2026-02-21T09:52:25.7148140Z       %143 = tt.splat %142 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7148233Z       %144 = arith.addi %143, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7148386Z       %145 = tt.expand_dims %141 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7148467Z       %146 = arith.muli %145, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7148562Z       %147 = tt.broadcast %146 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7148713Z       %148 = tt.expand_dims %144 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:52:25.7148807Z       %149 = tt.broadcast %148 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7148897Z       %150 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.7148963Z       %151 = arith.addi %147, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7149085Z       %152 = tt.addptr %7, %151 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7149150Z       %153 = tt.load %152 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7149340Z       %154 = ttg.memdesc_index %150[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7149499Z       ttg.local_store %153, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7149560Z       %155 = arith.addi %147, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7149664Z       %156 = tt.addptr %7, %155 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7149729Z       %157 = tt.load %156 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7149910Z       %158 = ttg.memdesc_index %150[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7150055Z       ttg.local_store %157, %158 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7150403Z       %159:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %154, %arg8 = %158) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:52:25.7150502Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7150598Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7150658Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:52:25.7150702Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:52:25.7150797Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7150886Z         %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7151033Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.7151129Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7151193Z         %417 = arith.addi %147, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7151297Z         %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7151364Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7151566Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7151767Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7151914Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7151992Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7152085Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7152149Z         %425 = arith.addi %424, %149 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7152248Z         %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7152311Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7152461Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7152585Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7152684Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7152786Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7152951Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7153099Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7153200Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7153308Z         %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7153406Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7153508Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7153601Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7153696Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7153818Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7153992Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7154283Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7154334Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:25.7154384Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:52:25.7154436Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:52:25.7154619Z         %446 = ttg.memdesc_index %150[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7154765Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7154985Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7155033Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:52:25.7155234Z       %160 = ttg.local_load %159#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7155432Z       %161 = arith.extf %160 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7155510Z       %162 = arith.addi %64, %149 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7155613Z       %163 = tt.addptr %8, %162 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7155675Z       %164 = tt.load %163 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7155819Z       %165 = ttg.convert_layout %164 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7155922Z       %166 = arith.shli %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7156037Z       %167 = arith.shrsi %166, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7156134Z       %168 = arith.shrsi %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7156287Z       %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7156435Z       %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7156545Z       %171 = tt.broadcast %169 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7156653Z       %172 = arith.select %13, %171, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7156748Z       %173 = tt.broadcast %170 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7156848Z       %174 = arith.select %15, %173, %172 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7156944Z       %175 = tt.reshape %174 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7157037Z       %176 = arith.sitofp %175 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7157157Z       %177 = ttg.local_alloc %176 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7157328Z       %178 = ttg.local_load %177 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7157593Z       %179 = tt.dot %161, %178, %159#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7157803Z       %180 = ttg.local_load %159#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7158002Z       %181 = arith.extf %180 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7158063Z       %182 = arith.addi %88, %149 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7158165Z       %183 = tt.addptr %8, %182 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7158229Z       %184 = tt.load %183 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7158373Z       %185 = ttg.convert_layout %184 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7158471Z       %186 = arith.shli %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7158573Z       %187 = arith.shrsi %186, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7158671Z       %188 = arith.shrsi %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7158823Z       %189 = tt.expand_dims %187 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7158995Z       %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7159091Z       %191 = tt.broadcast %189 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7159196Z       %192 = arith.select %13, %191, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7159291Z       %193 = tt.broadcast %190 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7159427Z       %194 = arith.select %15, %193, %192 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7159517Z       %195 = tt.reshape %194 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7159629Z       %196 = arith.sitofp %195 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7159745Z       %197 = ttg.local_alloc %196 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7159913Z       %198 = ttg.local_load %197 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7160190Z       %199 = tt.dot %181, %198, %179, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7160278Z       ttg.local_dealloc %150 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.7160368Z       %200 = arith.truncf %199 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:52:25.7160418Z       %201 = arith.extsi %139 : i32 to i64
2026-02-21T09:52:25.7160460Z       %202 = arith.extsi %142 : i32 to i64
2026-02-21T09:52:25.7160547Z       %203 = tt.splat %201 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7160635Z       %204 = arith.addi %203, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7160776Z       %205 = tt.expand_dims %204 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7160837Z       %206 = arith.muli %205, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7160924Z       %207 = tt.broadcast %206 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7161008Z       %208 = tt.splat %202 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7161105Z       %209 = arith.addi %208, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7161245Z       %210 = tt.expand_dims %209 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7161329Z       %211 = tt.broadcast %210 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7161387Z       %212 = arith.addi %207, %211 : tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7161488Z       %213 = tt.addptr %16, %212 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7161554Z       %214 = arith.cmpi sge, %205, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7161620Z       %215 = arith.cmpi slt, %205, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7161681Z       %216 = arith.andi %214, %215 : tensor<128x1xi1, #mma>
2026-02-21T09:52:25.7161763Z       %217 = tt.broadcast %216 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7161826Z       %218 = arith.cmpi sge, %210, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7161889Z       %219 = arith.cmpi slt, %210, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7161949Z       %220 = arith.andi %218, %219 : tensor<1x256xi1, #mma>
2026-02-21T09:52:25.7162030Z       %221 = tt.broadcast %220 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7162088Z       %222 = arith.andi %217, %221 : tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7162172Z       tt.store %213, %200, %222 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.7162220Z       %223 = arith.addi %arg3, %c19456_i32 : i32
2026-02-21T09:52:25.7162265Z       %224 = arith.divsi %223, %c256_i32 : i32
2026-02-21T09:52:25.7162311Z       %225 = arith.muli %224, %c8_i32 : i32
2026-02-21T09:52:25.7162355Z       %226 = arith.subi %c128_i32, %225 : i32
2026-02-21T09:52:25.7162399Z       %227 = arith.minsi %226, %c8_i32 : i32
2026-02-21T09:52:25.7162442Z       %228 = arith.remsi %223, %c256_i32 : i32
2026-02-21T09:52:25.7162487Z       %229 = arith.remsi %228, %227 : i32
2026-02-21T09:52:25.7162529Z       %230 = arith.addi %225, %229 : i32
2026-02-21T09:52:25.7162616Z       %231 = arith.divsi %228, %227 : i32
2026-02-21T09:52:25.7162685Z       %232 = arith.muli %230, %c128_i32 : i32
2026-02-21T09:52:25.7162779Z       %233 = tt.splat %232 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7162872Z       %234 = arith.addi %233, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7162919Z       %235 = arith.muli %231, %c256_i32 : i32
2026-02-21T09:52:25.7163010Z       %236 = tt.splat %235 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7163123Z       %237 = arith.addi %236, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7163276Z       %238 = tt.expand_dims %234 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7163344Z       %239 = arith.muli %238, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7163439Z       %240 = tt.broadcast %239 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7163591Z       %241 = tt.expand_dims %237 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:52:25.7163684Z       %242 = tt.broadcast %241 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7163773Z       %243 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.7163837Z       %244 = arith.addi %240, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7163941Z       %245 = tt.addptr %7, %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7164006Z       %246 = tt.load %245 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7164210Z       %247 = ttg.memdesc_index %243[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7164356Z       ttg.local_store %246, %247 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7164417Z       %248 = arith.addi %240, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7164518Z       %249 = tt.addptr %7, %248 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7164584Z       %250 = tt.load %249 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7164766Z       %251 = ttg.memdesc_index %243[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7164906Z       ttg.local_store %250, %251 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7165254Z       %252:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %247, %arg8 = %251) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:52:25.7165354Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7165448Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7165511Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:52:25.7165554Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:52:25.7165646Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7165738Z         %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7165882Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.7165976Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7166057Z         %417 = arith.addi %240, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7166160Z         %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7166224Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7166427Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7166643Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7166787Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7166854Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7166948Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7167010Z         %425 = arith.addi %424, %242 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7167112Z         %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7167174Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7167322Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7167424Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7167524Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7167635Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7167788Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7167937Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7168035Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7168144Z         %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7168239Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7168338Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7168433Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7168526Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7168646Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7168819Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7169100Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7169148Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:25.7169200Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:52:25.7169250Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:52:25.7169432Z         %446 = ttg.memdesc_index %243[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7169595Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7169814Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7169861Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:52:25.7170075Z       %253 = ttg.local_load %252#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7170270Z       %254 = arith.extf %253 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7170333Z       %255 = arith.addi %64, %242 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7170437Z       %256 = tt.addptr %8, %255 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7170498Z       %257 = tt.load %256 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7170641Z       %258 = ttg.convert_layout %257 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7170745Z       %259 = arith.shli %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7170844Z       %260 = arith.shrsi %259, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7170942Z       %261 = arith.shrsi %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7171109Z       %262 = tt.expand_dims %260 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7171257Z       %263 = tt.expand_dims %261 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7171354Z       %264 = tt.broadcast %262 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7171462Z       %265 = arith.select %13, %264, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7171556Z       %266 = tt.broadcast %263 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7171654Z       %267 = arith.select %15, %266, %265 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7171748Z       %268 = tt.reshape %267 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7171840Z       %269 = arith.sitofp %268 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7171959Z       %270 = ttg.local_alloc %269 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7172130Z       %271 = ttg.local_load %270 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7172394Z       %272 = tt.dot %254, %271, %252#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7172604Z       %273 = ttg.local_load %252#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7172803Z       %274 = arith.extf %273 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7172865Z       %275 = arith.addi %88, %242 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7172967Z       %276 = tt.addptr %8, %275 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7173045Z       %277 = tt.load %276 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7173190Z       %278 = ttg.convert_layout %277 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7173288Z       %279 = arith.shli %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7173390Z       %280 = arith.shrsi %279, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7173511Z       %281 = arith.shrsi %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7173659Z       %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7173808Z       %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7173903Z       %284 = tt.broadcast %282 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7174008Z       %285 = arith.select %13, %284, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7174103Z       %286 = tt.broadcast %283 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7174202Z       %287 = arith.select %15, %286, %285 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7174293Z       %288 = tt.reshape %287 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7174392Z       %289 = arith.sitofp %288 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7174510Z       %290 = ttg.local_alloc %289 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7174693Z       %291 = ttg.local_load %290 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7174957Z       %292 = tt.dot %274, %291, %272, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7175045Z       ttg.local_dealloc %243 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.7175134Z       %293 = arith.truncf %292 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:52:25.7175181Z       %294 = arith.extsi %232 : i32 to i64
2026-02-21T09:52:25.7175224Z       %295 = arith.extsi %235 : i32 to i64
2026-02-21T09:52:25.7175310Z       %296 = tt.splat %294 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7175398Z       %297 = arith.addi %296, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7175544Z       %298 = tt.expand_dims %297 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7175606Z       %299 = arith.muli %298, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7175689Z       %300 = tt.broadcast %299 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7175775Z       %301 = tt.splat %295 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7175874Z       %302 = arith.addi %301, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7176012Z       %303 = tt.expand_dims %302 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7176097Z       %304 = tt.broadcast %303 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7176156Z       %305 = arith.addi %300, %304 : tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7176255Z       %306 = tt.addptr %16, %305 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7176323Z       %307 = arith.cmpi sge, %298, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7176401Z       %308 = arith.cmpi slt, %298, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7176457Z       %309 = arith.andi %307, %308 : tensor<128x1xi1, #mma>
2026-02-21T09:52:25.7176538Z       %310 = tt.broadcast %309 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7176605Z       %311 = arith.cmpi sge, %303, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7176669Z       %312 = arith.cmpi slt, %303, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7176739Z       %313 = arith.andi %311, %312 : tensor<1x256xi1, #mma>
2026-02-21T09:52:25.7176823Z       %314 = tt.broadcast %313 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7176881Z       %315 = arith.andi %310, %314 : tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7176949Z       tt.store %306, %293, %315 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.7177000Z       %316 = arith.addi %arg3, %c29184_i32 : i32
2026-02-21T09:52:25.7177045Z       %317 = arith.divsi %316, %c256_i32 : i32
2026-02-21T09:52:25.7177088Z       %318 = arith.muli %317, %c8_i32 : i32
2026-02-21T09:52:25.7177134Z       %319 = arith.subi %c128_i32, %318 : i32
2026-02-21T09:52:25.7177177Z       %320 = arith.minsi %319, %c8_i32 : i32
2026-02-21T09:52:25.7177220Z       %321 = arith.remsi %316, %c256_i32 : i32
2026-02-21T09:52:25.7177261Z       %322 = arith.remsi %321, %320 : i32
2026-02-21T09:52:25.7177304Z       %323 = arith.addi %318, %322 : i32
2026-02-21T09:52:25.7177344Z       %324 = arith.divsi %321, %320 : i32
2026-02-21T09:52:25.7177387Z       %325 = arith.muli %323, %c128_i32 : i32
2026-02-21T09:52:25.7177484Z       %326 = tt.splat %325 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7177576Z       %327 = arith.addi %326, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7177632Z       %328 = arith.muli %324, %c256_i32 : i32
2026-02-21T09:52:25.7177727Z       %329 = tt.splat %328 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7177819Z       %330 = arith.addi %329, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7177969Z       %331 = tt.expand_dims %327 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7178039Z       %332 = arith.muli %331, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7178133Z       %333 = tt.broadcast %332 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7178280Z       %334 = tt.expand_dims %330 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:52:25.7178375Z       %335 = tt.broadcast %334 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7178462Z       %336 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.7178522Z       %337 = arith.addi %333, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7178625Z       %338 = tt.addptr %7, %337 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7178693Z       %339 = tt.load %338 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7178878Z       %340 = ttg.memdesc_index %336[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7179036Z       ttg.local_store %339, %340 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7179098Z       %341 = arith.addi %333, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7179199Z       %342 = tt.addptr %7, %341 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7179264Z       %343 = tt.load %342 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7179452Z       %344 = ttg.memdesc_index %336[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7179615Z       ttg.local_store %343, %344 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7179959Z       %345:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %340, %arg8 = %344) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:52:25.7180075Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7180166Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7180212Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:52:25.7180258Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:52:25.7180351Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7180442Z         %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7180588Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.7180681Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7180744Z         %417 = arith.addi %333, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7180852Z         %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7180915Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7181129Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7181334Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7181477Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7181543Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7181638Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7181700Z         %425 = arith.addi %424, %335 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7181800Z         %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7181865Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7182013Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7182113Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7182216Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7182329Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7182483Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7182632Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7182729Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7182836Z         %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7182950Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7183050Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7183140Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7183237Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7183367Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7183538Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7183806Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7183855Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:25.7183903Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:52:25.7183956Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:52:25.7184138Z         %446 = ttg.memdesc_index %336[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7184284Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7184519Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7184566Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:52:25.7184765Z       %346 = ttg.local_load %345#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7184965Z       %347 = arith.extf %346 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7185027Z       %348 = arith.addi %64, %335 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7185130Z       %349 = tt.addptr %8, %348 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7185193Z       %350 = tt.load %349 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7185337Z       %351 = ttg.convert_layout %350 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7185439Z       %352 = arith.shli %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7185540Z       %353 = arith.shrsi %352, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7185639Z       %354 = arith.shrsi %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7185789Z       %355 = tt.expand_dims %353 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7185951Z       %356 = tt.expand_dims %354 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7186048Z       %357 = tt.broadcast %355 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7186152Z       %358 = arith.select %13, %357, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7186249Z       %359 = tt.broadcast %356 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7186348Z       %360 = arith.select %15, %359, %358 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7186457Z       %361 = tt.reshape %360 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7186551Z       %362 = arith.sitofp %361 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7186670Z       %363 = ttg.local_alloc %362 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7186853Z       %364 = ttg.local_load %363 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7187121Z       %365 = tt.dot %347, %364, %345#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7187317Z       %366 = ttg.local_load %345#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7187513Z       %367 = arith.extf %366 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7187577Z       %368 = arith.addi %88, %335 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7187679Z       %369 = tt.addptr %8, %368 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7187739Z       %370 = tt.load %369 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7187887Z       %371 = ttg.convert_layout %370 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7187985Z       %372 = arith.shli %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7188101Z       %373 = arith.shrsi %372, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7188201Z       %374 = arith.shrsi %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7188351Z       %375 = tt.expand_dims %373 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7188496Z       %376 = tt.expand_dims %374 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7188596Z       %377 = tt.broadcast %375 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7188701Z       %378 = arith.select %13, %377, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7188795Z       %379 = tt.broadcast %376 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7188898Z       %380 = arith.select %15, %379, %378 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7188988Z       %381 = tt.reshape %380 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7189081Z       %382 = arith.sitofp %381 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7189200Z       %383 = ttg.local_alloc %382 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7189382Z       %384 = ttg.local_load %383 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7189644Z       %385 = tt.dot %367, %384, %365, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7189733Z       ttg.local_dealloc %336 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.7189824Z       %386 = arith.truncf %385 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:52:25.7189867Z       %387 = arith.extsi %325 : i32 to i64
2026-02-21T09:52:25.7189928Z       %388 = arith.extsi %328 : i32 to i64
2026-02-21T09:52:25.7190016Z       %389 = tt.splat %387 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7190103Z       %390 = arith.addi %389, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7190244Z       %391 = tt.expand_dims %390 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7190320Z       %392 = arith.muli %391, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7190404Z       %393 = tt.broadcast %392 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7190488Z       %394 = tt.splat %388 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7190576Z       %395 = arith.addi %394, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7190713Z       %396 = tt.expand_dims %395 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7190796Z       %397 = tt.broadcast %396 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7190859Z       %398 = arith.addi %393, %397 : tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7190956Z       %399 = tt.addptr %16, %398 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7191022Z       %400 = arith.cmpi sge, %391, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7191090Z       %401 = arith.cmpi slt, %391, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7191147Z       %402 = arith.andi %400, %401 : tensor<128x1xi1, #mma>
2026-02-21T09:52:25.7191229Z       %403 = tt.broadcast %402 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7191308Z       %404 = arith.cmpi sge, %396, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7191372Z       %405 = arith.cmpi slt, %396, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7191429Z       %406 = arith.andi %404, %405 : tensor<1x256xi1, #mma>
2026-02-21T09:52:25.7191512Z       %407 = tt.broadcast %406 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7191570Z       %408 = arith.andi %403, %407 : tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7191638Z       tt.store %399, %386, %408 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.7191682Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:52:25.7191747Z     scf.for %arg3 = %24 to %c4096_i32 step %c9728_i32  : i32 {
2026-02-21T09:52:25.7191794Z       %25 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:52:25.7191836Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:52:25.7191881Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:52:25.7191924Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:52:25.7191968Z       %29 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:52:25.7192016Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:52:25.7192056Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:52:25.7192099Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:52:25.7192141Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:52:25.7192235Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7192326Z       %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:25.7192386Z       %36 = arith.muli %32, %c256_i32 : i32
2026-02-21T09:52:25.7192480Z       %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7192568Z       %38 = arith.addi %37, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:52:25.7192715Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7192784Z       %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:25.7192875Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7193034Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:52:25.7193128Z       %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7193214Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.7193367Z       %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.7193459Z       %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7193517Z       %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7193619Z       %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7193681Z       %49 = tt.load %48 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7193868Z       %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7194008Z       ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7194103Z       %51 = arith.addi %6, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7194243Z       %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.7194330Z       %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7194389Z       %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7194506Z       %55 = tt.addptr %7, %54 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7194567Z       %56 = tt.load %55 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7194747Z       %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7194886Z       ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7195228Z       %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:52:25.7195329Z         %130 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7195422Z         %131 = arith.addi %130, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7195472Z         %132 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:52:25.7195517Z         %133 = arith.muli %132, %c2_i32 : i32
2026-02-21T09:52:25.7195612Z         %134 = tt.splat %133 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7195702Z         %135 = arith.addi %134, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:25.7195860Z         %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:25.7195955Z         %137 = tt.broadcast %136 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7196017Z         %138 = arith.addi %41, %137 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7196123Z         %139 = tt.addptr %7, %138 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:25.7196190Z         %140 = tt.load %139 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:25.7196405Z         %141 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7196602Z         %142 = arith.extf %141 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7196748Z         %143 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7196828Z         %144 = arith.muli %143, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7196920Z         %145 = tt.broadcast %144 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7196984Z         %146 = arith.addi %145, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7197086Z         %147 = tt.addptr %8, %146 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7197147Z         %148 = tt.load %147 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7197296Z         %149 = ttg.convert_layout %148 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7197395Z         %150 = arith.shli %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7197495Z         %151 = arith.shrsi %150, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7197598Z         %152 = arith.shrsi %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7197748Z         %153 = tt.expand_dims %151 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7197912Z         %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7198012Z         %155 = tt.broadcast %153 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7198120Z         %156 = arith.select %13, %155, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7198216Z         %157 = tt.broadcast %154 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7198321Z         %158 = arith.select %15, %157, %156 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7198412Z         %159 = tt.reshape %158 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7198506Z         %160 = arith.sitofp %159 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7198632Z         %161 = ttg.local_alloc %160 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7198804Z         %162 = ttg.local_load %161 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7199071Z         %163 = tt.dot %142, %162, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7199133Z         %164 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:25.7199182Z         %165 = arith.cmpi slt, %164, %c2_i32 : i32
2026-02-21T09:52:25.7199233Z         %166 = arith.select %165, %164, %c0_i32 : i32
2026-02-21T09:52:25.7199418Z         %167 = ttg.memdesc_index %44[%166] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7199562Z         ttg.local_store %140, %167 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7199785Z         scf.yield %163, %166, %arg8, %167 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:52:25.7199851Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:52:25.7199943Z       %59 = arith.addi %5, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7200137Z       %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7200348Z       %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7200489Z       %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7200551Z       %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7200644Z       %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7200704Z       %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7200802Z       %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7200865Z       %67 = tt.load %66 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7201007Z       %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7201104Z       %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7201205Z       %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7201300Z       %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7201469Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7201616Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7201710Z       %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7201812Z       %75 = arith.select %13, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7201906Z       %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7202001Z       %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7202088Z       %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7202180Z       %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7202296Z       %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7202464Z       %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7202763Z       %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7202873Z       %83 = arith.addi %5, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:25.7203065Z       %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7203264Z       %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7203423Z       %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7203484Z       %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:52:25.7203573Z       %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7203632Z       %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7203744Z       %90 = tt.addptr %8, %89 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:52:25.7203805Z       %91 = tt.load %90 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:52:25.7203946Z       %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7204041Z       %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7204138Z       %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7204236Z       %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:25.7204380Z       %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7204523Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:52:25.7204619Z       %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7204719Z       %99 = arith.select %13, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7204837Z       %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7204939Z       %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:52:25.7205030Z       %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:52:25.7205123Z       %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:52:25.7205245Z       %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:52:25.7205414Z       %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:25.7205671Z       %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:52:25.7205760Z       ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:25.7205849Z       %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:52:25.7205894Z       %108 = arith.extsi %33 : i32 to i64
2026-02-21T09:52:25.7205938Z       %109 = arith.extsi %36 : i32 to i64
2026-02-21T09:52:25.7206025Z       %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7206123Z       %111 = arith.addi %110, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:25.7206264Z       %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7206324Z       %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7206407Z       %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7206493Z       %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7206578Z       %116 = arith.addi %115, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:25.7206731Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7206817Z       %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7206875Z       %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7206971Z       %120 = tt.addptr %16, %119 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:52:25.7207053Z       %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7207118Z       %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:52:25.7207174Z       %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma>
2026-02-21T09:52:25.7207256Z       %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7207324Z       %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7207390Z       %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:52:25.7207446Z       %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma>
2026-02-21T09:52:25.7207528Z       %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7207587Z       %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma>
2026-02-21T09:52:25.7207654Z       tt.store %120, %107, %129 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:25.7207697Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:52:25.7207731Z     tt.return
2026-02-21T09:52:25.7207764Z   }
2026-02-21T09:52:25.7207797Z }
2026-02-21T09:52:25.7207803Z 
2026-02-21T09:52:25.7207833Z {-#
2026-02-21T09:52:25.7207874Z   external_resources: {
2026-02-21T09:52:25.7207911Z     mlir_reproducer: {
2026-02-21T09:52:25.7208863Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:52:25.7208908Z       disable_threading: false,
2026-02-21T09:52:25.7208947Z       verify_each: true
2026-02-21T09:52:25.7208979Z     }
2026-02-21T09:52:25.7209010Z   }
2026-02-21T09:52:25.7209040Z #-}
2026-02-21T09:52:25.7209280Z /tmp/torchinductor_root/su/csuirppd7sjaegtojwbrn3szezptbqy4dkrx3wujv6lrttfuq6gg.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:52:25.7209705Z /tmp/torchinductor_root/su/csuirppd7sjaegtojwbrn3szezptbqy4dkrx3wujv6lrttfuq6gg.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:52:25.7209819Z [476s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:52:25.7210455Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:52:25.7210532Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:52:25.7210616Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:52:28.7390219Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:52:28.7398056Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:52:28.7398400Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:52:28.7398805Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:52:28.7399103Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:52:28.7399406Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:52:28.7399670Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:52:28.7399904Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:52:28.7400089Z #smem = #ttg.shared_memory
2026-02-21T09:52:28.7400327Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:52:28.7400798Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:52:28.7401261Z     %cst = arith.constant dense<4> : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7401405Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:52:28.7401522Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:52:28.7401634Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:52:28.7401864Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:52:28.7402007Z     %cst_0 = arith.constant dense<0> : tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7402190Z     %cst_1 = arith.constant dense<0> : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7402339Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T09:52:28.7402457Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:52:28.7402645Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:52:28.7402757Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:52:28.7402902Z     %cst_2 = arith.constant dense<1024> : tensor<64x1xi32, #blocked2>
2026-02-21T09:52:28.7403079Z     %cst_3 = arith.constant dense<8192> : tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7403257Z     %cst_4 = arith.constant dense<0> : tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7403427Z     %cst_5 = arith.constant dense<512> : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7403594Z     %cst_6 = arith.constant dense<0> : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7403767Z     %cst_7 = arith.constant dense<8192> : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7403974Z     %cst_8 = arith.constant dense<4> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7404227Z     %cst_9 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:28.7404420Z     %c504_i32 = arith.constant 504 : i32
2026-02-21T09:52:28.7404538Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:52:28.7404748Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<64x64xf32, #mma>
2026-02-21T09:52:28.7404935Z     %cst_11 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:28.7405116Z     %cst_12 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:28.7405290Z     %cst_13 = arith.constant dense<8192> : tensor<64x1xi32, #mma>
2026-02-21T09:52:28.7405439Z     %0 = tt.get_program_id x : i32
2026-02-21T09:52:28.7405558Z     %1 = arith.muli %0, %c2_i32 : i32
2026-02-21T09:52:28.7405671Z     %2 = arith.addi %1, %c2_i32 : i32
2026-02-21T09:52:28.7405792Z     %3 = arith.minsi %2, %c32768_i32 : i32
2026-02-21T09:52:28.7405993Z     %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:28.7406297Z     %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:28.7406562Z     %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:28.7406829Z     %7 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:28.7407115Z     %8 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:28.7407357Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<64x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:28.7407564Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:28.7407800Z     %11 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7408122Z     %12 = arith.extsi %11 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7408484Z     %13 = arith.extsi %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:28.7408851Z     %14 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:52:28.7409279Z     %15 = tt.expand_dims %14 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:52:28.7409711Z     %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:28.7409973Z     %17 = arith.cmpi eq, %16, %cst_11 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:28.7410191Z     %18 = tt.broadcast %17 : tensor<1x2x1xi1, #blocked1> -> tensor<4x2x64xi1, #blocked1>
2026-02-21T09:52:28.7410408Z     %19 = arith.cmpi eq, %16, %cst_12 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:28.7410615Z     %20 = tt.broadcast %19 : tensor<1x2x1xi1, #blocked1> -> tensor<4x2x64xi1, #blocked1>
2026-02-21T09:52:28.7410840Z     %21 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:28.7411010Z     %22 = arith.subi %3, %1 : i32
2026-02-21T09:52:28.7411132Z     %23 = arith.remsi %22, %c2_i32 : i32
2026-02-21T09:52:28.7411260Z     %24 = arith.subi %22, %23 : i32
2026-02-21T09:52:28.7411382Z     %25 = arith.addi %1, %24 : i32
2026-02-21T09:52:28.7411516Z     scf.for %arg3 = %1 to %25 step %c2_i32  : i32 {
2026-02-21T09:52:28.7411669Z       %26 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:52:28.7411795Z       %27 = arith.muli %26, %c2_i32 : i32
2026-02-21T09:52:28.7411922Z       %28 = arith.subi %c128_i32, %27 : i32
2026-02-21T09:52:28.7412045Z       %29 = arith.minsi %28, %c2_i32 : i32
2026-02-21T09:52:28.7412170Z       %30 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:52:28.7412298Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T09:52:28.7412412Z       %32 = arith.addi %27, %31 : i32
2026-02-21T09:52:28.7412530Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T09:52:28.7412665Z       %34 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:52:28.7412828Z       %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:28.7413040Z       %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:28.7413205Z       %37 = arith.muli %33, %c64_i32 : i32
2026-02-21T09:52:28.7413380Z       %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:28.7413595Z       %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:28.7413811Z       %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:28.7414043Z       %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:28.7414316Z       %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:52:28.7414603Z       %43 = arith.muli %42, %cst_2 : tensor<64x1xi32, #blocked2>
2026-02-21T09:52:28.7414798Z       %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked2> -> tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7414995Z       %45 = arith.extsi %34 : i32 to i64
2026-02-21T09:52:28.7415164Z       %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:28.7415388Z       %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:28.7415663Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7415935Z       %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked> -> tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7416137Z       %50 = arith.cmpi sge, %48, %cst_4 : tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7416307Z       %51 = arith.cmpi slt, %48, %cst_3 : tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7416473Z       %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked>
2026-02-21T09:52:28.7416656Z       %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked> -> tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7416868Z       %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable>
2026-02-21T09:52:28.7417147Z       %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:52:28.7417419Z       %56 = tt.broadcast %55 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7417633Z       %57 = arith.addi %44, %56 : tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7417835Z       %58 = tt.addptr %9, %57 : tensor<64x8x!tt.ptr<bf16>, #blocked2>, tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7418042Z       %59 = tt.load %58 : tensor<64x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:28.7418289Z       %60 = tt.expand_dims %12 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7418535Z       %61 = arith.muli %60, %cst_7 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7418725Z       %62 = tt.broadcast %61 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7418915Z       %63 = arith.addi %62, %49 : tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7419111Z       %64 = tt.addptr %10, %63 : tensor<4x64x!tt.ptr<i8>, #blocked>, tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7419318Z       %65 = arith.cmpi sge, %60, %cst_6 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7419489Z       %66 = arith.cmpi slt, %60, %cst_5 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7419654Z       %67 = arith.andi %65, %66 : tensor<4x1xi1, #blocked>
2026-02-21T09:52:28.7419834Z       %68 = tt.broadcast %67 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7420023Z       %69 = arith.andi %68, %53 : tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7420239Z       %70 = tt.load %64, %69, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:28.7420611Z       %71 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7420995Z       ttg.local_store %59, %71 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7421284Z       %72 = arith.addi %8, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:28.7421587Z       %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:52:28.7421870Z       %74 = tt.broadcast %73 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7422078Z       %75 = arith.addi %44, %74 : tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7422276Z       %76 = tt.addptr %9, %75 : tensor<64x8x!tt.ptr<bf16>, #blocked2>, tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7422477Z       %77 = tt.load %76 : tensor<64x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:28.7422671Z       %78 = arith.addi %12, %cst_8 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7422962Z       %79 = tt.expand_dims %78 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7423201Z       %80 = arith.muli %79, %cst_7 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7423389Z       %81 = tt.broadcast %80 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7423576Z       %82 = arith.addi %81, %49 : tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7423766Z       %83 = tt.addptr %10, %82 : tensor<4x64x!tt.ptr<i8>, #blocked>, tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7423975Z       %84 = arith.cmpi sge, %79, %cst_6 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7424141Z       %85 = arith.cmpi slt, %79, %cst_5 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7424305Z       %86 = arith.andi %84, %85 : tensor<4x1xi1, #blocked>
2026-02-21T09:52:28.7424481Z       %87 = tt.broadcast %86 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7424667Z       %88 = arith.andi %87, %53 : tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7424873Z       %89 = tt.load %83, %88, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:28.7425206Z       %90 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7425590Z       ttg.local_store %77, %90 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7426205Z       %91:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %71, %arg8 = %90, %arg9 = %70, %arg10 = %89) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked>)  : i32 {
2026-02-21T09:52:28.7426725Z         %227 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:52:28.7426858Z         %228 = arith.muli %227, %c2_i32 : i32
2026-02-21T09:52:28.7427037Z         %229 = tt.splat %228 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:28.7427269Z         %230 = arith.addi %229, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:28.7427549Z         %231 = tt.expand_dims %230 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:52:28.7427833Z         %232 = tt.broadcast %231 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7428033Z         %233 = arith.addi %44, %232 : tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7428234Z         %234 = tt.addptr %9, %233 : tensor<64x8x!tt.ptr<bf16>, #blocked2>, tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7428461Z         %235 = tt.load %234 : tensor<64x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:28.7428762Z         %236 = ttg.local_load %arg7 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7429203Z         %237 = arith.extf %236 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7429491Z         %238 = arith.extsi %227 : i32 to i64
2026-02-21T09:52:28.7429670Z         %239 = tt.splat %238 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7429919Z         %240 = arith.addi %239, %12 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7430214Z         %241 = tt.expand_dims %240 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7430470Z         %242 = arith.muli %241, %cst_7 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7430668Z         %243 = tt.broadcast %242 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7430862Z         %244 = arith.addi %243, %49 : tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7431080Z         %245 = tt.addptr %10, %244 : tensor<4x64x!tt.ptr<i8>, #blocked>, tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7431289Z         %246 = arith.cmpi sge, %241, %cst_6 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7431472Z         %247 = arith.cmpi slt, %241, %cst_5 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7431642Z         %248 = arith.andi %246, %247 : tensor<4x1xi1, #blocked>
2026-02-21T09:52:28.7431830Z         %249 = tt.broadcast %248 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7432024Z         %250 = arith.andi %249, %53 : tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7432193Z         %251 = tt.load %245, %250, %cst_1 : tensor<4x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:28.7432376Z         %252 = arith.shli %arg9, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7432537Z         %253 = arith.shrsi %252, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7432787Z         %254 = ttg.convert_layout %253 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7433040Z         %255 = arith.shrsi %arg9, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7433282Z         %256 = ttg.convert_layout %255 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7433639Z         %257 = tt.expand_dims %254 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7433983Z         %258 = tt.expand_dims %256 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7434274Z         %259 = tt.broadcast %257 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7434529Z         %260 = arith.select %18, %259, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7434776Z         %261 = tt.broadcast %258 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7435021Z         %262 = arith.select %20, %261, %260 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7435255Z         %263 = tt.reshape %262 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:52:28.7435486Z         %264 = arith.sitofp %263 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:52:28.7435745Z         %265 = ttg.local_alloc %264 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:52:28.7436073Z         %266 = ttg.local_load %265 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7436573Z         %267 = tt.dot %237, %266, %arg5, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:52:28.7436929Z         %268 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:28.7437060Z         %269 = arith.cmpi slt, %268, %c2_i32 : i32
2026-02-21T09:52:28.7437203Z         %270 = arith.select %269, %268, %c0_i32 : i32
2026-02-21T09:52:28.7437470Z         %271 = ttg.memdesc_index %54[%270] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7437844Z         ttg.local_store %235, %271 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7438336Z         scf.yield %267, %270, %arg8, %271, %arg10, %251 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7438757Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:52:28.7439081Z       %92 = ttg.local_load %91#2 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7439504Z       %93 = arith.extf %92 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7439797Z       %94 = arith.shli %91#4, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7439955Z       %95 = arith.shrsi %94, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7440194Z       %96 = ttg.convert_layout %95 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7440431Z       %97 = arith.shrsi %91#4, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7440667Z       %98 = ttg.convert_layout %97 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7440995Z       %99 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7441333Z       %100 = tt.expand_dims %98 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7441638Z       %101 = tt.broadcast %99 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7441876Z       %102 = arith.select %18, %101, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7442119Z       %103 = tt.broadcast %100 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7442354Z       %104 = arith.select %20, %103, %102 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7442620Z       %105 = tt.reshape %104 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:52:28.7442845Z       %106 = arith.sitofp %105 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:52:28.7443097Z       %107 = ttg.local_alloc %106 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:52:28.7443422Z       %108 = ttg.local_load %107 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7443887Z       %109 = tt.dot %93, %108, %91#0, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:52:28.7444371Z       %110 = ttg.local_load %91#3 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7444796Z       %111 = arith.extf %110 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7445109Z       %112 = arith.shli %91#5, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7445267Z       %113 = arith.shrsi %112, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7445509Z       %114 = ttg.convert_layout %113 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7445751Z       %115 = arith.shrsi %91#5, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7445991Z       %116 = ttg.convert_layout %115 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7446342Z       %117 = tt.expand_dims %114 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7446681Z       %118 = tt.expand_dims %116 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7446970Z       %119 = tt.broadcast %117 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7447242Z       %120 = arith.select %18, %119, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7447482Z       %121 = tt.broadcast %118 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7447719Z       %122 = arith.select %20, %121, %120 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7447949Z       %123 = tt.reshape %122 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:52:28.7448172Z       %124 = arith.sitofp %123 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:52:28.7448419Z       %125 = ttg.local_alloc %124 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:52:28.7448739Z       %126 = ttg.local_load %125 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7449205Z       %127 = tt.dot %111, %126, %109, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:52:28.7449583Z       ttg.local_dealloc %54 : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable>
2026-02-21T09:52:28.7449811Z       %128 = arith.truncf %127 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:52:28.7450071Z       %129 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:52:28.7450309Z       %130 = arith.muli %129, %cst_13 : tensor<64x1xi32, #mma>
2026-02-21T09:52:28.7450540Z       %131 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:52:28.7450792Z       %132 = tt.broadcast %130 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7450992Z       %133 = tt.broadcast %131 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7451168Z       %134 = arith.addi %132, %133 : tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7451356Z       %135 = tt.addptr %21, %134 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7451552Z       tt.store %135, %128 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:28.7451693Z       %136 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:52:28.7451819Z       %137 = arith.divsi %136, %c512_i32 : i32
2026-02-21T09:52:28.7451938Z       %138 = arith.muli %137, %c2_i32 : i32
2026-02-21T09:52:28.7452060Z       %139 = arith.subi %c128_i32, %138 : i32
2026-02-21T09:52:28.7452177Z       %140 = arith.minsi %139, %c2_i32 : i32
2026-02-21T09:52:28.7452297Z       %141 = arith.remsi %136, %c512_i32 : i32
2026-02-21T09:52:28.7452414Z       %142 = arith.remsi %141, %140 : i32
2026-02-21T09:52:28.7452543Z       %143 = arith.addi %138, %142 : i32
2026-02-21T09:52:28.7452656Z       %144 = arith.divsi %141, %140 : i32
2026-02-21T09:52:28.7452771Z       %145 = arith.muli %143, %c64_i32 : i32
2026-02-21T09:52:28.7452933Z       %146 = tt.splat %145 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:28.7453142Z       %147 = arith.addi %146, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:28.7453311Z       %148 = arith.muli %144, %c64_i32 : i32
2026-02-21T09:52:28.7453477Z       %149 = tt.splat %148 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:28.7453691Z       %150 = tt.splat %148 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:28.7453928Z       %151 = arith.addi %149, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:28.7454139Z       %152 = arith.addi %150, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:28.7454413Z       %153 = tt.expand_dims %151 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:52:28.7454664Z       %154 = arith.muli %153, %cst_2 : tensor<64x1xi32, #blocked2>
2026-02-21T09:52:28.7454882Z       %155 = tt.broadcast %154 : tensor<64x1xi32, #blocked2> -> tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7455060Z       %156 = arith.extsi %145 : i32 to i64
2026-02-21T09:52:28.7455225Z       %157 = tt.splat %156 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:28.7455444Z       %158 = arith.addi %157, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:28.7455718Z       %159 = tt.expand_dims %158 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7455994Z       %160 = tt.broadcast %159 : tensor<1x64xi64, #blocked> -> tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7456194Z       %161 = arith.cmpi sge, %159, %cst_4 : tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7456366Z       %162 = arith.cmpi slt, %159, %cst_3 : tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7456530Z       %163 = arith.andi %161, %162 : tensor<1x64xi1, #blocked>
2026-02-21T09:52:28.7456712Z       %164 = tt.broadcast %163 : tensor<1x64xi1, #blocked> -> tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7456926Z       %165 = ttg.local_alloc : () -> !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable>
2026-02-21T09:52:28.7457112Z       %166 = arith.addi %155, %56 : tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7457333Z       %167 = tt.addptr %9, %166 : tensor<64x8x!tt.ptr<bf16>, #blocked2>, tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7457541Z       %168 = tt.load %167 : tensor<64x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:28.7457697Z       %169 = arith.addi %62, %160 : tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7457890Z       %170 = tt.addptr %10, %169 : tensor<4x64x!tt.ptr<i8>, #blocked>, tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7458085Z       %171 = arith.andi %68, %164 : tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7458299Z       %172 = tt.load %170, %171, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:28.7458639Z       %173 = ttg.memdesc_index %165[%c0_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7458997Z       ttg.local_store %168, %173 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7459237Z       %174 = arith.addi %155, %74 : tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7459433Z       %175 = tt.addptr %9, %174 : tensor<64x8x!tt.ptr<bf16>, #blocked2>, tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7459641Z       %176 = tt.load %175 : tensor<64x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:28.7459798Z       %177 = arith.addi %81, %160 : tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7459989Z       %178 = tt.addptr %10, %177 : tensor<4x64x!tt.ptr<i8>, #blocked>, tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7468589Z       %179 = arith.andi %87, %164 : tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7468800Z       %180 = tt.load %178, %179, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:28.7469135Z       %181 = ttg.memdesc_index %165[%c1_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7469491Z       ttg.local_store %176, %181 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7470108Z       %182:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %173, %arg8 = %181, %arg9 = %172, %arg10 = %180) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked>)  : i32 {
2026-02-21T09:52:28.7470660Z         %227 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:52:28.7470790Z         %228 = arith.muli %227, %c2_i32 : i32
2026-02-21T09:52:28.7470979Z         %229 = tt.splat %228 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:28.7471206Z         %230 = arith.addi %229, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:28.7471483Z         %231 = tt.expand_dims %230 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:52:28.7471762Z         %232 = tt.broadcast %231 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7471962Z         %233 = arith.addi %155, %232 : tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7472162Z         %234 = tt.addptr %9, %233 : tensor<64x8x!tt.ptr<bf16>, #blocked2>, tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7472368Z         %235 = tt.load %234 : tensor<64x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:28.7472669Z         %236 = ttg.local_load %arg7 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7473112Z         %237 = arith.extf %236 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7473397Z         %238 = arith.extsi %227 : i32 to i64
2026-02-21T09:52:28.7473583Z         %239 = tt.splat %238 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7473804Z         %240 = arith.addi %239, %12 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7474081Z         %241 = tt.expand_dims %240 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7474325Z         %242 = arith.muli %241, %cst_7 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7474517Z         %243 = tt.broadcast %242 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7474707Z         %244 = arith.addi %243, %160 : tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7474907Z         %245 = tt.addptr %10, %244 : tensor<4x64x!tt.ptr<i8>, #blocked>, tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7475113Z         %246 = arith.cmpi sge, %241, %cst_6 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7475285Z         %247 = arith.cmpi slt, %241, %cst_5 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7475449Z         %248 = arith.andi %246, %247 : tensor<4x1xi1, #blocked>
2026-02-21T09:52:28.7475632Z         %249 = tt.broadcast %248 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7475822Z         %250 = arith.andi %249, %164 : tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7475986Z         %251 = tt.load %245, %250, %cst_1 : tensor<4x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:28.7476161Z         %252 = arith.shli %arg9, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7476332Z         %253 = arith.shrsi %252, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7476573Z         %254 = ttg.convert_layout %253 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7476819Z         %255 = arith.shrsi %arg9, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7477059Z         %256 = ttg.convert_layout %255 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7477395Z         %257 = tt.expand_dims %254 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7477748Z         %258 = tt.expand_dims %256 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7478035Z         %259 = tt.broadcast %257 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7478282Z         %260 = arith.select %18, %259, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7478524Z         %261 = tt.broadcast %258 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7478781Z         %262 = arith.select %20, %261, %260 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7479012Z         %263 = tt.reshape %262 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:52:28.7479237Z         %264 = arith.sitofp %263 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:52:28.7479489Z         %265 = ttg.local_alloc %264 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:52:28.7479813Z         %266 = ttg.local_load %265 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7480288Z         %267 = tt.dot %237, %266, %arg5, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:52:28.7480638Z         %268 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:28.7480767Z         %269 = arith.cmpi slt, %268, %c2_i32 : i32
2026-02-21T09:52:28.7480901Z         %270 = arith.select %269, %268, %c0_i32 : i32
2026-02-21T09:52:28.7481164Z         %271 = ttg.memdesc_index %165[%270] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7481536Z         ttg.local_store %235, %271 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7482014Z         scf.yield %267, %270, %arg8, %271, %arg10, %251 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7482428Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:52:28.7482779Z       %183 = ttg.local_load %182#2 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7483206Z       %184 = arith.extf %183 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7483505Z       %185 = arith.shli %182#4, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7483667Z       %186 = arith.shrsi %185, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7483908Z       %187 = ttg.convert_layout %186 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7484156Z       %188 = arith.shrsi %182#4, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7484394Z       %189 = ttg.convert_layout %188 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7484745Z       %190 = tt.expand_dims %187 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7485085Z       %191 = tt.expand_dims %189 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7485367Z       %192 = tt.broadcast %190 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7485610Z       %193 = arith.select %18, %192, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7485870Z       %194 = tt.broadcast %191 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7486101Z       %195 = arith.select %20, %194, %193 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7486332Z       %196 = tt.reshape %195 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:52:28.7486552Z       %197 = arith.sitofp %196 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:52:28.7486822Z       %198 = ttg.local_alloc %197 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:52:28.7487148Z       %199 = ttg.local_load %198 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7487614Z       %200 = tt.dot %184, %199, %182#0, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:52:28.7488111Z       %201 = ttg.local_load %182#3 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7488539Z       %202 = arith.extf %201 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7488833Z       %203 = arith.shli %182#5, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7488998Z       %204 = arith.shrsi %203, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7489237Z       %205 = ttg.convert_layout %204 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7489481Z       %206 = arith.shrsi %182#5, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7489739Z       %207 = ttg.convert_layout %206 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7490071Z       %208 = tt.expand_dims %205 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7490409Z       %209 = tt.expand_dims %207 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7490693Z       %210 = tt.broadcast %208 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7490935Z       %211 = arith.select %18, %210, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7491177Z       %212 = tt.broadcast %209 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7491410Z       %213 = arith.select %20, %212, %211 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7491643Z       %214 = tt.reshape %213 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:52:28.7491861Z       %215 = arith.sitofp %214 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:52:28.7492113Z       %216 = ttg.local_alloc %215 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:52:28.7492435Z       %217 = ttg.local_load %216 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7492914Z       %218 = tt.dot %202, %217, %200, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:52:28.7493295Z       ttg.local_dealloc %165 : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable>
2026-02-21T09:52:28.7493505Z       %219 = arith.truncf %218 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:52:28.7493772Z       %220 = tt.expand_dims %152 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:52:28.7494039Z       %221 = arith.muli %220, %cst_13 : tensor<64x1xi32, #mma>
2026-02-21T09:52:28.7494289Z       %222 = tt.expand_dims %147 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:52:28.7494544Z       %223 = tt.broadcast %221 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7494746Z       %224 = tt.broadcast %222 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7494926Z       %225 = arith.addi %223, %224 : tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7495128Z       %226 = tt.addptr %21, %225 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7495324Z       tt.store %226, %219 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:28.7495461Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:52:28.7495585Z     scf.for %arg3 = %25 to %3 step %c1_i32  : i32 {
2026-02-21T09:52:28.7495718Z       %26 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:52:28.7495846Z       %27 = arith.muli %26, %c2_i32 : i32
2026-02-21T09:52:28.7495972Z       %28 = arith.subi %c128_i32, %27 : i32
2026-02-21T09:52:28.7496089Z       %29 = arith.minsi %28, %c2_i32 : i32
2026-02-21T09:52:28.7496210Z       %30 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:52:28.7496326Z       %31 = arith.remsi %30, %29 : i32
2026-02-21T09:52:28.7496440Z       %32 = arith.addi %27, %31 : i32
2026-02-21T09:52:28.7496548Z       %33 = arith.divsi %30, %29 : i32
2026-02-21T09:52:28.7496661Z       %34 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:52:28.7496817Z       %35 = tt.splat %34 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:28.7497023Z       %36 = arith.addi %35, %7 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:28.7497188Z       %37 = arith.muli %33, %c64_i32 : i32
2026-02-21T09:52:28.7497369Z       %38 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:28.7497578Z       %39 = tt.splat %37 : i32 -> tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:28.7497785Z       %40 = arith.addi %38, %4 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:28.7497993Z       %41 = arith.addi %39, %5 : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:28.7498258Z       %42 = tt.expand_dims %40 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<64x1xi32, #blocked2>
2026-02-21T09:52:28.7498502Z       %43 = arith.muli %42, %cst_2 : tensor<64x1xi32, #blocked2>
2026-02-21T09:52:28.7498692Z       %44 = tt.broadcast %43 : tensor<64x1xi32, #blocked2> -> tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7498863Z       %45 = arith.extsi %34 : i32 to i64
2026-02-21T09:52:28.7499026Z       %46 = tt.splat %45 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:28.7499242Z       %47 = arith.addi %46, %13 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:28.7499505Z       %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7499774Z       %49 = tt.broadcast %48 : tensor<1x64xi64, #blocked> -> tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7499970Z       %50 = arith.cmpi sge, %48, %cst_4 : tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7500154Z       %51 = arith.cmpi slt, %48, %cst_3 : tensor<1x64xi64, #blocked>
2026-02-21T09:52:28.7500315Z       %52 = arith.andi %50, %51 : tensor<1x64xi1, #blocked>
2026-02-21T09:52:28.7500494Z       %53 = tt.broadcast %52 : tensor<1x64xi1, #blocked> -> tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7500710Z       %54 = ttg.local_alloc : () -> !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable>
2026-02-21T09:52:28.7500978Z       %55 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:52:28.7501247Z       %56 = tt.broadcast %55 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7501452Z       %57 = arith.addi %44, %56 : tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7501646Z       %58 = tt.addptr %9, %57 : tensor<64x8x!tt.ptr<bf16>, #blocked2>, tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7509723Z       %59 = tt.load %58 : tensor<64x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:28.7509992Z       %60 = tt.expand_dims %12 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7510234Z       %61 = arith.muli %60, %cst_7 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7510463Z       %62 = tt.broadcast %61 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7510651Z       %63 = arith.addi %62, %49 : tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7510841Z       %64 = tt.addptr %10, %63 : tensor<4x64x!tt.ptr<i8>, #blocked>, tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7511044Z       %65 = arith.cmpi sge, %60, %cst_6 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7511211Z       %66 = arith.cmpi slt, %60, %cst_5 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7511369Z       %67 = arith.andi %65, %66 : tensor<4x1xi1, #blocked>
2026-02-21T09:52:28.7511541Z       %68 = tt.broadcast %67 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7511724Z       %69 = arith.andi %68, %53 : tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7511932Z       %70 = tt.load %64, %69, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:28.7512266Z       %71 = ttg.memdesc_index %54[%c0_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7512620Z       ttg.local_store %59, %71 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7512911Z       %72 = arith.addi %8, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:28.7513188Z       %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:52:28.7513460Z       %74 = tt.broadcast %73 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7513647Z       %75 = arith.addi %44, %74 : tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7513842Z       %76 = tt.addptr %9, %75 : tensor<64x8x!tt.ptr<bf16>, #blocked2>, tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7514038Z       %77 = tt.load %76 : tensor<64x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:28.7514227Z       %78 = arith.addi %12, %cst_8 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7514495Z       %79 = tt.expand_dims %78 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7514730Z       %80 = arith.muli %79, %cst_7 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7514911Z       %81 = tt.broadcast %80 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7515094Z       %82 = arith.addi %81, %49 : tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7515280Z       %83 = tt.addptr %10, %82 : tensor<4x64x!tt.ptr<i8>, #blocked>, tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7515481Z       %84 = arith.cmpi sge, %79, %cst_6 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7515658Z       %85 = arith.cmpi slt, %79, %cst_5 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7515814Z       %86 = arith.andi %84, %85 : tensor<4x1xi1, #blocked>
2026-02-21T09:52:28.7515987Z       %87 = tt.broadcast %86 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7516168Z       %88 = arith.andi %87, %53 : tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7516369Z       %89 = tt.load %83, %88, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<4x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:28.7516704Z       %90 = ttg.memdesc_index %54[%c1_i32] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7517070Z       ttg.local_store %77, %90 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7517680Z       %91:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %71, %arg8 = %90, %arg9 = %70, %arg10 = %89) -> (tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked>)  : i32 {
2026-02-21T09:52:28.7518209Z         %136 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:52:28.7518339Z         %137 = arith.muli %136, %c2_i32 : i32
2026-02-21T09:52:28.7518513Z         %138 = tt.splat %137 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:28.7518739Z         %139 = arith.addi %138, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:28.7519015Z         %140 = tt.expand_dims %139 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:52:28.7519294Z         %141 = tt.broadcast %140 : tensor<1x8xi32, #blocked2> -> tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7519489Z         %142 = arith.addi %44, %141 : tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7519690Z         %143 = tt.addptr %9, %142 : tensor<64x8x!tt.ptr<bf16>, #blocked2>, tensor<64x8xi32, #blocked2>
2026-02-21T09:52:28.7519895Z         %144 = tt.load %143 : tensor<64x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:28.7520196Z         %145 = ttg.local_load %arg7 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7520650Z         %146 = arith.extf %145 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7520932Z         %147 = arith.extsi %136 : i32 to i64
2026-02-21T09:52:28.7521103Z         %148 = tt.splat %147 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7521325Z         %149 = arith.addi %148, %12 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:28.7521596Z         %150 = tt.expand_dims %149 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7521845Z         %151 = arith.muli %150, %cst_7 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7522037Z         %152 = tt.broadcast %151 : tensor<4x1xi64, #blocked> -> tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7522225Z         %153 = arith.addi %152, %49 : tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7522422Z         %154 = tt.addptr %10, %153 : tensor<4x64x!tt.ptr<i8>, #blocked>, tensor<4x64xi64, #blocked>
2026-02-21T09:52:28.7522682Z         %155 = arith.cmpi sge, %150, %cst_6 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7522855Z         %156 = arith.cmpi slt, %150, %cst_5 : tensor<4x1xi64, #blocked>
2026-02-21T09:52:28.7523023Z         %157 = arith.andi %155, %156 : tensor<4x1xi1, #blocked>
2026-02-21T09:52:28.7523207Z         %158 = tt.broadcast %157 : tensor<4x1xi1, #blocked> -> tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7523396Z         %159 = arith.andi %158, %53 : tensor<4x64xi1, #blocked>
2026-02-21T09:52:28.7523580Z         %160 = tt.load %154, %159, %cst_1 : tensor<4x64x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:28.7523754Z         %161 = arith.shli %arg9, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7523913Z         %162 = arith.shrsi %161, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7524156Z         %163 = ttg.convert_layout %162 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7524405Z         %164 = arith.shrsi %arg9, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7524644Z         %165 = ttg.convert_layout %164 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7525002Z         %166 = tt.expand_dims %163 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7525341Z         %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7525629Z         %168 = tt.broadcast %166 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7525898Z         %169 = arith.select %18, %168, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7526138Z         %170 = tt.broadcast %167 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7526378Z         %171 = arith.select %20, %170, %169 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7526613Z         %172 = tt.reshape %171 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:52:28.7526838Z         %173 = arith.sitofp %172 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:52:28.7527089Z         %174 = ttg.local_alloc %173 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:52:28.7527413Z         %175 = ttg.local_load %174 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7527888Z         %176 = tt.dot %146, %175, %arg5, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:52:28.7528238Z         %177 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:52:28.7528365Z         %178 = arith.cmpi slt, %177, %c2_i32 : i32
2026-02-21T09:52:28.7528520Z         %179 = arith.select %178, %177, %c0_i32 : i32
2026-02-21T09:52:28.7528780Z         %180 = ttg.memdesc_index %54[%179] : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7529133Z         ttg.local_store %144, %180 : tensor<64x8xbf16, #blocked2> -> !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>
2026-02-21T09:52:28.7529606Z         scf.yield %176, %179, %arg8, %180, %arg10, %160 : tensor<64x64xf32, #mma>, i32, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8>, tensor<4x64xi8, #blocked>, tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7530026Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:52:28.7530336Z       %92 = ttg.local_load %91#2 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7530758Z       %93 = arith.extf %92 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7531050Z       %94 = arith.shli %91#4, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7531208Z       %95 = arith.shrsi %94, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7531442Z       %96 = ttg.convert_layout %95 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7531698Z       %97 = arith.shrsi %91#4, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7531931Z       %98 = ttg.convert_layout %97 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7532258Z       %99 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7532596Z       %100 = tt.expand_dims %98 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7532877Z       %101 = tt.broadcast %99 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7533135Z       %102 = arith.select %18, %101, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7533376Z       %103 = tt.broadcast %100 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7533610Z       %104 = arith.select %20, %103, %102 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7533843Z       %105 = tt.reshape %104 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:52:28.7534080Z       %106 = arith.sitofp %105 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:52:28.7534329Z       %107 = ttg.local_alloc %106 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:52:28.7534653Z       %108 = ttg.local_load %107 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7535118Z       %109 = tt.dot %93, %108, %91#0, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:52:28.7535605Z       %110 = ttg.local_load %91#3 : !ttg.memdesc<64x8xbf16, #shared, #smem, mutable, 2x64x8> -> tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7536032Z       %111 = arith.extf %110 : tensor<64x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7536327Z       %112 = arith.shli %91#5, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7536487Z       %113 = arith.shrsi %112, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7536742Z       %114 = ttg.convert_layout %113 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7536987Z       %115 = arith.shrsi %91#5, %cst : tensor<4x64xi8, #blocked>
2026-02-21T09:52:28.7537229Z       %116 = ttg.convert_layout %115 : tensor<4x64xi8, #blocked> -> tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:28.7537560Z       %117 = tt.expand_dims %114 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7537898Z       %118 = tt.expand_dims %116 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1x64xi8, #blocked1>
2026-02-21T09:52:28.7538180Z       %119 = tt.broadcast %117 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7538423Z       %120 = arith.select %18, %119, %cst_0 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7538666Z       %121 = tt.broadcast %118 : tensor<4x1x64xi8, #blocked1> -> tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7538899Z       %122 = arith.select %20, %121, %120 : tensor<4x2x64xi1, #blocked1>, tensor<4x2x64xi8, #blocked1>
2026-02-21T09:52:28.7539132Z       %123 = tt.reshape %122 : tensor<4x2x64xi8, #blocked1> -> tensor<8x64xi8, #blocked3>
2026-02-21T09:52:28.7539349Z       %124 = arith.sitofp %123 : tensor<8x64xi8, #blocked3> to tensor<8x64xf32, #blocked3>
2026-02-21T09:52:28.7539599Z       %125 = ttg.local_alloc %124 : (tensor<8x64xf32, #blocked3>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:52:28.7539942Z       %126 = ttg.local_load %125 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:28.7540403Z       %127 = tt.dot %111, %126, %109, inputPrecision = tf32 : tensor<64x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<64x64xf32, #mma>
2026-02-21T09:52:28.7540787Z       ttg.local_dealloc %54 : !ttg.memdesc<2x64x8xbf16, #shared, #smem, mutable>
2026-02-21T09:52:28.7540998Z       %128 = arith.truncf %127 : tensor<64x64xf32, #mma> to tensor<64x64xbf16, #mma>
2026-02-21T09:52:28.7541276Z       %129 = tt.expand_dims %41 {axis = 1 : i32} : tensor<64xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<64x1xi32, #mma>
2026-02-21T09:52:28.7541511Z       %130 = arith.muli %129, %cst_13 : tensor<64x1xi32, #mma>
2026-02-21T09:52:28.7541740Z       %131 = tt.expand_dims %36 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:52:28.7541994Z       %132 = tt.broadcast %130 : tensor<64x1xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7542207Z       %133 = tt.broadcast %131 : tensor<1x64xi32, #mma> -> tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7542385Z       %134 = arith.addi %132, %133 : tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7542574Z       %135 = tt.addptr %21, %134 : tensor<64x64x!tt.ptr<bf16>, #mma>, tensor<64x64xi32, #mma>
2026-02-21T09:52:28.7542768Z       tt.store %135, %128 : tensor<64x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:28.7542908Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:52:28.7543013Z     tt.return
2026-02-21T09:52:28.7543098Z   }
2026-02-21T09:52:28.7543172Z }
2026-02-21T09:52:28.7543220Z 
2026-02-21T09:52:28.7543251Z {-#
2026-02-21T09:52:28.7543333Z   external_resources: {
2026-02-21T09:52:28.7543430Z     mlir_reproducer: {
2026-02-21T09:52:28.7544460Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:52:28.7545449Z       disable_threading: false,
2026-02-21T09:52:28.7545554Z       verify_each: true
2026-02-21T09:52:28.7545646Z     }
2026-02-21T09:52:28.7545717Z   }
2026-02-21T09:52:28.7545787Z #-}
2026-02-21T09:52:28.7546063Z /tmp/torchinductor_root/uu/cuut2pp4drnpd2em2agvv3vob4ubf3hxd3hqoqovoqkudileffxn.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:52:28.7546746Z /tmp/torchinductor_root/uu/cuut2pp4drnpd2em2agvv3vob4ubf3hxd3hqoqovoqkudileffxn.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:52:28.7547298Z [479s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:52:28.7548078Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 64, 64], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:52:28.7548777Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:52:28.7548964Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:52:30.4939615Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:52:30.4942679Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:52:30.4943044Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:52:30.4943354Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:52:30.4944031Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:52:30.4944306Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:52:30.4944545Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:52:30.4944722Z #smem = #ttg.shared_memory
2026-02-21T09:52:30.4945023Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:52:30.4945490Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:52:30.4946017Z     %cst = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:52:30.4946189Z     %cst_0 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:52:30.4946363Z     %cst_1 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:52:30.4946530Z     %cst_2 = arith.constant dense<8192> : tensor<1x128xi64, #mma>
2026-02-21T09:52:30.4946698Z     %cst_3 = arith.constant dense<0> : tensor<1x128xi64, #mma>
2026-02-21T09:52:30.4946870Z     %cst_4 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.4947037Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.4947208Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.4947379Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:30.4947549Z     %cst_8 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:30.4947828Z     %cst_9 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:30.4948012Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:52:30.4948178Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:52:30.4948294Z     %c510_i32 = arith.constant 510 : i32
2026-02-21T09:52:30.4948474Z     %cst_11 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:30.4948663Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:52:30.4948789Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:52:30.4948930Z     %cst_12 = arith.constant dense<0> : tensor<2x128xi8, #blocked>
2026-02-21T09:52:30.4949109Z     %cst_13 = arith.constant dense<8192> : tensor<1x128xi64, #blocked>
2026-02-21T09:52:30.4949281Z     %cst_14 = arith.constant dense<0> : tensor<1x128xi64, #blocked>
2026-02-21T09:52:30.4949452Z     %cst_15 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked1>
2026-02-21T09:52:30.4949599Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:52:30.4949780Z     %cst_16 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:30.4949974Z     %0 = tt.get_program_id x : i32
2026-02-21T09:52:30.4950085Z     %1 = arith.divsi %0, %c128_i32 : i32
2026-02-21T09:52:30.4950202Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:52:30.4950313Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:52:30.4950425Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:52:30.4950591Z     %5 = arith.remsi %0, %c128_i32 : i32
2026-02-21T09:52:30.4950702Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:52:30.4950811Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:52:30.4950914Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:52:30.4951022Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:52:30.4951222Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:30.4951500Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:30.4951764Z     %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:30.4952047Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:30.4952304Z     %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:30.4952528Z     %15 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:52:30.4952696Z     %16 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:52:30.4984010Z     %17 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:30.4984347Z     %18 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:52:30.4984600Z     %19 = arith.muli %18, %cst_9 : tensor<128x1xi32, #blocked2>
2026-02-21T09:52:30.4984793Z     %20 = tt.broadcast %19 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:30.4985019Z     %21 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:30.4985185Z     %22 = arith.extsi %16 : i32 to i64
2026-02-21T09:52:30.4985335Z     %23 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:30.4985567Z     %24 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:30.4985879Z     %25 = arith.extsi %24 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:30.4986166Z     %26 = tt.splat %22 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:30.4986599Z     %27 = arith.extsi %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:30.4986884Z     %28 = arith.addi %26, %27 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:52:30.4987152Z     %29 = tt.expand_dims %28 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked>
2026-02-21T09:52:30.4987423Z     %30 = tt.broadcast %29 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:52:30.4987622Z     %31 = arith.cmpi sge, %29, %cst_14 : tensor<1x128xi64, #blocked>
2026-02-21T09:52:30.4987790Z     %32 = arith.cmpi slt, %29, %cst_13 : tensor<1x128xi64, #blocked>
2026-02-21T09:52:30.4987952Z     %33 = arith.andi %31, %32 : tensor<1x128xi1, #blocked>
2026-02-21T09:52:30.4988127Z     %34 = tt.broadcast %33 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:52:30.4988413Z     %35 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:52:30.4988826Z     %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:52:30.4989223Z     %37 = tt.expand_dims %36 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:30.4989477Z     %38 = arith.cmpi eq, %37, %cst_8 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:30.4989708Z     %39 = tt.broadcast %38 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1>
2026-02-21T09:52:30.4989904Z     %40 = arith.cmpi eq, %37, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:52:30.4990097Z     %41 = tt.broadcast %40 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1>
2026-02-21T09:52:30.4990310Z     %42 = ttg.local_alloc : () -> !ttg.memdesc<1x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:30.4990579Z     %43 = tt.expand_dims %17 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:30.4990843Z     %44 = tt.broadcast %43 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:30.4991057Z     %45 = arith.addi %20, %44 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:30.4991253Z     %46 = tt.addptr %21, %45 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:30.4991452Z     %47 = tt.load %46 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:30.4991736Z     %48 = ttg.memdesc_index %42[%c0_i32] : !ttg.memdesc<1x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4>
2026-02-21T09:52:30.4992110Z     ttg.local_store %47, %48 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4>
2026-02-21T09:52:30.4992541Z     %49:3 = scf.for %arg3 = %c0_i32 to %c510_i32 step %c2_i32 iter_args(%arg4 = %cst_10, %arg5 = %c0_i32, %arg6 = %48) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4>)  : i32 {
2026-02-21T09:52:30.4992873Z       %104 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:52:30.4992997Z       %105 = arith.muli %104, %c2_i32 : i32
2026-02-21T09:52:30.4993168Z       %106 = tt.splat %105 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:30.4993391Z       %107 = arith.addi %106, %17 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:52:30.4993668Z       %108 = tt.expand_dims %107 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:52:30.4993946Z       %109 = tt.broadcast %108 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:52:30.4994140Z       %110 = arith.addi %20, %109 : tensor<128x4xi32, #blocked2>
2026-02-21T09:52:30.4994343Z       %111 = tt.addptr %21, %110 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:52:30.4994568Z       %112 = tt.load %111 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:52:30.4994872Z       %113 = ttg.local_load %arg6 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:30.4995318Z       %114 = arith.extf %113 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:30.4995594Z       %115 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:52:30.4995763Z       %116 = tt.splat %115 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:30.4995982Z       %117 = arith.addi %116, %25 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:30.4996253Z       %118 = tt.expand_dims %117 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.4996497Z       %119 = arith.muli %118, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.4996684Z       %120 = tt.broadcast %119 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:52:30.4996875Z       %121 = arith.addi %120, %30 : tensor<2x128xi64, #blocked>
2026-02-21T09:52:30.4997069Z       %122 = tt.addptr %23, %121 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:52:30.4997277Z       %123 = arith.cmpi sge, %118, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.4997465Z       %124 = arith.cmpi slt, %118, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.4997626Z       %125 = arith.andi %123, %124 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:30.4997813Z       %126 = tt.broadcast %125 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:52:30.4997999Z       %127 = arith.andi %126, %34 : tensor<2x128xi1, #blocked>
2026-02-21T09:52:30.4998169Z       %128 = tt.load %122, %127, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:30.4998431Z       %129 = ttg.convert_layout %128 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:30.4998718Z       %130 = arith.shli %129, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:30.4998979Z       %131 = arith.shrsi %130, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:30.4999220Z       %132 = arith.shrsi %129, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:30.4999518Z       %133 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:52:30.4999877Z       %134 = tt.expand_dims %132 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:52:30.5000163Z       %135 = tt.broadcast %133 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:52:30.5000417Z       %136 = arith.select %39, %135, %cst_15 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:52:30.5000660Z       %137 = tt.broadcast %134 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:52:30.5000905Z       %138 = arith.select %41, %137, %136 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:52:30.5001137Z       %139 = tt.reshape %138 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked>
2026-02-21T09:52:30.5001356Z       %140 = arith.sitofp %139 : tensor<4x128xi8, #blocked> to tensor<4x128xf32, #blocked>
2026-02-21T09:52:30.5001604Z       %141 = ttg.local_alloc %140 : (tensor<4x128xf32, #blocked>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:52:30.5001928Z       %142 = ttg.local_load %141 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:30.5002423Z       %143 = tt.dot %114, %142, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:52:30.5002798Z       %144 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:52:30.5002923Z       %145 = arith.cmpi slt, %144, %c1_i32 : i32
2026-02-21T09:52:30.5003053Z       %146 = arith.select %145, %144, %c0_i32 : i32
2026-02-21T09:52:30.5003315Z       %147 = ttg.memdesc_index %42[%146] : !ttg.memdesc<1x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4>
2026-02-21T09:52:30.5003678Z       ttg.local_store %112, %147 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4>
2026-02-21T09:52:30.5003992Z       scf.yield %143, %146, %147 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4>
2026-02-21T09:52:30.5004270Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:52:30.5004610Z     %50 = ttg.local_load %49#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 1x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:30.5005034Z     %51 = arith.extf %50 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:30.5005359Z     %52 = arith.addi %25, %cst_11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:52:30.5005649Z     %53 = tt.expand_dims %52 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.5005883Z     %54 = arith.muli %53, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.5006066Z     %55 = tt.broadcast %54 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:52:30.5006248Z     %56 = arith.addi %55, %30 : tensor<2x128xi64, #blocked>
2026-02-21T09:52:30.5006436Z     %57 = tt.addptr %23, %56 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:52:30.5006639Z     %58 = arith.cmpi sge, %53, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.5006822Z     %59 = arith.cmpi slt, %53, %cst_4 : tensor<2x1xi64, #blocked>
2026-02-21T09:52:30.5006978Z     %60 = arith.andi %58, %59 : tensor<2x1xi1, #blocked>
2026-02-21T09:52:30.5007152Z     %61 = tt.broadcast %60 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:52:30.5007332Z     %62 = arith.andi %61, %34 : tensor<2x128xi1, #blocked>
2026-02-21T09:52:30.5007491Z     %63 = tt.load %57, %62, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:52:30.5007756Z     %64 = ttg.convert_layout %63 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:30.5008034Z     %65 = arith.shli %64, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:30.5008262Z     %66 = arith.shrsi %65, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:30.5008500Z     %67 = arith.shrsi %64, %cst_16 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:52:30.5008785Z     %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:52:30.5009120Z     %69 = tt.expand_dims %67 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:52:30.5009402Z     %70 = tt.broadcast %68 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:52:30.5009638Z     %71 = arith.select %39, %70, %cst_15 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:52:30.5009875Z     %72 = tt.broadcast %69 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:52:30.5010106Z     %73 = arith.select %41, %72, %71 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:52:30.5010346Z     %74 = tt.reshape %73 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked>
2026-02-21T09:52:30.5010562Z     %75 = arith.sitofp %74 : tensor<4x128xi8, #blocked> to tensor<4x128xf32, #blocked>
2026-02-21T09:52:30.5010802Z     %76 = ttg.local_alloc %75 : (tensor<4x128xf32, #blocked>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:52:30.5011117Z     %77 = ttg.local_load %76 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:52:30.5011583Z     %78 = tt.dot %51, %77, %49#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:52:30.5011959Z     ttg.local_dealloc %42 : !ttg.memdesc<1x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:52:30.5012168Z     %79 = arith.truncf %78 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:52:30.5012334Z     %80 = arith.extsi %9 : i32 to i64
2026-02-21T09:52:30.5012492Z     %81 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:30.5012693Z     %82 = tt.splat %80 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:30.5012964Z     %83 = arith.extsi %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:30.5013296Z     %84 = arith.extsi %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:30.5013578Z     %85 = arith.addi %82, %83 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:52:30.5013841Z     %86 = tt.expand_dims %85 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:52:30.5014076Z     %87 = arith.muli %86, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:52:30.5014247Z     %88 = tt.broadcast %87 : tensor<128x1xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:52:30.5014449Z     %89 = tt.splat %22 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:30.5014666Z     %90 = arith.addi %89, %84 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:52:30.5014922Z     %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi64, #mma>
2026-02-21T09:52:30.5015175Z     %92 = tt.broadcast %91 : tensor<1x128xi64, #mma> -> tensor<128x128xi64, #mma>
2026-02-21T09:52:30.5015349Z     %93 = arith.addi %88, %92 : tensor<128x128xi64, #mma>
2026-02-21T09:52:30.5015536Z     %94 = tt.addptr %81, %93 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi64, #mma>
2026-02-21T09:52:30.5015750Z     %95 = arith.cmpi sge, %86, %cst_0 : tensor<128x1xi64, #mma>
2026-02-21T09:52:30.5015909Z     %96 = arith.cmpi slt, %86, %cst : tensor<128x1xi64, #mma>
2026-02-21T09:52:30.5016055Z     %97 = arith.andi %95, %96 : tensor<128x1xi1, #mma>
2026-02-21T09:52:30.5016222Z     %98 = tt.broadcast %97 : tensor<128x1xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:52:30.5016402Z     %99 = arith.cmpi sge, %91, %cst_3 : tensor<1x128xi64, #mma>
2026-02-21T09:52:30.5016564Z     %100 = arith.cmpi slt, %91, %cst_2 : tensor<1x128xi64, #mma>
2026-02-21T09:52:30.5016721Z     %101 = arith.andi %99, %100 : tensor<1x128xi1, #mma>
2026-02-21T09:52:30.5016890Z     %102 = tt.broadcast %101 : tensor<1x128xi1, #mma> -> tensor<128x128xi1, #mma>
2026-02-21T09:52:30.5017068Z     %103 = arith.andi %98, %102 : tensor<128x128xi1, #mma>
2026-02-21T09:52:30.5017226Z     tt.store %94, %79, %103 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:52:30.5017361Z     tt.return
2026-02-21T09:52:30.5017442Z   }
2026-02-21T09:52:30.5017517Z }
2026-02-21T09:52:30.5017559Z 
2026-02-21T09:52:30.5017595Z {-#
2026-02-21T09:52:30.5017673Z   external_resources: {
2026-02-21T09:52:30.5017773Z     mlir_reproducer: {
2026-02-21T09:52:30.5018775Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:52:30.5019775Z       disable_threading: false,
2026-02-21T09:52:30.5019880Z       verify_each: true
2026-02-21T09:52:30.5019970Z     }
2026-02-21T09:52:30.5020039Z   }
2026-02-21T09:52:30.5020109Z #-}
2026-02-21T09:52:30.5020393Z /tmp/torchinductor_root/lo/cloq6ns7tap3eo3pmp6ejokhmdqe7g4cg5xwcrxxbveenhy2w4e2.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:52:30.5021090Z /tmp/torchinductor_root/lo/cloq6ns7tap3eo3pmp6ejokhmdqe7g4cg5xwcrxxbveenhy2w4e2.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:52:30.5021640Z [481s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:52:30.5022372Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:52:30.5023040Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:52:30.5023208Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:52:31.4821137Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 82/82 7.8 configs/s
2026-02-21T09:52:38.4043166Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 194/194 19.5 configs/s
2026-02-21T09:52:41.4864505Z [492s] Generation 6 complete: 
2026-02-21T09:52:41.4864918Z error=8
2026-02-21T09:52:41.4865116Z ok=77
2026-02-21T09:52:41.4865332Z min=1.0169
2026-02-21T09:52:41.4865537Z mid=1.6862
2026-02-21T09:52:41.4865732Z max=62.3544
2026-02-21T09:52:41.4866014Z best={'block_sizes': [8, 128, 128],
2026-02-21T09:52:41.4866385Z  'indexing': ['block_ptr', 'block_ptr', 'pointer'],
2026-02-21T09:52:41.4866743Z  'l2_groupings': [2],
2026-02-21T09:52:41.4867009Z  'load_eviction_policies': ['', ''],
2026-02-21T09:52:41.4867777Z  'loop_orders': [[0, 1]],
2026-02-21T09:52:41.4868054Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:52:41.4868322Z  'num_stages': 1,
2026-02-21T09:52:41.4868547Z  'num_warps': 4,
2026-02-21T09:52:41.4868675Z  'pid_type': 'flat',
2026-02-21T09:52:41.4868774Z  'range_flattens': [None, None],
2026-02-21T09:52:41.4868896Z  'range_multi_buffers': [None, False],
2026-02-21T09:52:41.4869011Z  'range_num_stages': [0, 2],
2026-02-21T09:52:41.4869127Z  'range_unroll_factors': [0, 0],
2026-02-21T09:52:41.4869236Z  'range_warp_specializes': [],
2026-02-21T09:52:41.4869335Z  'waves_per_eu': 2}
2026-02-21T09:52:41.4979498Z [492s] Fitting surrogate: 675 points, 675 targets
2026-02-21T09:52:42.4037147Z [493s] Generation 7 starting: 81 neighbors, 4 active search path(s)
2026-02-21T09:53:05.9831108Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.6 configs/s
2026-02-21T09:53:09.3261353Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:53:09.3271463Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}>
2026-02-21T09:53:09.3274363Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:53:09.3274716Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:53:09.3275136Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}>
2026-02-21T09:53:09.3275607Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:53:09.3276009Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:53:09.3276354Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:53:09.3276627Z #smem = #ttg.shared_memory
2026-02-21T09:53:09.3276977Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:53:09.3277699Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:53:09.3278199Z     %cst = arith.constant dense<8192> : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3278376Z     %cst_0 = arith.constant dense<0> : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3278546Z     %cst_1 = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3278796Z     %cst_2 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3278961Z     %cst_3 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3279136Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:53:09.3279305Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:53:09.3279490Z     %cst_6 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3279663Z     %c13823_i32 = arith.constant 13823 : i32
2026-02-21T09:53:09.3279787Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:53:09.3279909Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:53:09.3280172Z     %cst_7 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3280428Z     %cst_8 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3280681Z     %cst_9 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3280900Z     %cst_10 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3281143Z     %cst_11 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3281294Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:53:09.3281418Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:53:09.3281536Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:53:09.3281656Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:53:09.3281774Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:53:09.3281897Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:53:09.3282022Z     %c38912_i32 = arith.constant 38912 : i32
2026-02-21T09:53:09.3282143Z     %c19456_i32 = arith.constant 19456 : i32
2026-02-21T09:53:09.3282268Z     %c29184_i32 = arith.constant 29184 : i32
2026-02-21T09:53:09.3282415Z     %cst_12 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3282659Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:53:09.3282777Z     %c9728_i32 = arith.constant 9728 : i32
2026-02-21T09:53:09.3282966Z     %cst_13 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3283162Z     %0 = tt.get_program_id x : i32
2026-02-21T09:53:09.3283360Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3283677Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3283954Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3284256Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3284526Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3284792Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3285037Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3285240Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3285510Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:53:09.3285931Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:53:09.3286337Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:53:09.3286589Z     %12 = arith.cmpi eq, %11, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:53:09.3286809Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:53:09.3287005Z     %14 = arith.cmpi eq, %11, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:53:09.3287200Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:53:09.3287408Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:09.3287683Z     %17 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3288022Z     %18 = arith.extsi %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3288275Z     %19 = arith.subi %c13823_i32, %0 : i32
2026-02-21T09:53:09.3288400Z     %20 = arith.divui %19, %c9728_i32 : i32
2026-02-21T09:53:09.3288519Z     %21 = arith.remsi %20, %c4_i32 : i32
2026-02-21T09:53:09.3288641Z     %22 = arith.subi %20, %21 : i32
2026-02-21T09:53:09.3288761Z     %23 = arith.muli %22, %c9728_i32 : i32
2026-02-21T09:53:09.3288876Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:53:09.3289029Z     scf.for %arg3 = %0 to %24 step %c38912_i32  : i32 {
2026-02-21T09:53:09.3289170Z       %25 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:53:09.3289297Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:53:09.3289416Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:53:09.3289538Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:53:09.3289659Z       %29 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:53:09.3289784Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:53:09.3289904Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:53:09.3290017Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:53:09.3290134Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:53:09.3290304Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3290535Z       %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3290804Z       %36 = arith.muli %32, %c256_i32 : i32
2026-02-21T09:53:09.3291068Z       %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3291422Z       %38 = arith.addi %37, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3291787Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3292045Z       %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3292242Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3292524Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:53:09.3292807Z       %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3293028Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:09.3293301Z       %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:09.3293569Z       %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3293768Z       %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3293970Z       %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3294174Z       %49 = tt.load %48 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3294462Z       %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3294847Z       ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3295126Z       %51 = arith.addi %6, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3295404Z       %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:09.3295676Z       %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3295880Z       %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3296093Z       %55 = tt.addptr %7, %54 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3296332Z       %56 = tt.load %55 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3296618Z       %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3296976Z       ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3297525Z       %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:53:09.3298006Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3298239Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3298426Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:53:09.3298551Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:53:09.3298727Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3298955Z         %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3299232Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:09.3299515Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3299710Z         %417 = arith.addi %41, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3300104Z         %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3300314Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3300626Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3301075Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3301464Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3301719Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3301922Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3302118Z         %425 = arith.addi %424, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3302340Z         %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3302571Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3302821Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3303128Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3303374Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3303616Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3303914Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3304262Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3304569Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3304820Z         %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3305067Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3305310Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3305568Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3305796Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3306075Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3306408Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3306898Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3307279Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:53:09.3307428Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:53:09.3307598Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:53:09.3307904Z         %446 = ttg.memdesc_index %44[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3308316Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3308776Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3309116Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:53:09.3309318Z       %59 = arith.addi %5, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3309688Z       %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3310143Z       %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3310579Z       %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3310900Z       %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3311189Z       %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3311433Z       %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3311722Z       %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3312020Z       %67 = tt.load %66 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3312352Z       %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3312761Z       %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3313116Z       %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3313473Z       %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3313913Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3314445Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3314872Z       %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3315239Z       %75 = arith.select %13, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3315616Z       %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3315966Z       %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3316312Z       %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3316645Z       %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3317030Z       %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3317523Z       %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3318249Z       %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3318860Z       %83 = arith.addi %5, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3319361Z       %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3320050Z       %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3320638Z       %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3321008Z       %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3321297Z       %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3321571Z       %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3321864Z       %90 = tt.addptr %8, %89 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3322159Z       %91 = tt.load %90 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3322517Z       %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3323003Z       %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3323356Z       %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3323711Z       %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3324147Z       %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3324685Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3325113Z       %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3325471Z       %99 = arith.select %13, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3325835Z       %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3326193Z       %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3326571Z       %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3326915Z       %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3327302Z       %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3327832Z       %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3328562Z       %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3329151Z       ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:09.3329481Z       %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:53:09.3329730Z       %108 = arith.extsi %33 : i32 to i64
2026-02-21T09:53:09.3329907Z       %109 = arith.extsi %36 : i32 to i64
2026-02-21T09:53:09.3330153Z       %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3330474Z       %111 = arith.addi %110, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3330888Z       %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3331251Z       %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3331522Z       %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3331867Z       %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3332187Z       %116 = arith.addi %115, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3332603Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3333004Z       %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3333284Z       %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3333576Z       %120 = tt.addptr %16, %119 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3333882Z       %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3334115Z       %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3334351Z       %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma>
2026-02-21T09:53:09.3334625Z       %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3334907Z       %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3335157Z       %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3335392Z       %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma>
2026-02-21T09:53:09.3335655Z       %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3335951Z       %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3336191Z       tt.store %120, %107, %129 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:09.3336424Z       %130 = arith.addi %arg3, %c9728_i32 : i32
2026-02-21T09:53:09.3336611Z       %131 = arith.divsi %130, %c256_i32 : i32
2026-02-21T09:53:09.3336791Z       %132 = arith.muli %131, %c8_i32 : i32
2026-02-21T09:53:09.3336972Z       %133 = arith.subi %c128_i32, %132 : i32
2026-02-21T09:53:09.3337148Z       %134 = arith.minsi %133, %c8_i32 : i32
2026-02-21T09:53:09.3337332Z       %135 = arith.remsi %130, %c256_i32 : i32
2026-02-21T09:53:09.3337507Z       %136 = arith.remsi %135, %134 : i32
2026-02-21T09:53:09.3337706Z       %137 = arith.addi %132, %136 : i32
2026-02-21T09:53:09.3337875Z       %138 = arith.divsi %135, %134 : i32
2026-02-21T09:53:09.3338051Z       %139 = arith.muli %137, %c128_i32 : i32
2026-02-21T09:53:09.3338312Z       %140 = tt.splat %139 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3338656Z       %141 = arith.addi %140, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3338886Z       %142 = arith.muli %138, %c256_i32 : i32
2026-02-21T09:53:09.3339177Z       %143 = tt.splat %142 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3339519Z       %144 = arith.addi %143, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3339928Z       %145 = tt.expand_dims %141 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3340321Z       %146 = arith.muli %145, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3340626Z       %147 = tt.broadcast %146 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3341062Z       %148 = tt.expand_dims %144 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:53:09.3341503Z       %149 = tt.broadcast %148 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3341841Z       %150 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:09.3342127Z       %151 = arith.addi %147, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3342432Z       %152 = tt.addptr %7, %151 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3342763Z       %153 = tt.load %152 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3343205Z       %154 = ttg.memdesc_index %150[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3343770Z       ttg.local_store %153, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3344139Z       %155 = arith.addi %147, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3344445Z       %156 = tt.addptr %7, %155 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3344755Z       %157 = tt.load %156 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3345192Z       %158 = ttg.memdesc_index %150[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3345754Z       ttg.local_store %157, %158 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3346570Z       %159:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %154, %arg8 = %158) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:53:09.3347313Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3347679Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3347913Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:53:09.3348097Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:53:09.3348349Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3348688Z         %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3349107Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:09.3349554Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3349855Z         %417 = arith.addi %147, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3350165Z         %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3350483Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3350969Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3351649Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3352249Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3352628Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3352923Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3353218Z         %425 = arith.addi %424, %149 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3353528Z         %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3353841Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3354219Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3354656Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3355043Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3355413Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3355866Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3356389Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3356833Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3357207Z         %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3357579Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3357948Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3358346Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3358696Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3359089Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3359473Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3359951Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3360300Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:53:09.3360432Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:53:09.3360567Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:53:09.3360856Z         %446 = ttg.memdesc_index %150[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3361219Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3361751Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3362072Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:53:09.3362352Z       %160 = ttg.local_load %159#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3362958Z       %161 = arith.extf %160 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3363275Z       %162 = arith.addi %64, %149 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3363478Z       %163 = tt.addptr %8, %162 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3363678Z       %164 = tt.load %163 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3363925Z       %165 = ttg.convert_layout %164 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3364208Z       %166 = arith.shli %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3364446Z       %167 = arith.shrsi %166, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3364685Z       %168 = arith.shrsi %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3364999Z       %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3365339Z       %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3365705Z       %171 = tt.broadcast %169 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3365977Z       %172 = arith.select %13, %171, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3366218Z       %173 = tt.broadcast %170 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3366449Z       %174 = arith.select %15, %173, %172 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3366682Z       %175 = tt.reshape %174 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3366906Z       %176 = arith.sitofp %175 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3367160Z       %177 = ttg.local_alloc %176 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3367485Z       %178 = ttg.local_load %177 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3367953Z       %179 = tt.dot %161, %178, %159#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3368469Z       %180 = ttg.local_load %159#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3368901Z       %181 = arith.extf %180 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3369196Z       %182 = arith.addi %88, %149 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3369416Z       %183 = tt.addptr %8, %182 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3369614Z       %184 = tt.load %183 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3369859Z       %185 = ttg.convert_layout %184 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3370144Z       %186 = arith.shli %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3370395Z       %187 = arith.shrsi %186, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3370632Z       %188 = arith.shrsi %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3370922Z       %189 = tt.expand_dims %187 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3371258Z       %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3371554Z       %191 = tt.broadcast %189 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3371795Z       %192 = arith.select %13, %191, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3372037Z       %193 = tt.broadcast %190 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3372268Z       %194 = arith.select %15, %193, %192 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3372501Z       %195 = tt.reshape %194 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3372729Z       %196 = arith.sitofp %195 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3372999Z       %197 = ttg.local_alloc %196 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3373322Z       %198 = ttg.local_load %197 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3373794Z       %199 = tt.dot %181, %198, %179, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3374177Z       ttg.local_dealloc %150 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:09.3374396Z       %200 = arith.truncf %199 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:53:09.3374568Z       %201 = arith.extsi %139 : i32 to i64
2026-02-21T09:53:09.3374687Z       %202 = arith.extsi %142 : i32 to i64
2026-02-21T09:53:09.3374849Z       %203 = tt.splat %201 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3375066Z       %204 = arith.addi %203, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3375333Z       %205 = tt.expand_dims %204 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3375574Z       %206 = arith.muli %205, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3375757Z       %207 = tt.broadcast %206 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3375985Z       %208 = tt.splat %202 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3376196Z       %209 = arith.addi %208, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3376461Z       %210 = tt.expand_dims %209 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3376720Z       %211 = tt.broadcast %210 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3384843Z       %212 = arith.addi %207, %211 : tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3385065Z       %213 = tt.addptr %16, %212 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3385317Z       %214 = arith.cmpi sge, %205, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3385480Z       %215 = arith.cmpi slt, %205, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3385638Z       %216 = arith.andi %214, %215 : tensor<128x1xi1, #mma>
2026-02-21T09:53:09.3385815Z       %217 = tt.broadcast %216 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3386001Z       %218 = arith.cmpi sge, %210, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3386183Z       %219 = arith.cmpi slt, %210, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3386337Z       %220 = arith.andi %218, %219 : tensor<1x256xi1, #mma>
2026-02-21T09:53:09.3386510Z       %221 = tt.broadcast %220 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3386691Z       %222 = arith.andi %217, %221 : tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3386854Z       tt.store %213, %200, %222 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:09.3387008Z       %223 = arith.addi %arg3, %c19456_i32 : i32
2026-02-21T09:53:09.3387138Z       %224 = arith.divsi %223, %c256_i32 : i32
2026-02-21T09:53:09.3387259Z       %225 = arith.muli %224, %c8_i32 : i32
2026-02-21T09:53:09.3387378Z       %226 = arith.subi %c128_i32, %225 : i32
2026-02-21T09:53:09.3387499Z       %227 = arith.minsi %226, %c8_i32 : i32
2026-02-21T09:53:09.3387615Z       %228 = arith.remsi %223, %c256_i32 : i32
2026-02-21T09:53:09.3387733Z       %229 = arith.remsi %228, %227 : i32
2026-02-21T09:53:09.3387849Z       %230 = arith.addi %225, %229 : i32
2026-02-21T09:53:09.3387964Z       %231 = arith.divsi %228, %227 : i32
2026-02-21T09:53:09.3388079Z       %232 = arith.muli %230, %c128_i32 : i32
2026-02-21T09:53:09.3388253Z       %233 = tt.splat %232 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3388499Z       %234 = arith.addi %233, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3388671Z       %235 = arith.muli %231, %c256_i32 : i32
2026-02-21T09:53:09.3388842Z       %236 = tt.splat %235 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3389062Z       %237 = arith.addi %236, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3389346Z       %238 = tt.expand_dims %234 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3389602Z       %239 = arith.muli %238, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3389801Z       %240 = tt.broadcast %239 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3390084Z       %241 = tt.expand_dims %237 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:53:09.3390369Z       %242 = tt.broadcast %241 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3390591Z       %243 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:09.3390781Z       %244 = arith.addi %240, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3390979Z       %245 = tt.addptr %7, %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3391204Z       %246 = tt.load %245 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3391488Z       %247 = ttg.memdesc_index %243[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3391853Z       ttg.local_store %246, %247 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3392096Z       %248 = arith.addi %240, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3392294Z       %249 = tt.addptr %7, %248 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3392498Z       %250 = tt.load %249 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3392795Z       %251 = ttg.memdesc_index %243[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3393154Z       ttg.local_store %250, %251 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3393697Z       %252:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %247, %arg8 = %251) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:53:09.3394174Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3394403Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3394579Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:53:09.3394705Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:53:09.3394874Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3395090Z         %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3395366Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:09.3395642Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3395840Z         %417 = arith.addi %240, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3396042Z         %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3396260Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3396563Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3397002Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3397383Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3397633Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3397827Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3398021Z         %425 = arith.addi %424, %242 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3398220Z         %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3398421Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3398668Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3398949Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3399203Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3399441Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3399733Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3400078Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3400363Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3400625Z         %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3400869Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3401103Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3401334Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3401572Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3401828Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3402156Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3402705Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3403065Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:53:09.3403194Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:53:09.3403329Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:53:09.3403605Z         %446 = ttg.memdesc_index %243[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3403969Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3404403Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3404711Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:53:09.3404989Z       %253 = ttg.local_load %252#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3405423Z       %254 = arith.extf %253 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3405722Z       %255 = arith.addi %64, %242 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3405922Z       %256 = tt.addptr %8, %255 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3406125Z       %257 = tt.load %256 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3406371Z       %258 = ttg.convert_layout %257 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3406653Z       %259 = arith.shli %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3406889Z       %260 = arith.shrsi %259, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3407131Z       %261 = arith.shrsi %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3407449Z       %262 = tt.expand_dims %260 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3407784Z       %263 = tt.expand_dims %261 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3408070Z       %264 = tt.broadcast %262 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3408312Z       %265 = arith.select %13, %264, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3408557Z       %266 = tt.broadcast %263 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3408813Z       %267 = arith.select %15, %266, %265 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3409044Z       %268 = tt.reshape %267 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3409269Z       %269 = arith.sitofp %268 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3409520Z       %270 = ttg.local_alloc %269 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3409862Z       %271 = ttg.local_load %270 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3410339Z       %272 = tt.dot %254, %271, %252#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3410837Z       %273 = ttg.local_load %252#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3411270Z       %274 = arith.extf %273 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3411573Z       %275 = arith.addi %88, %242 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3411771Z       %276 = tt.addptr %8, %275 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3411971Z       %277 = tt.load %276 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3412211Z       %278 = ttg.convert_layout %277 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3412516Z       %279 = arith.shli %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3412753Z       %280 = arith.shrsi %279, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3412989Z       %281 = arith.shrsi %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3413278Z       %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3413613Z       %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3413900Z       %284 = tt.broadcast %282 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3414140Z       %285 = arith.select %13, %284, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3414378Z       %286 = tt.broadcast %283 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3414611Z       %287 = arith.select %15, %286, %285 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3414844Z       %288 = tt.reshape %287 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3415070Z       %289 = arith.sitofp %288 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3415340Z       %290 = ttg.local_alloc %289 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3415663Z       %291 = ttg.local_load %290 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3416137Z       %292 = tt.dot %274, %291, %272, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3416528Z       ttg.local_dealloc %243 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:09.3416761Z       %293 = arith.truncf %292 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:53:09.3416936Z       %294 = arith.extsi %232 : i32 to i64
2026-02-21T09:53:09.3417054Z       %295 = arith.extsi %235 : i32 to i64
2026-02-21T09:53:09.3417222Z       %296 = tt.splat %294 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3417435Z       %297 = arith.addi %296, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3417717Z       %298 = tt.expand_dims %297 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3417958Z       %299 = arith.muli %298, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3418137Z       %300 = tt.broadcast %299 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3418350Z       %301 = tt.splat %295 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3418558Z       %302 = arith.addi %301, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3418829Z       %303 = tt.expand_dims %302 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3419093Z       %304 = tt.broadcast %303 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3419275Z       %305 = arith.addi %300, %304 : tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3419468Z       %306 = tt.addptr %16, %305 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3419671Z       %307 = arith.cmpi sge, %298, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3419839Z       %308 = arith.cmpi slt, %298, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3419999Z       %309 = arith.andi %307, %308 : tensor<128x1xi1, #mma>
2026-02-21T09:53:09.3420188Z       %310 = tt.broadcast %309 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3420374Z       %311 = arith.cmpi sge, %303, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3420535Z       %312 = arith.cmpi slt, %303, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3420691Z       %313 = arith.andi %311, %312 : tensor<1x256xi1, #mma>
2026-02-21T09:53:09.3420862Z       %314 = tt.broadcast %313 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3421042Z       %315 = arith.andi %310, %314 : tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3421204Z       tt.store %306, %293, %315 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:09.3421355Z       %316 = arith.addi %arg3, %c29184_i32 : i32
2026-02-21T09:53:09.3421485Z       %317 = arith.divsi %316, %c256_i32 : i32
2026-02-21T09:53:09.3421605Z       %318 = arith.muli %317, %c8_i32 : i32
2026-02-21T09:53:09.3421724Z       %319 = arith.subi %c128_i32, %318 : i32
2026-02-21T09:53:09.3421841Z       %320 = arith.minsi %319, %c8_i32 : i32
2026-02-21T09:53:09.3421964Z       %321 = arith.remsi %316, %c256_i32 : i32
2026-02-21T09:53:09.3422081Z       %322 = arith.remsi %321, %320 : i32
2026-02-21T09:53:09.3422195Z       %323 = arith.addi %318, %322 : i32
2026-02-21T09:53:09.3422311Z       %324 = arith.divsi %321, %320 : i32
2026-02-21T09:53:09.3422425Z       %325 = arith.muli %323, %c128_i32 : i32
2026-02-21T09:53:09.3422598Z       %326 = tt.splat %325 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3422838Z       %327 = arith.addi %326, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3423012Z       %328 = arith.muli %324, %c256_i32 : i32
2026-02-21T09:53:09.3423180Z       %329 = tt.splat %328 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3423400Z       %330 = arith.addi %329, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3423681Z       %331 = tt.expand_dims %327 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3423933Z       %332 = arith.muli %331, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3424146Z       %333 = tt.broadcast %332 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3424428Z       %334 = tt.expand_dims %330 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:53:09.3424708Z       %335 = tt.broadcast %334 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3424928Z       %336 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:09.3425127Z       %337 = arith.addi %333, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3425327Z       %338 = tt.addptr %7, %337 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3425536Z       %339 = tt.load %338 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3425822Z       %340 = ttg.memdesc_index %336[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3426187Z       ttg.local_store %339, %340 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3426427Z       %341 = arith.addi %333, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3426624Z       %342 = tt.addptr %7, %341 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3426824Z       %343 = tt.load %342 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3427103Z       %344 = ttg.memdesc_index %336[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3427460Z       ttg.local_store %343, %344 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3428010Z       %345:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %340, %arg8 = %344) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:53:09.3428487Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3428711Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3428884Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:53:09.3429006Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:53:09.3429173Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3429388Z         %414 = arith.addi %413, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3429661Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:09.3429937Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3430127Z         %417 = arith.addi %333, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3430326Z         %418 = tt.addptr %7, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3430542Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3430841Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3431277Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3431658Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3431917Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3432108Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3432301Z         %425 = arith.addi %424, %335 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3432499Z         %426 = tt.addptr %8, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3432697Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3432956Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3433237Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3433471Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3433705Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3433995Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3434334Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3434618Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3434860Z         %435 = arith.select %13, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3435098Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3435347Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3435576Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3435802Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3436055Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3436385Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3436859Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3437205Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:53:09.3437331Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:53:09.3437462Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:53:09.3437727Z         %446 = ttg.memdesc_index %336[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3438090Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3438501Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3438804Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:53:09.3439078Z       %346 = ttg.local_load %345#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3439514Z       %347 = arith.extf %346 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3439968Z       %348 = arith.addi %64, %335 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3440162Z       %349 = tt.addptr %8, %348 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3440359Z       %350 = tt.load %349 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3440600Z       %351 = ttg.convert_layout %350 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3440897Z       %352 = arith.shli %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3441131Z       %353 = arith.shrsi %352, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3441365Z       %354 = arith.shrsi %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3441656Z       %355 = tt.expand_dims %353 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3441991Z       %356 = tt.expand_dims %354 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3442273Z       %357 = tt.broadcast %355 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3442512Z       %358 = arith.select %13, %357, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3442810Z       %359 = tt.broadcast %356 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3443045Z       %360 = arith.select %15, %359, %358 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3443272Z       %361 = tt.reshape %360 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3443517Z       %362 = arith.sitofp %361 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3443770Z       %363 = ttg.local_alloc %362 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3444095Z       %364 = ttg.local_load %363 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3444564Z       %365 = tt.dot %347, %364, %345#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3445063Z       %366 = ttg.local_load %345#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3445491Z       %367 = arith.extf %366 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3445787Z       %368 = arith.addi %88, %335 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3445982Z       %369 = tt.addptr %8, %368 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3446179Z       %370 = tt.load %369 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3446420Z       %371 = ttg.convert_layout %370 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3446714Z       %372 = arith.shli %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3446947Z       %373 = arith.shrsi %372, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3447181Z       %374 = arith.shrsi %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3447468Z       %375 = tt.expand_dims %373 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3447804Z       %376 = tt.expand_dims %374 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3448104Z       %377 = tt.broadcast %375 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3448344Z       %378 = arith.select %13, %377, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3448580Z       %379 = tt.broadcast %376 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3448808Z       %380 = arith.select %15, %379, %378 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3449055Z       %381 = tt.reshape %380 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3449276Z       %382 = arith.sitofp %381 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3449527Z       %383 = ttg.local_alloc %382 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3449851Z       %384 = ttg.local_load %383 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3450319Z       %385 = tt.dot %367, %384, %365, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3450701Z       ttg.local_dealloc %336 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:09.3450916Z       %386 = arith.truncf %385 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:53:09.3451085Z       %387 = arith.extsi %325 : i32 to i64
2026-02-21T09:53:09.3451201Z       %388 = arith.extsi %328 : i32 to i64
2026-02-21T09:53:09.3451360Z       %389 = tt.splat %387 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3451587Z       %390 = arith.addi %389, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3451849Z       %391 = tt.expand_dims %390 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3452093Z       %392 = arith.muli %391, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3452272Z       %393 = tt.broadcast %392 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3452477Z       %394 = tt.splat %388 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3452682Z       %395 = arith.addi %394, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3452944Z       %396 = tt.expand_dims %395 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3453204Z       %397 = tt.broadcast %396 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3453386Z       %398 = arith.addi %393, %397 : tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3453576Z       %399 = tt.addptr %16, %398 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3453778Z       %400 = arith.cmpi sge, %391, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3453938Z       %401 = arith.cmpi slt, %391, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3454096Z       %402 = arith.andi %400, %401 : tensor<128x1xi1, #mma>
2026-02-21T09:53:09.3454284Z       %403 = tt.broadcast %402 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3454465Z       %404 = arith.cmpi sge, %396, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3454624Z       %405 = arith.cmpi slt, %396, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3454774Z       %406 = arith.andi %404, %405 : tensor<1x256xi1, #mma>
2026-02-21T09:53:09.3454944Z       %407 = tt.broadcast %406 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3455120Z       %408 = arith.andi %403, %407 : tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3455280Z       tt.store %399, %386, %408 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:09.3455443Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:53:09.3455571Z     scf.for %arg3 = %24 to %c4096_i32 step %c9728_i32  : i32 {
2026-02-21T09:53:09.3455713Z       %25 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:53:09.3455832Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:53:09.3455950Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:53:09.3456063Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:53:09.3456180Z       %29 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:53:09.3456294Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:53:09.3456420Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:53:09.3456528Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:53:09.3456637Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:53:09.3456801Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3457021Z       %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:09.3457190Z       %36 = arith.muli %32, %c256_i32 : i32
2026-02-21T09:53:09.3457350Z       %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3457565Z       %38 = arith.addi %37, %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:09.3457840Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3458087Z       %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:09.3458278Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3458549Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:53:09.3458838Z       %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3459055Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:09.3459317Z       %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:09.3459579Z       %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3459765Z       %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3459959Z       %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3460158Z       %49 = tt.load %48 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3460435Z       %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3460793Z       ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3461063Z       %51 = arith.addi %6, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3461333Z       %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:09.3461613Z       %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3461795Z       %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3461986Z       %55 = tt.addptr %7, %54 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3462181Z       %56 = tt.load %55 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3462458Z       %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3462812Z       ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3463346Z       %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:53:09.3463818Z         %130 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3464061Z         %131 = arith.addi %130, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3464233Z         %132 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:53:09.3464354Z         %133 = arith.muli %132, %c2_i32 : i32
2026-02-21T09:53:09.3464518Z         %134 = tt.splat %133 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3464736Z         %135 = arith.addi %134, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:09.3465009Z         %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:09.3465283Z         %137 = tt.broadcast %136 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3465476Z         %138 = arith.addi %41, %137 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3465672Z         %139 = tt.addptr %7, %138 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:09.3465877Z         %140 = tt.load %139 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:09.3466176Z         %141 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3466633Z         %142 = arith.extf %141 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3467013Z         %143 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3467257Z         %144 = arith.muli %143, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3467449Z         %145 = tt.broadcast %144 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3467639Z         %146 = arith.addi %145, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3467834Z         %147 = tt.addptr %8, %146 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3468034Z         %148 = tt.load %147 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3468275Z         %149 = ttg.convert_layout %148 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3468559Z         %150 = arith.shli %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3468793Z         %151 = arith.shrsi %150, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3469026Z         %152 = arith.shrsi %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3469319Z         %153 = tt.expand_dims %151 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3469674Z         %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3469957Z         %155 = tt.broadcast %153 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3470200Z         %156 = arith.select %13, %155, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3470439Z         %157 = tt.broadcast %154 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3470675Z         %158 = arith.select %15, %157, %156 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3470930Z         %159 = tt.reshape %158 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3471153Z         %160 = arith.sitofp %159 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3471405Z         %161 = ttg.local_alloc %160 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3471744Z         %162 = ttg.local_load %161 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3472220Z         %163 = tt.dot %142, %162, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3472565Z         %164 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:53:09.3472691Z         %165 = arith.cmpi slt, %164, %c2_i32 : i32
2026-02-21T09:53:09.3472822Z         %166 = arith.select %165, %164, %c0_i32 : i32
2026-02-21T09:53:09.3473083Z         %167 = ttg.memdesc_index %44[%166] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3473442Z         ttg.local_store %140, %167 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3473839Z         scf.yield %163, %166, %arg8, %167 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:09.3474143Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:53:09.3474334Z       %59 = arith.addi %5, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3474655Z       %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3475082Z       %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3475460Z       %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3475700Z       %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3475891Z       %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3476078Z       %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3476268Z       %66 = tt.addptr %8, %65 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3476461Z       %67 = tt.load %66 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3476693Z       %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3476968Z       %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3477194Z       %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3477441Z       %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3477723Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3478054Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3478335Z       %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3478569Z       %75 = arith.select %13, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3478821Z       %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3479046Z       %77 = arith.select %15, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3479269Z       %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3479488Z       %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3479754Z       %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3480075Z       %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3480539Z       %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3480927Z       %83 = arith.addi %5, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:09.3481250Z       %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3481676Z       %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3482058Z       %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3482302Z       %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:53:09.3482504Z       %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3482754Z       %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3482945Z       %90 = tt.addptr %8, %89 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:53:09.3483135Z       %91 = tt.load %90 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:09.3483372Z       %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3483644Z       %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3483878Z       %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3484110Z       %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:09.3484395Z       %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3484732Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:53:09.3485012Z       %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3485253Z       %99 = arith.select %13, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3485511Z       %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3485739Z       %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:53:09.3485971Z       %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:09.3486195Z       %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:09.3486455Z       %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:53:09.3486806Z       %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:09.3487272Z       %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:09.3487659Z       ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:09.3487906Z       %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:53:09.3488080Z       %108 = arith.extsi %33 : i32 to i64
2026-02-21T09:53:09.3488204Z       %109 = arith.extsi %36 : i32 to i64
2026-02-21T09:53:09.3488370Z       %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3488587Z       %111 = arith.addi %110, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:09.3488857Z       %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3489103Z       %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3489290Z       %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3489499Z       %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3489713Z       %116 = arith.addi %115, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:09.3489979Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3490262Z       %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3490446Z       %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3490639Z       %120 = tt.addptr %16, %119 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:53:09.3490844Z       %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3491009Z       %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:53:09.3491171Z       %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma>
2026-02-21T09:53:09.3491349Z       %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3491540Z       %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3491705Z       %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:53:09.3491858Z       %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma>
2026-02-21T09:53:09.3492033Z       %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3492214Z       %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma>
2026-02-21T09:53:09.3492376Z       tt.store %120, %107, %129 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:09.3492526Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:53:09.3492629Z     tt.return
2026-02-21T09:53:09.3492712Z   }
2026-02-21T09:53:09.3492788Z }
2026-02-21T09:53:09.3492836Z 
2026-02-21T09:53:09.3492867Z {-#
2026-02-21T09:53:09.3492967Z   external_resources: {
2026-02-21T09:53:09.3493073Z     mlir_reproducer: {
2026-02-21T09:53:09.3494067Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:53:09.3495071Z       disable_threading: false,
2026-02-21T09:53:09.3495179Z       verify_each: true
2026-02-21T09:53:09.3495274Z     }
2026-02-21T09:53:09.3495345Z   }
2026-02-21T09:53:09.3495418Z #-}
2026-02-21T09:53:09.3495700Z /tmp/torchinductor_root/xb/cxbuks2owjyp7fxrahwxzjlcklwajtufzfy3vea6edykq37ttp3p.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:53:09.3496415Z /tmp/torchinductor_root/xb/cxbuks2owjyp7fxrahwxzjlcklwajtufzfy3vea6edykq37ttp3p.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:53:09.3496962Z [520s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:53:09.3497742Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:53:09.3498459Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:53:09.3498628Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:53:10.1224450Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:53:10.1235607Z #blocked = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:53:10.1236312Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:53:10.1237021Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:53:10.1237887Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:53:10.1238645Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:53:10.1239338Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:53:10.1239834Z #smem = #ttg.shared_memory
2026-02-21T09:53:10.1240438Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:53:10.1241775Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:53:10.1242501Z     %cst = arith.constant dense<4> : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1242939Z     %c19456_i32 = arith.constant 19456 : i32
2026-02-21T09:53:10.1243184Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:53:10.1243466Z     %cst_0 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1243810Z     %c58368_i32 = arith.constant 58368 : i32
2026-02-21T09:53:10.1244027Z     %c38912_i32 = arith.constant 38912 : i32
2026-02-21T09:53:10.1244233Z     %c77824_i32 = arith.constant 77824 : i32
2026-02-21T09:53:10.1244443Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:53:10.1244644Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:53:10.1244845Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:53:10.1245063Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:53:10.1245264Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:53:10.1245458Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:53:10.1245758Z     %cst_1 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1246073Z     %cst_2 = arith.constant dense<8192> : tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1246429Z     %cst_3 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1246703Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:53:10.1246960Z     %cst_4 = arith.constant dense<2> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1247227Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:53:10.1247422Z     %c23551_i32 = arith.constant 23551 : i32
2026-02-21T09:53:10.1247641Z     %cst_5 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1247913Z     %cst_6 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:53:10.1248153Z     %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:53:10.1248393Z     %cst_8 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1248626Z     %cst_9 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1248851Z     %cst_10 = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1249082Z     %cst_11 = arith.constant dense<0> : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1249309Z     %cst_12 = arith.constant dense<8192> : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1249515Z     %0 = tt.get_program_id x : i32
2026-02-21T09:53:10.1249791Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1250179Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1250544Z     %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1250943Z     %4 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1251318Z     %5 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1251679Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1252008Z     %7 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1252277Z     %8 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1252647Z     %9 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:53:10.1253229Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:53:10.1253783Z     %11 = tt.expand_dims %10 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:53:10.1254131Z     %12 = arith.cmpi eq, %11, %cst_6 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:53:10.1254428Z     %13 = tt.broadcast %12 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1>
2026-02-21T09:53:10.1254704Z     %14 = arith.cmpi eq, %11, %cst_7 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:53:10.1254993Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1>
2026-02-21T09:53:10.1255280Z     %16 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:10.1255654Z     %17 = arith.extsi %2 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1256119Z     %18 = arith.extsi %3 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1256437Z     %19 = arith.subi %c23551_i32, %0 : i32
2026-02-21T09:53:10.1256630Z     %20 = arith.divui %19, %c19456_i32 : i32
2026-02-21T09:53:10.1256792Z     %21 = arith.remsi %20, %c4_i32 : i32
2026-02-21T09:53:10.1256954Z     %22 = arith.subi %20, %21 : i32
2026-02-21T09:53:10.1257106Z     %23 = arith.muli %22, %c19456_i32 : i32
2026-02-21T09:53:10.1257264Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:53:10.1257418Z     scf.for %arg3 = %0 to %24 step %c77824_i32  : i32 {
2026-02-21T09:53:10.1257572Z       %25 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:53:10.1257704Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:53:10.1257852Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:53:10.1257983Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:53:10.1258112Z       %29 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:53:10.1258243Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:53:10.1258366Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:53:10.1258488Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:53:10.1258609Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:53:10.1258795Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1259037Z       %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1259221Z       %36 = arith.muli %32, %c256_i32 : i32
2026-02-21T09:53:10.1259404Z       %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1259635Z       %38 = arith.addi %37, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1259935Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1260207Z       %40 = arith.muli %39, %cst_1 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1260438Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1260741Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi32, #blocked>
2026-02-21T09:53:10.1261036Z       %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1261287Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:10.1261613Z       %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:10.1261908Z       %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1262113Z       %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1262323Z       %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1262543Z       %49 = tt.load %48 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1262798Z       %50 = tt.expand_dims %5 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1263055Z       %51 = arith.muli %50, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1263252Z       %52 = tt.broadcast %51 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1263469Z       %53 = arith.addi %52, %43 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1263674Z       %54 = tt.addptr %8, %53 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1263881Z       %55 = tt.load %54 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1264184Z       %56 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1264573Z       ttg.local_store %49, %56 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1264863Z       %57 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1265119Z       %58 = arith.addi %6, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1265412Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:10.1265706Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1265911Z       %61 = arith.addi %41, %60 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1266138Z       %62 = tt.addptr %7, %61 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1266357Z       %63 = tt.load %62 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1266593Z       %64 = tt.expand_dims %57 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1266830Z       %65 = arith.muli %64, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1267014Z       %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1267195Z       %67 = arith.addi %66, %43 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1267382Z       %68 = tt.addptr %8, %67 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1267569Z       %69 = tt.load %68 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1267845Z       %70 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1268199Z       ttg.local_store %63, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1268839Z       %71:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %56, %arg8 = %70, %arg9 = %55, %arg10 = %69) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:53:10.1269360Z         %408 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:53:10.1269535Z         %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1269754Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1269929Z         %411 = arith.muli %408, %c2_i32 : i32
2026-02-21T09:53:10.1270097Z         %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1270317Z         %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1270590Z         %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:10.1270868Z         %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1271072Z         %416 = arith.addi %41, %415 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1271274Z         %417 = tt.addptr %7, %416 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1271497Z         %418 = tt.load %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1271802Z         %419 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1272361Z         %420 = arith.extf %419 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1272747Z         %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1272992Z         %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1273214Z         %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1273408Z         %424 = arith.addi %423, %43 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1273601Z         %425 = tt.addptr %8, %424 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1273804Z         %426 = tt.load %425 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1273962Z         %427 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1274139Z         %428 = arith.shrsi %427, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1274384Z         %429 = ttg.convert_layout %428 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1274637Z         %430 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1274882Z         %431 = ttg.convert_layout %430 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1275228Z         %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1275579Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1275871Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1276124Z         %435 = arith.select %13, %434, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1276373Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1276631Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1276867Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1277097Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1277397Z         %440 = ttg.convert_layout %439 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1277876Z         %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1278225Z         %442 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:53:10.1278357Z         %443 = arith.cmpi slt, %442, %c2_i32 : i32
2026-02-21T09:53:10.1278493Z         %444 = arith.select %443, %442, %c0_i32 : i32
2026-02-21T09:53:10.1278762Z         %445 = ttg.memdesc_index %44[%444] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1279125Z         ttg.local_store %418, %445 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1279612Z         scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1280019Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:53:10.1280299Z       %72 = ttg.local_load %71#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1280725Z       %73 = arith.extf %72 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1281023Z       %74 = arith.shli %71#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1281202Z       %75 = arith.shrsi %74, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1281438Z       %76 = ttg.convert_layout %75 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1281681Z       %77 = arith.shrsi %71#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1281918Z       %78 = ttg.convert_layout %77 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1282265Z       %79 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1282655Z       %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1282941Z       %81 = tt.broadcast %79 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1283182Z       %82 = arith.select %13, %81, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1283421Z       %83 = tt.broadcast %80 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1283655Z       %84 = arith.select %15, %83, %82 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1283886Z       %85 = tt.reshape %84 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1284105Z       %86 = arith.sitofp %85 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1284397Z       %87 = ttg.convert_layout %86 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1284877Z       %88 = tt.dot %73, %87, %71#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1285368Z       %89 = ttg.local_load %71#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1285793Z       %90 = arith.extf %89 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1286089Z       %91 = arith.shli %71#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1286251Z       %92 = arith.shrsi %91, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1286494Z       %93 = ttg.convert_layout %92 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1286734Z       %94 = arith.shrsi %71#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1286978Z       %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1287307Z       %96 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1287647Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1287946Z       %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1288183Z       %99 = arith.select %13, %98, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1288425Z       %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1288662Z       %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1288904Z       %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1289133Z       %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1289448Z       %104 = ttg.convert_layout %103 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1289911Z       %105 = tt.dot %90, %104, %88, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1290293Z       ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:10.1290528Z       %106 = arith.truncf %105 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:53:10.1290703Z       %107 = arith.extsi %33 : i32 to i64
2026-02-21T09:53:10.1290822Z       %108 = arith.extsi %36 : i32 to i64
2026-02-21T09:53:10.1290988Z       %109 = tt.splat %107 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1291199Z       %110 = arith.addi %109, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1291469Z       %111 = tt.expand_dims %110 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1291711Z       %112 = arith.muli %111, %cst_8 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1291890Z       %113 = tt.broadcast %112 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1292102Z       %114 = tt.splat %108 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1292311Z       %115 = arith.addi %114, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1292579Z       %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1292857Z       %117 = tt.broadcast %116 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1293039Z       %118 = arith.addi %113, %117 : tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1293263Z       %119 = tt.addptr %16, %118 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1293465Z       %120 = arith.cmpi sge, %111, %cst_9 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1293634Z       %121 = arith.cmpi slt, %111, %cst_10 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1293798Z       %122 = arith.andi %120, %121 : tensor<128x1xi1, #mma>
2026-02-21T09:53:10.1293973Z       %123 = tt.broadcast %122 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1294163Z       %124 = arith.cmpi sge, %116, %cst_11 : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1294327Z       %125 = arith.cmpi slt, %116, %cst_12 : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1294486Z       %126 = arith.andi %124, %125 : tensor<1x256xi1, #mma>
2026-02-21T09:53:10.1294659Z       %127 = tt.broadcast %126 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1294840Z       %128 = arith.andi %123, %127 : tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1295007Z       tt.store %119, %106, %128 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:10.1295160Z       %129 = arith.addi %arg3, %c19456_i32 : i32
2026-02-21T09:53:10.1295291Z       %130 = arith.divsi %129, %c256_i32 : i32
2026-02-21T09:53:10.1295412Z       %131 = arith.muli %130, %c8_i32 : i32
2026-02-21T09:53:10.1295548Z       %132 = arith.subi %c128_i32, %131 : i32
2026-02-21T09:53:10.1295667Z       %133 = arith.minsi %132, %c8_i32 : i32
2026-02-21T09:53:10.1295787Z       %134 = arith.remsi %129, %c256_i32 : i32
2026-02-21T09:53:10.1295906Z       %135 = arith.remsi %134, %133 : i32
2026-02-21T09:53:10.1296020Z       %136 = arith.addi %131, %135 : i32
2026-02-21T09:53:10.1296136Z       %137 = arith.divsi %134, %133 : i32
2026-02-21T09:53:10.1296251Z       %138 = arith.muli %136, %c128_i32 : i32
2026-02-21T09:53:10.1296425Z       %139 = tt.splat %138 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1296652Z       %140 = arith.addi %139, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1296846Z       %141 = arith.muli %137, %c256_i32 : i32
2026-02-21T09:53:10.1297016Z       %142 = tt.splat %141 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1297235Z       %143 = arith.addi %142, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1297517Z       %144 = tt.expand_dims %140 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1297785Z       %145 = arith.muli %144, %cst_1 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1297983Z       %146 = tt.broadcast %145 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1298266Z       %147 = tt.expand_dims %143 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi32, #blocked>
2026-02-21T09:53:10.1298544Z       %148 = tt.broadcast %147 : tensor<1x256xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1298768Z       %149 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:10.1298956Z       %150 = arith.addi %146, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1299159Z       %151 = tt.addptr %7, %150 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1299368Z       %152 = tt.load %151 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1299528Z       %153 = arith.addi %52, %148 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1299725Z       %154 = tt.addptr %8, %153 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1299923Z       %155 = tt.load %154 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1300226Z       %156 = ttg.memdesc_index %149[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1300591Z       ttg.local_store %152, %156 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1300834Z       %157 = arith.addi %146, %60 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1301037Z       %158 = tt.addptr %7, %157 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1301244Z       %159 = tt.load %158 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1301405Z       %160 = arith.addi %66, %148 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1301597Z       %161 = tt.addptr %8, %160 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1301795Z       %162 = tt.load %161 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1302078Z       %163 = ttg.memdesc_index %149[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1302436Z       ttg.local_store %159, %163 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1303073Z       %164:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %156, %arg8 = %163, %arg9 = %155, %arg10 = %162) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:53:10.1303621Z         %408 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:53:10.1303795Z         %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1304017Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1304190Z         %411 = arith.muli %408, %c2_i32 : i32
2026-02-21T09:53:10.1304362Z         %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1304601Z         %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1304877Z         %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:10.1305156Z         %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1305353Z         %416 = arith.addi %146, %415 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1305587Z         %417 = tt.addptr %7, %416 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1305796Z         %418 = tt.load %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1306100Z         %419 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1306540Z         %420 = arith.extf %419 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1306922Z         %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1307172Z         %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1307366Z         %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1307558Z         %424 = arith.addi %423, %148 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1307758Z         %425 = tt.addptr %8, %424 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1307958Z         %426 = tt.load %425 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1308138Z         %427 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1308300Z         %428 = arith.shrsi %427, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1308545Z         %429 = ttg.convert_layout %428 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1308796Z         %430 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1309039Z         %431 = ttg.convert_layout %430 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1309381Z         %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1309728Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1310017Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1310269Z         %435 = arith.select %13, %434, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1310514Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1310757Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1310996Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1311344Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1311644Z         %440 = ttg.convert_layout %439 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1312117Z         %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1312464Z         %442 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:53:10.1312613Z         %443 = arith.cmpi slt, %442, %c2_i32 : i32
2026-02-21T09:53:10.1312748Z         %444 = arith.select %443, %442, %c0_i32 : i32
2026-02-21T09:53:10.1313018Z         %445 = ttg.memdesc_index %149[%444] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1313381Z         ttg.local_store %418, %445 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1313879Z         scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1314271Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:53:10.1314551Z       %165 = ttg.local_load %164#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1321217Z       %166 = arith.extf %165 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1321555Z       %167 = arith.shli %164#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1321725Z       %168 = arith.shrsi %167, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1321970Z       %169 = ttg.convert_layout %168 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1322221Z       %170 = arith.shrsi %164#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1322462Z       %171 = ttg.convert_layout %170 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1322923Z       %172 = tt.expand_dims %169 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1323270Z       %173 = tt.expand_dims %171 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1323560Z       %174 = tt.broadcast %172 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1323812Z       %175 = arith.select %13, %174, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1324057Z       %176 = tt.broadcast %173 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1324299Z       %177 = arith.select %15, %176, %175 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1324534Z       %178 = tt.reshape %177 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1324762Z       %179 = arith.sitofp %178 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1325059Z       %180 = ttg.convert_layout %179 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1325534Z       %181 = tt.dot %166, %180, %164#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1326053Z       %182 = ttg.local_load %164#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1326488Z       %183 = arith.extf %182 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1326790Z       %184 = arith.shli %164#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1326952Z       %185 = arith.shrsi %184, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1327196Z       %186 = ttg.convert_layout %185 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1327465Z       %187 = arith.shrsi %164#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1327708Z       %188 = ttg.convert_layout %187 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1328044Z       %189 = tt.expand_dims %186 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1328401Z       %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1328691Z       %191 = tt.broadcast %189 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1328937Z       %192 = arith.select %13, %191, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1329181Z       %193 = tt.broadcast %190 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1329424Z       %194 = arith.select %15, %193, %192 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1329661Z       %195 = tt.reshape %194 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1329890Z       %196 = arith.sitofp %195 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1330188Z       %197 = ttg.convert_layout %196 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1330655Z       %198 = tt.dot %183, %197, %181, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1331059Z       ttg.local_dealloc %149 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:10.1331277Z       %199 = arith.truncf %198 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:53:10.1331454Z       %200 = arith.extsi %138 : i32 to i64
2026-02-21T09:53:10.1331571Z       %201 = arith.extsi %141 : i32 to i64
2026-02-21T09:53:10.1331735Z       %202 = tt.splat %200 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1331949Z       %203 = arith.addi %202, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1332219Z       %204 = tt.expand_dims %203 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1332462Z       %205 = arith.muli %204, %cst_8 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1332642Z       %206 = tt.broadcast %205 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1332856Z       %207 = tt.splat %201 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1333065Z       %208 = arith.addi %207, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1333331Z       %209 = tt.expand_dims %208 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1333595Z       %210 = tt.broadcast %209 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1333796Z       %211 = arith.addi %206, %210 : tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1333990Z       %212 = tt.addptr %16, %211 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1334201Z       %213 = arith.cmpi sge, %204, %cst_9 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1334369Z       %214 = arith.cmpi slt, %204, %cst_10 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1334529Z       %215 = arith.andi %213, %214 : tensor<128x1xi1, #mma>
2026-02-21T09:53:10.1334706Z       %216 = tt.broadcast %215 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1334894Z       %217 = arith.cmpi sge, %209, %cst_11 : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1335078Z       %218 = arith.cmpi slt, %209, %cst_12 : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1335235Z       %219 = arith.andi %217, %218 : tensor<1x256xi1, #mma>
2026-02-21T09:53:10.1335410Z       %220 = tt.broadcast %219 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1335589Z       %221 = arith.andi %216, %220 : tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1335752Z       tt.store %212, %199, %221 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:10.1335905Z       %222 = arith.addi %arg3, %c38912_i32 : i32
2026-02-21T09:53:10.1336052Z       %223 = arith.divsi %222, %c256_i32 : i32
2026-02-21T09:53:10.1336174Z       %224 = arith.muli %223, %c8_i32 : i32
2026-02-21T09:53:10.1336299Z       %225 = arith.subi %c128_i32, %224 : i32
2026-02-21T09:53:10.1336419Z       %226 = arith.minsi %225, %c8_i32 : i32
2026-02-21T09:53:10.1336540Z       %227 = arith.remsi %222, %c256_i32 : i32
2026-02-21T09:53:10.1336659Z       %228 = arith.remsi %227, %226 : i32
2026-02-21T09:53:10.1336775Z       %229 = arith.addi %224, %228 : i32
2026-02-21T09:53:10.1336888Z       %230 = arith.divsi %227, %226 : i32
2026-02-21T09:53:10.1337004Z       %231 = arith.muli %229, %c128_i32 : i32
2026-02-21T09:53:10.1337178Z       %232 = tt.splat %231 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1337405Z       %233 = arith.addi %232, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1337580Z       %234 = arith.muli %230, %c256_i32 : i32
2026-02-21T09:53:10.1337752Z       %235 = tt.splat %234 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1337971Z       %236 = arith.addi %235, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1338271Z       %237 = tt.expand_dims %233 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1338526Z       %238 = arith.muli %237, %cst_1 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1338726Z       %239 = tt.broadcast %238 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1339005Z       %240 = tt.expand_dims %236 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi32, #blocked>
2026-02-21T09:53:10.1339285Z       %241 = tt.broadcast %240 : tensor<1x256xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1339506Z       %242 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:10.1339694Z       %243 = arith.addi %239, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1339895Z       %244 = tt.addptr %7, %243 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1340102Z       %245 = tt.load %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1340263Z       %246 = arith.addi %52, %241 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1340457Z       %247 = tt.addptr %8, %246 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1340654Z       %248 = tt.load %247 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1340937Z       %249 = ttg.memdesc_index %242[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1341320Z       ttg.local_store %245, %249 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1341563Z       %250 = arith.addi %239, %60 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1341761Z       %251 = tt.addptr %7, %250 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1341964Z       %252 = tt.load %251 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1342125Z       %253 = arith.addi %66, %241 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1342317Z       %254 = tt.addptr %8, %253 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1342531Z       %255 = tt.load %254 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1342809Z       %256 = ttg.memdesc_index %242[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1343168Z       ttg.local_store %252, %256 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1343819Z       %257:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %249, %arg8 = %256, %arg9 = %248, %arg10 = %255) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:53:10.1344347Z         %408 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:53:10.1344519Z         %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1344743Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1344916Z         %411 = arith.muli %408, %c2_i32 : i32
2026-02-21T09:53:10.1345082Z         %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1345304Z         %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1345581Z         %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:10.1345862Z         %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1346058Z         %416 = arith.addi %239, %415 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1346280Z         %417 = tt.addptr %7, %416 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1346490Z         %418 = tt.load %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1346791Z         %419 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1347232Z         %420 = arith.extf %419 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1347618Z         %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1347863Z         %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1348054Z         %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1348245Z         %424 = arith.addi %423, %241 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1348439Z         %425 = tt.addptr %8, %424 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1348640Z         %426 = tt.load %425 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1348798Z         %427 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1348971Z         %428 = arith.shrsi %427, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1349214Z         %429 = ttg.convert_layout %428 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1349466Z         %430 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1349709Z         %431 = ttg.convert_layout %430 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1350053Z         %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1350403Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1350715Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1350964Z         %435 = arith.select %13, %434, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1351213Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1351473Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1351711Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1351936Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1352237Z         %440 = ttg.convert_layout %439 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1352707Z         %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1353058Z         %442 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:53:10.1353186Z         %443 = arith.cmpi slt, %442, %c2_i32 : i32
2026-02-21T09:53:10.1353319Z         %444 = arith.select %443, %442, %c0_i32 : i32
2026-02-21T09:53:10.1353591Z         %445 = ttg.memdesc_index %242[%444] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1353953Z         ttg.local_store %418, %445 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1354463Z         scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1354856Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:53:10.1355135Z       %258 = ttg.local_load %257#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1355569Z       %259 = arith.extf %258 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1355870Z       %260 = arith.shli %257#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1356030Z       %261 = arith.shrsi %260, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1356275Z       %262 = ttg.convert_layout %261 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1356523Z       %263 = arith.shrsi %257#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1356763Z       %264 = ttg.convert_layout %263 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1357100Z       %265 = tt.expand_dims %262 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1357455Z       %266 = tt.expand_dims %264 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1357751Z       %267 = tt.broadcast %265 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1358000Z       %268 = arith.select %13, %267, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1358244Z       %269 = tt.broadcast %266 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1358484Z       %270 = arith.select %15, %269, %268 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1358736Z       %271 = tt.reshape %270 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1358966Z       %272 = arith.sitofp %271 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1359266Z       %273 = ttg.convert_layout %272 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1359747Z       %274 = tt.dot %259, %273, %257#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1360246Z       %275 = ttg.local_load %257#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1360675Z       %276 = arith.extf %275 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1360978Z       %277 = arith.shli %257#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1361141Z       %278 = arith.shrsi %277, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1361384Z       %279 = ttg.convert_layout %278 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1361633Z       %280 = arith.shrsi %257#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1361877Z       %281 = ttg.convert_layout %280 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1362227Z       %282 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1362619Z       %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1362963Z       %284 = tt.broadcast %282 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1363208Z       %285 = arith.select %13, %284, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1363449Z       %286 = tt.broadcast %283 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1363689Z       %287 = arith.select %15, %286, %285 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1363925Z       %288 = tt.reshape %287 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1364150Z       %289 = arith.sitofp %288 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1364448Z       %290 = ttg.convert_layout %289 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1364908Z       %291 = tt.dot %276, %290, %274, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1365294Z       ttg.local_dealloc %242 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:10.1365530Z       %292 = arith.truncf %291 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:53:10.1365701Z       %293 = arith.extsi %231 : i32 to i64
2026-02-21T09:53:10.1365820Z       %294 = arith.extsi %234 : i32 to i64
2026-02-21T09:53:10.1365981Z       %295 = tt.splat %293 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1366195Z       %296 = arith.addi %295, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1366463Z       %297 = tt.expand_dims %296 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1366720Z       %298 = arith.muli %297, %cst_8 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1366900Z       %299 = tt.broadcast %298 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1367105Z       %300 = tt.splat %294 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1367313Z       %301 = arith.addi %300, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1367596Z       %302 = tt.expand_dims %301 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1367856Z       %303 = tt.broadcast %302 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1368039Z       %304 = arith.addi %299, %303 : tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1368230Z       %305 = tt.addptr %16, %304 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1368432Z       %306 = arith.cmpi sge, %297, %cst_9 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1368600Z       %307 = arith.cmpi slt, %297, %cst_10 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1368760Z       %308 = arith.andi %306, %307 : tensor<128x1xi1, #mma>
2026-02-21T09:53:10.1368935Z       %309 = tt.broadcast %308 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1369120Z       %310 = arith.cmpi sge, %302, %cst_11 : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1369284Z       %311 = arith.cmpi slt, %302, %cst_12 : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1369439Z       %312 = arith.andi %310, %311 : tensor<1x256xi1, #mma>
2026-02-21T09:53:10.1369612Z       %313 = tt.broadcast %312 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1369790Z       %314 = arith.andi %309, %313 : tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1369948Z       tt.store %305, %292, %314 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:10.1370119Z       %315 = arith.addi %arg3, %c58368_i32 : i32
2026-02-21T09:53:10.1370243Z       %316 = arith.divsi %315, %c256_i32 : i32
2026-02-21T09:53:10.1370366Z       %317 = arith.muli %316, %c8_i32 : i32
2026-02-21T09:53:10.1370484Z       %318 = arith.subi %c128_i32, %317 : i32
2026-02-21T09:53:10.1370601Z       %319 = arith.minsi %318, %c8_i32 : i32
2026-02-21T09:53:10.1370719Z       %320 = arith.remsi %315, %c256_i32 : i32
2026-02-21T09:53:10.1370836Z       %321 = arith.remsi %320, %319 : i32
2026-02-21T09:53:10.1370949Z       %322 = arith.addi %317, %321 : i32
2026-02-21T09:53:10.1371059Z       %323 = arith.divsi %320, %319 : i32
2026-02-21T09:53:10.1371176Z       %324 = arith.muli %322, %c128_i32 : i32
2026-02-21T09:53:10.1371345Z       %325 = tt.splat %324 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1371572Z       %326 = arith.addi %325, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1371746Z       %327 = arith.muli %323, %c256_i32 : i32
2026-02-21T09:53:10.1371912Z       %328 = tt.splat %327 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1372132Z       %329 = arith.addi %328, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1372408Z       %330 = tt.expand_dims %326 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1372678Z       %331 = arith.muli %330, %cst_1 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1372872Z       %332 = tt.broadcast %331 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1373151Z       %333 = tt.expand_dims %329 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi32, #blocked>
2026-02-21T09:53:10.1373509Z       %334 = tt.broadcast %333 : tensor<1x256xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1373775Z       %335 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:10.1373963Z       %336 = arith.addi %332, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1374186Z       %337 = tt.addptr %7, %336 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1374391Z       %338 = tt.load %337 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1374549Z       %339 = arith.addi %52, %334 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1374743Z       %340 = tt.addptr %8, %339 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1374940Z       %341 = tt.load %340 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1375236Z       %342 = ttg.memdesc_index %335[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1375602Z       ttg.local_store %338, %342 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1375842Z       %343 = arith.addi %332, %60 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1376042Z       %344 = tt.addptr %7, %343 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1376245Z       %345 = tt.load %344 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1376399Z       %346 = arith.addi %66, %334 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1376590Z       %347 = tt.addptr %8, %346 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1376784Z       %348 = tt.load %347 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1377062Z       %349 = ttg.memdesc_index %335[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1377420Z       ttg.local_store %345, %349 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1378067Z       %350:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %342, %arg8 = %349, %arg9 = %341, %arg10 = %348) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:53:10.1378594Z         %408 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:53:10.1378768Z         %409 = tt.splat %408 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1378987Z         %410 = arith.addi %409, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1379160Z         %411 = arith.muli %408, %c2_i32 : i32
2026-02-21T09:53:10.1379327Z         %412 = tt.splat %411 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1379545Z         %413 = arith.addi %412, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1379823Z         %414 = tt.expand_dims %413 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:10.1380101Z         %415 = tt.broadcast %414 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1380296Z         %416 = arith.addi %332, %415 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1380497Z         %417 = tt.addptr %7, %416 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1380725Z         %418 = tt.load %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1381027Z         %419 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1381464Z         %420 = arith.extf %419 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1381842Z         %421 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1382100Z         %422 = arith.muli %421, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1382288Z         %423 = tt.broadcast %422 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1382479Z         %424 = arith.addi %423, %334 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1382670Z         %425 = tt.addptr %8, %424 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1382884Z         %426 = tt.load %425 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1383040Z         %427 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1383196Z         %428 = arith.shrsi %427, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1383441Z         %429 = ttg.convert_layout %428 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1383686Z         %430 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1383932Z         %431 = ttg.convert_layout %430 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1384268Z         %432 = tt.expand_dims %429 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1384610Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1384903Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1385148Z         %435 = arith.select %13, %434, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1385429Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1385670Z         %437 = arith.select %15, %436, %435 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1385907Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1386134Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1386429Z         %440 = ttg.convert_layout %439 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1386902Z         %441 = tt.dot %420, %440, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1387248Z         %442 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:53:10.1387374Z         %443 = arith.cmpi slt, %442, %c2_i32 : i32
2026-02-21T09:53:10.1387507Z         %444 = arith.select %443, %442, %c0_i32 : i32
2026-02-21T09:53:10.1387774Z         %445 = ttg.memdesc_index %335[%444] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1388135Z         ttg.local_store %418, %445 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1388620Z         scf.yield %441, %444, %arg8, %445, %arg10, %426 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1389022Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:53:10.1389300Z       %351 = ttg.local_load %350#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1389737Z       %352 = arith.extf %351 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1390051Z       %353 = arith.shli %350#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1390209Z       %354 = arith.shrsi %353, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1390454Z       %355 = ttg.convert_layout %354 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1390700Z       %356 = arith.shrsi %350#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1390955Z       %357 = ttg.convert_layout %356 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1391290Z       %358 = tt.expand_dims %355 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1391631Z       %359 = tt.expand_dims %357 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1391921Z       %360 = tt.broadcast %358 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1392168Z       %361 = arith.select %13, %360, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1392408Z       %362 = tt.broadcast %359 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1392648Z       %363 = arith.select %15, %362, %361 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1392881Z       %364 = tt.reshape %363 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1393107Z       %365 = arith.sitofp %364 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1393405Z       %366 = ttg.convert_layout %365 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1393883Z       %367 = tt.dot %352, %366, %350#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1394379Z       %368 = ttg.local_load %350#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1394806Z       %369 = arith.extf %368 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1395105Z       %370 = arith.shli %350#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1395264Z       %371 = arith.shrsi %370, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1395502Z       %372 = ttg.convert_layout %371 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1395747Z       %373 = arith.shrsi %350#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1395986Z       %374 = ttg.convert_layout %373 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1396320Z       %375 = tt.expand_dims %372 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1396676Z       %376 = tt.expand_dims %374 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1396965Z       %377 = tt.broadcast %375 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1397208Z       %378 = arith.select %13, %377, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1397450Z       %379 = tt.broadcast %376 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1397689Z       %380 = arith.select %15, %379, %378 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1397920Z       %381 = tt.reshape %380 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1398159Z       %382 = arith.sitofp %381 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1398454Z       %383 = ttg.convert_layout %382 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1398936Z       %384 = tt.dot %369, %383, %367, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1399319Z       ttg.local_dealloc %335 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:10.1399534Z       %385 = arith.truncf %384 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:53:10.1399704Z       %386 = arith.extsi %324 : i32 to i64
2026-02-21T09:53:10.1399820Z       %387 = arith.extsi %327 : i32 to i64
2026-02-21T09:53:10.1399988Z       %388 = tt.splat %386 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1400197Z       %389 = arith.addi %388, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1400466Z       %390 = tt.expand_dims %389 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1400706Z       %391 = arith.muli %390, %cst_8 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1400886Z       %392 = tt.broadcast %391 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1401093Z       %393 = tt.splat %387 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1401299Z       %394 = arith.addi %393, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1401577Z       %395 = tt.expand_dims %394 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1401836Z       %396 = tt.broadcast %395 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1402018Z       %397 = arith.addi %392, %396 : tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1402209Z       %398 = tt.addptr %16, %397 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1402411Z       %399 = arith.cmpi sge, %390, %cst_9 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1402616Z       %400 = arith.cmpi slt, %390, %cst_10 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1402777Z       %401 = arith.andi %399, %400 : tensor<128x1xi1, #mma>
2026-02-21T09:53:10.1402952Z       %402 = tt.broadcast %401 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1403137Z       %403 = arith.cmpi sge, %395, %cst_11 : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1403300Z       %404 = arith.cmpi slt, %395, %cst_12 : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1403458Z       %405 = arith.andi %403, %404 : tensor<1x256xi1, #mma>
2026-02-21T09:53:10.1403627Z       %406 = tt.broadcast %405 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1403808Z       %407 = arith.andi %402, %406 : tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1403965Z       tt.store %398, %385, %407 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:10.1404111Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:53:10.1404259Z     scf.for %arg3 = %24 to %c4096_i32 step %c19456_i32  : i32 {
2026-02-21T09:53:10.1404404Z       %25 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:53:10.1404524Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:53:10.1404640Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:53:10.1404755Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:53:10.1404872Z       %29 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:53:10.1404988Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:53:10.1405098Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:53:10.1405206Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:53:10.1405315Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:53:10.1405500Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1405720Z       %35 = arith.addi %34, %1 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:10.1405887Z       %36 = arith.muli %32, %c256_i32 : i32
2026-02-21T09:53:10.1406047Z       %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1406274Z       %38 = arith.addi %37, %4 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:53:10.1406547Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1406794Z       %40 = arith.muli %39, %cst_1 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:10.1406983Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1407254Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi32, #blocked>
2026-02-21T09:53:10.1407525Z       %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1407736Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:10.1408003Z       %45 = tt.expand_dims %6 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:10.1408269Z       %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1408456Z       %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1408651Z       %48 = tt.addptr %7, %47 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1408876Z       %49 = tt.load %48 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1409110Z       %50 = tt.expand_dims %5 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1409347Z       %51 = arith.muli %50, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1409529Z       %52 = tt.broadcast %51 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1409712Z       %53 = arith.addi %52, %43 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1409900Z       %54 = tt.addptr %8, %53 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1410090Z       %55 = tt.load %54 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1410368Z       %56 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1410732Z       ttg.local_store %49, %56 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1411000Z       %57 = arith.addi %5, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1411223Z       %58 = arith.addi %6, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1411495Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:10.1411774Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1411961Z       %61 = arith.addi %41, %60 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1412152Z       %62 = tt.addptr %7, %61 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1412351Z       %63 = tt.load %62 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1412585Z       %64 = tt.expand_dims %57 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1412819Z       %65 = arith.muli %64, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1413014Z       %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1413195Z       %67 = arith.addi %66, %43 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1413385Z       %68 = tt.addptr %8, %67 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1413573Z       %69 = tt.load %68 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1413864Z       %70 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1414219Z       ttg.local_store %63, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1414839Z       %71:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_5, %arg6 = %c1_i32, %arg7 = %56, %arg8 = %70, %arg9 = %55, %arg10 = %69) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:53:10.1415359Z         %129 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:53:10.1415533Z         %130 = tt.splat %129 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1415755Z         %131 = arith.addi %130, %5 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:10.1415932Z         %132 = arith.muli %129, %c2_i32 : i32
2026-02-21T09:53:10.1416101Z         %133 = tt.splat %132 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1416321Z         %134 = arith.addi %133, %6 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:10.1416611Z         %135 = tt.expand_dims %134 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:53:10.1416888Z         %136 = tt.broadcast %135 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1417084Z         %137 = arith.addi %41, %136 : tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1417284Z         %138 = tt.addptr %7, %137 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:53:10.1417491Z         %139 = tt.load %138 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:10.1417796Z         %140 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1418233Z         %141 = arith.extf %140 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1418618Z         %142 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1418861Z         %143 = arith.muli %142, %cst_2 : tensor<2x1xi32, #blocked>
2026-02-21T09:53:10.1419053Z         %144 = tt.broadcast %143 : tensor<2x1xi32, #blocked> -> tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1419248Z         %145 = arith.addi %144, %43 : tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1419444Z         %146 = tt.addptr %8, %145 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi32, #blocked>
2026-02-21T09:53:10.1419661Z         %147 = tt.load %146 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:53:10.1419822Z         %148 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1419985Z         %149 = arith.shrsi %148, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1420240Z         %150 = ttg.convert_layout %149 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1420491Z         %151 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1420738Z         %152 = ttg.convert_layout %151 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1421092Z         %153 = tt.expand_dims %150 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1421443Z         %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1421740Z         %155 = tt.broadcast %153 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1422010Z         %156 = arith.select %13, %155, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1422258Z         %157 = tt.broadcast %154 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1422503Z         %158 = arith.select %15, %157, %156 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1422744Z         %159 = tt.reshape %158 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1422974Z         %160 = arith.sitofp %159 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1423271Z         %161 = ttg.convert_layout %160 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1423748Z         %162 = tt.dot %141, %161, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1424103Z         %163 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:53:10.1424232Z         %164 = arith.cmpi slt, %163, %c2_i32 : i32
2026-02-21T09:53:10.1424370Z         %165 = arith.select %164, %163, %c0_i32 : i32
2026-02-21T09:53:10.1424655Z         %166 = ttg.memdesc_index %44[%165] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1425018Z         ttg.local_store %139, %166 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:53:10.1425503Z         scf.yield %162, %165, %arg8, %166, %arg10, %147 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1425894Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:53:10.1426172Z       %72 = ttg.local_load %71#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1426605Z       %73 = arith.extf %72 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1426901Z       %74 = arith.shli %71#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1427063Z       %75 = arith.shrsi %74, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1427298Z       %76 = ttg.convert_layout %75 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1427541Z       %77 = arith.shrsi %71#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1427695Z       %78 = ttg.convert_layout %77 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1427850Z       %79 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1427999Z       %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1428096Z       %81 = tt.broadcast %79 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1428203Z       %82 = arith.select %13, %81, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1428311Z       %83 = tt.broadcast %80 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1428409Z       %84 = arith.select %15, %83, %82 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1428498Z       %85 = tt.reshape %84 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1428605Z       %86 = arith.sitofp %85 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1428764Z       %87 = ttg.convert_layout %86 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1429026Z       %88 = tt.dot %73, %87, %71#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1429222Z       %89 = ttg.local_load %71#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1429415Z       %90 = arith.extf %89 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1429480Z       %91 = arith.shli %71#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1429539Z       %92 = arith.shrsi %91, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1429681Z       %93 = ttg.convert_layout %92 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1429745Z       %94 = arith.shrsi %71#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:53:10.1429900Z       %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:10.1430050Z       %96 = tt.expand_dims %93 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1430200Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:53:10.1430298Z       %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1430402Z       %99 = arith.select %13, %98, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1430500Z       %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1430603Z       %101 = arith.select %15, %100, %99 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:53:10.1430696Z       %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:53:10.1430789Z       %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:53:10.1430953Z       %104 = ttg.convert_layout %103 : tensor<4x256xf32, #blocked3> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:10.1431210Z       %105 = tt.dot %90, %104, %88, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:53:10.1431316Z       ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:53:10.1431406Z       %106 = arith.truncf %105 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:53:10.1431451Z       %107 = arith.extsi %33 : i32 to i64
2026-02-21T09:53:10.1431493Z       %108 = arith.extsi %36 : i32 to i64
2026-02-21T09:53:10.1431583Z       %109 = tt.splat %107 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1431667Z       %110 = arith.addi %109, %17 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:10.1431823Z       %111 = tt.expand_dims %110 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1431882Z       %112 = arith.muli %111, %cst_8 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1431967Z       %113 = tt.broadcast %112 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1432051Z       %114 = tt.splat %108 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1432155Z       %115 = arith.addi %114, %18 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:10.1432293Z       %116 = tt.expand_dims %115 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1432377Z       %117 = tt.broadcast %116 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1432438Z       %118 = arith.addi %113, %117 : tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1432536Z       %119 = tt.addptr %16, %118 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:53:10.1432602Z       %120 = arith.cmpi sge, %111, %cst_9 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1432672Z       %121 = arith.cmpi slt, %111, %cst_10 : tensor<128x1xi64, #mma>
2026-02-21T09:53:10.1432732Z       %122 = arith.andi %120, %121 : tensor<128x1xi1, #mma>
2026-02-21T09:53:10.1432814Z       %123 = tt.broadcast %122 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1432881Z       %124 = arith.cmpi sge, %116, %cst_11 : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1432946Z       %125 = arith.cmpi slt, %116, %cst_12 : tensor<1x256xi64, #mma>
2026-02-21T09:53:10.1433001Z       %126 = arith.andi %124, %125 : tensor<1x256xi1, #mma>
2026-02-21T09:53:10.1433209Z       %127 = tt.broadcast %126 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1433269Z       %128 = arith.andi %123, %127 : tensor<128x256xi1, #mma>
2026-02-21T09:53:10.1433337Z       tt.store %119, %106, %128 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:10.1433380Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:53:10.1433417Z     tt.return
2026-02-21T09:53:10.1433450Z   }
2026-02-21T09:53:10.1433483Z }
2026-02-21T09:53:10.1433488Z 
2026-02-21T09:53:10.1433523Z {-#
2026-02-21T09:53:10.1433565Z   external_resources: {
2026-02-21T09:53:10.1433606Z     mlir_reproducer: {
2026-02-21T09:53:10.1434554Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:53:10.1434599Z       disable_threading: false,
2026-02-21T09:53:10.1434637Z       verify_each: true
2026-02-21T09:53:10.1434674Z     }
2026-02-21T09:53:10.1434705Z   }
2026-02-21T09:53:10.1434736Z #-}
2026-02-21T09:53:10.1434990Z /tmp/torchinductor_root/tt/cttq32vntkim2n6zppmuy3waxfbua3zj3tkcf4h746awu2lpjb4e.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:53:10.1435403Z /tmp/torchinductor_root/tt/cttq32vntkim2n6zppmuy3waxfbua3zj3tkcf4h746awu2lpjb4e.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:53:10.1435516Z [520s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:53:10.1436152Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:53:10.1436223Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:53:10.1436303Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:53:14.5808140Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:53:14.5810114Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:53:14.5810438Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:53:14.5810789Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:53:14.5811114Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:53:14.5811400Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:53:14.5811672Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:53:14.5811910Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:53:14.5812106Z #smem = #ttg.shared_memory
2026-02-21T09:53:14.5812460Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:53:14.5812949Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:53:14.5813359Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:53:14.5813544Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:53:14.5813739Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:53:14.5813925Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:53:14.5814093Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:53:14.5814228Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:53:14.5814417Z     %cst_3 = arith.constant dense<508> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:14.5814644Z     %cst_4 = arith.constant dense<8192> : tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5814837Z     %cst_5 = arith.constant dense<0> : tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5815017Z     %cst_6 = arith.constant dense<512> : tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5815193Z     %cst_7 = arith.constant dense<0> : tensor<1x128xi64, #blocked1>
2026-02-21T09:53:14.5815385Z     %cst_8 = arith.constant dense<8192> : tensor<1x128xi64, #blocked1>
2026-02-21T09:53:14.5815631Z     %cst_9 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:14.5815792Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:53:14.5815919Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:53:14.5816066Z     %cst_10 = arith.constant dense<0> : tensor<4x128xi8, #blocked1>
2026-02-21T09:53:14.5816256Z     %cst_11 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked>
2026-02-21T09:53:14.5816410Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:53:14.5816528Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:53:14.5816714Z     %cst_12 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:14.5817007Z     %0 = tt.get_program_id x : i32
2026-02-21T09:53:14.5817136Z     %1 = arith.divsi %0, %c128_i32 : i32
2026-02-21T09:53:14.5817253Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:53:14.5817385Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:53:14.5817505Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:53:14.5817621Z     %5 = arith.remsi %0, %c128_i32 : i32
2026-02-21T09:53:14.5817740Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:53:14.5817862Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:53:14.5817975Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:53:14.5818103Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:53:14.5818310Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:14.5818611Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:14.5818924Z     %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:14.5819196Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:14.5819456Z     %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:14.5819667Z     %15 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:14.5819879Z     %16 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:53:14.5820109Z     %17 = arith.addi %15, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:53:14.5820269Z     %18 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:53:14.5820429Z     %19 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:14.5820662Z     %20 = arith.addi %19, %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:53:14.5820899Z     %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:14.5821221Z     %22 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:53:14.5821468Z     %23 = arith.muli %22, %cst_9 : tensor<128x1xi32, #blocked2>
2026-02-21T09:53:14.5821678Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:53:14.5821891Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:14.5822066Z     %26 = arith.extsi %18 : i32 to i64
2026-02-21T09:53:14.5822216Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:14.5822448Z     %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:14.5822783Z     %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:14.5823073Z     %30 = tt.splat %26 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:14.5823380Z     %31 = arith.extsi %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:14.5823706Z     %32 = arith.addi %30, %31 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:53:14.5823983Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi64, #blocked1>
2026-02-21T09:53:14.5824273Z     %34 = tt.broadcast %33 : tensor<1x128xi64, #blocked1> -> tensor<4x128xi64, #blocked1>
2026-02-21T09:53:14.5824472Z     %35 = arith.cmpi sge, %33, %cst_7 : tensor<1x128xi64, #blocked1>
2026-02-21T09:53:14.5824646Z     %36 = arith.cmpi slt, %33, %cst_8 : tensor<1x128xi64, #blocked1>
2026-02-21T09:53:14.5824821Z     %37 = arith.andi %35, %36 : tensor<1x128xi1, #blocked1>
2026-02-21T09:53:14.5825025Z     %38 = tt.broadcast %37 : tensor<1x128xi1, #blocked1> -> tensor<4x128xi1, #blocked1>
2026-02-21T09:53:14.5825324Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:53:14.5825745Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:53:14.5826160Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:53:14.5826425Z     %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:53:14.5826618Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:53:14.5826829Z     %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:53:14.5827019Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:53:14.5827236Z     %46 = ttg.local_alloc : () -> !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:53:14.5827521Z     %47 = tt.expand_dims %21 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:53:14.5827788Z     %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:53:14.5827996Z     %49 = arith.addi %24, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:53:14.5828189Z     %50 = tt.addptr %25, %49 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:53:14.5828406Z     %51 = tt.load %50 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:14.5828706Z     %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:53:14.5829079Z     ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:53:14.5829524Z     %53:3 = scf.for %arg3 = %c0_i32 to %c508_i32 step %c4_i32 iter_args(%arg4 = %cst_2, %arg5 = %c0_i32, %arg6 = %52) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>)  : i32 {
2026-02-21T09:53:14.5829853Z       %92 = arith.addi %arg3, %c4_i32 : i32
2026-02-21T09:53:14.5829989Z       %93 = arith.muli %92, %c2_i32 : i32
2026-02-21T09:53:14.5830156Z       %94 = tt.splat %93 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:14.5830374Z       %95 = arith.addi %94, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:53:14.5830660Z       %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:53:14.5830929Z       %97 = tt.broadcast %96 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:53:14.5831141Z       %98 = arith.addi %24, %97 : tensor<128x8xi32, #blocked2>
2026-02-21T09:53:14.5831350Z       %99 = tt.addptr %25, %98 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:53:14.5831566Z       %100 = tt.load %99 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:53:14.5831887Z       %101 = ttg.local_load %arg6 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:14.5832345Z       %102 = arith.extf %101 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:14.5832635Z       %103 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:53:14.5832810Z       %104 = tt.splat %103 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:14.5833031Z       %105 = arith.addi %104, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:14.5833344Z       %106 = tt.expand_dims %105 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5833594Z       %107 = arith.muli %106, %cst_4 : tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5833801Z       %108 = tt.broadcast %107 : tensor<4x1xi64, #blocked1> -> tensor<4x128xi64, #blocked1>
2026-02-21T09:53:14.5833998Z       %109 = arith.addi %108, %34 : tensor<4x128xi64, #blocked1>
2026-02-21T09:53:14.5834254Z       %110 = tt.addptr %27, %109 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi64, #blocked1>
2026-02-21T09:53:14.5834469Z       %111 = arith.cmpi sge, %106, %cst_5 : tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5834639Z       %112 = arith.cmpi slt, %106, %cst_6 : tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5834818Z       %113 = arith.andi %111, %112 : tensor<4x1xi1, #blocked1>
2026-02-21T09:53:14.5835005Z       %114 = tt.broadcast %113 : tensor<4x1xi1, #blocked1> -> tensor<4x128xi1, #blocked1>
2026-02-21T09:53:14.5835209Z       %115 = arith.andi %114, %38 : tensor<4x128xi1, #blocked1>
2026-02-21T09:53:14.5835381Z       %116 = tt.load %110, %115, %cst_10 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:14.5835640Z       %117 = ttg.convert_layout %116 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:14.5835941Z       %118 = arith.shli %117, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:14.5836182Z       %119 = arith.shrsi %118, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:14.5836434Z       %120 = arith.shrsi %117, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:14.5836755Z       %121 = tt.expand_dims %119 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:53:14.5837100Z       %122 = tt.expand_dims %120 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:53:14.5837402Z       %123 = tt.broadcast %121 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:53:14.5837644Z       %124 = arith.select %43, %123, %cst_11 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:53:14.5837898Z       %125 = tt.broadcast %122 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:53:14.5838135Z       %126 = arith.select %45, %125, %124 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:53:14.5838378Z       %127 = tt.reshape %126 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:53:14.5838603Z       %128 = arith.sitofp %127 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:53:14.5838869Z       %129 = ttg.local_alloc %128 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:53:14.5839198Z       %130 = ttg.local_load %129 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:14.5839687Z       %131 = tt.dot %102, %130, %arg4, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:53:14.5840073Z       %132 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:53:14.5840200Z       %133 = arith.cmpi slt, %132, %c1_i32 : i32
2026-02-21T09:53:14.5840334Z       %134 = arith.select %133, %132, %c0_i32 : i32
2026-02-21T09:53:14.5840613Z       %135 = ttg.memdesc_index %46[%134] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:53:14.5840990Z       ttg.local_store %100, %135 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:53:14.5841307Z       scf.yield %131, %134, %135 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:53:14.5841587Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 2 : i32}
2026-02-21T09:53:14.5841892Z     %54 = ttg.local_load %53#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:14.5842352Z     %55 = arith.extf %54 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:14.5842764Z     %56 = arith.addi %29, %cst_3 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:53:14.5843045Z     %57 = tt.expand_dims %56 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5843302Z     %58 = arith.muli %57, %cst_4 : tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5843488Z     %59 = tt.broadcast %58 : tensor<4x1xi64, #blocked1> -> tensor<4x128xi64, #blocked1>
2026-02-21T09:53:14.5843688Z     %60 = arith.addi %59, %34 : tensor<4x128xi64, #blocked1>
2026-02-21T09:53:14.5843891Z     %61 = tt.addptr %27, %60 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi64, #blocked1>
2026-02-21T09:53:14.5844096Z     %62 = arith.cmpi sge, %57, %cst_5 : tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5844282Z     %63 = arith.cmpi slt, %57, %cst_6 : tensor<4x1xi64, #blocked1>
2026-02-21T09:53:14.5844439Z     %64 = arith.andi %62, %63 : tensor<4x1xi1, #blocked1>
2026-02-21T09:53:14.5844617Z     %65 = tt.broadcast %64 : tensor<4x1xi1, #blocked1> -> tensor<4x128xi1, #blocked1>
2026-02-21T09:53:14.5844816Z     %66 = arith.andi %65, %38 : tensor<4x128xi1, #blocked1>
2026-02-21T09:53:14.5844978Z     %67 = tt.load %61, %66, %cst_10 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:53:14.5845261Z     %68 = ttg.convert_layout %67 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:14.5845542Z     %69 = arith.shli %68, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:14.5845780Z     %70 = arith.shrsi %69, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:14.5846012Z     %71 = arith.shrsi %68, %cst_12 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:53:14.5846302Z     %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:53:14.5846637Z     %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:53:14.5846926Z     %74 = tt.broadcast %72 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:53:14.5847161Z     %75 = arith.select %43, %74, %cst_11 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:53:14.5847409Z     %76 = tt.broadcast %73 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:53:14.5847640Z     %77 = arith.select %45, %76, %75 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:53:14.5847861Z     %78 = tt.reshape %77 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:53:14.5848109Z     %79 = arith.sitofp %78 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:53:14.5848354Z     %80 = ttg.local_alloc %79 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:53:14.5848689Z     %81 = ttg.local_load %80 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:53:14.5849170Z     %82 = tt.dot %55, %81, %53#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:53:14.5849584Z     ttg.local_dealloc %46 : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:53:14.5849798Z     %83 = arith.truncf %82 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:53:14.5850074Z     %84 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:53:14.5850315Z     %85 = arith.muli %84, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:53:14.5850573Z     %86 = tt.expand_dims %20 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:53:14.5850826Z     %87 = tt.broadcast %85 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:53:14.5851038Z     %88 = tt.broadcast %86 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:53:14.5851213Z     %89 = arith.addi %87, %88 : tensor<128x128xi32, #mma>
2026-02-21T09:53:14.5851386Z     %90 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:14.5851617Z     %91 = tt.addptr %90, %89 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:53:14.5851810Z     tt.store %91, %83 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:53:14.5851943Z     tt.return
2026-02-21T09:53:14.5852022Z   }
2026-02-21T09:53:14.5852111Z }
2026-02-21T09:53:14.5852156Z 
2026-02-21T09:53:14.5852187Z {-#
2026-02-21T09:53:14.5852269Z   external_resources: {
2026-02-21T09:53:14.5852366Z     mlir_reproducer: {
2026-02-21T09:53:14.5853408Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:53:14.5854438Z       disable_threading: false,
2026-02-21T09:53:14.5854545Z       verify_each: true
2026-02-21T09:53:14.5854636Z     }
2026-02-21T09:53:14.5854720Z   }
2026-02-21T09:53:14.5854791Z #-}
2026-02-21T09:53:14.5855067Z /tmp/torchinductor_root/ge/cgeahcd57aauizttec2barrz3cst5noydv6rsrfcsbxtir4b7wj6.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:53:14.5855771Z /tmp/torchinductor_root/ge/cgeahcd57aauizttec2barrz3cst5noydv6rsrfcsbxtir4b7wj6.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:53:14.5856339Z [525s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:53:14.5857081Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:53:14.5857769Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:53:14.5857950Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:53:15.1467403Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 82/82 9.0 configs/s
2026-02-21T09:53:20.1346316Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 196/196 25.2 configs/s
2026-02-21T09:53:23.5344696Z [534s] Generation 7 complete: 
2026-02-21T09:53:23.5345145Z error=13
2026-02-21T09:53:23.5345398Z ok=72
2026-02-21T09:53:23.5345606Z min=1.0151
2026-02-21T09:53:23.5345807Z mid=1.6335
2026-02-21T09:53:23.5346540Z max=59.9127
2026-02-21T09:53:23.5346770Z best={'block_sizes': [8, 128, 256],
2026-02-21T09:53:23.5347152Z  'indexing': ['block_ptr', 'pointer', 'block_ptr'],
2026-02-21T09:53:23.5347512Z  'l2_groupings': [8],
2026-02-21T09:53:23.5347786Z  'load_eviction_policies': ['', ''],
2026-02-21T09:53:23.5348108Z  'loop_orders': [[0, 1]],
2026-02-21T09:53:23.5348384Z  'matrix_instr_nonkdim': 32,
2026-02-21T09:53:23.5348664Z  'num_sm_multiplier': 64,
2026-02-21T09:53:23.5348918Z  'num_stages': 4,
2026-02-21T09:53:23.5349150Z  'num_warps': 4,
2026-02-21T09:53:23.5349538Z  'pid_type': 'persistent_interleaved',
2026-02-21T09:53:23.5349861Z  'range_flattens': [False, True],
2026-02-21T09:53:23.5350157Z  'range_multi_buffers': [True, None],
2026-02-21T09:53:23.5350460Z  'range_num_stages': [2, 3],
2026-02-21T09:53:23.5350737Z  'range_unroll_factors': [4, 0],
2026-02-21T09:53:23.5351035Z  'range_warp_specializes': [],
2026-02-21T09:53:23.5351308Z  'waves_per_eu': 2}
2026-02-21T09:53:23.5423397Z [534s] Fitting surrogate: 760 points, 760 targets
2026-02-21T09:53:24.4541320Z [535s] Generation 8 starting: 80 neighbors, 4 active search path(s)
2026-02-21T09:54:01.5870815Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 0.2 configs/s
2026-02-21T09:54:03.9142341Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:54:03.9154732Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}>
2026-02-21T09:54:03.9155662Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:54:03.9156865Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:54:03.9157804Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}>
2026-02-21T09:54:03.9158689Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:54:03.9159451Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:03.9160099Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:03.9160570Z #smem = #ttg.shared_memory
2026-02-21T09:54:03.9161184Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:54:03.9162476Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:54:03.9163602Z     %cst = arith.constant dense<8192> : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9164046Z     %cst_0 = arith.constant dense<0> : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9164491Z     %cst_1 = arith.constant dense<16384> : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9164931Z     %cst_2 = arith.constant dense<0> : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9165363Z     %cst_3 = arith.constant dense<8192> : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9165919Z     %cst_4 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:03.9166373Z     %cst_5 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:03.9166851Z     %cst_6 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9167273Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:54:03.9176453Z     %cst_7 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9176977Z     %cst_8 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9177486Z     %cst_9 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9178031Z     %cst_10 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9178458Z     %cst_11 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9178761Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:54:03.9178987Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:54:03.9179222Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:54:03.9179447Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:54:03.9179772Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:54:03.9179962Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:54:03.9180138Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:54:03.9180372Z     %cst_12 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9180614Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:54:03.9180792Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:54:03.9181084Z     %cst_13 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9181404Z     %0 = tt.get_program_id x : i32
2026-02-21T09:54:03.9181581Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:54:03.9181771Z     %2 = arith.minsi %1, %c4096_i32 : i32
2026-02-21T09:54:03.9182103Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9182574Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9183031Z     %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9183378Z     %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9183858Z     %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9184309Z     %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9184710Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9185031Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9185480Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:54:03.9186174Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:54:03.9186851Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:03.9187267Z     %14 = arith.cmpi eq, %13, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:03.9187594Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:54:03.9187910Z     %16 = arith.cmpi eq, %13, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:03.9188217Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:54:03.9188581Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:03.9189018Z     %19 = arith.extsi %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> to tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9189581Z     %20 = arith.extsi %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9189952Z     %21 = arith.subi %2, %0 : i32
2026-02-21T09:54:03.9190122Z     %22 = arith.remsi %21, %c4_i32 : i32
2026-02-21T09:54:03.9190308Z     %23 = arith.subi %21, %22 : i32
2026-02-21T09:54:03.9190476Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:54:03.9190695Z     scf.for %arg3 = %0 to %24 step %c4_i32  : i32 {
2026-02-21T09:54:03.9190907Z       %25 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:54:03.9191102Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:54:03.9191281Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:54:03.9191471Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:54:03.9191659Z       %29 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:54:03.9191850Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:54:03.9192031Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:54:03.9192222Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:54:03.9192402Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:54:03.9192671Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9193036Z       %35 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9193309Z       %36 = arith.muli %32, %c256_i32 : i32
2026-02-21T09:54:03.9193574Z       %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9193931Z       %38 = arith.addi %37, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9194381Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9194795Z       %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9195117Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9195566Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:54:03.9196038Z       %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9196390Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:03.9196832Z       %45 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:03.9197288Z       %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9197599Z       %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9197918Z       %48 = tt.addptr %9, %47 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9198241Z       %49 = tt.load %48 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9198706Z       %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9199306Z       ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9199747Z       %51 = arith.addi %8, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9200207Z       %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:03.9200661Z       %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9200986Z       %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9201313Z       %55 = tt.addptr %9, %54 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9201561Z       %56 = tt.load %55 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9201841Z       %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9202197Z       ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9202890Z       %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:54:03.9203368Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9203596Z         %410 = arith.addi %409, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9203793Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:54:03.9203914Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:54:03.9204087Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9204309Z         %414 = arith.addi %413, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9204580Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:03.9204863Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9205057Z         %417 = arith.addi %41, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9205301Z         %418 = tt.addptr %9, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9205555Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9205856Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9206316Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9206700Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9206948Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9207142Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9207335Z         %425 = arith.addi %424, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9207537Z         %426 = tt.addptr %10, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9207740Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9207985Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9208274Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9208508Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9208748Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9209038Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9209406Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9209692Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9209933Z         %435 = arith.select %15, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9210175Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9210411Z         %437 = arith.select %17, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9210662Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9210887Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9211142Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9211474Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9211972Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9212321Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:03.9212452Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:54:03.9212588Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:54:03.9212857Z         %446 = ttg.memdesc_index %44[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9213217Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9213621Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9213929Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:54:03.9214103Z       %59 = arith.addi %7, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9214447Z       %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9214875Z       %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9215247Z       %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9215489Z       %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9215684Z       %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9215876Z       %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9216069Z       %66 = tt.addptr %10, %65 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9216266Z       %67 = tt.load %66 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9216509Z       %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9216785Z       %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9217014Z       %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9217260Z       %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9217541Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9217871Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9218148Z       %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9218385Z       %75 = arith.select %15, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9218640Z       %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9218867Z       %77 = arith.select %17, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9219089Z       %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9219308Z       %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9219569Z       %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9219890Z       %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9220351Z       %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9220739Z       %83 = arith.addi %7, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9221065Z       %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9221489Z       %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9221863Z       %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9222104Z       %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9222308Z       %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9222500Z       %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9222690Z       %90 = tt.addptr %10, %89 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9222885Z       %91 = tt.load %90 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9223122Z       %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9223399Z       %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9223633Z       %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9223861Z       %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9224144Z       %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9224473Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9224749Z       %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9224984Z       %99 = arith.select %15, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9225232Z       %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9225464Z       %101 = arith.select %17, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9225697Z       %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9225919Z       %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9226173Z       %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9226497Z       %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9226980Z       %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9227364Z       ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:03.9227594Z       %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:54:03.9227770Z       %108 = arith.extsi %33 : i32 to i64
2026-02-21T09:54:03.9227887Z       %109 = arith.extsi %36 : i32 to i64
2026-02-21T09:54:03.9228051Z       %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9228265Z       %111 = arith.addi %110, %19 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9228530Z       %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9228770Z       %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9228948Z       %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9229157Z       %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9229373Z       %116 = arith.addi %115, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9229635Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9229899Z       %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9230103Z       %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9230292Z       %120 = tt.addptr %18, %119 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9230497Z       %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9230661Z       %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9230818Z       %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma>
2026-02-21T09:54:03.9230991Z       %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9231178Z       %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9231343Z       %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9231497Z       %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma>
2026-02-21T09:54:03.9231672Z       %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9231852Z       %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9232016Z       tt.store %120, %107, %129 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:03.9232166Z       %130 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:54:03.9232292Z       %131 = arith.divsi %130, %c256_i32 : i32
2026-02-21T09:54:03.9232415Z       %132 = arith.muli %131, %c8_i32 : i32
2026-02-21T09:54:03.9232533Z       %133 = arith.subi %c128_i32, %132 : i32
2026-02-21T09:54:03.9232667Z       %134 = arith.minsi %133, %c8_i32 : i32
2026-02-21T09:54:03.9232785Z       %135 = arith.remsi %130, %c256_i32 : i32
2026-02-21T09:54:03.9232903Z       %136 = arith.remsi %135, %134 : i32
2026-02-21T09:54:03.9233017Z       %137 = arith.addi %132, %136 : i32
2026-02-21T09:54:03.9233133Z       %138 = arith.divsi %135, %134 : i32
2026-02-21T09:54:03.9233249Z       %139 = arith.muli %137, %c128_i32 : i32
2026-02-21T09:54:03.9233422Z       %140 = tt.splat %139 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9233649Z       %141 = arith.addi %140, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9233823Z       %142 = arith.muli %138, %c256_i32 : i32
2026-02-21T09:54:03.9234008Z       %143 = tt.splat %142 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9234229Z       %144 = arith.addi %143, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9234508Z       %145 = tt.expand_dims %141 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9234764Z       %146 = arith.muli %145, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9234975Z       %147 = tt.broadcast %146 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9235257Z       %148 = tt.expand_dims %144 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:54:03.9235538Z       %149 = tt.broadcast %148 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9235763Z       %150 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:03.9235953Z       %151 = arith.addi %147, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9236151Z       %152 = tt.addptr %9, %151 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9236360Z       %153 = tt.load %152 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9236649Z       %154 = ttg.memdesc_index %150[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9237017Z       ttg.local_store %153, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9237261Z       %155 = arith.addi %147, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9237480Z       %156 = tt.addptr %9, %155 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9237687Z       %157 = tt.load %156 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9237967Z       %158 = ttg.memdesc_index %150[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9238329Z       ttg.local_store %157, %158 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9238859Z       %159:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %154, %arg8 = %158) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:54:03.9239425Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9239656Z         %410 = arith.addi %409, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9239835Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:54:03.9239958Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:54:03.9240129Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9240351Z         %414 = arith.addi %413, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9240645Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:03.9240924Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9241119Z         %417 = arith.addi %147, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9241325Z         %418 = tt.addptr %9, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9241532Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9241835Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9246196Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9246585Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9246865Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9247059Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9247259Z         %425 = arith.addi %424, %149 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9247462Z         %426 = tt.addptr %10, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9247664Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9247912Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9248197Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9248437Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9248675Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9248965Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9249326Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9249613Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9249862Z         %435 = arith.select %15, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9250105Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9250343Z         %437 = arith.select %17, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9250577Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9250836Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9251142Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9251471Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9251946Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9252296Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:03.9252441Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:54:03.9252575Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:54:03.9252848Z         %446 = ttg.memdesc_index %150[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9253207Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9253613Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9253936Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:54:03.9254212Z       %160 = ttg.local_load %159#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9254647Z       %161 = arith.extf %160 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9254956Z       %162 = arith.addi %64, %149 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9255159Z       %163 = tt.addptr %10, %162 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9255362Z       %164 = tt.load %163 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9255605Z       %165 = ttg.convert_layout %164 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9255889Z       %166 = arith.shli %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9256126Z       %167 = arith.shrsi %166, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9256362Z       %168 = arith.shrsi %165, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9256655Z       %169 = tt.expand_dims %167 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9256991Z       %170 = tt.expand_dims %168 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9257277Z       %171 = tt.broadcast %169 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9257535Z       %172 = arith.select %15, %171, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9257776Z       %173 = tt.broadcast %170 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9258014Z       %174 = arith.select %17, %173, %172 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9258244Z       %175 = tt.reshape %174 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9258471Z       %176 = arith.sitofp %175 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9258724Z       %177 = ttg.local_alloc %176 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9259059Z       %178 = ttg.local_load %177 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9259561Z       %179 = tt.dot %161, %178, %159#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9260058Z       %180 = ttg.local_load %159#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9260489Z       %181 = arith.extf %180 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9260801Z       %182 = arith.addi %88, %149 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9261002Z       %183 = tt.addptr %10, %182 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9261206Z       %184 = tt.load %183 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9261449Z       %185 = ttg.convert_layout %184 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9261737Z       %186 = arith.shli %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9261976Z       %187 = arith.shrsi %186, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9262228Z       %188 = arith.shrsi %185, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9262517Z       %189 = tt.expand_dims %187 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9262851Z       %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9263150Z       %191 = tt.broadcast %189 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9263393Z       %192 = arith.select %15, %191, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9263632Z       %193 = tt.broadcast %190 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9263863Z       %194 = arith.select %17, %193, %192 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9264098Z       %195 = tt.reshape %194 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9264324Z       %196 = arith.sitofp %195 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9264598Z       %197 = ttg.local_alloc %196 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9264922Z       %198 = ttg.local_load %197 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9265392Z       %199 = tt.dot %181, %198, %179, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9265794Z       ttg.local_dealloc %150 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:03.9266009Z       %200 = arith.truncf %199 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:54:03.9266185Z       %201 = arith.extsi %139 : i32 to i64
2026-02-21T09:54:03.9266302Z       %202 = arith.extsi %142 : i32 to i64
2026-02-21T09:54:03.9266478Z       %203 = tt.splat %201 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9266694Z       %204 = arith.addi %203, %19 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9266963Z       %205 = tt.expand_dims %204 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9267204Z       %206 = arith.muli %205, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9267383Z       %207 = tt.broadcast %206 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9267595Z       %208 = tt.splat %202 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9267801Z       %209 = arith.addi %208, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9268067Z       %210 = tt.expand_dims %209 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9268330Z       %211 = tt.broadcast %210 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9268526Z       %212 = arith.addi %207, %211 : tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9268720Z       %213 = tt.addptr %18, %212 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9268923Z       %214 = arith.cmpi sge, %205, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9269091Z       %215 = arith.cmpi slt, %205, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9269248Z       %216 = arith.andi %214, %215 : tensor<128x1xi1, #mma>
2026-02-21T09:54:03.9269424Z       %217 = tt.broadcast %216 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9269609Z       %218 = arith.cmpi sge, %210, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9269798Z       %219 = arith.cmpi slt, %210, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9269960Z       %220 = arith.andi %218, %219 : tensor<1x256xi1, #mma>
2026-02-21T09:54:03.9270133Z       %221 = tt.broadcast %220 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9270320Z       %222 = arith.andi %217, %221 : tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9270482Z       tt.store %213, %200, %222 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:03.9270629Z       %223 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:54:03.9270786Z       %224 = arith.divsi %223, %c256_i32 : i32
2026-02-21T09:54:03.9270906Z       %225 = arith.muli %224, %c8_i32 : i32
2026-02-21T09:54:03.9271030Z       %226 = arith.subi %c128_i32, %225 : i32
2026-02-21T09:54:03.9271147Z       %227 = arith.minsi %226, %c8_i32 : i32
2026-02-21T09:54:03.9271276Z       %228 = arith.remsi %223, %c256_i32 : i32
2026-02-21T09:54:03.9271416Z       %229 = arith.remsi %228, %227 : i32
2026-02-21T09:54:03.9271532Z       %230 = arith.addi %225, %229 : i32
2026-02-21T09:54:03.9271648Z       %231 = arith.divsi %228, %227 : i32
2026-02-21T09:54:03.9271778Z       %232 = arith.muli %230, %c128_i32 : i32
2026-02-21T09:54:03.9271966Z       %233 = tt.splat %232 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9272193Z       %234 = arith.addi %233, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9272368Z       %235 = arith.muli %231, %c256_i32 : i32
2026-02-21T09:54:03.9272539Z       %236 = tt.splat %235 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9272758Z       %237 = arith.addi %236, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9273055Z       %238 = tt.expand_dims %234 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9273309Z       %239 = arith.muli %238, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9273509Z       %240 = tt.broadcast %239 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9273793Z       %241 = tt.expand_dims %237 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:54:03.9274073Z       %242 = tt.broadcast %241 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9274296Z       %243 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:03.9274486Z       %244 = arith.addi %240, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9274686Z       %245 = tt.addptr %9, %244 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9274893Z       %246 = tt.load %245 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9275179Z       %247 = ttg.memdesc_index %243[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9275543Z       ttg.local_store %246, %247 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9275786Z       %248 = arith.addi %240, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9276000Z       %249 = tt.addptr %9, %248 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9276204Z       %250 = tt.load %249 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9276484Z       %251 = ttg.memdesc_index %243[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9276844Z       ttg.local_store %250, %251 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9277370Z       %252:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %247, %arg8 = %251) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:54:03.9277863Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9278093Z         %410 = arith.addi %409, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9278268Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:54:03.9278392Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:54:03.9278574Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9278795Z         %414 = arith.addi %413, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9279071Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:03.9279349Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9279568Z         %417 = arith.addi %240, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9279827Z         %418 = tt.addptr %9, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9280036Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9280341Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9280775Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9281174Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9281425Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9281616Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9281813Z         %425 = arith.addi %424, %242 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9282012Z         %426 = tt.addptr %10, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9282290Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9282545Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9282883Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9283126Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9283360Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9283653Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9283993Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9284295Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9284541Z         %435 = arith.select %15, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9284780Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9285021Z         %437 = arith.select %17, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9285256Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9285500Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9285754Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9286081Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9286606Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9286957Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:03.9287147Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:54:03.9287283Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:54:03.9287551Z         %446 = ttg.memdesc_index %243[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9287916Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9288317Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9288622Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:54:03.9288913Z       %253 = ttg.local_load %252#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9289366Z       %254 = arith.extf %253 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9289665Z       %255 = arith.addi %64, %242 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9289865Z       %256 = tt.addptr %10, %255 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9290064Z       %257 = tt.load %256 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9290311Z       %258 = ttg.convert_layout %257 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9290593Z       %259 = arith.shli %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9290830Z       %260 = arith.shrsi %259, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9291069Z       %261 = arith.shrsi %258, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9291360Z       %262 = tt.expand_dims %260 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9291706Z       %263 = tt.expand_dims %261 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9291994Z       %264 = tt.broadcast %262 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9292299Z       %265 = arith.select %15, %264, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9292544Z       %266 = tt.broadcast %263 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9292787Z       %267 = arith.select %17, %266, %265 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9293025Z       %268 = tt.reshape %267 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9293255Z       %269 = arith.sitofp %268 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9293510Z       %270 = ttg.local_alloc %269 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9294027Z       %271 = ttg.local_load %270 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9294513Z       %272 = tt.dot %254, %271, %252#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9295026Z       %273 = ttg.local_load %252#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9295462Z       %274 = arith.extf %273 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9295769Z       %275 = arith.addi %88, %242 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9295970Z       %276 = tt.addptr %10, %275 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9296214Z       %277 = tt.load %276 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9296463Z       %278 = ttg.convert_layout %277 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9296804Z       %279 = arith.shli %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9297074Z       %280 = arith.shrsi %279, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9297320Z       %281 = arith.shrsi %278, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9297618Z       %282 = tt.expand_dims %280 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9297975Z       %283 = tt.expand_dims %281 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9298266Z       %284 = tt.broadcast %282 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9298512Z       %285 = arith.select %15, %284, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9298753Z       %286 = tt.broadcast %283 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9298995Z       %287 = arith.select %17, %286, %285 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9299230Z       %288 = tt.reshape %287 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9299460Z       %289 = arith.sitofp %288 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9299719Z       %290 = ttg.local_alloc %289 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9300068Z       %291 = ttg.local_load %290 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9300611Z       %292 = tt.dot %274, %291, %272, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9301017Z       ttg.local_dealloc %243 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:03.9301245Z       %293 = arith.truncf %292 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:54:03.9301426Z       %294 = arith.extsi %232 : i32 to i64
2026-02-21T09:54:03.9301548Z       %295 = arith.extsi %235 : i32 to i64
2026-02-21T09:54:03.9301747Z       %296 = tt.splat %294 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9301979Z       %297 = arith.addi %296, %19 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9302257Z       %298 = tt.expand_dims %297 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9302536Z       %299 = arith.muli %298, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9302719Z       %300 = tt.broadcast %299 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9302986Z       %301 = tt.splat %295 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9303197Z       %302 = arith.addi %301, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9303482Z       %303 = tt.expand_dims %302 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9303749Z       %304 = tt.broadcast %303 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9303935Z       %305 = arith.addi %300, %304 : tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9304133Z       %306 = tt.addptr %18, %305 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9304340Z       %307 = arith.cmpi sge, %298, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9304511Z       %308 = arith.cmpi slt, %298, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9304671Z       %309 = arith.andi %307, %308 : tensor<128x1xi1, #mma>
2026-02-21T09:54:03.9304854Z       %310 = tt.broadcast %309 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9305096Z       %311 = arith.cmpi sge, %303, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9305262Z       %312 = arith.cmpi slt, %303, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9305437Z       %313 = arith.andi %311, %312 : tensor<1x256xi1, #mma>
2026-02-21T09:54:03.9305658Z       %314 = tt.broadcast %313 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9305843Z       %315 = arith.andi %310, %314 : tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9306028Z       tt.store %306, %293, %315 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:03.9306223Z       %316 = arith.addi %arg3, %c3_i32 : i32
2026-02-21T09:54:03.9306353Z       %317 = arith.divsi %316, %c256_i32 : i32
2026-02-21T09:54:03.9306477Z       %318 = arith.muli %317, %c8_i32 : i32
2026-02-21T09:54:03.9306607Z       %319 = arith.subi %c128_i32, %318 : i32
2026-02-21T09:54:03.9306727Z       %320 = arith.minsi %319, %c8_i32 : i32
2026-02-21T09:54:03.9306853Z       %321 = arith.remsi %316, %c256_i32 : i32
2026-02-21T09:54:03.9306974Z       %322 = arith.remsi %321, %320 : i32
2026-02-21T09:54:03.9307098Z       %323 = arith.addi %318, %322 : i32
2026-02-21T09:54:03.9307220Z       %324 = arith.divsi %321, %320 : i32
2026-02-21T09:54:03.9307359Z       %325 = arith.muli %323, %c128_i32 : i32
2026-02-21T09:54:03.9307537Z       %326 = tt.splat %325 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9307767Z       %327 = arith.addi %326, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9307948Z       %328 = arith.muli %324, %c256_i32 : i32
2026-02-21T09:54:03.9308188Z       %329 = tt.splat %328 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9308426Z       %330 = arith.addi %329, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9308723Z       %331 = tt.expand_dims %327 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9308995Z       %332 = arith.muli %331, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9309207Z       %333 = tt.broadcast %332 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9309489Z       %334 = tt.expand_dims %330 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:54:03.9309776Z       %335 = tt.broadcast %334 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9310003Z       %336 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:03.9310276Z       %337 = arith.addi %333, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9310487Z       %338 = tt.addptr %9, %337 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9310768Z       %339 = tt.load %338 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9311060Z       %340 = ttg.memdesc_index %336[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9311477Z       ttg.local_store %339, %340 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9311723Z       %341 = arith.addi %333, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9311932Z       %342 = tt.addptr %9, %341 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9312141Z       %343 = tt.load %342 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9312455Z       %344 = ttg.memdesc_index %336[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9312829Z       ttg.local_store %343, %344 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9313356Z       %345:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %340, %arg8 = %344) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:54:03.9313875Z         %409 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9314127Z         %410 = arith.addi %409, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9314306Z         %411 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:54:03.9314437Z         %412 = arith.muli %411, %c2_i32 : i32
2026-02-21T09:54:03.9314609Z         %413 = tt.splat %412 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9314834Z         %414 = arith.addi %413, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9315114Z         %415 = tt.expand_dims %414 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:03.9315443Z         %416 = tt.broadcast %415 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9315652Z         %417 = arith.addi %333, %416 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9315879Z         %418 = tt.addptr %9, %417 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9316091Z         %419 = tt.load %418 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9316395Z         %420 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9316838Z         %421 = arith.extf %420 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9317253Z         %422 = tt.expand_dims %410 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9317501Z         %423 = arith.muli %422, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9317701Z         %424 = tt.broadcast %423 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9317905Z         %425 = arith.addi %424, %335 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9318109Z         %426 = tt.addptr %10, %425 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9318330Z         %427 = tt.load %426 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9318593Z         %428 = ttg.convert_layout %427 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9318883Z         %429 = arith.shli %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9319129Z         %430 = arith.shrsi %429, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9319374Z         %431 = arith.shrsi %428, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9319687Z         %432 = tt.expand_dims %430 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9320025Z         %433 = tt.expand_dims %431 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9320318Z         %434 = tt.broadcast %432 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9320570Z         %435 = arith.select %15, %434, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9320813Z         %436 = tt.broadcast %433 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9321057Z         %437 = arith.select %17, %436, %435 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9321293Z         %438 = tt.reshape %437 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9321528Z         %439 = arith.sitofp %438 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9321798Z         %440 = ttg.local_alloc %439 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9322161Z         %441 = ttg.local_load %440 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9322692Z         %442 = tt.dot %421, %441, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9323049Z         %443 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:03.9323181Z         %444 = arith.cmpi slt, %443, %c2_i32 : i32
2026-02-21T09:54:03.9323321Z         %445 = arith.select %444, %443, %c0_i32 : i32
2026-02-21T09:54:03.9323592Z         %446 = ttg.memdesc_index %336[%445] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9323958Z         ttg.local_store %419, %446 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9324364Z         scf.yield %442, %445, %arg8, %446 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9324673Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:54:03.9324957Z       %346 = ttg.local_load %345#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9325390Z       %347 = arith.extf %346 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9325711Z       %348 = arith.addi %64, %335 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9325919Z       %349 = tt.addptr %10, %348 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9326121Z       %350 = tt.load %349 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9326374Z       %351 = ttg.convert_layout %350 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9326657Z       %352 = arith.shli %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9326921Z       %353 = arith.shrsi %352, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9327166Z       %354 = arith.shrsi %351, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9327463Z       %355 = tt.expand_dims %353 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9327824Z       %356 = tt.expand_dims %354 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9328113Z       %357 = tt.broadcast %355 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9328363Z       %358 = arith.select %15, %357, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9328611Z       %359 = tt.broadcast %356 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9328852Z       %360 = arith.select %17, %359, %358 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9329091Z       %361 = tt.reshape %360 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9329319Z       %362 = arith.sitofp %361 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9329581Z       %363 = ttg.local_alloc %362 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9329914Z       %364 = ttg.local_load %363 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9330408Z       %365 = tt.dot %347, %364, %345#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9330907Z       %366 = ttg.local_load %345#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9331342Z       %367 = arith.extf %366 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9331640Z       %368 = arith.addi %88, %335 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9331846Z       %369 = tt.addptr %10, %368 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9332047Z       %370 = tt.load %369 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9332294Z       %371 = ttg.convert_layout %370 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9332584Z       %372 = arith.shli %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9332822Z       %373 = arith.shrsi %372, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9333064Z       %374 = arith.shrsi %371, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9333352Z       %375 = tt.expand_dims %373 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9333708Z       %376 = tt.expand_dims %374 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9333997Z       %377 = tt.broadcast %375 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9334235Z       %378 = arith.select %15, %377, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9334475Z       %379 = tt.broadcast %376 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9334708Z       %380 = arith.select %17, %379, %378 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9334955Z       %381 = tt.reshape %380 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9335179Z       %382 = arith.sitofp %381 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9335433Z       %383 = ttg.local_alloc %382 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9335774Z       %384 = ttg.local_load %383 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9336241Z       %385 = tt.dot %367, %384, %365, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9336628Z       ttg.local_dealloc %336 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:03.9336847Z       %386 = arith.truncf %385 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:54:03.9337021Z       %387 = arith.extsi %325 : i32 to i64
2026-02-21T09:54:03.9337141Z       %388 = arith.extsi %328 : i32 to i64
2026-02-21T09:54:03.9337308Z       %389 = tt.splat %387 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9337520Z       %390 = arith.addi %389, %19 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9337788Z       %391 = tt.expand_dims %390 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9338028Z       %392 = arith.muli %391, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9338210Z       %393 = tt.broadcast %392 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9338434Z       %394 = tt.splat %388 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9338643Z       %395 = arith.addi %394, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9338908Z       %396 = tt.expand_dims %395 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9339168Z       %397 = tt.broadcast %396 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9339355Z       %398 = arith.addi %393, %397 : tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9339545Z       %399 = tt.addptr %18, %398 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9339750Z       %400 = arith.cmpi sge, %391, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9339917Z       %401 = arith.cmpi slt, %391, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9340072Z       %402 = arith.andi %400, %401 : tensor<128x1xi1, #mma>
2026-02-21T09:54:03.9340251Z       %403 = tt.broadcast %402 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9340433Z       %404 = arith.cmpi sge, %396, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9340599Z       %405 = arith.cmpi slt, %396, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9340755Z       %406 = arith.andi %404, %405 : tensor<1x256xi1, #mma>
2026-02-21T09:54:03.9340926Z       %407 = tt.broadcast %406 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9341121Z       %408 = arith.andi %403, %407 : tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9341281Z       tt.store %399, %386, %408 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:03.9341429Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:54:03.9341550Z     scf.for %arg3 = %24 to %2 step %c1_i32  : i32 {
2026-02-21T09:54:03.9341685Z       %25 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:54:03.9341804Z       %26 = arith.muli %25, %c8_i32 : i32
2026-02-21T09:54:03.9341922Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:54:03.9342042Z       %28 = arith.minsi %27, %c8_i32 : i32
2026-02-21T09:54:03.9342159Z       %29 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:54:03.9342279Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:54:03.9342409Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:54:03.9342524Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:54:03.9342635Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:54:03.9342803Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9343028Z       %35 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:03.9343196Z       %36 = arith.muli %32, %c256_i32 : i32
2026-02-21T09:54:03.9343373Z       %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9343588Z       %38 = arith.addi %37, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:03.9343864Z       %39 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9344113Z       %40 = arith.muli %39, %cst_11 : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:03.9344307Z       %41 = tt.broadcast %40 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9344582Z       %42 = tt.expand_dims %38 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:54:03.9344857Z       %43 = tt.broadcast %42 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9345076Z       %44 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:03.9345342Z       %45 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:03.9345609Z       %46 = tt.broadcast %45 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9345814Z       %47 = arith.addi %41, %46 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9346006Z       %48 = tt.addptr %9, %47 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9346208Z       %49 = tt.load %48 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9346487Z       %50 = ttg.memdesc_index %44[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9346848Z       ttg.local_store %49, %50 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9347121Z       %51 = arith.addi %8, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9347391Z       %52 = tt.expand_dims %51 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:03.9347659Z       %53 = tt.broadcast %52 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9347850Z       %54 = arith.addi %41, %53 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9348041Z       %55 = tt.addptr %9, %54 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9348244Z       %56 = tt.load %55 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9348520Z       %57 = ttg.memdesc_index %44[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9348890Z       ttg.local_store %56, %57 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9349408Z       %58:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_6, %arg6 = %c1_i32, %arg7 = %50, %arg8 = %57) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:54:03.9349879Z         %130 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9350107Z         %131 = arith.addi %130, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9350301Z         %132 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:54:03.9350425Z         %133 = arith.muli %132, %c2_i32 : i32
2026-02-21T09:54:03.9350594Z         %134 = tt.splat %133 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9350816Z         %135 = arith.addi %134, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:03.9351105Z         %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:03.9351382Z         %137 = tt.broadcast %136 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9351577Z         %138 = arith.addi %41, %137 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9351780Z         %139 = tt.addptr %9, %138 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:03.9351990Z         %140 = tt.load %139 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:03.9352293Z         %141 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9352728Z         %142 = arith.extf %141 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9353112Z         %143 = tt.expand_dims %131 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9353365Z         %144 = arith.muli %143, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9353557Z         %145 = tt.broadcast %144 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9353768Z         %146 = arith.addi %145, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9353967Z         %147 = tt.addptr %10, %146 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9354172Z         %148 = tt.load %147 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9354417Z         %149 = ttg.convert_layout %148 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9354697Z         %150 = arith.shli %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9354935Z         %151 = arith.shrsi %150, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9355171Z         %152 = arith.shrsi %149, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9355463Z         %153 = tt.expand_dims %151 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9355802Z         %154 = tt.expand_dims %152 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9356086Z         %155 = tt.broadcast %153 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9356332Z         %156 = arith.select %15, %155, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9356590Z         %157 = tt.broadcast %154 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9356825Z         %158 = arith.select %17, %157, %156 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9357061Z         %159 = tt.reshape %158 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9357284Z         %160 = arith.sitofp %159 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9357540Z         %161 = ttg.local_alloc %160 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9357868Z         %162 = ttg.local_load %161 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9358364Z         %163 = tt.dot %142, %162, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9358713Z         %164 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:03.9358840Z         %165 = arith.cmpi slt, %164, %c2_i32 : i32
2026-02-21T09:54:03.9358997Z         %166 = arith.select %165, %164, %c0_i32 : i32
2026-02-21T09:54:03.9359265Z         %167 = ttg.memdesc_index %44[%166] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9359628Z         ttg.local_store %140, %167 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9360031Z         scf.yield %163, %166, %arg8, %167 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:03.9360335Z       } {tt.flatten, tt.num_stages = 3 : i32}
2026-02-21T09:54:03.9360509Z       %59 = arith.addi %7, %cst_7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9360837Z       %60 = ttg.local_load %58#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9361261Z       %61 = arith.extf %60 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9361660Z       %62 = tt.expand_dims %59 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9361903Z       %63 = arith.muli %62, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9362092Z       %64 = tt.broadcast %63 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9362282Z       %65 = arith.addi %64, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9362472Z       %66 = tt.addptr %10, %65 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9362737Z       %67 = tt.load %66 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9362978Z       %68 = ttg.convert_layout %67 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9363254Z       %69 = arith.shli %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9363488Z       %70 = arith.shrsi %69, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9363723Z       %71 = arith.shrsi %68, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9364007Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9364339Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9364634Z       %74 = tt.broadcast %72 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9364873Z       %75 = arith.select %15, %74, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9365105Z       %76 = tt.broadcast %73 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9365330Z       %77 = arith.select %17, %76, %75 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9365553Z       %78 = tt.reshape %77 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9365769Z       %79 = arith.sitofp %78 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9366037Z       %80 = ttg.local_alloc %79 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9366355Z       %81 = ttg.local_load %80 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9366851Z       %82 = tt.dot %61, %81, %58#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9367240Z       %83 = arith.addi %7, %cst_8 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:03.9367566Z       %84 = ttg.local_load %58#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9367991Z       %85 = arith.extf %84 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9368366Z       %86 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9368613Z       %87 = arith.muli %86, %cst_10 : tensor<2x1xi32, #blocked1>
2026-02-21T09:54:03.9368803Z       %88 = tt.broadcast %87 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9368989Z       %89 = arith.addi %88, %43 : tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9369185Z       %90 = tt.addptr %10, %89 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:54:03.9369379Z       %91 = tt.load %90 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:03.9369638Z       %92 = ttg.convert_layout %91 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9369915Z       %93 = arith.shli %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9370145Z       %94 = arith.shrsi %93, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9370377Z       %95 = arith.shrsi %92, %cst_13 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:03.9370659Z       %96 = tt.expand_dims %94 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9370992Z       %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:54:03.9371273Z       %98 = tt.broadcast %96 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9371506Z       %99 = arith.select %15, %98, %cst_12 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9371742Z       %100 = tt.broadcast %97 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9371975Z       %101 = arith.select %17, %100, %99 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:54:03.9372204Z       %102 = tt.reshape %101 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:54:03.9372432Z       %103 = arith.sitofp %102 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:54:03.9372698Z       %104 = ttg.local_alloc %103 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:54:03.9373027Z       %105 = ttg.local_load %104 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:03.9373495Z       %106 = tt.dot %85, %105, %82, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:54:03.9373873Z       ttg.local_dealloc %44 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:03.9374109Z       %107 = arith.truncf %106 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:54:03.9374279Z       %108 = arith.extsi %33 : i32 to i64
2026-02-21T09:54:03.9374399Z       %109 = arith.extsi %36 : i32 to i64
2026-02-21T09:54:03.9374564Z       %110 = tt.splat %108 : i64 -> tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9374775Z       %111 = arith.addi %110, %19 : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:03.9375060Z       %112 = tt.expand_dims %111 {axis = 1 : i32} : tensor<128xi64, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9375298Z       %113 = arith.muli %112, %cst_3 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9375478Z       %114 = tt.broadcast %113 : tensor<128x1xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9375688Z       %115 = tt.splat %109 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9375896Z       %116 = arith.addi %115, %20 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:03.9376166Z       %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9376426Z       %118 = tt.broadcast %117 : tensor<1x256xi64, #mma> -> tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9376612Z       %119 = arith.addi %114, %118 : tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9376804Z       %120 = tt.addptr %18, %119 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi64, #mma>
2026-02-21T09:54:03.9377008Z       %121 = arith.cmpi sge, %112, %cst_2 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9377175Z       %122 = arith.cmpi slt, %112, %cst_1 : tensor<128x1xi64, #mma>
2026-02-21T09:54:03.9377330Z       %123 = arith.andi %121, %122 : tensor<128x1xi1, #mma>
2026-02-21T09:54:03.9377521Z       %124 = tt.broadcast %123 : tensor<128x1xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9377707Z       %125 = arith.cmpi sge, %117, %cst_0 : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9377871Z       %126 = arith.cmpi slt, %117, %cst : tensor<1x256xi64, #mma>
2026-02-21T09:54:03.9378026Z       %127 = arith.andi %125, %126 : tensor<1x256xi1, #mma>
2026-02-21T09:54:03.9378197Z       %128 = tt.broadcast %127 : tensor<1x256xi1, #mma> -> tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9378379Z       %129 = arith.andi %124, %128 : tensor<128x256xi1, #mma>
2026-02-21T09:54:03.9378539Z       tt.store %120, %107, %129 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:03.9378687Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:54:03.9378791Z     tt.return
2026-02-21T09:54:03.9378872Z   }
2026-02-21T09:54:03.9378947Z }
2026-02-21T09:54:03.9378994Z 
2026-02-21T09:54:03.9379025Z {-#
2026-02-21T09:54:03.9379108Z   external_resources: {
2026-02-21T09:54:03.9379206Z     mlir_reproducer: {
2026-02-21T09:54:03.9380209Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:54:03.9381211Z       disable_threading: false,
2026-02-21T09:54:03.9381316Z       verify_each: true
2026-02-21T09:54:03.9381407Z     }
2026-02-21T09:54:03.9381478Z   }
2026-02-21T09:54:03.9381549Z #-}
2026-02-21T09:54:03.9381822Z /tmp/torchinductor_root/s4/cs4axgay44a357gkb6qe3krrz4yax7524ixw3db4slq63gerq545.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:54:03.9382508Z /tmp/torchinductor_root/s4/cs4axgay44a357gkb6qe3krrz4yax7524ixw3db4slq63gerq545.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:54:03.9383063Z [574s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:54:03.9383860Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:54:03.9384562Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:54:03.9384732Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:54:05.4115233Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:54:05.4124955Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:54:05.4126212Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T09:54:05.4127371Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:54:05.4128543Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T09:54:05.4129938Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 1], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:54:05.4130763Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:05.4131455Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:05.4131809Z #smem = #ttg.shared_memory
2026-02-21T09:54:05.4132541Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:54:05.4134091Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:54:05.4135347Z     %cst = arith.constant dense<4> : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4135809Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:54:05.4136168Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:54:05.4136524Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:54:05.4136987Z     %cst_0 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4137547Z     %cst_1 = arith.constant dense<0> : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4138003Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:54:05.4138370Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:54:05.4138740Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:54:05.4139206Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:54:05.4139412Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:54:05.4139575Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:54:05.4139778Z     %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:05.4140035Z     %cst_3 = arith.constant dense<8192> : tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4140284Z     %cst_4 = arith.constant dense<0> : tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4140639Z     %cst_5 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4141038Z     %cst_6 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4141541Z     %cst_7 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4142028Z     %cst_8 = arith.constant dense<2> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4142619Z     %cst_9 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4143041Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:54:05.4143399Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4143901Z     %cst_11 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:54:05.4144312Z     %cst_12 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:54:05.4144705Z     %cst_13 = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:54:05.4145039Z     %0 = tt.get_program_id x : i32
2026-02-21T09:54:05.4145289Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:54:05.4145543Z     %2 = arith.minsi %1, %c8192_i32 : i32
2026-02-21T09:54:05.4146011Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:05.4146667Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:05.4147296Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:54:05.4147928Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:05.4148555Z     %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4149122Z     %8 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4149593Z     %9 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4150125Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4150719Z     %11 = arith.extsi %10 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4151384Z     %12 = arith.extsi %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:54:05.4152046Z     %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:54:05.4152749Z     %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:54:05.4153233Z     %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:54:05.4153697Z     %16 = arith.cmpi eq, %15, %cst_11 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:54:05.4154058Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1>
2026-02-21T09:54:05.4154408Z     %18 = arith.cmpi eq, %15, %cst_12 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:54:05.4154755Z     %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x128xi1, #blocked1>
2026-02-21T09:54:05.4155011Z     %20 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:05.4155184Z     %21 = arith.subi %2, %0 : i32
2026-02-21T09:54:05.4155340Z     %22 = arith.remsi %21, %c3_i32 : i32
2026-02-21T09:54:05.4155546Z     %23 = arith.subi %21, %22 : i32
2026-02-21T09:54:05.4155740Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:54:05.4155957Z     scf.for %arg3 = %0 to %24 step %c3_i32  : i32 {
2026-02-21T09:54:05.4156196Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:54:05.4156414Z       %26 = arith.muli %25, %c4_i32 : i32
2026-02-21T09:54:05.4156619Z       %27 = arith.subi %c64_i32, %26 : i32
2026-02-21T09:54:05.4156856Z       %28 = arith.minsi %27, %c4_i32 : i32
2026-02-21T09:54:05.4157063Z       %29 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:54:05.4157277Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:54:05.4157476Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:54:05.4157675Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:54:05.4157870Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:54:05.4158149Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:05.4158555Z       %35 = arith.addi %34, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:05.4158838Z       %36 = arith.muli %32, %c128_i32 : i32
2026-02-21T09:54:05.4159133Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:05.4159489Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:05.4159828Z       %39 = arith.addi %37, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:05.4160170Z       %40 = arith.addi %38, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:05.4160648Z       %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:54:05.4161054Z       %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:05.4161367Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4161648Z       %44 = arith.extsi %33 : i32 to i64
2026-02-21T09:54:05.4161908Z       %45 = tt.splat %44 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:54:05.4162260Z       %46 = arith.addi %45, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:54:05.4162803Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4163255Z       %48 = tt.broadcast %47 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4163478Z       %49 = arith.cmpi sge, %47, %cst_4 : tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4163646Z       %50 = arith.cmpi slt, %47, %cst_3 : tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4163805Z       %51 = arith.andi %49, %50 : tensor<1x128xi1, #blocked>
2026-02-21T09:54:05.4164094Z       %52 = tt.broadcast %51 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4164443Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:05.4164887Z       %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:05.4165332Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4165584Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4165782Z       %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4165980Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4166214Z       %59 = tt.expand_dims %11 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4166473Z       %60 = arith.muli %59, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4166653Z       %61 = tt.broadcast %60 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4166837Z       %62 = arith.addi %61, %48 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4167025Z       %63 = tt.addptr %9, %62 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4167229Z       %64 = arith.cmpi sge, %59, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4167395Z       %65 = arith.cmpi slt, %59, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4167568Z       %66 = arith.andi %64, %65 : tensor<2x1xi1, #blocked>
2026-02-21T09:54:05.4167744Z       %67 = tt.broadcast %66 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4167927Z       %68 = arith.andi %67, %52 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4168133Z       %69 = tt.load %63, %68, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4168491Z       %70 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4168850Z       ttg.local_store %58, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4169124Z       %71 = arith.addi %7, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4169400Z       %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:05.4169678Z       %73 = tt.broadcast %72 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4169869Z       %74 = arith.addi %43, %73 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4170064Z       %75 = tt.addptr %8, %74 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4170266Z       %76 = tt.load %75 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4170451Z       %77 = arith.addi %11, %cst_8 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4170722Z       %78 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4170956Z       %79 = arith.muli %78, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4171151Z       %80 = tt.broadcast %79 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4171335Z       %81 = arith.addi %80, %48 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4171520Z       %82 = tt.addptr %9, %81 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4171722Z       %83 = arith.cmpi sge, %78, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4171885Z       %84 = arith.cmpi slt, %78, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4172044Z       %85 = arith.andi %83, %84 : tensor<2x1xi1, #blocked>
2026-02-21T09:54:05.4172219Z       %86 = tt.broadcast %85 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4172400Z       %87 = arith.andi %86, %52 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4172607Z       %88 = tt.load %82, %87, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4172942Z       %89 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4173302Z       ttg.local_store %76, %89 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4173926Z       %90:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %70, %arg8 = %89, %arg9 = %69, %arg10 = %88) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:54:05.4174465Z         %317 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:54:05.4174592Z         %318 = arith.muli %317, %c2_i32 : i32
2026-02-21T09:54:05.4174766Z         %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4174990Z         %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4175269Z         %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:05.4175574Z         %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4175789Z         %323 = arith.addi %43, %322 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4175996Z         %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4176204Z         %325 = tt.load %324 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4176524Z         %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4176965Z         %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4177251Z         %328 = arith.extsi %317 : i32 to i64
2026-02-21T09:54:05.4177421Z         %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4177642Z         %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4177915Z         %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4178157Z         %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4178356Z         %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4178548Z         %334 = arith.addi %333, %48 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4178747Z         %335 = tt.addptr %9, %334 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4178977Z         %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4179145Z         %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4179313Z         %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked>
2026-02-21T09:54:05.4179500Z         %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4179689Z         %340 = arith.andi %339, %52 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4179855Z         %341 = tt.load %335, %340, %cst_1 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4180031Z         %342 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4180191Z         %343 = arith.shrsi %342, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4180436Z         %344 = ttg.convert_layout %343 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4180688Z         %345 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4180935Z         %346 = ttg.convert_layout %345 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4181273Z         %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4181618Z         %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4181922Z         %349 = tt.broadcast %347 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4182172Z         %350 = arith.select %17, %349, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4182418Z         %351 = tt.broadcast %348 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4182662Z         %352 = arith.select %19, %351, %350 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4182900Z         %353 = tt.reshape %352 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4183145Z         %354 = arith.sitofp %353 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4183402Z         %355 = ttg.local_alloc %354 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4183731Z         %356 = ttg.local_load %355 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4184225Z         %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4184575Z         %358 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:05.4184704Z         %359 = arith.cmpi slt, %358, %c2_i32 : i32
2026-02-21T09:54:05.4184840Z         %360 = arith.select %359, %358, %c0_i32 : i32
2026-02-21T09:54:05.4185110Z         %361 = ttg.memdesc_index %53[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4185475Z         ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4185969Z         scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4186398Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:05.4186716Z       %91 = ttg.local_load %90#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4187162Z       %92 = arith.extf %91 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4187459Z       %93 = arith.shli %90#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4187620Z       %94 = arith.shrsi %93, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4187859Z       %95 = ttg.convert_layout %94 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4188099Z       %96 = arith.shrsi %90#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4188339Z       %97 = ttg.convert_layout %96 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4188670Z       %98 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4189012Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4189299Z       %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4189544Z       %101 = arith.select %17, %100, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4189788Z       %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4190038Z       %103 = arith.select %19, %102, %101 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4190280Z       %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4190506Z       %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4190758Z       %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4191086Z       %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4191570Z       %108 = tt.dot %92, %107, %90#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4192061Z       %109 = ttg.local_load %90#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4192508Z       %110 = arith.extf %109 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4192807Z       %111 = arith.shli %90#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4192968Z       %112 = arith.shrsi %111, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4193216Z       %113 = ttg.convert_layout %112 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4193471Z       %114 = arith.shrsi %90#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4193716Z       %115 = ttg.convert_layout %114 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4194054Z       %116 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4194400Z       %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4194689Z       %118 = tt.broadcast %116 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4194936Z       %119 = arith.select %17, %118, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4195205Z       %120 = tt.broadcast %117 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4195444Z       %121 = arith.select %19, %120, %119 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4195683Z       %122 = tt.reshape %121 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4195911Z       %123 = arith.sitofp %122 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4196164Z       %124 = ttg.local_alloc %123 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4196491Z       %125 = ttg.local_load %124 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4196966Z       %126 = tt.dot %110, %125, %108, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4197350Z       ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:05.4197568Z       %127 = arith.truncf %126 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:54:05.4197839Z       %128 = tt.expand_dims %40 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:05.4198095Z       %129 = arith.muli %128, %cst_13 : tensor<128x1xi32, #mma>
2026-02-21T09:54:05.4198331Z       %130 = tt.expand_dims %35 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:54:05.4198591Z       %131 = tt.broadcast %129 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4198801Z       %132 = tt.broadcast %130 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4198986Z       %133 = arith.addi %131, %132 : tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4199184Z       %134 = tt.addptr %20, %133 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4199407Z       tt.store %134, %127 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:05.4199550Z       %135 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:54:05.4199675Z       %136 = arith.divsi %135, %c512_i32 : i32
2026-02-21T09:54:05.4199797Z       %137 = arith.muli %136, %c4_i32 : i32
2026-02-21T09:54:05.4199920Z       %138 = arith.subi %c64_i32, %137 : i32
2026-02-21T09:54:05.4200035Z       %139 = arith.minsi %138, %c4_i32 : i32
2026-02-21T09:54:05.4200153Z       %140 = arith.remsi %135, %c512_i32 : i32
2026-02-21T09:54:05.4200269Z       %141 = arith.remsi %140, %139 : i32
2026-02-21T09:54:05.4200397Z       %142 = arith.addi %137, %141 : i32
2026-02-21T09:54:05.4200514Z       %143 = arith.divsi %140, %139 : i32
2026-02-21T09:54:05.4200630Z       %144 = arith.muli %142, %c128_i32 : i32
2026-02-21T09:54:05.4200799Z       %145 = tt.splat %144 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:05.4201009Z       %146 = arith.addi %145, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:05.4201182Z       %147 = arith.muli %143, %c128_i32 : i32
2026-02-21T09:54:05.4201350Z       %148 = tt.splat %147 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:05.4201571Z       %149 = tt.splat %147 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:05.4201789Z       %150 = arith.addi %148, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:05.4202002Z       %151 = arith.addi %149, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:05.4202277Z       %152 = tt.expand_dims %150 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:54:05.4202532Z       %153 = arith.muli %152, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:05.4202808Z       %154 = tt.broadcast %153 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4202995Z       %155 = arith.extsi %144 : i32 to i64
2026-02-21T09:54:05.4203161Z       %156 = tt.splat %155 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:54:05.4203384Z       %157 = arith.addi %156, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:54:05.4203658Z       %158 = tt.expand_dims %157 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4203939Z       %159 = tt.broadcast %158 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4204143Z       %160 = arith.cmpi sge, %158, %cst_4 : tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4204316Z       %161 = arith.cmpi slt, %158, %cst_3 : tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4204483Z       %162 = arith.andi %160, %161 : tensor<1x128xi1, #blocked>
2026-02-21T09:54:05.4204669Z       %163 = tt.broadcast %162 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4204884Z       %164 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:05.4205070Z       %165 = arith.addi %154, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4205271Z       %166 = tt.addptr %8, %165 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4205507Z       %167 = tt.load %166 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4205664Z       %168 = arith.addi %61, %159 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4205858Z       %169 = tt.addptr %9, %168 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4206054Z       %170 = arith.andi %67, %163 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4206270Z       %171 = tt.load %169, %170, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4206614Z       %172 = ttg.memdesc_index %164[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4206997Z       ttg.local_store %167, %172 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4207239Z       %173 = arith.addi %154, %73 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4207440Z       %174 = tt.addptr %8, %173 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4207647Z       %175 = tt.load %174 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4207805Z       %176 = arith.addi %80, %159 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4208012Z       %177 = tt.addptr %9, %176 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4208208Z       %178 = arith.andi %86, %163 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4208418Z       %179 = tt.load %177, %178, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4208756Z       %180 = ttg.memdesc_index %164[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4209120Z       ttg.local_store %175, %180 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4209762Z       %181:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %172, %arg8 = %180, %arg9 = %171, %arg10 = %179) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:54:05.4210292Z         %317 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:54:05.4210418Z         %318 = arith.muli %317, %c2_i32 : i32
2026-02-21T09:54:05.4210606Z         %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4210834Z         %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4211109Z         %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:05.4211388Z         %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4211586Z         %323 = arith.addi %154, %322 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4211788Z         %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4211999Z         %325 = tt.load %324 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4212298Z         %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4212738Z         %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4213020Z         %328 = arith.extsi %317 : i32 to i64
2026-02-21T09:54:05.4213188Z         %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4213407Z         %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4213694Z         %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4213941Z         %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4214132Z         %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4214322Z         %334 = arith.addi %333, %159 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4214520Z         %335 = tt.addptr %9, %334 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4214723Z         %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4214913Z         %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4215074Z         %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked>
2026-02-21T09:54:05.4215261Z         %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4215453Z         %340 = arith.andi %339, %163 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4215621Z         %341 = tt.load %335, %340, %cst_1 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4215811Z         %342 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4215971Z         %343 = arith.shrsi %342, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4216218Z         %344 = ttg.convert_layout %343 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4216469Z         %345 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4216714Z         %346 = ttg.convert_layout %345 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4217054Z         %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4217398Z         %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4217695Z         %349 = tt.broadcast %347 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4217947Z         %350 = arith.select %17, %349, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4218190Z         %351 = tt.broadcast %348 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4218451Z         %352 = arith.select %19, %351, %350 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4218686Z         %353 = tt.reshape %352 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4218914Z         %354 = arith.sitofp %353 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4219171Z         %355 = ttg.local_alloc %354 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4219503Z         %356 = ttg.local_load %355 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4219989Z         %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4220343Z         %358 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:05.4220471Z         %359 = arith.cmpi slt, %358, %c2_i32 : i32
2026-02-21T09:54:05.4220608Z         %360 = arith.select %359, %358, %c0_i32 : i32
2026-02-21T09:54:05.4220876Z         %361 = ttg.memdesc_index %164[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4221238Z         ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4221742Z         scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4222165Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:05.4222486Z       %182 = ttg.local_load %181#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4222939Z       %183 = arith.extf %182 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4223239Z       %184 = arith.shli %181#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4223405Z       %185 = arith.shrsi %184, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4223646Z       %186 = ttg.convert_layout %185 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4223906Z       %187 = arith.shrsi %181#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4224150Z       %188 = ttg.convert_layout %187 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4224488Z       %189 = tt.expand_dims %186 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4224838Z       %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4225127Z       %191 = tt.broadcast %189 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4225377Z       %192 = arith.select %17, %191, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4225626Z       %193 = tt.broadcast %190 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4225867Z       %194 = arith.select %19, %193, %192 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4226106Z       %195 = tt.reshape %194 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4226329Z       %196 = arith.sitofp %195 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4226600Z       %197 = ttg.local_alloc %196 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4226929Z       %198 = ttg.local_load %197 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4227401Z       %199 = tt.dot %183, %198, %181#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4227899Z       %200 = ttg.local_load %181#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4228330Z       %201 = arith.extf %200 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4228632Z       %202 = arith.shli %181#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4228796Z       %203 = arith.shrsi %202, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4229041Z       %204 = ttg.convert_layout %203 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4229291Z       %205 = arith.shrsi %181#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4229534Z       %206 = ttg.convert_layout %205 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4229882Z       %207 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4230224Z       %208 = tt.expand_dims %206 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4230511Z       %209 = tt.broadcast %207 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4230758Z       %210 = arith.select %17, %209, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4231022Z       %211 = tt.broadcast %208 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4231261Z       %212 = arith.select %19, %211, %210 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4231497Z       %213 = tt.reshape %212 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4231728Z       %214 = arith.sitofp %213 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4232001Z       %215 = ttg.local_alloc %214 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4232327Z       %216 = ttg.local_load %215 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4232798Z       %217 = tt.dot %201, %216, %199, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4233183Z       ttg.local_dealloc %164 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:05.4233401Z       %218 = arith.truncf %217 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:54:05.4233672Z       %219 = tt.expand_dims %151 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:05.4233916Z       %220 = arith.muli %219, %cst_13 : tensor<128x1xi32, #mma>
2026-02-21T09:54:05.4234151Z       %221 = tt.expand_dims %146 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:54:05.4234414Z       %222 = tt.broadcast %220 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4234647Z       %223 = tt.broadcast %221 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4234828Z       %224 = arith.addi %222, %223 : tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4235023Z       %225 = tt.addptr %20, %224 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4235222Z       tt.store %225, %218 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:05.4235366Z       %226 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:54:05.4235489Z       %227 = arith.divsi %226, %c512_i32 : i32
2026-02-21T09:54:05.4235612Z       %228 = arith.muli %227, %c4_i32 : i32
2026-02-21T09:54:05.4235733Z       %229 = arith.subi %c64_i32, %228 : i32
2026-02-21T09:54:05.4235852Z       %230 = arith.minsi %229, %c4_i32 : i32
2026-02-21T09:54:05.4235972Z       %231 = arith.remsi %226, %c512_i32 : i32
2026-02-21T09:54:05.4236089Z       %232 = arith.remsi %231, %230 : i32
2026-02-21T09:54:05.4236210Z       %233 = arith.addi %228, %232 : i32
2026-02-21T09:54:05.4236326Z       %234 = arith.divsi %231, %230 : i32
2026-02-21T09:54:05.4236451Z       %235 = arith.muli %233, %c128_i32 : i32
2026-02-21T09:54:05.4236618Z       %236 = tt.splat %235 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:05.4236838Z       %237 = arith.addi %236, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:05.4237015Z       %238 = arith.muli %234, %c128_i32 : i32
2026-02-21T09:54:05.4237189Z       %239 = tt.splat %238 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:05.4237427Z       %240 = tt.splat %238 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:05.4237646Z       %241 = arith.addi %239, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:05.4237866Z       %242 = arith.addi %240, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:05.4238148Z       %243 = tt.expand_dims %241 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:54:05.4238406Z       %244 = arith.muli %243, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:05.4238630Z       %245 = tt.broadcast %244 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4238810Z       %246 = arith.extsi %235 : i32 to i64
2026-02-21T09:54:05.4238985Z       %247 = tt.splat %246 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:54:05.4239213Z       %248 = arith.addi %247, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:54:05.4239506Z       %249 = tt.expand_dims %248 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4239791Z       %250 = tt.broadcast %249 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4239996Z       %251 = arith.cmpi sge, %249, %cst_4 : tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4240180Z       %252 = arith.cmpi slt, %249, %cst_3 : tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4240353Z       %253 = arith.andi %251, %252 : tensor<1x128xi1, #blocked>
2026-02-21T09:54:05.4240543Z       %254 = tt.broadcast %253 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4240765Z       %255 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:05.4240955Z       %256 = arith.addi %245, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4241166Z       %257 = tt.addptr %8, %256 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4241375Z       %258 = tt.load %257 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4241541Z       %259 = arith.addi %61, %250 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4241741Z       %260 = tt.addptr %9, %259 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4241940Z       %261 = arith.andi %67, %254 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4242176Z       %262 = tt.load %260, %261, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4242520Z       %263 = ttg.memdesc_index %255[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4242916Z       ttg.local_store %258, %263 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4243168Z       %264 = arith.addi %245, %73 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4243371Z       %265 = tt.addptr %8, %264 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4243586Z       %266 = tt.load %265 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4243747Z       %267 = arith.addi %80, %250 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4243945Z       %268 = tt.addptr %9, %267 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4244150Z       %269 = arith.andi %86, %254 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4244363Z       %270 = tt.load %268, %269, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4244712Z       %271 = ttg.memdesc_index %255[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4245080Z       ttg.local_store %266, %271 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4245744Z       %272:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %263, %arg8 = %271, %arg9 = %262, %arg10 = %270) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:54:05.4246284Z         %317 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:54:05.4246411Z         %318 = arith.muli %317, %c2_i32 : i32
2026-02-21T09:54:05.4246611Z         %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4246841Z         %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4247121Z         %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:05.4247409Z         %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4247625Z         %323 = arith.addi %245, %322 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4247834Z         %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4248046Z         %325 = tt.load %324 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4248353Z         %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4248803Z         %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4249088Z         %328 = arith.extsi %317 : i32 to i64
2026-02-21T09:54:05.4249268Z         %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4249494Z         %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4249774Z         %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4250029Z         %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4250243Z         %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4250444Z         %334 = arith.addi %333, %250 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4250646Z         %335 = tt.addptr %9, %334 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4250855Z         %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4251032Z         %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4251198Z         %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked>
2026-02-21T09:54:05.4251388Z         %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4251582Z         %340 = arith.andi %339, %254 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4251756Z         %341 = tt.load %335, %340, %cst_1 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4251938Z         %342 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4252100Z         %343 = arith.shrsi %342, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4252351Z         %344 = ttg.convert_layout %343 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4252606Z         %345 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4252861Z         %346 = ttg.convert_layout %345 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4253222Z         %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4253570Z         %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4253871Z         %349 = tt.broadcast %347 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4254130Z         %350 = arith.select %17, %349, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4254383Z         %351 = tt.broadcast %348 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4254651Z         %352 = arith.select %19, %351, %350 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4254892Z         %353 = tt.reshape %352 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4255128Z         %354 = arith.sitofp %353 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4255387Z         %355 = ttg.local_alloc %354 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4255743Z         %356 = ttg.local_load %355 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4256228Z         %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4256579Z         %358 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:05.4256714Z         %359 = arith.cmpi slt, %358, %c2_i32 : i32
2026-02-21T09:54:05.4256854Z         %360 = arith.select %359, %358, %c0_i32 : i32
2026-02-21T09:54:05.4257126Z         %361 = ttg.memdesc_index %255[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4257496Z         ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4257986Z         scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4258431Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:05.4258756Z       %273 = ttg.local_load %272#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4259192Z       %274 = arith.extf %273 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4259502Z       %275 = arith.shli %272#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4259670Z       %276 = arith.shrsi %275, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4259918Z       %277 = ttg.convert_layout %276 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4260172Z       %278 = arith.shrsi %272#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4260417Z       %279 = ttg.convert_layout %278 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4260760Z       %280 = tt.expand_dims %277 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4261113Z       %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4261420Z       %282 = tt.broadcast %280 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4261675Z       %283 = arith.select %17, %282, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4261924Z       %284 = tt.broadcast %281 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4262171Z       %285 = arith.select %19, %284, %283 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4262418Z       %286 = tt.reshape %285 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4262650Z       %287 = arith.sitofp %286 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4262924Z       %288 = ttg.local_alloc %287 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4263250Z       %289 = ttg.local_load %288 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4263751Z       %290 = tt.dot %274, %289, %272#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4264254Z       %291 = ttg.local_load %272#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4264689Z       %292 = arith.extf %291 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4264998Z       %293 = arith.shli %272#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4265168Z       %294 = arith.shrsi %293, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4265413Z       %295 = ttg.convert_layout %294 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4265668Z       %296 = arith.shrsi %272#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4265914Z       %297 = ttg.convert_layout %296 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4266259Z       %298 = tt.expand_dims %295 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4266627Z       %299 = tt.expand_dims %297 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4266918Z       %300 = tt.broadcast %298 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4267175Z       %301 = arith.select %17, %300, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4267423Z       %302 = tt.broadcast %299 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4267670Z       %303 = arith.select %19, %302, %301 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4267911Z       %304 = tt.reshape %303 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4268141Z       %305 = arith.sitofp %304 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4268403Z       %306 = ttg.local_alloc %305 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4268736Z       %307 = ttg.local_load %306 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4269214Z       %308 = tt.dot %292, %307, %290, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4269603Z       ttg.local_dealloc %255 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:05.4269837Z       %309 = arith.truncf %308 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:54:05.4270115Z       %310 = tt.expand_dims %242 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:05.4270363Z       %311 = arith.muli %310, %cst_13 : tensor<128x1xi32, #mma>
2026-02-21T09:54:05.4270599Z       %312 = tt.expand_dims %237 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:54:05.4270869Z       %313 = tt.broadcast %311 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4271101Z       %314 = tt.broadcast %312 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4271292Z       %315 = arith.addi %313, %314 : tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4271494Z       %316 = tt.addptr %20, %315 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4271701Z       tt.store %316, %309 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:05.4271848Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:54:05.4271971Z     scf.for %arg3 = %24 to %2 step %c1_i32  : i32 {
2026-02-21T09:54:05.4272128Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:54:05.4272252Z       %26 = arith.muli %25, %c4_i32 : i32
2026-02-21T09:54:05.4272379Z       %27 = arith.subi %c64_i32, %26 : i32
2026-02-21T09:54:05.4272498Z       %28 = arith.minsi %27, %c4_i32 : i32
2026-02-21T09:54:05.4272624Z       %29 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:54:05.4272748Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:54:05.4272865Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:54:05.4272982Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:54:05.4273096Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:54:05.4273262Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:05.4273473Z       %35 = arith.addi %34, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:05.4273643Z       %36 = arith.muli %32, %c128_i32 : i32
2026-02-21T09:54:05.4273816Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:05.4274028Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:05.4274246Z       %39 = arith.addi %37, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:05.4274476Z       %40 = arith.addi %38, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:05.4274751Z       %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:54:05.4275006Z       %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:05.4275196Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4275373Z       %44 = arith.extsi %33 : i32 to i64
2026-02-21T09:54:05.4275537Z       %45 = tt.splat %44 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:54:05.4275754Z       %46 = arith.addi %45, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:54:05.4276023Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4276297Z       %48 = tt.broadcast %47 : tensor<1x128xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4276493Z       %49 = arith.cmpi sge, %47, %cst_4 : tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4276661Z       %50 = arith.cmpi slt, %47, %cst_3 : tensor<1x128xi64, #blocked>
2026-02-21T09:54:05.4276824Z       %51 = arith.andi %49, %50 : tensor<1x128xi1, #blocked>
2026-02-21T09:54:05.4277000Z       %52 = tt.broadcast %51 : tensor<1x128xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4277229Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:05.4277495Z       %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:05.4277761Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4277950Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4278144Z       %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4278345Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4278601Z       %59 = tt.expand_dims %11 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4278836Z       %60 = arith.muli %59, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4279018Z       %61 = tt.broadcast %60 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4279203Z       %62 = arith.addi %61, %48 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4279392Z       %63 = tt.addptr %9, %62 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4279606Z       %64 = arith.cmpi sge, %59, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4279771Z       %65 = arith.cmpi slt, %59, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4279928Z       %66 = arith.andi %64, %65 : tensor<2x1xi1, #blocked>
2026-02-21T09:54:05.4280102Z       %67 = tt.broadcast %66 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4280286Z       %68 = arith.andi %67, %52 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4280493Z       %69 = tt.load %63, %68, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4280831Z       %70 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4281195Z       ttg.local_store %58, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4281468Z       %71 = arith.addi %7, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4281745Z       %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:05.4282030Z       %73 = tt.broadcast %72 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4282219Z       %74 = arith.addi %43, %73 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4282417Z       %75 = tt.addptr %8, %74 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4282658Z       %76 = tt.load %75 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4282847Z       %77 = arith.addi %11, %cst_8 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4283115Z       %78 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4283352Z       %79 = arith.muli %78, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4283534Z       %80 = tt.broadcast %79 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4283718Z       %81 = arith.addi %80, %48 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4283909Z       %82 = tt.addptr %9, %81 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4284107Z       %83 = arith.cmpi sge, %78, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4284273Z       %84 = arith.cmpi slt, %78, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4284429Z       %85 = arith.andi %83, %84 : tensor<2x1xi1, #blocked>
2026-02-21T09:54:05.4284604Z       %86 = tt.broadcast %85 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4284806Z       %87 = arith.andi %86, %52 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4285008Z       %88 = tt.load %82, %87, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4285341Z       %89 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4285699Z       ttg.local_store %76, %89 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4286324Z       %90:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %70, %arg8 = %89, %arg9 = %69, %arg10 = %88) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>)  : i32 {
2026-02-21T09:54:05.4286861Z         %135 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:54:05.4286987Z         %136 = arith.muli %135, %c2_i32 : i32
2026-02-21T09:54:05.4287161Z         %137 = tt.splat %136 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4287405Z         %138 = arith.addi %137, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:05.4287679Z         %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:05.4287957Z         %140 = tt.broadcast %139 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4288154Z         %141 = arith.addi %43, %140 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4288360Z         %142 = tt.addptr %8, %141 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:05.4288569Z         %143 = tt.load %142 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:05.4288869Z         %144 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4289309Z         %145 = arith.extf %144 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4289590Z         %146 = arith.extsi %135 : i32 to i64
2026-02-21T09:54:05.4289762Z         %147 = tt.splat %146 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4290002Z         %148 = arith.addi %147, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:05.4290273Z         %149 = tt.expand_dims %148 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4290521Z         %150 = arith.muli %149, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4290710Z         %151 = tt.broadcast %150 : tensor<2x1xi64, #blocked> -> tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4290904Z         %152 = arith.addi %151, %48 : tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4291100Z         %153 = tt.addptr %9, %152 : tensor<2x128x!tt.ptr<i8>, #blocked>, tensor<2x128xi64, #blocked>
2026-02-21T09:54:05.4291307Z         %154 = arith.cmpi sge, %149, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4291480Z         %155 = arith.cmpi slt, %149, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:54:05.4291642Z         %156 = arith.andi %154, %155 : tensor<2x1xi1, #blocked>
2026-02-21T09:54:05.4291833Z         %157 = tt.broadcast %156 : tensor<2x1xi1, #blocked> -> tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4292025Z         %158 = arith.andi %157, %52 : tensor<2x128xi1, #blocked>
2026-02-21T09:54:05.4292193Z         %159 = tt.load %153, %158, %cst_1 : tensor<2x128x!tt.ptr<i8>, #blocked>
2026-02-21T09:54:05.4292370Z         %160 = arith.shli %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4292531Z         %161 = arith.shrsi %160, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4292794Z         %162 = ttg.convert_layout %161 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4293043Z         %163 = arith.shrsi %arg9, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4293293Z         %164 = ttg.convert_layout %163 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4293635Z         %165 = tt.expand_dims %162 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4293980Z         %166 = tt.expand_dims %164 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4294297Z         %167 = tt.broadcast %165 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4294550Z         %168 = arith.select %17, %167, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4294796Z         %169 = tt.broadcast %166 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4295058Z         %170 = arith.select %19, %169, %168 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4295293Z         %171 = tt.reshape %170 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4295533Z         %172 = arith.sitofp %171 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4295788Z         %173 = ttg.local_alloc %172 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4296120Z         %174 = ttg.local_load %173 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4296598Z         %175 = tt.dot %145, %174, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4296949Z         %176 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:05.4297077Z         %177 = arith.cmpi slt, %176, %c2_i32 : i32
2026-02-21T09:54:05.4297211Z         %178 = arith.select %177, %176, %c0_i32 : i32
2026-02-21T09:54:05.4297477Z         %179 = ttg.memdesc_index %53[%178] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4297858Z         ttg.local_store %143, %179 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:54:05.4298350Z         scf.yield %175, %178, %arg8, %179, %arg10, %159 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x128xi8, #blocked>, tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4298774Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:05.4299092Z       %91 = ttg.local_load %90#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4299521Z       %92 = arith.extf %91 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4299817Z       %93 = arith.shli %90#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4299976Z       %94 = arith.shrsi %93, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4300214Z       %95 = ttg.convert_layout %94 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4300454Z       %96 = arith.shrsi %90#4, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4300691Z       %97 = ttg.convert_layout %96 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4301036Z       %98 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4301376Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4301661Z       %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4301904Z       %101 = arith.select %17, %100, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4302147Z       %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4302402Z       %103 = arith.select %19, %102, %101 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4302640Z       %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4302869Z       %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4303134Z       %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4303462Z       %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4303933Z       %108 = tt.dot %92, %107, %90#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4304429Z       %109 = ttg.local_load %90#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4304856Z       %110 = arith.extf %109 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4305154Z       %111 = arith.shli %90#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4305316Z       %112 = arith.shrsi %111, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4305561Z       %113 = ttg.convert_layout %112 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4305807Z       %114 = arith.shrsi %90#5, %cst : tensor<2x128xi8, #blocked>
2026-02-21T09:54:05.4306065Z       %115 = ttg.convert_layout %114 : tensor<2x128xi8, #blocked> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:05.4306401Z       %116 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4306745Z       %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x128xi8, #blocked1>
2026-02-21T09:54:05.4307036Z       %118 = tt.broadcast %116 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4307281Z       %119 = arith.select %17, %118, %cst_0 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4307529Z       %120 = tt.broadcast %117 : tensor<2x1x128xi8, #blocked1> -> tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4307769Z       %121 = arith.select %19, %120, %119 : tensor<2x2x128xi1, #blocked1>, tensor<2x2x128xi8, #blocked1>
2026-02-21T09:54:05.4308006Z       %122 = tt.reshape %121 : tensor<2x2x128xi8, #blocked1> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:05.4308233Z       %123 = arith.sitofp %122 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:05.4308487Z       %124 = ttg.local_alloc %123 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:05.4308813Z       %125 = ttg.local_load %124 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:05.4309295Z       %126 = tt.dot %110, %125, %108, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:05.4309682Z       ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:54:05.4309901Z       %127 = arith.truncf %126 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:54:05.4310174Z       %128 = tt.expand_dims %40 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:05.4310429Z       %129 = arith.muli %128, %cst_13 : tensor<128x1xi32, #mma>
2026-02-21T09:54:05.4310667Z       %130 = tt.expand_dims %35 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:54:05.4310932Z       %131 = tt.broadcast %129 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4311140Z       %132 = tt.broadcast %130 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4311336Z       %133 = arith.addi %131, %132 : tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4311532Z       %134 = tt.addptr %20, %133 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:54:05.4311731Z       tt.store %134, %127 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:05.4311871Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:54:05.4311975Z     tt.return
2026-02-21T09:54:05.4312053Z   }
2026-02-21T09:54:05.4312130Z }
2026-02-21T09:54:05.4312174Z 
2026-02-21T09:54:05.4312206Z {-#
2026-02-21T09:54:05.4312287Z   external_resources: {
2026-02-21T09:54:05.4312385Z     mlir_reproducer: {
2026-02-21T09:54:05.4313378Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:54:05.4314362Z       disable_threading: false,
2026-02-21T09:54:05.4314503Z       verify_each: true
2026-02-21T09:54:05.4314597Z     }
2026-02-21T09:54:05.4314667Z   }
2026-02-21T09:54:05.4314738Z #-}
2026-02-21T09:54:05.4315016Z /tmp/torchinductor_root/sm/csm3u5qpmzobx27atwavnzywqgcr3ileu4vkrdd2sfjqwnv35b46.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:54:05.4315702Z /tmp/torchinductor_root/sm/csm3u5qpmzobx27atwavnzywqgcr3ileu4vkrdd2sfjqwnv35b46.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:54:05.4316250Z [576s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:54:05.4317024Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:54:05.4317738Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:54:05.4317908Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:54:07.7872981Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:54:07.7876742Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:54:07.7877433Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:54:07.7878098Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:54:07.7878688Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:54:07.7879322Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:07.7879798Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:07.7880155Z #smem = #ttg.shared_memory
2026-02-21T09:54:07.7880610Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:54:07.7881657Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:54:07.7882427Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:54:07.7882846Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:07.7883188Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:07.7883539Z     %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:07.7883915Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:54:07.7884284Z     %cst_4 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2>
2026-02-21T09:54:07.7884632Z     %cst_5 = arith.constant dense<0> : tensor<2x1xi64, #blocked2>
2026-02-21T09:54:07.7884974Z     %cst_6 = arith.constant dense<512> : tensor<2x1xi64, #blocked2>
2026-02-21T09:54:07.7885318Z     %cst_7 = arith.constant dense<0> : tensor<1x128xi64, #blocked2>
2026-02-21T09:54:07.7885670Z     %cst_8 = arith.constant dense<8192> : tensor<1x128xi64, #blocked2>
2026-02-21T09:54:07.7895224Z     %cst_9 = arith.constant dense<0> : tensor<2x128xi8, #blocked2>
2026-02-21T09:54:07.7895537Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:54:07.7895924Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:54:07.7896206Z     %cst_10 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked>
2026-02-21T09:54:07.7896511Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:54:07.7896737Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:54:07.7897093Z     %cst_11 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:07.7897479Z     %0 = tt.get_program_id x : i32
2026-02-21T09:54:07.7897692Z     %1 = arith.divsi %0, %c128_i32 : i32
2026-02-21T09:54:07.7897916Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:54:07.7898133Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:54:07.7898359Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:54:07.7898576Z     %5 = arith.remsi %0, %c128_i32 : i32
2026-02-21T09:54:07.7898798Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:54:07.7899007Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:54:07.7899208Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:54:07.7899418Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:54:07.7899816Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:07.7900386Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:07.7900924Z     %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:07.7901513Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:07.7902035Z     %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:07.7902461Z     %15 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:07.7902887Z     %16 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:07.7903315Z     %17 = arith.addi %15, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:07.7903693Z     %18 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:54:07.7904002Z     %19 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:07.7904405Z     %20 = arith.addi %19, %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:07.7904891Z     %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:07.7905522Z     %22 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:07.7906055Z     %23 = arith.muli %22, %cst_2 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:07.7906443Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:54:07.7906874Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:07.7907193Z     %26 = arith.extsi %18 : i32 to i64
2026-02-21T09:54:07.7907490Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:54:07.7907961Z     %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:07.7908616Z     %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:07.7909204Z     %30 = tt.splat %26 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:07.7909809Z     %31 = arith.extsi %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:07.7910411Z     %32 = arith.addi %30, %31 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:07.7911001Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:54:07.7911565Z     %34 = tt.broadcast %33 : tensor<1x128xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:54:07.7911968Z     %35 = arith.cmpi sge, %33, %cst_7 : tensor<1x128xi64, #blocked2>
2026-02-21T09:54:07.7912307Z     %36 = arith.cmpi slt, %33, %cst_8 : tensor<1x128xi64, #blocked2>
2026-02-21T09:54:07.7912627Z     %37 = arith.andi %35, %36 : tensor<1x128xi1, #blocked2>
2026-02-21T09:54:07.7912990Z     %38 = tt.broadcast %37 : tensor<1x128xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:54:07.7913576Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:54:07.7914427Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:54:07.7915259Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:07.7915770Z     %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:07.7916150Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:54:07.7916541Z     %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:07.7916942Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:54:07.7917472Z     %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst_3) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:54:07.7917905Z       %56 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:54:07.7918233Z       %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:07.7918669Z       %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:07.7919218Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:54:07.7919796Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:54:07.7920176Z       %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked1>
2026-02-21T09:54:07.7920564Z       %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:54:07.7920969Z       %63 = tt.load %62 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:07.7921423Z       %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:54:07.7922102Z       %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:07.7923001Z       %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:07.7923582Z       %67 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:54:07.7923911Z       %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:07.7924341Z       %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:07.7924890Z       %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:54:07.7925384Z       %71 = arith.muli %70, %cst_4 : tensor<2x1xi64, #blocked2>
2026-02-21T09:54:07.7925757Z       %72 = tt.broadcast %71 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:54:07.7926134Z       %73 = arith.addi %72, %34 : tensor<2x128xi64, #blocked2>
2026-02-21T09:54:07.7926518Z       %74 = tt.addptr %27, %73 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:54:07.7926964Z       %75 = arith.cmpi sge, %70, %cst_5 : tensor<2x1xi64, #blocked2>
2026-02-21T09:54:07.7927299Z       %76 = arith.cmpi slt, %70, %cst_6 : tensor<2x1xi64, #blocked2>
2026-02-21T09:54:07.7927616Z       %77 = arith.andi %75, %76 : tensor<2x1xi1, #blocked2>
2026-02-21T09:54:07.7927977Z       %78 = tt.broadcast %77 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:54:07.7928346Z       %79 = arith.andi %78, %38 : tensor<2x128xi1, #blocked2>
2026-02-21T09:54:07.7928671Z       %80 = tt.load %74, %79, %cst_9 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:54:07.7929180Z       %81 = ttg.convert_layout %80 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:07.7929747Z       %82 = arith.shli %81, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:07.7930225Z       %83 = arith.shrsi %82, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:07.7930700Z       %84 = arith.shrsi %81, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:07.7931294Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:54:07.7931976Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:54:07.7932583Z       %87 = tt.broadcast %85 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:54:07.7933066Z       %88 = arith.select %43, %87, %cst_10 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:54:07.7933543Z       %89 = tt.broadcast %86 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:54:07.7934009Z       %90 = arith.select %45, %89, %88 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:54:07.7934464Z       %91 = tt.reshape %90 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:54:07.7934909Z       %92 = arith.sitofp %91 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:54:07.7935451Z       %93 = ttg.local_alloc %92 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:07.7936120Z       %94 = ttg.local_load %93 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:07.7937141Z       %95 = tt.dot %66, %94, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:07.7937872Z       scf.yield %95 : tensor<128x128xf32, #mma>
2026-02-21T09:54:07.7938249Z     } {tt.disallow_acc_multi_buffer, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32}
2026-02-21T09:54:07.7938718Z     %47 = arith.truncf %46 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:54:07.7939252Z     %48 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:07.7939736Z     %49 = arith.muli %48, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:54:07.7940198Z     %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:54:07.7940725Z     %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:07.7941137Z     %52 = tt.broadcast %50 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:07.7941491Z     %53 = arith.addi %51, %52 : tensor<128x128xi32, #mma>
2026-02-21T09:54:07.7941839Z     %54 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:07.7942271Z     %55 = tt.addptr %54, %53 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:54:07.7942684Z     tt.store %55, %47 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:07.7942944Z     tt.return
2026-02-21T09:54:07.7943096Z   }
2026-02-21T09:54:07.7943246Z }
2026-02-21T09:54:07.7943328Z 
2026-02-21T09:54:07.7943386Z {-#
2026-02-21T09:54:07.7943540Z   external_resources: {
2026-02-21T09:54:07.7943726Z     mlir_reproducer: {
2026-02-21T09:54:07.7945860Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:54:07.7948007Z       disable_threading: false,
2026-02-21T09:54:07.7948220Z       verify_each: true
2026-02-21T09:54:07.7948391Z     }
2026-02-21T09:54:07.7948529Z   }
2026-02-21T09:54:07.7948660Z #-}
2026-02-21T09:54:07.7949246Z /tmp/torchinductor_root/nf/cnf5yzghtklydrgpfrfik7a4ewtelw3rmct3ccelbte66osh63kw.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:54:07.7950775Z /tmp/torchinductor_root/nf/cnf5yzghtklydrgpfrfik7a4ewtelw3rmct3ccelbte66osh63kw.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:54:07.7952036Z [578s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:54:07.7953635Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:54:07.7955119Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:54:07.7955462Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:54:08.4809691Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:54:08.4811602Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:54:08.4812303Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:54:08.4812973Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:54:08.4813633Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:54:08.4814257Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:54:08.4814817Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:54:08.4815330Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:08.4815726Z #smem = #ttg.shared_memory
2026-02-21T09:54:08.4816237Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:54:08.4817415Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:54:08.4818239Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:54:08.4818627Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:08.4818993Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:08.4819298Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:54:08.4819612Z     %cst_3 = arith.constant dense<8192> : tensor<2x1xi64, #blocked1>
2026-02-21T09:54:08.4819900Z     %cst_4 = arith.constant dense<0> : tensor<2x1xi64, #blocked1>
2026-02-21T09:54:08.4820180Z     %cst_5 = arith.constant dense<512> : tensor<2x1xi64, #blocked1>
2026-02-21T09:54:08.4820470Z     %cst_6 = arith.constant dense<0> : tensor<1x128xi64, #blocked1>
2026-02-21T09:54:08.4820763Z     %cst_7 = arith.constant dense<8192> : tensor<1x128xi64, #blocked1>
2026-02-21T09:54:08.4821054Z     %cst_8 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:08.4821306Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:54:08.4821507Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:54:08.4821745Z     %cst_9 = arith.constant dense<0> : tensor<2x128xi8, #blocked1>
2026-02-21T09:54:08.4822033Z     %cst_10 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked>
2026-02-21T09:54:08.4822277Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:54:08.4822471Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:54:08.4822819Z     %cst_11 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:08.4823135Z     %0 = tt.get_program_id x : i32
2026-02-21T09:54:08.4823315Z     %1 = arith.divsi %0, %c128_i32 : i32
2026-02-21T09:54:08.4823504Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:54:08.4823692Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:54:08.4823872Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:54:08.4824061Z     %5 = arith.remsi %0, %c128_i32 : i32
2026-02-21T09:54:08.4824239Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:54:08.4824422Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:54:08.4824588Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:54:08.4824847Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:54:08.4825189Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:08.4825652Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:08.4826113Z     %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:08.4826587Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:08.4827003Z     %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:08.4827356Z     %15 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:08.4827710Z     %16 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:54:08.4828064Z     %17 = arith.addi %15, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:08.4828329Z     %18 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:54:08.4828591Z     %19 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:08.4828929Z     %20 = arith.addi %19, %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:08.4829324Z     %21 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:08.4829852Z     %22 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:54:08.4830165Z     %23 = arith.muli %22, %cst_8 : tensor<128x1xi32, #blocked2>
2026-02-21T09:54:08.4830429Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:08.4830698Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:08.4830928Z     %26 = arith.extsi %18 : i32 to i64
2026-02-21T09:54:08.4831117Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:08.4831406Z     %28 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:08.4831810Z     %29 = arith.extsi %28 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:08.4832176Z     %30 = tt.splat %26 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:08.4832550Z     %31 = arith.extsi %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:08.4832921Z     %32 = arith.addi %30, %31 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:08.4833268Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi64, #blocked1>
2026-02-21T09:54:08.4833613Z     %34 = tt.broadcast %33 : tensor<1x128xi64, #blocked1> -> tensor<2x128xi64, #blocked1>
2026-02-21T09:54:08.4833865Z     %35 = arith.cmpi sge, %33, %cst_6 : tensor<1x128xi64, #blocked1>
2026-02-21T09:54:08.4834101Z     %36 = arith.cmpi slt, %33, %cst_7 : tensor<1x128xi64, #blocked1>
2026-02-21T09:54:08.4834302Z     %37 = arith.andi %35, %36 : tensor<1x128xi1, #blocked1>
2026-02-21T09:54:08.4834528Z     %38 = tt.broadcast %37 : tensor<1x128xi1, #blocked1> -> tensor<2x128xi1, #blocked1>
2026-02-21T09:54:08.4834883Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:54:08.4835404Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:54:08.4835935Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:08.4836253Z     %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:08.4836498Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:54:08.4836739Z     %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:08.4836976Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:54:08.4837327Z     %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg4 = %cst_2) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:54:08.4837594Z       %56 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:54:08.4837810Z       %57 = tt.splat %56 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:08.4838085Z       %58 = arith.addi %57, %21 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:54:08.4838422Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:54:08.4838768Z       %60 = tt.broadcast %59 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:54:08.4839059Z       %61 = arith.addi %24, %60 : tensor<128x4xi32, #blocked2>
2026-02-21T09:54:08.4839260Z       %62 = tt.addptr %25, %61 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:54:08.4839468Z       %63 = tt.load %62 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:54:08.4839690Z       %64 = ttg.local_alloc %63 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:54:08.4840041Z       %65 = ttg.local_load %64 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:08.4840451Z       %66 = arith.extf %65 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:08.4840740Z       %67 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:54:08.4840910Z       %68 = tt.splat %67 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:08.4841125Z       %69 = arith.addi %68, %29 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:08.4841399Z       %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi64, #blocked1>
2026-02-21T09:54:08.4841642Z       %71 = arith.muli %70, %cst_3 : tensor<2x1xi64, #blocked1>
2026-02-21T09:54:08.4841831Z       %72 = tt.broadcast %71 : tensor<2x1xi64, #blocked1> -> tensor<2x128xi64, #blocked1>
2026-02-21T09:54:08.4842019Z       %73 = arith.addi %72, %34 : tensor<2x128xi64, #blocked1>
2026-02-21T09:54:08.4842219Z       %74 = tt.addptr %27, %73 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi64, #blocked1>
2026-02-21T09:54:08.4842430Z       %75 = arith.cmpi sge, %70, %cst_4 : tensor<2x1xi64, #blocked1>
2026-02-21T09:54:08.4842700Z       %76 = arith.cmpi slt, %70, %cst_5 : tensor<2x1xi64, #blocked1>
2026-02-21T09:54:08.4842865Z       %77 = arith.andi %75, %76 : tensor<2x1xi1, #blocked1>
2026-02-21T09:54:08.4843069Z       %78 = tt.broadcast %77 : tensor<2x1xi1, #blocked1> -> tensor<2x128xi1, #blocked1>
2026-02-21T09:54:08.4843257Z       %79 = arith.andi %78, %38 : tensor<2x128xi1, #blocked1>
2026-02-21T09:54:08.4843426Z       %80 = tt.load %74, %79, %cst_9 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:54:08.4843681Z       %81 = ttg.convert_layout %80 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:08.4843968Z       %82 = arith.shli %81, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:08.4844205Z       %83 = arith.shrsi %82, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:08.4844469Z       %84 = arith.shrsi %81, %cst_11 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:08.4844761Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:54:08.4845100Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:54:08.4845388Z       %87 = tt.broadcast %85 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:54:08.4845649Z       %88 = arith.select %43, %87, %cst_10 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:54:08.4845891Z       %89 = tt.broadcast %86 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:54:08.4846124Z       %90 = arith.select %45, %89, %88 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:54:08.4846350Z       %91 = tt.reshape %90 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:54:08.4846575Z       %92 = arith.sitofp %91 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:54:08.4846824Z       %93 = ttg.local_alloc %92 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:54:08.4847157Z       %94 = ttg.local_load %93 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:08.4847635Z       %95 = tt.dot %66, %94, %arg4, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:54:08.4847990Z       scf.yield %95 : tensor<128x128xf32, #mma>
2026-02-21T09:54:08.4848119Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:54:08.4857705Z     %47 = arith.truncf %46 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:54:08.4857997Z     %48 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:08.4858233Z     %49 = arith.muli %48, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:54:08.4858458Z     %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:54:08.4858715Z     %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:08.4858916Z     %52 = tt.broadcast %50 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:54:08.4859093Z     %53 = arith.addi %51, %52 : tensor<128x128xi32, #mma>
2026-02-21T09:54:08.4859267Z     %54 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:08.4859482Z     %55 = tt.addptr %54, %53 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:54:08.4859679Z     tt.store %55, %47 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:08.4859809Z     tt.return
2026-02-21T09:54:08.4859890Z   }
2026-02-21T09:54:08.4859963Z }
2026-02-21T09:54:08.4860008Z 
2026-02-21T09:54:08.4860040Z {-#
2026-02-21T09:54:08.4860117Z   external_resources: {
2026-02-21T09:54:08.4860216Z     mlir_reproducer: {
2026-02-21T09:54:08.4861224Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:54:08.4862263Z       disable_threading: false,
2026-02-21T09:54:08.4862367Z       verify_each: true
2026-02-21T09:54:08.4862478Z     }
2026-02-21T09:54:08.4862548Z   }
2026-02-21T09:54:08.4862618Z #-}
2026-02-21T09:54:08.4862899Z /tmp/torchinductor_root/qy/cqyq56n6uafmgokjpwcvf26cpfbfufu46mxhrzhs4gjynwx5cdlg.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:54:08.4863602Z /tmp/torchinductor_root/qy/cqyq56n6uafmgokjpwcvf26cpfbfufu46mxhrzhs4gjynwx5cdlg.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:54:08.4864154Z [579s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:54:08.4864877Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:54:08.4865533Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:54:08.4865702Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:54:08.8706784Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 11.2 configs/s
2026-02-21T09:54:14.4341207Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 211/211 27.4 configs/s
2026-02-21T09:54:17.5181455Z [588s] Generation 8 complete: 
2026-02-21T09:54:17.5181826Z error=10
2026-02-21T09:54:17.5182010Z ok=74
2026-02-21T09:54:17.5182181Z min=0.9360
2026-02-21T09:54:17.5182354Z mid=1.4499
2026-02-21T09:54:17.5182524Z max=34.4558
2026-02-21T09:54:17.5182722Z best={'block_sizes': [8, 128, 128],
2026-02-21T09:54:17.5183565Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:54:17.5183864Z  'l2_groupings': [2],
2026-02-21T09:54:17.5184116Z  'load_eviction_policies': ['', ''],
2026-02-21T09:54:17.5184379Z  'loop_orders': [[0, 1]],
2026-02-21T09:54:17.5184618Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:54:17.5184839Z  'num_stages': 1,
2026-02-21T09:54:17.5185034Z  'num_warps': 4,
2026-02-21T09:54:17.5185236Z  'pid_type': 'flat',
2026-02-21T09:54:17.5185465Z  'range_flattens': [None, None],
2026-02-21T09:54:17.5185728Z  'range_multi_buffers': [None, False],
2026-02-21T09:54:17.5185987Z  'range_num_stages': [0, 1],
2026-02-21T09:54:17.5186215Z  'range_unroll_factors': [0, 0],
2026-02-21T09:54:17.5186468Z  'range_warp_specializes': [],
2026-02-21T09:54:17.5186707Z  'waves_per_eu': 2}
2026-02-21T09:54:17.5258585Z [588s] Fitting surrogate: 844 points, 844 targets
2026-02-21T09:54:18.2054323Z [589s] Generation 9 starting: 63 neighbors, 3 active search path(s)
2026-02-21T09:54:42.0969797Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 0.9 configs/s
2026-02-21T09:54:46.8506249Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:54:46.8510344Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [8, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:54:46.8513616Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:54:46.8515211Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:54:46.8516334Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:54:46.8517390Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:46.8517890Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:46.8518172Z #smem = #ttg.shared_memory
2026-02-21T09:54:46.8518757Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:54:46.8519518Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:54:46.8520158Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma>
2026-02-21T09:54:46.8520492Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:54:46.8520670Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:54:46.8520854Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T09:54:46.8521040Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:54:46.8521220Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:54:46.8521403Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:54:46.8521629Z     %cst_0 = arith.constant dense<0> : tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8521861Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:54:46.8522036Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:54:46.8522204Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:54:46.8522424Z     %cst_1 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:46.8522819Z     %cst_2 = arith.constant dense<8192> : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8523076Z     %cst_3 = arith.constant dense<4> : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8523291Z     %cst_4 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:46.8523464Z     %cst_5 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:46.8523633Z     %cst_6 = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:54:46.8523876Z     %0 = tt.get_program_id x : i32
2026-02-21T09:54:46.8523988Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:54:46.8524107Z     %2 = arith.minsi %1, %c16384_i32 : i32
2026-02-21T09:54:46.8524308Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:46.8524582Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:46.8524890Z     %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8525194Z     %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:46.8525494Z     %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8525799Z     %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:46.8526038Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:46.8526282Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8526581Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:54:46.8527015Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:54:46.8527411Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:46.8527658Z     %14 = arith.cmpi eq, %13, %cst_4 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:46.8527853Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:54:46.8528049Z     %16 = arith.cmpi eq, %13, %cst_5 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:46.8528254Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x64xi1, #blocked>
2026-02-21T09:54:46.8528459Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:46.8528615Z     %19 = arith.subi %2, %0 : i32
2026-02-21T09:54:46.8528729Z     %20 = arith.remsi %19, %c3_i32 : i32
2026-02-21T09:54:46.8528845Z     %21 = arith.subi %19, %20 : i32
2026-02-21T09:54:46.8528956Z     %22 = arith.addi %0, %21 : i32
2026-02-21T09:54:46.8529123Z     scf.for %arg3 = %0 to %22 step %c3_i32  : i32 {
2026-02-21T09:54:46.8529263Z       %23 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:54:46.8529384Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:54:46.8529503Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:54:46.8529621Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:54:46.8529736Z       %27 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:54:46.8529854Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:54:46.8529965Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:54:46.8530074Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:54:46.8530183Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:54:46.8530351Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:46.8530568Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:46.8530780Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:46.8530990Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:46.8531150Z       %36 = arith.muli %30, %c64_i32 : i32
2026-02-21T09:54:46.8531376Z       %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8531622Z       %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:46.8531865Z       %39 = arith.addi %37, %5 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8532108Z       %40 = arith.addi %38, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:46.8532375Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:46.8532625Z       %42 = arith.muli %41, %cst_1 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:46.8532821Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8533168Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8533590Z       %45 = tt.broadcast %44 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8533923Z       %46 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<128x64xf32, #mma>)  : i32 {
2026-02-21T09:54:46.8534221Z         %121 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8534538Z         %122 = arith.addi %121, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8534753Z         %123 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:54:46.8534924Z         %124 = tt.splat %123 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:46.8535143Z         %125 = arith.addi %124, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:46.8535419Z         %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:54:46.8535716Z         %127 = tt.broadcast %126 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8535909Z         %128 = arith.addi %43, %127 : tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8536114Z         %129 = tt.addptr %9, %128 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8536322Z         %130 = tt.load %129 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:46.8536563Z         %131 = ttg.local_alloc %130 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:54:46.8536898Z         %132 = ttg.local_load %131 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8537307Z         %133 = arith.extf %132 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8537767Z         %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8538127Z         %135 = arith.muli %134, %cst_2 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8538434Z         %136 = tt.broadcast %135 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8538738Z         %137 = arith.addi %136, %45 : tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8539047Z         %138 = tt.addptr %10, %137 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8539379Z         %139 = tt.load %138 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8539615Z         %140 = arith.shli %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8539851Z         %141 = arith.shrsi %140, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8540091Z         %142 = arith.shrsi %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8540381Z         %143 = tt.expand_dims %141 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:54:46.8540722Z         %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:54:46.8541009Z         %145 = tt.broadcast %143 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8541249Z         %146 = arith.select %15, %145, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8541491Z         %147 = tt.broadcast %144 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8541731Z         %148 = arith.select %17, %147, %146 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8541963Z         %149 = tt.reshape %148 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:54:46.8542189Z         %150 = arith.sitofp %149 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:54:46.8542458Z         %151 = ttg.local_alloc %150 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:54:46.8542787Z         %152 = ttg.local_load %151 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8543272Z         %153 = tt.dot %133, %152, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:46.8543632Z         scf.yield %153 : tensor<128x64xf32, #mma>
2026-02-21T09:54:46.8543825Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:46.8544032Z       %47 = arith.truncf %46 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:54:46.8544302Z       %48 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:46.8544550Z       %49 = arith.muli %48, %cst_6 : tensor<128x1xi32, #mma>
2026-02-21T09:54:46.8544793Z       %50 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:54:46.8545049Z       %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8545249Z       %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8545429Z       %53 = arith.addi %51, %52 : tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8545618Z       %54 = tt.addptr %18, %53 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8545811Z       tt.store %54, %47 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:46.8545955Z       %55 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:54:46.8546078Z       %56 = arith.divsi %55, %c512_i32 : i32
2026-02-21T09:54:46.8546203Z       %57 = arith.muli %56, %c4_i32 : i32
2026-02-21T09:54:46.8546321Z       %58 = arith.subi %c128_i32, %57 : i32
2026-02-21T09:54:46.8546443Z       %59 = arith.minsi %58, %c4_i32 : i32
2026-02-21T09:54:46.8546563Z       %60 = arith.remsi %55, %c512_i32 : i32
2026-02-21T09:54:46.8546686Z       %61 = arith.remsi %60, %59 : i32
2026-02-21T09:54:46.8546804Z       %62 = arith.addi %57, %61 : i32
2026-02-21T09:54:46.8546917Z       %63 = arith.divsi %60, %59 : i32
2026-02-21T09:54:46.8547036Z       %64 = arith.muli %62, %c128_i32 : i32
2026-02-21T09:54:46.8547235Z       %65 = tt.splat %64 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:46.8547454Z       %66 = tt.splat %64 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:46.8547669Z       %67 = arith.addi %65, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:46.8547910Z       %68 = arith.addi %66, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:46.8548075Z       %69 = arith.muli %63, %c64_i32 : i32
2026-02-21T09:54:46.8548282Z       %70 = tt.splat %69 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8548533Z       %71 = tt.splat %69 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:46.8548783Z       %72 = arith.addi %70, %5 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8549032Z       %73 = arith.addi %71, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:46.8549300Z       %74 = tt.expand_dims %67 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:46.8549556Z       %75 = arith.muli %74, %cst_1 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:46.8549748Z       %76 = tt.broadcast %75 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8550119Z       %77 = tt.expand_dims %72 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8550545Z       %78 = tt.broadcast %77 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8550880Z       %79 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<128x64xf32, #mma>)  : i32 {
2026-02-21T09:54:46.8551185Z         %121 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8551504Z         %122 = arith.addi %121, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8551725Z         %123 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:54:46.8551903Z         %124 = tt.splat %123 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:46.8552128Z         %125 = arith.addi %124, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:46.8552422Z         %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:54:46.8552701Z         %127 = tt.broadcast %126 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8552904Z         %128 = arith.addi %76, %127 : tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8553116Z         %129 = tt.addptr %9, %128 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8553328Z         %130 = tt.load %129 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:46.8553561Z         %131 = ttg.local_alloc %130 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:54:46.8553895Z         %132 = ttg.local_load %131 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8554312Z         %133 = arith.extf %132 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8554777Z         %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8555151Z         %135 = arith.muli %134, %cst_2 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8555465Z         %136 = tt.broadcast %135 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8555774Z         %137 = arith.addi %136, %78 : tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8556083Z         %138 = tt.addptr %10, %137 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8556401Z         %139 = tt.load %138 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8556635Z         %140 = arith.shli %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8556876Z         %141 = arith.shrsi %140, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8557120Z         %142 = arith.shrsi %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8557408Z         %143 = tt.expand_dims %141 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:54:46.8557748Z         %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:54:46.8558034Z         %145 = tt.broadcast %143 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8558382Z         %146 = arith.select %15, %145, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8558625Z         %147 = tt.broadcast %144 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8558857Z         %148 = arith.select %17, %147, %146 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8559095Z         %149 = tt.reshape %148 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:54:46.8559326Z         %150 = arith.sitofp %149 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:54:46.8559597Z         %151 = ttg.local_alloc %150 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:54:46.8559927Z         %152 = ttg.local_load %151 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8560401Z         %153 = tt.dot %133, %152, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:46.8560780Z         scf.yield %153 : tensor<128x64xf32, #mma>
2026-02-21T09:54:46.8560953Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:46.8561160Z       %80 = arith.truncf %79 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:54:46.8561431Z       %81 = tt.expand_dims %68 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:46.8561670Z       %82 = arith.muli %81, %cst_6 : tensor<128x1xi32, #mma>
2026-02-21T09:54:46.8561901Z       %83 = tt.expand_dims %73 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:54:46.8562157Z       %84 = tt.broadcast %82 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8562356Z       %85 = tt.broadcast %83 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8562535Z       %86 = arith.addi %84, %85 : tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8562761Z       %87 = tt.addptr %18, %86 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8562956Z       tt.store %87, %80 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:46.8563097Z       %88 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:54:46.8563241Z       %89 = arith.divsi %88, %c512_i32 : i32
2026-02-21T09:54:46.8563364Z       %90 = arith.muli %89, %c4_i32 : i32
2026-02-21T09:54:46.8563481Z       %91 = arith.subi %c128_i32, %90 : i32
2026-02-21T09:54:46.8563603Z       %92 = arith.minsi %91, %c4_i32 : i32
2026-02-21T09:54:46.8563722Z       %93 = arith.remsi %88, %c512_i32 : i32
2026-02-21T09:54:46.8563844Z       %94 = arith.remsi %93, %92 : i32
2026-02-21T09:54:46.8563959Z       %95 = arith.addi %90, %94 : i32
2026-02-21T09:54:46.8564077Z       %96 = arith.divsi %93, %92 : i32
2026-02-21T09:54:46.8564196Z       %97 = arith.muli %95, %c128_i32 : i32
2026-02-21T09:54:46.8564364Z       %98 = tt.splat %97 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:46.8564581Z       %99 = tt.splat %97 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:46.8564797Z       %100 = arith.addi %98, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:46.8565015Z       %101 = arith.addi %99, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:46.8565182Z       %102 = arith.muli %96, %c64_i32 : i32
2026-02-21T09:54:46.8565397Z       %103 = tt.splat %102 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8565657Z       %104 = tt.splat %102 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:46.8565934Z       %105 = arith.addi %103, %5 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8566190Z       %106 = arith.addi %104, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:46.8566466Z       %107 = tt.expand_dims %100 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:46.8566728Z       %108 = arith.muli %107, %cst_1 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:46.8566937Z       %109 = tt.broadcast %108 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8567293Z       %110 = tt.expand_dims %105 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8567744Z       %111 = tt.broadcast %110 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8568088Z       %112 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<128x64xf32, #mma>)  : i32 {
2026-02-21T09:54:46.8568407Z         %121 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8568711Z         %122 = arith.addi %121, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8568928Z         %123 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:54:46.8569108Z         %124 = tt.splat %123 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:46.8569338Z         %125 = arith.addi %124, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:46.8569616Z         %126 = tt.expand_dims %125 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:54:46.8569898Z         %127 = tt.broadcast %126 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8570100Z         %128 = arith.addi %109, %127 : tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8570310Z         %129 = tt.addptr %9, %128 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8570525Z         %130 = tt.load %129 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:46.8570753Z         %131 = ttg.local_alloc %130 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:54:46.8574086Z         %132 = ttg.local_load %131 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8574508Z         %133 = arith.extf %132 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8574976Z         %134 = tt.expand_dims %122 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8575340Z         %135 = arith.muli %134, %cst_2 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8575649Z         %136 = tt.broadcast %135 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8575959Z         %137 = arith.addi %136, %111 : tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8576277Z         %138 = tt.addptr %10, %137 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8576589Z         %139 = tt.load %138 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8576825Z         %140 = arith.shli %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8577084Z         %141 = arith.shrsi %140, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8577327Z         %142 = arith.shrsi %139, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8577619Z         %143 = tt.expand_dims %141 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:54:46.8577957Z         %144 = tt.expand_dims %142 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:54:46.8578243Z         %145 = tt.broadcast %143 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8578499Z         %146 = arith.select %15, %145, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8578740Z         %147 = tt.broadcast %144 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8578978Z         %148 = arith.select %17, %147, %146 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8579208Z         %149 = tt.reshape %148 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:54:46.8579458Z         %150 = arith.sitofp %149 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:54:46.8579709Z         %151 = ttg.local_alloc %150 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:54:46.8580038Z         %152 = ttg.local_load %151 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8580517Z         %153 = tt.dot %133, %152, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:46.8580873Z         scf.yield %153 : tensor<128x64xf32, #mma>
2026-02-21T09:54:46.8581040Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:46.8581250Z       %113 = arith.truncf %112 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:54:46.8581517Z       %114 = tt.expand_dims %101 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:46.8581759Z       %115 = arith.muli %114, %cst_6 : tensor<128x1xi32, #mma>
2026-02-21T09:54:46.8582003Z       %116 = tt.expand_dims %106 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:54:46.8582260Z       %117 = tt.broadcast %115 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8582466Z       %118 = tt.broadcast %116 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8582648Z       %119 = arith.addi %117, %118 : tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8582840Z       %120 = tt.addptr %18, %119 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8583036Z       tt.store %120, %113 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:46.8583176Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:54:46.8583295Z     scf.for %arg3 = %22 to %2 step %c1_i32  : i32 {
2026-02-21T09:54:46.8583431Z       %23 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:54:46.8583554Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:54:46.8583671Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:54:46.8583788Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:54:46.8583908Z       %27 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:54:46.8584027Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:54:46.8584138Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:54:46.8584250Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:54:46.8584359Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:54:46.8584526Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:46.8584756Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:46.8584967Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:46.8585179Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:46.8585340Z       %36 = arith.muli %30, %c64_i32 : i32
2026-02-21T09:54:46.8585544Z       %37 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8585792Z       %38 = tt.splat %36 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:46.8586051Z       %39 = arith.addi %37, %5 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8586297Z       %40 = arith.addi %38, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:46.8586564Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:46.8586817Z       %42 = arith.muli %41, %cst_1 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:46.8587023Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8587370Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8587788Z       %45 = tt.broadcast %44 : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8588124Z       %46 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c2_i32 iter_args(%arg5 = %cst) -> (tensor<128x64xf32, #mma>)  : i32 {
2026-02-21T09:54:46.8588426Z         %55 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8588725Z         %56 = arith.addi %55, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:46.8588934Z         %57 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:54:46.8589103Z         %58 = tt.splat %57 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:46.8589318Z         %59 = arith.addi %58, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:46.8589604Z         %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:54:46.8589878Z         %61 = tt.broadcast %60 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8590071Z         %62 = arith.addi %43, %61 : tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8590267Z         %63 = tt.addptr %9, %62 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:54:46.8590468Z         %64 = tt.load %63 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:46.8590689Z         %65 = ttg.local_alloc %64 : (tensor<128x4xbf16, #blocked1>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:54:46.8591018Z         %66 = ttg.local_load %65 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8591422Z         %67 = arith.extf %66 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8591876Z         %68 = tt.expand_dims %56 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8592227Z         %69 = arith.muli %68, %cst_2 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8592523Z         %70 = tt.broadcast %69 : tensor<2x1xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8592832Z         %71 = arith.addi %70, %45 : tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8593131Z         %72 = tt.addptr %10, %71 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<2x64xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8593433Z         %73 = tt.load %72 : tensor<2x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8593657Z         %74 = arith.shli %73, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8593901Z         %75 = arith.shrsi %74, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8594129Z         %76 = arith.shrsi %73, %cst_3 : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:46.8594406Z         %77 = tt.expand_dims %75 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:54:46.8594735Z         %78 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x64xi8, #blocked>
2026-02-21T09:54:46.8595024Z         %79 = tt.broadcast %77 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8595254Z         %80 = arith.select %15, %79, %cst_0 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8595484Z         %81 = tt.broadcast %78 : tensor<2x1x64xi8, #blocked> -> tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8595705Z         %82 = arith.select %17, %81, %80 : tensor<2x2x64xi1, #blocked>, tensor<2x2x64xi8, #blocked>
2026-02-21T09:54:46.8595929Z         %83 = tt.reshape %82 : tensor<2x2x64xi8, #blocked> -> tensor<4x64xi8, #blocked2>
2026-02-21T09:54:46.8596145Z         %84 = arith.sitofp %83 : tensor<4x64xi8, #blocked2> to tensor<4x64xf32, #blocked2>
2026-02-21T09:54:46.8596390Z         %85 = ttg.local_alloc %84 : (tensor<4x64xf32, #blocked2>) -> !ttg.memdesc<4x64xf32, #shared1, #smem>
2026-02-21T09:54:46.8596713Z         %86 = ttg.local_load %85 : !ttg.memdesc<4x64xf32, #shared1, #smem> -> tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:46.8597181Z         %87 = tt.dot %67, %86, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:46.8597543Z         scf.yield %87 : tensor<128x64xf32, #mma>
2026-02-21T09:54:46.8597708Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:46.8597913Z       %47 = arith.truncf %46 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:54:46.8598177Z       %48 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:46.8598412Z       %49 = arith.muli %48, %cst_6 : tensor<128x1xi32, #mma>
2026-02-21T09:54:46.8598637Z       %50 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:54:46.8598892Z       %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8599087Z       %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8599264Z       %53 = arith.addi %51, %52 : tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8599447Z       %54 = tt.addptr %18, %53 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:54:46.8599636Z       tt.store %54, %47 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:46.8599774Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:54:46.8599878Z     tt.return
2026-02-21T09:54:46.8599960Z   }
2026-02-21T09:54:46.8600034Z }
2026-02-21T09:54:46.8600080Z 
2026-02-21T09:54:46.8600110Z {-#
2026-02-21T09:54:46.8600204Z   external_resources: {
2026-02-21T09:54:46.8600305Z     mlir_reproducer: {
2026-02-21T09:54:46.8601305Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:54:46.8602313Z       disable_threading: false,
2026-02-21T09:54:46.8602418Z       verify_each: true
2026-02-21T09:54:46.8602508Z     }
2026-02-21T09:54:46.8602614Z   }
2026-02-21T09:54:46.8602685Z #-}
2026-02-21T09:54:46.8602956Z /tmp/torchinductor_root/js/cjs4wbyfs5ng5fgjz2j2ouuubwzm3iruptmd2fosj7bk57htgu72.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:54:46.8603650Z /tmp/torchinductor_root/js/cjs4wbyfs5ng5fgjz2j2ouuubwzm3iruptmd2fosj7bk57htgu72.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:54:46.8604194Z [617s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:54:46.8604959Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 64], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:54:46.8605662Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:54:46.8605830Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:54:47.1671617Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:54:47.1689795Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:54:47.1690237Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:54:47.1690648Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:54:47.1691022Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:54:47.1691368Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:54:47.1691676Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:47.1691921Z #smem = #ttg.shared_memory
2026-02-21T09:54:47.1692226Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:54:47.1692849Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:54:47.1693359Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1693567Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:54:47.1693721Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:54:47.1693867Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:54:47.1694048Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:54:47.1694243Z     %cst_0 = arith.constant dense<0> : tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1694436Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:54:47.1694594Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T09:54:47.1694747Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:54:47.1694896Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:54:47.1695049Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:54:47.1695297Z     %cst_1 = arith.constant dense<0> : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1695655Z     %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1696029Z     %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1696423Z     %cst_4 = arith.constant dense<512> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1696736Z     %cst_5 = arith.constant dense<0> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1697046Z     %cst_6 = arith.constant dense<8192> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1697445Z     %cst_7 = arith.constant dense<508> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1697853Z     %cst_8 = arith.constant dense<504> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1698218Z     %cst_9 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1698449Z     %c504_i32 = arith.constant 504 : i32
2026-02-21T09:54:47.1698599Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:54:47.1698790Z     %cst_10 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1699060Z     %cst_11 = arith.constant dense<4> : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1699339Z     %cst_12 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:47.1699553Z     %cst_13 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:47.1699773Z     %cst_14 = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1699962Z     %0 = tt.get_program_id x : i32
2026-02-21T09:54:47.1700105Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:54:47.1700248Z     %2 = arith.minsi %1, %c16384_i32 : i32
2026-02-21T09:54:47.1700531Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1700882Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1701268Z     %5 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1701654Z     %6 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1701981Z     %7 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1702288Z     %8 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1702589Z     %9 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1702982Z     %10 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1703519Z     %11 = arith.extsi %10 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1704155Z     %12 = arith.extsi %5 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1704695Z     %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:54:47.1705208Z     %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:54:47.1705711Z     %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:47.1706014Z     %16 = arith.cmpi eq, %15, %cst_12 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:47.1706353Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked>
2026-02-21T09:54:47.1706552Z     %18 = arith.cmpi eq, %15, %cst_13 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:47.1706744Z     %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked>
2026-02-21T09:54:47.1706955Z     %20 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:47.1707118Z     %21 = arith.subi %2, %0 : i32
2026-02-21T09:54:47.1707250Z     %22 = arith.remsi %21, %c4_i32 : i32
2026-02-21T09:54:47.1707368Z     %23 = arith.subi %21, %22 : i32
2026-02-21T09:54:47.1707480Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:54:47.1707607Z     scf.for %arg3 = %0 to %24 step %c4_i32  : i32 {
2026-02-21T09:54:47.1707747Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:54:47.1707872Z       %26 = arith.muli %25, %c4_i32 : i32
2026-02-21T09:54:47.1707992Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:54:47.1708115Z       %28 = arith.minsi %27, %c4_i32 : i32
2026-02-21T09:54:47.1708236Z       %29 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:54:47.1708357Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:54:47.1708471Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:54:47.1708583Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:54:47.1708699Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:54:47.1708872Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1709093Z       %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1709311Z       %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1709546Z       %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1709713Z       %38 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:54:47.1709872Z       %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1710077Z       %40 = arith.addi %39, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1710352Z       %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1710610Z       %42 = arith.muli %41, %cst_10 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1710810Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1710985Z       %44 = arith.extsi %38 : i32 to i64
2026-02-21T09:54:47.1711194Z       %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1711496Z       %46 = arith.addi %45, %12 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1711889Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1712327Z       %48 = tt.broadcast %47 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1712654Z       %49 = arith.cmpi sge, %47, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1712899Z       %50 = arith.cmpi slt, %47, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1713143Z       %51 = arith.andi %49, %50 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1713443Z       %52 = tt.broadcast %51 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1713740Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:54:47.1714030Z       %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:54:47.1714306Z       %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1714502Z       %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1714705Z       %57 = tt.addptr %8, %56 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1714922Z       %58 = tt.load %57 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1715205Z       %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1723701Z       ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1724012Z       %60 = arith.addi %7, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1724295Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:54:47.1724567Z       %62 = tt.broadcast %61 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1724762Z       %63 = arith.addi %43, %62 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1724960Z       %64 = tt.addptr %8, %63 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1725160Z       %65 = tt.load %64 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1725446Z       %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1725860Z       ttg.local_store %65, %66 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1726381Z       %67:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>)  : i32 {
2026-02-21T09:54:47.1726801Z         %393 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:54:47.1726925Z         %394 = arith.muli %393, %c2_i32 : i32
2026-02-21T09:54:47.1727100Z         %395 = tt.splat %394 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1727333Z         %396 = arith.addi %395, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1727611Z         %397 = tt.expand_dims %396 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:54:47.1727894Z         %398 = tt.broadcast %397 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1728091Z         %399 = arith.addi %43, %398 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1728297Z         %400 = tt.addptr %8, %399 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1728506Z         %401 = tt.load %400 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1728827Z         %402 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1729272Z         %403 = arith.extf %402 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1729556Z         %404 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:54:47.1729769Z         %405 = tt.splat %404 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1730066Z         %406 = arith.addi %405, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1730476Z         %407 = tt.expand_dims %406 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1730834Z         %408 = arith.muli %407, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1731159Z         %409 = tt.broadcast %408 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1731461Z         %410 = arith.addi %409, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1731771Z         %411 = tt.addptr %9, %410 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1732086Z         %412 = arith.cmpi sge, %407, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1732333Z         %413 = arith.cmpi slt, %407, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1732569Z         %414 = arith.andi %412, %413 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1732868Z         %415 = tt.broadcast %414 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1733166Z         %416 = arith.andi %415, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1733411Z         %417 = tt.load %411, %416, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1733657Z         %418 = arith.shli %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1733909Z         %419 = arith.shrsi %418, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1734147Z         %420 = arith.shrsi %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1734434Z         %421 = tt.expand_dims %419 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1734770Z         %422 = tt.expand_dims %420 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1735055Z         %423 = tt.broadcast %421 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1735348Z         %424 = arith.select %17, %423, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1735581Z         %425 = tt.broadcast %422 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1735814Z         %426 = arith.select %19, %425, %424 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1736043Z         %427 = tt.reshape %426 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1736264Z         %428 = arith.sitofp %427 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1736517Z         %429 = ttg.local_alloc %428 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1736853Z         %430 = ttg.local_load %429 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1737333Z         %431 = tt.dot %403, %430, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1737687Z         %432 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:47.1737816Z         %433 = arith.cmpi slt, %432, %c2_i32 : i32
2026-02-21T09:54:47.1737952Z         %434 = arith.select %433, %432, %c0_i32 : i32
2026-02-21T09:54:47.1738241Z         %435 = ttg.memdesc_index %53[%434] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1738603Z         ttg.local_store %401, %435 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1739004Z         scf.yield %431, %434, %arg8, %435 : tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1739355Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:47.1739672Z       %68 = ttg.local_load %67#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1740104Z       %69 = arith.extf %68 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1740471Z       %70 = arith.addi %11, %cst_8 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1740860Z       %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1741207Z       %72 = arith.muli %71, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1741505Z       %73 = tt.broadcast %72 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1741798Z       %74 = arith.addi %73, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1742116Z       %75 = tt.addptr %9, %74 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1742424Z       %76 = arith.cmpi sge, %71, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1742659Z       %77 = arith.cmpi slt, %71, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1742881Z       %78 = arith.andi %76, %77 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1743171Z       %79 = tt.broadcast %78 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1743459Z       %80 = arith.andi %79, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1743696Z       %81 = tt.load %75, %80, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1743936Z       %82 = arith.shli %81, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1744161Z       %83 = arith.shrsi %82, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1744394Z       %84 = arith.shrsi %81, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1744671Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1745010Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1745283Z       %87 = tt.broadcast %85 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1745511Z       %88 = arith.select %17, %87, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1745739Z       %89 = tt.broadcast %86 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1745963Z       %90 = arith.select %19, %89, %88 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1746212Z       %91 = tt.reshape %90 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1746426Z       %92 = arith.sitofp %91 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1746665Z       %93 = ttg.local_alloc %92 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1746981Z       %94 = ttg.local_load %93 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1747457Z       %95 = tt.dot %69, %94, %67#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1747943Z       %96 = ttg.local_load %67#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1748369Z       %97 = arith.extf %96 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1748737Z       %98 = arith.addi %11, %cst_7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1749123Z       %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1749476Z       %100 = arith.muli %99, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1749780Z       %101 = tt.broadcast %100 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1750100Z       %102 = arith.addi %101, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1750404Z       %103 = tt.addptr %9, %102 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1750716Z       %104 = arith.cmpi sge, %99, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1750957Z       %105 = arith.cmpi slt, %99, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1751187Z       %106 = arith.andi %104, %105 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1751485Z       %107 = tt.broadcast %106 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1751782Z       %108 = arith.andi %107, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1752024Z       %109 = tt.load %103, %108, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1752270Z       %110 = arith.shli %109, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1752503Z       %111 = arith.shrsi %110, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1752737Z       %112 = arith.shrsi %109, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1753036Z       %113 = tt.expand_dims %111 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1753368Z       %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1753649Z       %115 = tt.broadcast %113 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1753888Z       %116 = arith.select %17, %115, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1754121Z       %117 = tt.broadcast %114 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1754375Z       %118 = arith.select %19, %117, %116 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1754597Z       %119 = tt.reshape %118 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1754817Z       %120 = arith.sitofp %119 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1755069Z       %121 = ttg.local_alloc %120 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1755406Z       %122 = ttg.local_load %121 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1755872Z       %123 = tt.dot %97, %122, %95, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1756252Z       ttg.local_dealloc %53 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:54:47.1756468Z       %124 = arith.truncf %123 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:54:47.1756736Z       %125 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1756976Z       %126 = arith.muli %125, %cst_14 : tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1757209Z       %127 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:54:47.1757465Z       %128 = tt.broadcast %126 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1757671Z       %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1757852Z       %130 = arith.addi %128, %129 : tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1758057Z       %131 = tt.addptr %20, %130 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1758257Z       tt.store %131, %124 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:47.1758398Z       %132 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:54:47.1758521Z       %133 = arith.divsi %132, %c512_i32 : i32
2026-02-21T09:54:47.1758640Z       %134 = arith.muli %133, %c4_i32 : i32
2026-02-21T09:54:47.1758761Z       %135 = arith.subi %c128_i32, %134 : i32
2026-02-21T09:54:47.1758881Z       %136 = arith.minsi %135, %c4_i32 : i32
2026-02-21T09:54:47.1758999Z       %137 = arith.remsi %132, %c512_i32 : i32
2026-02-21T09:54:47.1759120Z       %138 = arith.remsi %137, %136 : i32
2026-02-21T09:54:47.1759235Z       %139 = arith.addi %134, %138 : i32
2026-02-21T09:54:47.1759352Z       %140 = arith.divsi %137, %136 : i32
2026-02-21T09:54:47.1759467Z       %141 = arith.muli %139, %c128_i32 : i32
2026-02-21T09:54:47.1759643Z       %142 = tt.splat %141 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1759865Z       %143 = tt.splat %141 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1760086Z       %144 = arith.addi %142, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1760304Z       %145 = arith.addi %143, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1760486Z       %146 = arith.muli %140, %c64_i32 : i32
2026-02-21T09:54:47.1760645Z       %147 = tt.splat %146 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1760850Z       %148 = arith.addi %147, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1761125Z       %149 = tt.expand_dims %144 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1761382Z       %150 = arith.muli %149, %cst_10 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1761582Z       %151 = tt.broadcast %150 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1761760Z       %152 = arith.extsi %146 : i32 to i64
2026-02-21T09:54:47.1761984Z       %153 = tt.splat %152 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1762281Z       %154 = arith.addi %153, %12 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1762728Z       %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1763172Z       %156 = tt.broadcast %155 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1763484Z       %157 = arith.cmpi sge, %155, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1763728Z       %158 = arith.cmpi slt, %155, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1763967Z       %159 = arith.andi %157, %158 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1764269Z       %160 = tt.broadcast %159 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1764565Z       %161 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:54:47.1764756Z       %162 = arith.addi %151, %55 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1764959Z       %163 = tt.addptr %8, %162 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1765165Z       %164 = tt.load %163 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1765477Z       %165 = ttg.memdesc_index %161[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1765837Z       ttg.local_store %164, %165 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1766078Z       %166 = arith.addi %151, %62 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1766279Z       %167 = tt.addptr %8, %166 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1766481Z       %168 = tt.load %167 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1766763Z       %169 = ttg.memdesc_index %161[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1767119Z       ttg.local_store %168, %169 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1767641Z       %170:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %165, %arg8 = %169) -> (tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>)  : i32 {
2026-02-21T09:54:47.1768060Z         %393 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:54:47.1768184Z         %394 = arith.muli %393, %c2_i32 : i32
2026-02-21T09:54:47.1768357Z         %395 = tt.splat %394 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1768600Z         %396 = arith.addi %395, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1768877Z         %397 = tt.expand_dims %396 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:54:47.1769157Z         %398 = tt.broadcast %397 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1769353Z         %399 = arith.addi %151, %398 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1769556Z         %400 = tt.addptr %8, %399 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1769768Z         %401 = tt.load %400 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1770092Z         %402 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1770528Z         %403 = arith.extf %402 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1770813Z         %404 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:54:47.1771039Z         %405 = tt.splat %404 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1771335Z         %406 = arith.addi %405, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1771724Z         %407 = tt.expand_dims %406 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1772077Z         %408 = arith.muli %407, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1772384Z         %409 = tt.broadcast %408 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1772689Z         %410 = arith.addi %409, %156 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1772997Z         %411 = tt.addptr %9, %410 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1773311Z         %412 = arith.cmpi sge, %407, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1773573Z         %413 = arith.cmpi slt, %407, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1773809Z         %414 = arith.andi %412, %413 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1774108Z         %415 = tt.broadcast %414 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1774402Z         %416 = arith.andi %415, %160 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1774648Z         %417 = tt.load %411, %416, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1774894Z         %418 = arith.shli %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1775127Z         %419 = arith.shrsi %418, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1775361Z         %420 = arith.shrsi %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1775646Z         %421 = tt.expand_dims %419 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1775984Z         %422 = tt.expand_dims %420 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1776265Z         %423 = tt.broadcast %421 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1776519Z         %424 = arith.select %17, %423, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1776756Z         %425 = tt.broadcast %422 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1776985Z         %426 = arith.select %19, %425, %424 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1777214Z         %427 = tt.reshape %426 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1777437Z         %428 = arith.sitofp %427 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1777687Z         %429 = ttg.local_alloc %428 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1778027Z         %430 = ttg.local_load %429 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1778497Z         %431 = tt.dot %403, %430, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1778847Z         %432 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:47.1779089Z         %433 = arith.cmpi slt, %432, %c2_i32 : i32
2026-02-21T09:54:47.1779221Z         %434 = arith.select %433, %432, %c0_i32 : i32
2026-02-21T09:54:47.1779490Z         %435 = ttg.memdesc_index %161[%434] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1779849Z         ttg.local_store %401, %435 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1780251Z         scf.yield %431, %434, %arg8, %435 : tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1780592Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:47.1780907Z       %171 = ttg.local_load %170#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1781343Z       %172 = arith.extf %171 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1781697Z       %173 = arith.addi %73, %156 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1782001Z       %174 = tt.addptr %9, %173 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1782302Z       %175 = arith.andi %79, %160 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1782540Z       %176 = tt.load %174, %175, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1782786Z       %177 = arith.shli %176, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1783022Z       %178 = arith.shrsi %177, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1783255Z       %179 = arith.shrsi %176, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1783543Z       %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1783873Z       %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1784152Z       %182 = tt.broadcast %180 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1784391Z       %183 = arith.select %17, %182, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1784645Z       %184 = tt.broadcast %181 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1784875Z       %185 = arith.select %19, %184, %183 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1785101Z       %186 = tt.reshape %185 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1785319Z       %187 = arith.sitofp %186 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1785567Z       %188 = ttg.local_alloc %187 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1785883Z       %189 = ttg.local_load %188 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1786369Z       %190 = tt.dot %172, %189, %170#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1786865Z       %191 = ttg.local_load %170#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1787304Z       %192 = arith.extf %191 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1787637Z       %193 = arith.addi %101, %156 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1787945Z       %194 = tt.addptr %9, %193 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1788247Z       %195 = arith.andi %107, %160 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1788487Z       %196 = tt.load %194, %195, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1788732Z       %197 = arith.shli %196, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1788966Z       %198 = arith.shrsi %197, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1789200Z       %199 = arith.shrsi %196, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1789482Z       %200 = tt.expand_dims %198 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1789827Z       %201 = tt.expand_dims %199 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1790102Z       %202 = tt.broadcast %200 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1790337Z       %203 = arith.select %17, %202, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1790571Z       %204 = tt.broadcast %201 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1790799Z       %205 = arith.select %19, %204, %203 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1791025Z       %206 = tt.reshape %205 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1791242Z       %207 = arith.sitofp %206 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1791490Z       %208 = ttg.local_alloc %207 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1791810Z       %209 = ttg.local_load %208 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1792276Z       %210 = tt.dot %192, %209, %190, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1792680Z       ttg.local_dealloc %161 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:54:47.1792893Z       %211 = arith.truncf %210 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:54:47.1793164Z       %212 = tt.expand_dims %145 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1793406Z       %213 = arith.muli %212, %cst_14 : tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1793634Z       %214 = tt.expand_dims %148 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:54:47.1793890Z       %215 = tt.broadcast %213 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1794143Z       %216 = tt.broadcast %214 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1794320Z       %217 = arith.addi %215, %216 : tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1794509Z       %218 = tt.addptr %20, %217 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1794704Z       tt.store %218, %211 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:47.1794845Z       %219 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:54:47.1794977Z       %220 = arith.divsi %219, %c512_i32 : i32
2026-02-21T09:54:47.1795097Z       %221 = arith.muli %220, %c4_i32 : i32
2026-02-21T09:54:47.1795216Z       %222 = arith.subi %c128_i32, %221 : i32
2026-02-21T09:54:47.1795333Z       %223 = arith.minsi %222, %c4_i32 : i32
2026-02-21T09:54:47.1795450Z       %224 = arith.remsi %219, %c512_i32 : i32
2026-02-21T09:54:47.1795565Z       %225 = arith.remsi %224, %223 : i32
2026-02-21T09:54:47.1795680Z       %226 = arith.addi %221, %225 : i32
2026-02-21T09:54:47.1795794Z       %227 = arith.divsi %224, %223 : i32
2026-02-21T09:54:47.1795912Z       %228 = arith.muli %226, %c128_i32 : i32
2026-02-21T09:54:47.1796082Z       %229 = tt.splat %228 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1796303Z       %230 = tt.splat %228 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1796518Z       %231 = arith.addi %229, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1796730Z       %232 = arith.addi %230, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1796895Z       %233 = arith.muli %227, %c64_i32 : i32
2026-02-21T09:54:47.1797051Z       %234 = tt.splat %233 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1797268Z       %235 = arith.addi %234, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1797541Z       %236 = tt.expand_dims %231 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1797794Z       %237 = arith.muli %236, %cst_10 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1797991Z       %238 = tt.broadcast %237 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1798166Z       %239 = arith.extsi %233 : i32 to i64
2026-02-21T09:54:47.1798370Z       %240 = tt.splat %239 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1798669Z       %241 = arith.addi %240, %12 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1799056Z       %242 = tt.expand_dims %241 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1799481Z       %243 = tt.broadcast %242 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1799789Z       %244 = arith.cmpi sge, %242, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1800029Z       %245 = arith.cmpi slt, %242, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1800276Z       %246 = arith.andi %244, %245 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1800573Z       %247 = tt.broadcast %246 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1800867Z       %248 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:54:47.1801053Z       %249 = arith.addi %238, %55 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1801255Z       %250 = tt.addptr %8, %249 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1801477Z       %251 = tt.load %250 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1801758Z       %252 = ttg.memdesc_index %248[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1802116Z       ttg.local_store %251, %252 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1802355Z       %253 = arith.addi %238, %62 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1802602Z       %254 = tt.addptr %8, %253 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1802807Z       %255 = tt.load %254 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1803083Z       %256 = ttg.memdesc_index %248[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1803437Z       ttg.local_store %255, %256 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1803954Z       %257:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %252, %arg8 = %256) -> (tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>)  : i32 {
2026-02-21T09:54:47.1804371Z         %393 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:54:47.1804493Z         %394 = arith.muli %393, %c2_i32 : i32
2026-02-21T09:54:47.1804662Z         %395 = tt.splat %394 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1804882Z         %396 = arith.addi %395, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1805176Z         %397 = tt.expand_dims %396 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:54:47.1805450Z         %398 = tt.broadcast %397 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1805646Z         %399 = arith.addi %238, %398 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1805845Z         %400 = tt.addptr %8, %399 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1806051Z         %401 = tt.load %400 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1806352Z         %402 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1806787Z         %403 = arith.extf %402 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1807066Z         %404 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:54:47.1807273Z         %405 = tt.splat %404 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1807568Z         %406 = arith.addi %405, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1807952Z         %407 = tt.expand_dims %406 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1808317Z         %408 = arith.muli %407, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1808621Z         %409 = tt.broadcast %408 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1808923Z         %410 = arith.addi %409, %243 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1809230Z         %411 = tt.addptr %9, %410 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1809561Z         %412 = arith.cmpi sge, %407, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1809800Z         %413 = arith.cmpi slt, %407, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1810033Z         %414 = arith.andi %412, %413 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1810331Z         %415 = tt.broadcast %414 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1810640Z         %416 = arith.andi %415, %247 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1810883Z         %417 = tt.load %411, %416, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1811126Z         %418 = arith.shli %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1811360Z         %419 = arith.shrsi %418, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1811598Z         %420 = arith.shrsi %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1811881Z         %421 = tt.expand_dims %419 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1812214Z         %422 = tt.expand_dims %420 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1812493Z         %423 = tt.broadcast %421 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1812727Z         %424 = arith.select %17, %423, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1812979Z         %425 = tt.broadcast %422 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1813207Z         %426 = arith.select %19, %425, %424 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1813433Z         %427 = tt.reshape %426 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1813651Z         %428 = arith.sitofp %427 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1813902Z         %429 = ttg.local_alloc %428 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1814226Z         %430 = ttg.local_load %429 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1814695Z         %431 = tt.dot %403, %430, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1815042Z         %432 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:47.1815168Z         %433 = arith.cmpi slt, %432, %c2_i32 : i32
2026-02-21T09:54:47.1815299Z         %434 = arith.select %433, %432, %c0_i32 : i32
2026-02-21T09:54:47.1815566Z         %435 = ttg.memdesc_index %248[%434] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1815922Z         ttg.local_store %401, %435 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1816334Z         scf.yield %431, %434, %arg8, %435 : tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1816669Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:47.1816982Z       %258 = ttg.local_load %257#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1817410Z       %259 = arith.extf %258 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1817753Z       %260 = arith.addi %73, %243 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1818054Z       %261 = tt.addptr %9, %260 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1818357Z       %262 = arith.andi %79, %247 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1818608Z       %263 = tt.load %261, %262, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1818850Z       %264 = arith.shli %263, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1819082Z       %265 = arith.shrsi %264, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1819313Z       %266 = arith.shrsi %263, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1819598Z       %267 = tt.expand_dims %265 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1819924Z       %268 = tt.expand_dims %266 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1820201Z       %269 = tt.broadcast %267 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1820438Z       %270 = arith.select %17, %269, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1820667Z       %271 = tt.broadcast %268 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1820911Z       %272 = arith.select %19, %271, %270 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1821134Z       %273 = tt.reshape %272 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1821352Z       %274 = arith.sitofp %273 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1821599Z       %275 = ttg.local_alloc %274 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1821913Z       %276 = ttg.local_load %275 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1822386Z       %277 = tt.dot %259, %276, %257#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1822883Z       %278 = ttg.local_load %257#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1823311Z       %279 = arith.extf %278 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1823645Z       %280 = arith.addi %101, %243 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1823950Z       %281 = tt.addptr %9, %280 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1824266Z       %282 = arith.andi %107, %247 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1824508Z       %283 = tt.load %281, %282, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1824755Z       %284 = arith.shli %283, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1824989Z       %285 = arith.shrsi %284, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1825222Z       %286 = arith.shrsi %283, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1825523Z       %287 = tt.expand_dims %285 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1825853Z       %288 = tt.expand_dims %286 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1826128Z       %289 = tt.broadcast %287 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1826375Z       %290 = arith.select %17, %289, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1826608Z       %291 = tt.broadcast %288 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1826835Z       %292 = arith.select %19, %291, %290 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1827061Z       %293 = tt.reshape %292 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1827278Z       %294 = arith.sitofp %293 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1827529Z       %295 = ttg.local_alloc %294 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1827849Z       %296 = ttg.local_load %295 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1828313Z       %297 = tt.dot %279, %296, %277, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1828699Z       ttg.local_dealloc %248 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:54:47.1828912Z       %298 = arith.truncf %297 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:54:47.1829194Z       %299 = tt.expand_dims %232 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1829437Z       %300 = arith.muli %299, %cst_14 : tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1829668Z       %301 = tt.expand_dims %235 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:54:47.1829925Z       %302 = tt.broadcast %300 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1830128Z       %303 = tt.broadcast %301 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1830310Z       %304 = arith.addi %302, %303 : tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1830501Z       %305 = tt.addptr %20, %304 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1830696Z       tt.store %305, %298 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:47.1830838Z       %306 = arith.addi %arg3, %c3_i32 : i32
2026-02-21T09:54:47.1830961Z       %307 = arith.divsi %306, %c512_i32 : i32
2026-02-21T09:54:47.1831081Z       %308 = arith.muli %307, %c4_i32 : i32
2026-02-21T09:54:47.1831199Z       %309 = arith.subi %c128_i32, %308 : i32
2026-02-21T09:54:47.1831318Z       %310 = arith.minsi %309, %c4_i32 : i32
2026-02-21T09:54:47.1831436Z       %311 = arith.remsi %306, %c512_i32 : i32
2026-02-21T09:54:47.1831550Z       %312 = arith.remsi %311, %310 : i32
2026-02-21T09:54:47.1831679Z       %313 = arith.addi %308, %312 : i32
2026-02-21T09:54:47.1831793Z       %314 = arith.divsi %311, %310 : i32
2026-02-21T09:54:47.1831910Z       %315 = arith.muli %313, %c128_i32 : i32
2026-02-21T09:54:47.1832080Z       %316 = tt.splat %315 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1832299Z       %317 = tt.splat %315 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1832517Z       %318 = arith.addi %316, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1832733Z       %319 = arith.addi %317, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1832920Z       %320 = arith.muli %314, %c64_i32 : i32
2026-02-21T09:54:47.1833080Z       %321 = tt.splat %320 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1833291Z       %322 = arith.addi %321, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1833567Z       %323 = tt.expand_dims %318 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1833830Z       %324 = arith.muli %323, %cst_10 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1834048Z       %325 = tt.broadcast %324 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1834226Z       %326 = arith.extsi %320 : i32 to i64
2026-02-21T09:54:47.1834438Z       %327 = tt.splat %326 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1834736Z       %328 = arith.addi %327, %12 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1835133Z       %329 = tt.expand_dims %328 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1835562Z       %330 = tt.broadcast %329 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1835876Z       %331 = arith.cmpi sge, %329, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1836125Z       %332 = arith.cmpi slt, %329, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1836364Z       %333 = arith.andi %331, %332 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1836682Z       %334 = tt.broadcast %333 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1836983Z       %335 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:54:47.1837171Z       %336 = arith.addi %325, %55 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1837377Z       %337 = tt.addptr %8, %336 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1837591Z       %338 = tt.load %337 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1837879Z       %339 = ttg.memdesc_index %335[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1838242Z       ttg.local_store %338, %339 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1838486Z       %340 = arith.addi %325, %62 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1838688Z       %341 = tt.addptr %8, %340 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1838897Z       %342 = tt.load %341 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1839180Z       %343 = ttg.memdesc_index %335[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1839557Z       ttg.local_store %342, %343 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1840085Z       %344:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %339, %arg8 = %343) -> (tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>)  : i32 {
2026-02-21T09:54:47.1840509Z         %393 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:54:47.1840641Z         %394 = arith.muli %393, %c2_i32 : i32
2026-02-21T09:54:47.1840813Z         %395 = tt.splat %394 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1841059Z         %396 = arith.addi %395, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1841337Z         %397 = tt.expand_dims %396 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:54:47.1841622Z         %398 = tt.broadcast %397 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1841827Z         %399 = arith.addi %325, %398 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1842053Z         %400 = tt.addptr %8, %399 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1842263Z         %401 = tt.load %400 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1842613Z         %402 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1843056Z         %403 = arith.extf %402 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1843347Z         %404 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:54:47.1843559Z         %405 = tt.splat %404 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1843866Z         %406 = arith.addi %405, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1844265Z         %407 = tt.expand_dims %406 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1844624Z         %408 = arith.muli %407, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1844957Z         %409 = tt.broadcast %408 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1845263Z         %410 = arith.addi %409, %330 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1845575Z         %411 = tt.addptr %9, %410 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1845894Z         %412 = arith.cmpi sge, %407, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1846140Z         %413 = arith.cmpi slt, %407, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1846380Z         %414 = arith.andi %412, %413 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1846682Z         %415 = tt.broadcast %414 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1846980Z         %416 = arith.andi %415, %334 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1847233Z         %417 = tt.load %411, %416, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1847483Z         %418 = arith.shli %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1847741Z         %419 = arith.shrsi %418, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1847982Z         %420 = arith.shrsi %417, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1848273Z         %421 = tt.expand_dims %419 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1848611Z         %422 = tt.expand_dims %420 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1848894Z         %423 = tt.broadcast %421 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1849157Z         %424 = arith.select %17, %423, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1849396Z         %425 = tt.broadcast %422 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1849630Z         %426 = arith.select %19, %425, %424 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1849864Z         %427 = tt.reshape %426 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1850104Z         %428 = arith.sitofp %427 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1850363Z         %429 = ttg.local_alloc %428 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1850693Z         %430 = ttg.local_load %429 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1851164Z         %431 = tt.dot %403, %430, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1851524Z         %432 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:47.1851654Z         %433 = arith.cmpi slt, %432, %c2_i32 : i32
2026-02-21T09:54:47.1851794Z         %434 = arith.select %433, %432, %c0_i32 : i32
2026-02-21T09:54:47.1852071Z         %435 = ttg.memdesc_index %335[%434] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1852433Z         ttg.local_store %401, %435 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1852852Z         scf.yield %431, %434, %arg8, %435 : tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1853195Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:47.1853514Z       %345 = ttg.local_load %344#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1853955Z       %346 = arith.extf %345 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1854288Z       %347 = arith.addi %73, %330 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1854599Z       %348 = tt.addptr %9, %347 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1854906Z       %349 = arith.andi %79, %334 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1855147Z       %350 = tt.load %348, %349, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1855397Z       %351 = arith.shli %350, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1855635Z       %352 = arith.shrsi %351, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1855889Z       %353 = arith.shrsi %350, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1856179Z       %354 = tt.expand_dims %352 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1856512Z       %355 = tt.expand_dims %353 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1856797Z       %356 = tt.broadcast %354 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1857037Z       %357 = arith.select %17, %356, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1857287Z       %358 = tt.broadcast %355 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1857523Z       %359 = arith.select %19, %358, %357 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1857751Z       %360 = tt.reshape %359 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1857974Z       %361 = arith.sitofp %360 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1858241Z       %362 = ttg.local_alloc %361 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1858562Z       %363 = ttg.local_load %362 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1859033Z       %364 = tt.dot %346, %363, %344#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1859535Z       %365 = ttg.local_load %344#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1859969Z       %366 = arith.extf %365 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1860303Z       %367 = arith.addi %101, %330 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1860612Z       %368 = tt.addptr %9, %367 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1860919Z       %369 = arith.andi %107, %334 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1861185Z       %370 = tt.load %368, %369, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1861431Z       %371 = arith.shli %370, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1861670Z       %372 = arith.shrsi %371, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1861906Z       %373 = arith.shrsi %370, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1862197Z       %374 = tt.expand_dims %372 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1862533Z       %375 = tt.expand_dims %373 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1862812Z       %376 = tt.broadcast %374 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1863054Z       %377 = arith.select %17, %376, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1863288Z       %378 = tt.broadcast %375 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1863524Z       %379 = arith.select %19, %378, %377 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1863754Z       %380 = tt.reshape %379 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1863988Z       %381 = arith.sitofp %380 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1864243Z       %382 = ttg.local_alloc %381 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1864564Z       %383 = ttg.local_load %382 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1865034Z       %384 = tt.dot %366, %383, %364, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1865443Z       ttg.local_dealloc %335 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:54:47.1865657Z       %385 = arith.truncf %384 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:54:47.1865931Z       %386 = tt.expand_dims %319 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1866178Z       %387 = arith.muli %386, %cst_14 : tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1866429Z       %388 = tt.expand_dims %322 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:54:47.1866692Z       %389 = tt.broadcast %387 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1866898Z       %390 = tt.broadcast %388 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1867084Z       %391 = arith.addi %389, %390 : tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1867275Z       %392 = tt.addptr %20, %391 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1867481Z       tt.store %392, %385 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:47.1867626Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:54:47.1867750Z     scf.for %arg3 = %24 to %2 step %c1_i32  : i32 {
2026-02-21T09:54:47.1867893Z       %25 = arith.divsi %arg3, %c512_i32 : i32
2026-02-21T09:54:47.1868019Z       %26 = arith.muli %25, %c4_i32 : i32
2026-02-21T09:54:47.1868143Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:54:47.1868263Z       %28 = arith.minsi %27, %c4_i32 : i32
2026-02-21T09:54:47.1868388Z       %29 = arith.remsi %arg3, %c512_i32 : i32
2026-02-21T09:54:47.1868514Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:54:47.1868629Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:54:47.1868746Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:54:47.1868879Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:54:47.1869051Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1869266Z       %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1869483Z       %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:47.1869700Z       %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:47.1869867Z       %38 = arith.muli %32, %c64_i32 : i32
2026-02-21T09:54:47.1870030Z       %39 = tt.splat %38 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1870233Z       %40 = arith.addi %39, %6 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:47.1870510Z       %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1870768Z       %42 = arith.muli %41, %cst_10 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:47.1870962Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1871142Z       %44 = arith.extsi %38 : i32 to i64
2026-02-21T09:54:47.1871346Z       %45 = tt.splat %44 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1871659Z       %46 = arith.addi %45, %12 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1872043Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1872467Z       %48 = tt.broadcast %47 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1872780Z       %49 = arith.cmpi sge, %47, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1873019Z       %50 = arith.cmpi slt, %47, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1873274Z       %51 = arith.andi %49, %50 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1873571Z       %52 = tt.broadcast %51 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1873858Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:54:47.1874141Z       %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:54:47.1874409Z       %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1874599Z       %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1874796Z       %57 = tt.addptr %8, %56 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1874995Z       %58 = tt.load %57 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1875277Z       %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1875632Z       ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1875904Z       %60 = arith.addi %7, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1876178Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:54:47.1876445Z       %62 = tt.broadcast %61 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1876650Z       %63 = arith.addi %43, %62 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1876841Z       %64 = tt.addptr %8, %63 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1877041Z       %65 = tt.load %64 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1877318Z       %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1877670Z       ttg.local_store %65, %66 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1878185Z       %67:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>)  : i32 {
2026-02-21T09:54:47.1878600Z         %132 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:54:47.1878645Z         %133 = arith.muli %132, %c2_i32 : i32
2026-02-21T09:54:47.1878739Z         %134 = tt.splat %133 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1878833Z         %135 = arith.addi %134, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:47.1878980Z         %136 = tt.expand_dims %135 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:54:47.1879085Z         %137 = tt.broadcast %136 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1879148Z         %138 = arith.addi %43, %137 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1879255Z         %139 = tt.addptr %8, %138 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:47.1879319Z         %140 = tt.load %139 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:47.1879522Z         %141 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1879724Z         %142 = arith.extf %141 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1879787Z         %143 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:54:47.1879918Z         %144 = tt.splat %143 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1880049Z         %145 = arith.addi %144, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1880280Z         %146 = tt.expand_dims %145 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1880379Z         %147 = arith.muli %146, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1880551Z         %148 = tt.broadcast %147 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1880646Z         %149 = arith.addi %148, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1880821Z         %150 = tt.addptr %9, %149 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1880926Z         %151 = arith.cmpi sge, %146, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1881027Z         %152 = arith.cmpi slt, %146, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1881122Z         %153 = arith.andi %151, %152 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1881300Z         %154 = tt.broadcast %153 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1881392Z         %155 = arith.andi %154, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1881502Z         %156 = tt.load %150, %155, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1881601Z         %157 = arith.shli %156, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1881698Z         %158 = arith.shrsi %157, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1881795Z         %159 = arith.shrsi %156, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1881945Z         %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1882092Z         %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1882185Z         %162 = tt.broadcast %160 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1882292Z         %163 = arith.select %17, %162, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1882381Z         %164 = tt.broadcast %161 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1882490Z         %165 = arith.select %19, %164, %163 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1882622Z         %166 = tt.reshape %165 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1882714Z         %167 = arith.sitofp %166 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1882831Z         %168 = ttg.local_alloc %167 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1883002Z         %169 = ttg.local_load %168 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1883268Z         %170 = tt.dot %142, %169, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1883334Z         %171 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:54:47.1883385Z         %172 = arith.cmpi slt, %171, %c2_i32 : i32
2026-02-21T09:54:47.1883436Z         %173 = arith.select %172, %171, %c0_i32 : i32
2026-02-21T09:54:47.1883631Z         %174 = ttg.memdesc_index %53[%173] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1883776Z         ttg.local_store %140, %174 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1883993Z         scf.yield %170, %173, %arg8, %174 : tensor<128x64xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:54:47.1884074Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:47.1884271Z       %68 = ttg.local_load %67#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1884467Z       %69 = arith.extf %68 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1884600Z       %70 = arith.addi %11, %cst_8 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1884818Z       %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1884927Z       %72 = arith.muli %71, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1885096Z       %73 = tt.broadcast %72 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1885183Z       %74 = arith.addi %73, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1885352Z       %75 = tt.addptr %9, %74 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1885453Z       %76 = arith.cmpi sge, %71, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1885548Z       %77 = arith.cmpi slt, %71, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1885633Z       %78 = arith.andi %76, %77 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1885798Z       %79 = tt.broadcast %78 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1885886Z       %80 = arith.andi %79, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1885989Z       %81 = tt.load %75, %80, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1886085Z       %82 = arith.shli %81, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1886194Z       %83 = arith.shrsi %82, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1886289Z       %84 = arith.shrsi %81, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1886431Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1886576Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1886666Z       %87 = tt.broadcast %85 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1886778Z       %88 = arith.select %17, %87, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1886868Z       %89 = tt.broadcast %86 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1886962Z       %90 = arith.select %19, %89, %88 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1887045Z       %91 = tt.reshape %90 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1887147Z       %92 = arith.sitofp %91 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1887259Z       %93 = ttg.local_alloc %92 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1887422Z       %94 = ttg.local_load %93 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1887683Z       %95 = tt.dot %69, %94, %67#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1887876Z       %96 = ttg.local_load %67#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1888072Z       %97 = arith.extf %96 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1888202Z       %98 = arith.addi %11, %cst_7 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:47.1888426Z       %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1888525Z       %100 = arith.muli %99, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1888691Z       %101 = tt.broadcast %100 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1888785Z       %102 = arith.addi %101, %48 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1888964Z       %103 = tt.addptr %9, %102 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1889062Z       %104 = arith.cmpi sge, %99, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1889159Z       %105 = arith.cmpi slt, %99, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1889253Z       %106 = arith.andi %104, %105 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1889415Z       %107 = tt.broadcast %106 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1889506Z       %108 = arith.andi %107, %52 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1889635Z       %109 = tt.load %103, %108, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1889732Z       %110 = arith.shli %109, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1889830Z       %111 = arith.shrsi %110, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1889928Z       %112 = arith.shrsi %109, %cst_11 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:47.1890076Z       %113 = tt.expand_dims %111 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1890220Z       %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:47.1890438Z       %115 = tt.broadcast %113 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1890539Z       %116 = arith.select %17, %115, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1890631Z       %117 = tt.broadcast %114 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1890741Z       %118 = arith.select %19, %117, %116 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:47.1890827Z       %119 = tt.reshape %118 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:47.1890915Z       %120 = arith.sitofp %119 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:47.1891030Z       %121 = ttg.local_alloc %120 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:47.1891197Z       %122 = ttg.local_load %121 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:47.1891452Z       %123 = tt.dot %97, %122, %95, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:47.1891541Z       ttg.local_dealloc %53 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:54:47.1891628Z       %124 = arith.truncf %123 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:54:47.1891768Z       %125 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1891830Z       %126 = arith.muli %125, %cst_14 : tensor<128x1xi32, #mma>
2026-02-21T09:54:47.1891977Z       %127 = tt.expand_dims %40 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:54:47.1892060Z       %128 = tt.broadcast %126 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1892141Z       %129 = tt.broadcast %127 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1892202Z       %130 = arith.addi %128, %129 : tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1892298Z       %131 = tt.addptr %20, %130 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:54:47.1892360Z       tt.store %131, %124 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:47.1892405Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:54:47.1892440Z     tt.return
2026-02-21T09:54:47.1892471Z   }
2026-02-21T09:54:47.1892509Z }
2026-02-21T09:54:47.1892514Z 
2026-02-21T09:54:47.1892544Z {-#
2026-02-21T09:54:47.1892585Z   external_resources: {
2026-02-21T09:54:47.1892623Z     mlir_reproducer: {
2026-02-21T09:54:47.1893558Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:54:47.1893616Z       disable_threading: false,
2026-02-21T09:54:47.1893654Z       verify_each: true
2026-02-21T09:54:47.1893686Z     }
2026-02-21T09:54:47.1893716Z   }
2026-02-21T09:54:47.1893746Z #-}
2026-02-21T09:54:47.1893987Z /tmp/torchinductor_root/lv/clvm7l6fs2dqjoun2ulv7bbmeucmix4j7eppuxy3r2di4tghe3tn.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:54:47.1894398Z /tmp/torchinductor_root/lv/clvm7l6fs2dqjoun2ulv7bbmeucmix4j7eppuxy3r2di4tghe3tn.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:54:47.1894529Z [618s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:54:47.1895162Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[2, 3], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:54:47.1895219Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:54:47.1895304Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:54:48.2720478Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:54:48.2722878Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [8, 2, 1], order = [2, 1, 0]}>
2026-02-21T09:54:48.2723536Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:54:48.2724150Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [16, 1], order = [1, 0]}>
2026-02-21T09:54:48.2724716Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [4, 4], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:54:48.2725487Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:54:48.2725924Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:54:48.2726267Z #smem = #ttg.shared_memory
2026-02-21T09:54:48.2726710Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:54:48.2727594Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:54:48.2728315Z     %cst = arith.constant dense<0.000000e+00> : tensor<128x64xf32, #mma>
2026-02-21T09:54:48.2728617Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:54:48.2728827Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:54:48.2729042Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:54:48.2729352Z     %cst_0 = arith.constant dense<0> : tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:48.2729619Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:54:48.2729831Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:54:48.2730050Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T09:54:48.2730384Z     %cst_1 = arith.constant dense<0> : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2730864Z     %cst_2 = arith.constant dense<8192> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2731438Z     %cst_3 = arith.constant dense<0> : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2731908Z     %cst_4 = arith.constant dense<512> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2732369Z     %cst_5 = arith.constant dense<0> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2732830Z     %cst_6 = arith.constant dense<8192> : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2733249Z     %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:48.2733639Z     %cst_8 = arith.constant dense<4> : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2734122Z     %cst_9 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:48.2734442Z     %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:48.2734768Z     %cst_11 = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:54:48.2734978Z     %0 = tt.get_program_id x : i32
2026-02-21T09:54:48.2735131Z     %1 = arith.divsi %0, %c512_i32 : i32
2026-02-21T09:54:48.2735289Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T09:54:48.2735523Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:54:48.2735681Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T09:54:48.2735835Z     %5 = arith.remsi %0, %c512_i32 : i32
2026-02-21T09:54:48.2735988Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:54:48.2736135Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:54:48.2736279Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:54:48.2736424Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:54:48.2736709Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:48.2737095Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:48.2737440Z     %12 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:48.2737736Z     %13 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:48.2738030Z     %14 = arith.addi %12, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:54:48.2738317Z     %15 = arith.addi %13, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:54:48.2738550Z     %16 = arith.muli %8, %c64_i32 : i32
2026-02-21T09:54:48.2738897Z     %17 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:48.2739325Z     %18 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:48.2739640Z     %19 = tt.splat %16 : i32 -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:48.2739914Z     %20 = arith.addi %19, %18 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:54:48.2740246Z     %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:48.2740667Z     %22 = tt.expand_dims %14 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:54:48.2741022Z     %23 = arith.muli %22, %cst_7 : tensor<128x1xi32, #blocked1>
2026-02-21T09:54:48.2741292Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:48.2741591Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:48.2741821Z     %26 = arith.extsi %16 : i32 to i64
2026-02-21T09:54:48.2742082Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2742501Z     %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:48.2743116Z     %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:48.2743664Z     %30 = tt.splat %26 : i64 -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:48.2744209Z     %31 = arith.extsi %17 : tensor<64xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> to tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:48.2744736Z     %32 = arith.addi %30, %31 : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:48.2745176Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2745636Z     %34 = tt.broadcast %33 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2745992Z     %35 = arith.cmpi sge, %33, %cst_3 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2746253Z     %36 = arith.cmpi slt, %33, %cst_2 : tensor<1x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2746501Z     %37 = arith.andi %35, %36 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2746817Z     %38 = tt.broadcast %37 : tensor<1x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2747208Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:54:48.2747674Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:54:48.2748105Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:48.2748382Z     %42 = arith.cmpi eq, %41, %cst_9 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:48.2748590Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked>
2026-02-21T09:54:48.2748804Z     %44 = arith.cmpi eq, %41, %cst_10 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:54:48.2749029Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x64xi1, #blocked>
2026-02-21T09:54:48.2749315Z     %46 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst) -> (tensor<128x64xf32, #mma>)  : i32 {
2026-02-21T09:54:48.2749546Z       %56 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:54:48.2749729Z       %57 = tt.splat %56 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:48.2749966Z       %58 = arith.addi %57, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:54:48.2750257Z       %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:54:48.2750549Z       %60 = tt.broadcast %59 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:54:48.2750758Z       %61 = arith.addi %24, %60 : tensor<128x8xi32, #blocked1>
2026-02-21T09:54:48.2750973Z       %62 = tt.addptr %25, %61 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:54:48.2751200Z       %63 = tt.load %62 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:54:48.2751437Z       %64 = ttg.local_alloc %63 : (tensor<128x8xbf16, #blocked1>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:54:48.2751793Z       %65 = ttg.local_load %64 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:48.2752250Z       %66 = arith.extf %65 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:48.2752555Z       %67 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:54:48.2760303Z       %68 = tt.splat %67 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:48.2760621Z       %69 = arith.addi %68, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>>
2026-02-21T09:54:48.2761001Z       %70 = tt.expand_dims %69 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #ttg.slice<{dim = 1, parent = #blocked}>}>> -> tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2761391Z       %71 = arith.muli %70, %cst_6 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2761691Z       %72 = tt.broadcast %71 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2761986Z       %73 = arith.addi %72, %34 : tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2762309Z       %74 = tt.addptr %27, %73 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>, tensor<4x64xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2762698Z       %75 = arith.cmpi sge, %70, %cst_5 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2762940Z       %76 = arith.cmpi slt, %70, %cst_4 : tensor<4x1xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2763170Z       %77 = arith.andi %75, %76 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2763457Z       %78 = tt.broadcast %77 : tensor<4x1xi1, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2763744Z       %79 = arith.andi %78, %38 : tensor<4x64xi1, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2763978Z       %80 = tt.load %74, %79, %cst_1 : tensor<4x64x!tt.ptr<i8>, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2764213Z       %81 = arith.shli %80, %cst_8 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2764441Z       %82 = arith.shrsi %81, %cst_8 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2764695Z       %83 = arith.shrsi %80, %cst_8 : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:54:48.2764975Z       %84 = tt.expand_dims %82 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:48.2765306Z       %85 = tt.expand_dims %83 {axis = 1 : i32} : tensor<4x64xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x64xi8, #blocked>
2026-02-21T09:54:48.2765578Z       %86 = tt.broadcast %84 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:48.2765807Z       %87 = arith.select %43, %86, %cst_0 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:48.2766034Z       %88 = tt.broadcast %85 : tensor<4x1x64xi8, #blocked> -> tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:48.2766253Z       %89 = arith.select %45, %88, %87 : tensor<4x2x64xi1, #blocked>, tensor<4x2x64xi8, #blocked>
2026-02-21T09:54:48.2766472Z       %90 = tt.reshape %89 : tensor<4x2x64xi8, #blocked> -> tensor<8x64xi8, #blocked2>
2026-02-21T09:54:48.2766684Z       %91 = arith.sitofp %90 : tensor<8x64xi8, #blocked2> to tensor<8x64xf32, #blocked2>
2026-02-21T09:54:48.2766925Z       %92 = ttg.local_alloc %91 : (tensor<8x64xf32, #blocked2>) -> !ttg.memdesc<8x64xf32, #shared1, #smem>
2026-02-21T09:54:48.2767241Z       %93 = ttg.local_load %92 : !ttg.memdesc<8x64xf32, #shared1, #smem> -> tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:54:48.2767708Z       %94 = tt.dot %66, %93, %arg4, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x64xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x64xf32, #mma>
2026-02-21T09:54:48.2768073Z       scf.yield %94 : tensor<128x64xf32, #mma>
2026-02-21T09:54:48.2768238Z     } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:54:48.2768441Z     %47 = arith.truncf %46 : tensor<128x64xf32, #mma> to tensor<128x64xbf16, #mma>
2026-02-21T09:54:48.2768705Z     %48 = tt.expand_dims %15 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:54:48.2768937Z     %49 = arith.muli %48, %cst_11 : tensor<128x1xi32, #mma>
2026-02-21T09:54:48.2769185Z     %50 = tt.expand_dims %20 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x64xi32, #mma>
2026-02-21T09:54:48.2769435Z     %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:48.2769635Z     %52 = tt.broadcast %50 : tensor<1x64xi32, #mma> -> tensor<128x64xi32, #mma>
2026-02-21T09:54:48.2769808Z     %53 = arith.addi %51, %52 : tensor<128x64xi32, #mma>
2026-02-21T09:54:48.2769975Z     %54 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:48.2770204Z     %55 = tt.addptr %54, %53 : tensor<128x64x!tt.ptr<bf16>, #mma>, tensor<128x64xi32, #mma>
2026-02-21T09:54:48.2770390Z     tt.store %55, %47 : tensor<128x64x!tt.ptr<bf16>, #mma>
2026-02-21T09:54:48.2770519Z     tt.return
2026-02-21T09:54:48.2770601Z   }
2026-02-21T09:54:48.2770677Z }
2026-02-21T09:54:48.2770722Z 
2026-02-21T09:54:48.2770753Z {-#
2026-02-21T09:54:48.2770834Z   external_resources: {
2026-02-21T09:54:48.2770938Z     mlir_reproducer: {
2026-02-21T09:54:48.2771945Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:54:48.2772941Z       disable_threading: false,
2026-02-21T09:54:48.2773051Z       verify_each: true
2026-02-21T09:54:48.2773139Z     }
2026-02-21T09:54:48.2773211Z   }
2026-02-21T09:54:48.2773298Z #-}
2026-02-21T09:54:48.2773574Z /tmp/torchinductor_root/vx/cvxgycmeq7yjydt6ne5477bm5nqyqrzezs7lyi323qohz53sf3fy.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:54:48.2774242Z /tmp/torchinductor_root/vx/cvxgycmeq7yjydt6ne5477bm5nqyqrzezs7lyi323qohz53sf3fy.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:54:48.2774786Z [619s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:54:48.2775504Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:54:48.2776158Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:54:48.2776322Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:54:50.9819875Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 65/65 7.5 configs/s
2026-02-21T09:54:54.1801997Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 213/213 35.6 configs/s
2026-02-21T09:54:57.4422186Z [628s] Generation 9 complete: 
2026-02-21T09:54:57.4422402Z error=4
2026-02-21T09:54:57.4422488Z ok=62
2026-02-21T09:54:57.4422561Z min=0.9309
2026-02-21T09:54:57.4422642Z mid=1.8349
2026-02-21T09:54:57.4422747Z max=77.3627
2026-02-21T09:54:57.4422836Z best={'block_sizes': [16, 128, 128],
2026-02-21T09:54:57.4422979Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:54:57.4423114Z  'l2_groupings': [2],
2026-02-21T09:54:57.4423214Z  'load_eviction_policies': ['', ''],
2026-02-21T09:54:57.4423336Z  'loop_orders': [[0, 1]],
2026-02-21T09:54:57.4423437Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:54:57.4423538Z  'num_stages': 1,
2026-02-21T09:54:57.4423940Z  'num_warps': 4,
2026-02-21T09:54:57.4424029Z  'pid_type': 'flat',
2026-02-21T09:54:57.4424127Z  'range_flattens': [None, None],
2026-02-21T09:54:57.4424240Z  'range_multi_buffers': [None, False],
2026-02-21T09:54:57.4424355Z  'range_num_stages': [0, 1],
2026-02-21T09:54:57.4424463Z  'range_unroll_factors': [0, 0],
2026-02-21T09:54:57.4424571Z  'range_warp_specializes': [],
2026-02-21T09:54:57.4424670Z  'waves_per_eu': 2}
2026-02-21T09:54:57.4525165Z [628s] Fitting surrogate: 910 points, 910 targets
2026-02-21T09:54:58.1345196Z [628s] Generation 10 starting: 60 neighbors, 3 active search path(s)
2026-02-21T09:55:10.5449269Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 20.0 configs/s
2026-02-21T09:55:13.4908896Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:55:13.4934509Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:55:13.4935332Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:55:13.4936171Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:55:13.4936721Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:55:13.4937227Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:55:13.4937685Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:55:13.4938032Z #smem = #ttg.shared_memory
2026-02-21T09:55:13.4939135Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:55:13.4940172Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:55:13.4941384Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.4941751Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.4942065Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.4942362Z     %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.4942614Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:55:13.4942869Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:55:13.4943137Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:55:13.4943440Z     %cst_4 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.4943855Z     %cst_5 = arith.constant dense<508> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.4944269Z     %cst_6 = arith.constant dense<510> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.4944624Z     %cst_7 = arith.constant dense<8192> : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4945028Z     %cst_8 = arith.constant dense<0> : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4945302Z     %cst_9 = arith.constant dense<512> : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4945596Z     %cst_10 = arith.constant dense<0> : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.4945889Z     %cst_11 = arith.constant dense<8192> : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.4946179Z     %cst_12 = arith.constant dense<0> : tensor<2x128xi8, #blocked2>
2026-02-21T09:55:13.4946419Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:55:13.4946615Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:55:13.4946800Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:55:13.4947103Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:55:13.4947343Z     %cst_13 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4947586Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:55:13.4947769Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:55:13.4947950Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:55:13.4948245Z     %cst_14 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4948553Z     %0 = tt.get_program_id x : i32
2026-02-21T09:55:13.4948815Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:55:13.4949001Z     %2 = arith.minsi %1, %c8192_i32 : i32
2026-02-21T09:55:13.4949338Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.4949802Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.4950236Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.4950689Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.4951142Z     %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.4951545Z     %8 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.4951877Z     %9 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.4952202Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.4952623Z     %11 = arith.extsi %10 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.4953077Z     %12 = arith.extsi %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.4953515Z     %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:55:13.4954028Z     %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:55:13.4954526Z     %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.4954836Z     %16 = arith.cmpi eq, %15, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.4955085Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:55:13.4955332Z     %18 = arith.cmpi eq, %15, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.4955558Z     %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:55:13.4955823Z     %20 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.4956022Z     %21 = arith.subi %2, %0 : i32
2026-02-21T09:55:13.4956162Z     %22 = arith.remsi %21, %c3_i32 : i32
2026-02-21T09:55:13.4956306Z     %23 = arith.subi %21, %22 : i32
2026-02-21T09:55:13.4956463Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:55:13.4956614Z     scf.for %arg3 = %0 to %24 step %c3_i32  : i32 {
2026-02-21T09:55:13.4956782Z       %25 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:55:13.4956934Z       %26 = arith.muli %25, %c4_i32 : i32
2026-02-21T09:55:13.4957083Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:55:13.4957229Z       %28 = arith.minsi %27, %c4_i32 : i32
2026-02-21T09:55:13.4957374Z       %29 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:55:13.4957538Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:55:13.4957682Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:55:13.4957813Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:55:13.4957975Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:55:13.4958187Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.4958455Z       %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.4958721Z       %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.4959007Z       %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.4959231Z       %38 = arith.muli %32, %c128_i32 : i32
2026-02-21T09:55:13.4959425Z       %39 = tt.splat %38 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.4959680Z       %40 = arith.addi %39, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.4960011Z       %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.4960334Z       %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.4960572Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.4960784Z       %44 = arith.extsi %38 : i32 to i64
2026-02-21T09:55:13.4960992Z       %45 = tt.splat %44 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.4961267Z       %46 = arith.addi %45, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.4961605Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.4961995Z       %48 = tt.broadcast %47 : tensor<1x128xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.4962228Z       %49 = arith.cmpi sge, %47, %cst_10 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.4962405Z       %50 = arith.cmpi slt, %47, %cst_11 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.4962663Z       %51 = arith.andi %49, %50 : tensor<1x128xi1, #blocked2>
2026-02-21T09:55:13.4962850Z       %52 = tt.broadcast %51 : tensor<1x128xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.4963066Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.4963339Z       %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:55:13.4963610Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.4963798Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.4963996Z       %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.4964197Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.4964484Z       %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.4964847Z       ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.4965119Z       %60 = arith.addi %7, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.4965422Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:55:13.4965690Z       %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.4965884Z       %63 = arith.addi %43, %62 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.4966084Z       %64 = tt.addptr %8, %63 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.4966285Z       %65 = tt.load %64 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.4966591Z       %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.4966946Z       ttg.local_store %65, %66 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.4967494Z       %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:55:13.4967928Z         %312 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:55:13.4968054Z         %313 = arith.muli %312, %c2_i32 : i32
2026-02-21T09:55:13.4968233Z         %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.4968456Z         %315 = arith.addi %314, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.4968740Z         %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:55:13.4969022Z         %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.4969217Z         %318 = arith.addi %43, %317 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.4969423Z         %319 = tt.addptr %8, %318 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.4969633Z         %320 = tt.load %319 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.4969939Z         %321 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.4970407Z         %322 = arith.extf %321 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.4970694Z         %323 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:55:13.4970868Z         %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.4971095Z         %325 = arith.addi %324, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.4971373Z         %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4971626Z         %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4971820Z         %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.4972016Z         %329 = arith.addi %328, %48 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.4972217Z         %330 = tt.addptr %9, %329 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.4972425Z         %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4972598Z         %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4972760Z         %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2>
2026-02-21T09:55:13.4972950Z         %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.4977520Z         %335 = arith.andi %334, %52 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.4977694Z         %336 = tt.load %330, %335, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.4977957Z         %337 = ttg.convert_layout %336 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4978243Z         %338 = arith.shli %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4978485Z         %339 = arith.shrsi %338, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4978723Z         %340 = arith.shrsi %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4979056Z         %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.4979396Z         %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.4979685Z         %343 = tt.broadcast %341 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4979952Z         %344 = arith.select %17, %343, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4980195Z         %345 = tt.broadcast %342 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4980434Z         %346 = arith.select %19, %345, %344 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4980667Z         %347 = tt.reshape %346 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.4980895Z         %348 = arith.sitofp %347 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.4981153Z         %349 = ttg.local_alloc %348 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.4981483Z         %350 = ttg.local_load %349 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.4981965Z         %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.4982318Z         %352 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.4982463Z         %353 = arith.cmpi slt, %352, %c2_i32 : i32
2026-02-21T09:55:13.4982600Z         %354 = arith.select %353, %352, %c0_i32 : i32
2026-02-21T09:55:13.4982872Z         %355 = ttg.memdesc_index %53[%354] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.4983233Z         ttg.local_store %320, %355 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.4983639Z         scf.yield %351, %354, %arg8, %355 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.4983984Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.4984302Z       %68 = ttg.local_load %67#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.4984738Z       %69 = arith.extf %68 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.4985070Z       %70 = arith.addi %11, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.4985347Z       %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4985612Z       %72 = arith.muli %71, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4985804Z       %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.4985996Z       %74 = arith.addi %73, %48 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.4986185Z       %75 = tt.addptr %9, %74 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.4986390Z       %76 = arith.cmpi sge, %71, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4986557Z       %77 = arith.cmpi slt, %71, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4986717Z       %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2>
2026-02-21T09:55:13.4986916Z       %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.4987099Z       %80 = arith.andi %79, %52 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.4987264Z       %81 = tt.load %75, %80, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.4987516Z       %82 = ttg.convert_layout %81 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4987816Z       %83 = arith.shli %82, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4988047Z       %84 = arith.shrsi %83, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4988280Z       %85 = arith.shrsi %82, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4988564Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.4988898Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.4989180Z       %88 = tt.broadcast %86 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4989421Z       %89 = arith.select %17, %88, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4989654Z       %90 = tt.broadcast %87 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4989883Z       %91 = arith.select %19, %90, %89 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4990107Z       %92 = tt.reshape %91 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.4990343Z       %93 = arith.sitofp %92 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.4990591Z       %94 = ttg.local_alloc %93 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.4990912Z       %95 = ttg.local_load %94 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.4991378Z       %96 = tt.dot %69, %95, %67#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.4991872Z       %97 = ttg.local_load %67#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.4992297Z       %98 = arith.extf %97 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.4992629Z       %99 = arith.addi %11, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.4992907Z       %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4993157Z       %101 = arith.muli %100, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4993351Z       %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.4993570Z       %103 = arith.addi %102, %48 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.4993770Z       %104 = tt.addptr %9, %103 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.4993979Z       %105 = arith.cmpi sge, %100, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4994152Z       %106 = arith.cmpi slt, %100, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.4994319Z       %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2>
2026-02-21T09:55:13.4994504Z       %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.4994715Z       %109 = arith.andi %108, %52 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.4994882Z       %110 = tt.load %104, %109, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.4995144Z       %111 = ttg.convert_layout %110 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4995432Z       %112 = arith.shli %111, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4995686Z       %113 = arith.shrsi %112, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4995929Z       %114 = arith.shrsi %111, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.4996225Z       %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.4996570Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.4996862Z       %117 = tt.broadcast %115 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4997102Z       %118 = arith.select %17, %117, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4997346Z       %119 = tt.broadcast %116 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4997577Z       %120 = arith.select %19, %119, %118 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.4997809Z       %121 = tt.reshape %120 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.4998033Z       %122 = arith.sitofp %121 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.4998305Z       %123 = ttg.local_alloc %122 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.4998632Z       %124 = ttg.local_load %123 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.4999103Z       %125 = tt.dot %98, %124, %96, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.4999488Z       ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.4999706Z       %126 = arith.truncf %125 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.4999977Z       %127 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.5000220Z       %128 = arith.muli %127, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.5000451Z       %129 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.5000715Z       %130 = tt.broadcast %128 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5000924Z       %131 = tt.broadcast %129 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5001105Z       %132 = arith.addi %130, %131 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5001314Z       %133 = tt.addptr %20, %132 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5001513Z       tt.store %133, %126 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.5001661Z       %134 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:55:13.5001787Z       %135 = arith.divsi %134, %c256_i32 : i32
2026-02-21T09:55:13.5001907Z       %136 = arith.muli %135, %c4_i32 : i32
2026-02-21T09:55:13.5002029Z       %137 = arith.subi %c128_i32, %136 : i32
2026-02-21T09:55:13.5002147Z       %138 = arith.minsi %137, %c4_i32 : i32
2026-02-21T09:55:13.5002269Z       %139 = arith.remsi %134, %c256_i32 : i32
2026-02-21T09:55:13.5002384Z       %140 = arith.remsi %139, %138 : i32
2026-02-21T09:55:13.5002518Z       %141 = arith.addi %136, %140 : i32
2026-02-21T09:55:13.5002672Z       %142 = arith.divsi %139, %138 : i32
2026-02-21T09:55:13.5002792Z       %143 = arith.muli %141, %c128_i32 : i32
2026-02-21T09:55:13.5002967Z       %144 = tt.splat %143 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.5003186Z       %145 = tt.splat %143 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.5003425Z       %146 = arith.addi %144, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.5003641Z       %147 = arith.addi %145, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.5003810Z       %148 = arith.muli %142, %c128_i32 : i32
2026-02-21T09:55:13.5003975Z       %149 = tt.splat %148 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.5004180Z       %150 = arith.addi %149, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.5004461Z       %151 = tt.expand_dims %146 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.5004715Z       %152 = arith.muli %151, %cst_2 : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.5004914Z       %153 = tt.broadcast %152 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5005091Z       %154 = arith.extsi %148 : i32 to i64
2026-02-21T09:55:13.5005261Z       %155 = tt.splat %154 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.5005488Z       %156 = arith.addi %155, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.5005769Z       %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.5006075Z       %158 = tt.broadcast %157 : tensor<1x128xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5006280Z       %159 = arith.cmpi sge, %157, %cst_10 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.5006457Z       %160 = arith.cmpi slt, %157, %cst_11 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.5006625Z       %161 = arith.andi %159, %160 : tensor<1x128xi1, #blocked2>
2026-02-21T09:55:13.5006814Z       %162 = tt.broadcast %161 : tensor<1x128xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5007068Z       %163 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.5007253Z       %164 = arith.addi %153, %55 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5007454Z       %165 = tt.addptr %8, %164 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5007663Z       %166 = tt.load %165 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.5007950Z       %167 = ttg.memdesc_index %163[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5008316Z       ttg.local_store %166, %167 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5008557Z       %168 = arith.addi %153, %62 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5008757Z       %169 = tt.addptr %8, %168 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5008993Z       %170 = tt.load %169 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.5009275Z       %171 = ttg.memdesc_index %163[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5009634Z       ttg.local_store %170, %171 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5010164Z       %172:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %167, %arg8 = %171) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:55:13.5010611Z         %312 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:55:13.5010743Z         %313 = arith.muli %312, %c2_i32 : i32
2026-02-21T09:55:13.5010916Z         %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.5011141Z         %315 = arith.addi %314, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.5011444Z         %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:55:13.5011721Z         %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5011919Z         %318 = arith.addi %153, %317 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5012120Z         %319 = tt.addptr %8, %318 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5012332Z         %320 = tt.load %319 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.5012635Z         %321 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5013074Z         %322 = arith.extf %321 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5013358Z         %323 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:55:13.5013529Z         %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.5013756Z         %325 = arith.addi %324, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.5014052Z         %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5014299Z         %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5014495Z         %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5014689Z         %329 = arith.addi %328, %158 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5014891Z         %330 = tt.addptr %9, %329 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5015102Z         %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5015275Z         %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5015441Z         %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2>
2026-02-21T09:55:13.5015629Z         %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5015825Z         %335 = arith.andi %334, %162 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5015995Z         %336 = tt.load %330, %335, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.5016259Z         %337 = ttg.convert_layout %336 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5016546Z         %338 = arith.shli %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5016800Z         %339 = arith.shrsi %338, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5017043Z         %340 = arith.shrsi %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5017333Z         %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5017675Z         %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5017963Z         %343 = tt.broadcast %341 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5018227Z         %344 = arith.select %17, %343, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5018469Z         %345 = tt.broadcast %342 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5018706Z         %346 = arith.select %19, %345, %344 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5018949Z         %347 = tt.reshape %346 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.5019178Z         %348 = arith.sitofp %347 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.5019430Z         %349 = ttg.local_alloc %348 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.5019759Z         %350 = ttg.local_load %349 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5020240Z         %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.5020592Z         %352 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.5020721Z         %353 = arith.cmpi slt, %352, %c2_i32 : i32
2026-02-21T09:55:13.5020854Z         %354 = arith.select %353, %352, %c0_i32 : i32
2026-02-21T09:55:13.5021128Z         %355 = ttg.memdesc_index %163[%354] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5021497Z         ttg.local_store %320, %355 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5021921Z         scf.yield %351, %354, %arg8, %355 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5022267Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.5022586Z       %173 = ttg.local_load %172#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5023027Z       %174 = arith.extf %173 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5023331Z       %175 = arith.addi %73, %158 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5023536Z       %176 = tt.addptr %9, %175 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5023744Z       %177 = arith.andi %79, %162 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5023922Z       %178 = tt.load %176, %177, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.5024182Z       %179 = ttg.convert_layout %178 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5024471Z       %180 = arith.shli %179, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5024709Z       %181 = arith.shrsi %180, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5024974Z       %182 = arith.shrsi %179, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5025274Z       %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5025615Z       %184 = tt.expand_dims %182 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5025906Z       %185 = tt.broadcast %183 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5026170Z       %186 = arith.select %17, %185, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5026415Z       %187 = tt.broadcast %184 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5026655Z       %188 = arith.select %19, %187, %186 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5026888Z       %189 = tt.reshape %188 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.5027133Z       %190 = arith.sitofp %189 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.5027389Z       %191 = ttg.local_alloc %190 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.5027722Z       %192 = ttg.local_load %191 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5028203Z       %193 = tt.dot %174, %192, %172#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.5028711Z       %194 = ttg.local_load %172#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5029155Z       %195 = arith.extf %194 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5029467Z       %196 = arith.addi %102, %158 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5029671Z       %197 = tt.addptr %9, %196 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5029880Z       %198 = arith.andi %108, %162 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5030067Z       %199 = tt.load %197, %198, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.5030333Z       %200 = ttg.convert_layout %199 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5030624Z       %201 = arith.shli %200, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5030863Z       %202 = arith.shrsi %201, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5031109Z       %203 = arith.shrsi %200, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5031407Z       %204 = tt.expand_dims %202 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5031755Z       %205 = tt.expand_dims %203 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5032050Z       %206 = tt.broadcast %204 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5032294Z       %207 = arith.select %17, %206, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5032543Z       %208 = tt.broadcast %205 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5032777Z       %209 = arith.select %19, %208, %207 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5033031Z       %210 = tt.reshape %209 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.5033263Z       %211 = arith.sitofp %210 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.5033518Z       %212 = ttg.local_alloc %211 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.5033849Z       %213 = ttg.local_load %212 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5034324Z       %214 = tt.dot %195, %213, %193, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.5034737Z       ttg.local_dealloc %163 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.5034958Z       %215 = arith.truncf %214 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.5035232Z       %216 = tt.expand_dims %147 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.5035491Z       %217 = arith.muli %216, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.5035728Z       %218 = tt.expand_dims %150 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.5035991Z       %219 = tt.broadcast %217 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5036204Z       %220 = tt.broadcast %218 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5036390Z       %221 = arith.addi %219, %220 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5036590Z       %222 = tt.addptr %20, %221 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5036793Z       tt.store %222, %215 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.5036945Z       %223 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:55:13.5037075Z       %224 = arith.divsi %223, %c256_i32 : i32
2026-02-21T09:55:13.5037201Z       %225 = arith.muli %224, %c4_i32 : i32
2026-02-21T09:55:13.5037328Z       %226 = arith.subi %c128_i32, %225 : i32
2026-02-21T09:55:13.5037450Z       %227 = arith.minsi %226, %c4_i32 : i32
2026-02-21T09:55:13.5037574Z       %228 = arith.remsi %223, %c256_i32 : i32
2026-02-21T09:55:13.5037694Z       %229 = arith.remsi %228, %227 : i32
2026-02-21T09:55:13.5037816Z       %230 = arith.addi %225, %229 : i32
2026-02-21T09:55:13.5037957Z       %231 = arith.divsi %228, %227 : i32
2026-02-21T09:55:13.5038076Z       %232 = arith.muli %230, %c128_i32 : i32
2026-02-21T09:55:13.5038257Z       %233 = tt.splat %232 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.5038478Z       %234 = tt.splat %232 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.5038702Z       %235 = arith.addi %233, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.5038921Z       %236 = arith.addi %234, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.5039094Z       %237 = arith.muli %231, %c128_i32 : i32
2026-02-21T09:55:13.5039266Z       %238 = tt.splat %237 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.5039475Z       %239 = arith.addi %238, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.5039755Z       %240 = tt.expand_dims %235 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.5040013Z       %241 = arith.muli %240, %cst_2 : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.5040218Z       %242 = tt.broadcast %241 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5040402Z       %243 = arith.extsi %237 : i32 to i64
2026-02-21T09:55:13.5040573Z       %244 = tt.splat %243 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.5040818Z       %245 = arith.addi %244, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.5041100Z       %246 = tt.expand_dims %245 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.5041388Z       %247 = tt.broadcast %246 : tensor<1x128xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5041599Z       %248 = arith.cmpi sge, %246, %cst_10 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.5041777Z       %249 = arith.cmpi slt, %246, %cst_11 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.5041968Z       %250 = arith.andi %248, %249 : tensor<1x128xi1, #blocked2>
2026-02-21T09:55:13.5042161Z       %251 = tt.broadcast %250 : tensor<1x128xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5042387Z       %252 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.5042616Z       %253 = arith.addi %242, %55 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5042822Z       %254 = tt.addptr %8, %253 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5043058Z       %255 = tt.load %254 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.5043347Z       %256 = ttg.memdesc_index %252[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5043715Z       ttg.local_store %255, %256 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5043958Z       %257 = arith.addi %242, %62 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5044167Z       %258 = tt.addptr %8, %257 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5044740Z       %259 = tt.load %258 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.5045040Z       %260 = ttg.memdesc_index %252[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5045406Z       ttg.local_store %259, %260 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5046046Z       %261:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %256, %arg8 = %260) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:55:13.5046478Z         %312 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:55:13.5046616Z         %313 = arith.muli %312, %c2_i32 : i32
2026-02-21T09:55:13.5046788Z         %314 = tt.splat %313 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.5047023Z         %315 = arith.addi %314, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.5047309Z         %316 = tt.expand_dims %315 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:55:13.5047588Z         %317 = tt.broadcast %316 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5047791Z         %318 = arith.addi %242, %317 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5047997Z         %319 = tt.addptr %8, %318 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5048216Z         %320 = tt.load %319 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.5048525Z         %321 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5048968Z         %322 = arith.extf %321 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5049300Z         %323 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:55:13.5049475Z         %324 = tt.splat %323 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.5049707Z         %325 = arith.addi %324, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.5049991Z         %326 = tt.expand_dims %325 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5050244Z         %327 = arith.muli %326, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5050447Z         %328 = tt.broadcast %327 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5050680Z         %329 = arith.addi %328, %247 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5050890Z         %330 = tt.addptr %9, %329 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5051107Z         %331 = arith.cmpi sge, %326, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5051283Z         %332 = arith.cmpi slt, %326, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5051457Z         %333 = arith.andi %331, %332 : tensor<2x1xi1, #blocked2>
2026-02-21T09:55:13.5051673Z         %334 = tt.broadcast %333 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5051873Z         %335 = arith.andi %334, %251 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5052047Z         %336 = tt.load %330, %335, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.5052315Z         %337 = ttg.convert_layout %336 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5052612Z         %338 = arith.shli %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5052852Z         %339 = arith.shrsi %338, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5053100Z         %340 = arith.shrsi %337, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5053396Z         %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5053744Z         %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5054035Z         %343 = tt.broadcast %341 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5054305Z         %344 = arith.select %17, %343, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5054558Z         %345 = tt.broadcast %342 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5054798Z         %346 = arith.select %19, %345, %344 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5055033Z         %347 = tt.reshape %346 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.5055266Z         %348 = arith.sitofp %347 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.5055525Z         %349 = ttg.local_alloc %348 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.5055861Z         %350 = ttg.local_load %349 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5056346Z         %351 = tt.dot %322, %350, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.5056705Z         %352 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.5056842Z         %353 = arith.cmpi slt, %352, %c2_i32 : i32
2026-02-21T09:55:13.5056979Z         %354 = arith.select %353, %352, %c0_i32 : i32
2026-02-21T09:55:13.5057276Z         %355 = ttg.memdesc_index %252[%354] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5057649Z         ttg.local_store %320, %355 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5058052Z         scf.yield %351, %354, %arg8, %355 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5058404Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.5058723Z       %262 = ttg.local_load %261#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5059183Z       %263 = arith.extf %262 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5059491Z       %264 = arith.addi %73, %247 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5059708Z       %265 = tt.addptr %9, %264 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5059915Z       %266 = arith.andi %79, %251 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5060090Z       %267 = tt.load %265, %266, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.5060351Z       %268 = ttg.convert_layout %267 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5060641Z       %269 = arith.shli %268, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5060885Z       %270 = arith.shrsi %269, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5061133Z       %271 = arith.shrsi %268, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5061431Z       %272 = tt.expand_dims %270 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5061769Z       %273 = tt.expand_dims %271 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5062055Z       %274 = tt.broadcast %272 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5062311Z       %275 = arith.select %17, %274, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5062552Z       %276 = tt.broadcast %273 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5062790Z       %277 = arith.select %19, %276, %275 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5063019Z       %278 = tt.reshape %277 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.5063244Z       %279 = arith.sitofp %278 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.5063498Z       %280 = ttg.local_alloc %279 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.5063828Z       %281 = ttg.local_load %280 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5064302Z       %282 = tt.dot %263, %281, %261#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.5064796Z       %283 = ttg.local_load %261#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5065231Z       %284 = arith.extf %283 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5065548Z       %285 = arith.addi %102, %247 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5065747Z       %286 = tt.addptr %9, %285 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5065954Z       %287 = arith.andi %108, %251 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5066125Z       %288 = tt.load %286, %287, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.5066387Z       %289 = ttg.convert_layout %288 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5066674Z       %290 = arith.shli %289, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5066930Z       %291 = arith.shrsi %290, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5067168Z       %292 = arith.shrsi %289, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5067459Z       %293 = tt.expand_dims %291 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5067813Z       %294 = tt.expand_dims %292 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5068100Z       %295 = tt.broadcast %293 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5068340Z       %296 = arith.select %17, %295, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5068582Z       %297 = tt.broadcast %294 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5068815Z       %298 = arith.select %19, %297, %296 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5069046Z       %299 = tt.reshape %298 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.5069271Z       %300 = arith.sitofp %299 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.5069526Z       %301 = ttg.local_alloc %300 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.5069861Z       %302 = ttg.local_load %301 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5070351Z       %303 = tt.dot %284, %302, %282, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.5070738Z       ttg.local_dealloc %252 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.5070956Z       %304 = arith.truncf %303 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.5071229Z       %305 = tt.expand_dims %236 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.5071469Z       %306 = arith.muli %305, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.5071703Z       %307 = tt.expand_dims %239 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.5071964Z       %308 = tt.broadcast %306 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5072170Z       %309 = tt.broadcast %307 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5072353Z       %310 = arith.addi %308, %309 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5072548Z       %311 = tt.addptr %20, %310 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5072752Z       tt.store %311, %304 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.5072896Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:55:13.5073019Z     scf.for %arg3 = %24 to %2 step %c1_i32  : i32 {
2026-02-21T09:55:13.5073155Z       %25 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:55:13.5073295Z       %26 = arith.muli %25, %c4_i32 : i32
2026-02-21T09:55:13.5073414Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:55:13.5073532Z       %28 = arith.minsi %27, %c4_i32 : i32
2026-02-21T09:55:13.5073651Z       %29 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:55:13.5073771Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:55:13.5073885Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:55:13.5073994Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:55:13.5074108Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:55:13.5074275Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.5074517Z       %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.5074732Z       %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.5074946Z       %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.5075111Z       %38 = arith.muli %32, %c128_i32 : i32
2026-02-21T09:55:13.5075267Z       %39 = tt.splat %38 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.5075486Z       %40 = arith.addi %39, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.5075756Z       %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.5076008Z       %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.5076201Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5076376Z       %44 = arith.extsi %38 : i32 to i64
2026-02-21T09:55:13.5076543Z       %45 = tt.splat %44 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.5076762Z       %46 = arith.addi %45, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.5077042Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.5077319Z       %48 = tt.broadcast %47 : tensor<1x128xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5077521Z       %49 = arith.cmpi sge, %47, %cst_10 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.5077696Z       %50 = arith.cmpi slt, %47, %cst_11 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.5077860Z       %51 = arith.andi %49, %50 : tensor<1x128xi1, #blocked2>
2026-02-21T09:55:13.5078061Z       %52 = tt.broadcast %51 : tensor<1x128xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5078275Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.5078541Z       %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:55:13.5078812Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5078999Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5079198Z       %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5079397Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.5079682Z       %59 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5080045Z       ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5080315Z       %60 = arith.addi %7, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.5080588Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:55:13.5080872Z       %62 = tt.broadcast %61 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5081062Z       %63 = arith.addi %43, %62 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5081261Z       %64 = tt.addptr %8, %63 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5081461Z       %65 = tt.load %64 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.5081744Z       %66 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5082097Z       ttg.local_store %65, %66 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5082692Z       %67:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_3, %arg6 = %c1_i32, %arg7 = %59, %arg8 = %66) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:55:13.5083119Z         %134 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:55:13.5083242Z         %135 = arith.muli %134, %c2_i32 : i32
2026-02-21T09:55:13.5083443Z         %136 = tt.splat %135 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.5083669Z         %137 = arith.addi %136, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.5083947Z         %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x4xi32, #blocked1>
2026-02-21T09:55:13.5084224Z         %139 = tt.broadcast %138 : tensor<1x4xi32, #blocked1> -> tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5084420Z         %140 = arith.addi %43, %139 : tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5084623Z         %141 = tt.addptr %8, %140 : tensor<128x4x!tt.ptr<bf16>, #blocked1>, tensor<128x4xi32, #blocked1>
2026-02-21T09:55:13.5084830Z         %142 = tt.load %141 : tensor<128x4x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.5085135Z         %143 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5085573Z         %144 = arith.extf %143 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5085857Z         %145 = arith.extsi %arg4 : i32 to i64
2026-02-21T09:55:13.5086054Z         %146 = tt.splat %145 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.5086279Z         %147 = arith.addi %146, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.5086551Z         %148 = tt.expand_dims %147 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5086800Z         %149 = arith.muli %148, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5086992Z         %150 = tt.broadcast %149 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5087189Z         %151 = arith.addi %150, %48 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5087387Z         %152 = tt.addptr %9, %151 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5087598Z         %153 = arith.cmpi sge, %148, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5087775Z         %154 = arith.cmpi slt, %148, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5087938Z         %155 = arith.andi %153, %154 : tensor<2x1xi1, #blocked2>
2026-02-21T09:55:13.5088130Z         %156 = tt.broadcast %155 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5088320Z         %157 = arith.andi %156, %52 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5088492Z         %158 = tt.load %152, %157, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.5088773Z         %159 = ttg.convert_layout %158 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5089058Z         %160 = arith.shli %159, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5089299Z         %161 = arith.shrsi %160, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5089538Z         %162 = arith.shrsi %159, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5089836Z         %163 = tt.expand_dims %161 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5090197Z         %164 = tt.expand_dims %162 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5090481Z         %165 = tt.broadcast %163 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5090728Z         %166 = arith.select %17, %165, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5090983Z         %167 = tt.broadcast %164 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5091220Z         %168 = arith.select %19, %167, %166 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5091453Z         %169 = tt.reshape %168 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.5091678Z         %170 = arith.sitofp %169 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.5091934Z         %171 = ttg.local_alloc %170 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.5092261Z         %172 = ttg.local_load %171 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5092740Z         %173 = tt.dot %144, %172, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.5093096Z         %174 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.5093223Z         %175 = arith.cmpi slt, %174, %c2_i32 : i32
2026-02-21T09:55:13.5093357Z         %176 = arith.select %175, %174, %c0_i32 : i32
2026-02-21T09:55:13.5093644Z         %177 = ttg.memdesc_index %53[%176] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5094004Z         ttg.local_store %142, %177 : tensor<128x4xbf16, #blocked1> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5094407Z         scf.yield %173, %176, %arg8, %177 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.5094745Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.5095062Z       %68 = ttg.local_load %67#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5095491Z       %69 = arith.extf %68 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5095818Z       %70 = arith.addi %11, %cst_5 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.5096094Z       %71 = tt.expand_dims %70 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5096338Z       %72 = arith.muli %71, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5096525Z       %73 = tt.broadcast %72 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5096735Z       %74 = arith.addi %73, %48 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5096926Z       %75 = tt.addptr %9, %74 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5097132Z       %76 = arith.cmpi sge, %71, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5097298Z       %77 = arith.cmpi slt, %71, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5097458Z       %78 = arith.andi %76, %77 : tensor<2x1xi1, #blocked2>
2026-02-21T09:55:13.5097642Z       %79 = tt.broadcast %78 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5097826Z       %80 = arith.andi %79, %52 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5098008Z       %81 = tt.load %75, %80, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.5098258Z       %82 = ttg.convert_layout %81 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5098536Z       %83 = arith.shli %82, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5098766Z       %84 = arith.shrsi %83, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5099009Z       %85 = arith.shrsi %82, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5099293Z       %86 = tt.expand_dims %84 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5099622Z       %87 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5099905Z       %88 = tt.broadcast %86 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5100145Z       %89 = arith.select %17, %88, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5100378Z       %90 = tt.broadcast %87 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5100609Z       %91 = arith.select %19, %90, %89 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5100831Z       %92 = tt.reshape %91 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.5101050Z       %93 = arith.sitofp %92 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.5101298Z       %94 = ttg.local_alloc %93 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.5101635Z       %95 = ttg.local_load %94 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5102102Z       %96 = tt.dot %69, %95, %67#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.5102591Z       %97 = ttg.local_load %67#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5103017Z       %98 = arith.extf %97 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5103343Z       %99 = arith.addi %11, %cst_6 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.5103620Z       %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5103870Z       %101 = arith.muli %100, %cst_7 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5104067Z       %102 = tt.broadcast %101 : tensor<2x1xi64, #blocked2> -> tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5104258Z       %103 = arith.addi %102, %48 : tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5104456Z       %104 = tt.addptr %9, %103 : tensor<2x128x!tt.ptr<i8>, #blocked2>, tensor<2x128xi64, #blocked2>
2026-02-21T09:55:13.5104687Z       %105 = arith.cmpi sge, %100, %cst_8 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5104859Z       %106 = arith.cmpi slt, %100, %cst_9 : tensor<2x1xi64, #blocked2>
2026-02-21T09:55:13.5105028Z       %107 = arith.andi %105, %106 : tensor<2x1xi1, #blocked2>
2026-02-21T09:55:13.5105211Z       %108 = tt.broadcast %107 : tensor<2x1xi1, #blocked2> -> tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5105403Z       %109 = arith.andi %108, %52 : tensor<2x128xi1, #blocked2>
2026-02-21T09:55:13.5105573Z       %110 = tt.load %104, %109, %cst_12 : tensor<2x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.5105836Z       %111 = ttg.convert_layout %110 : tensor<2x128xi8, #blocked2> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5106137Z       %112 = arith.shli %111, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5106374Z       %113 = arith.shrsi %112, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5106613Z       %114 = arith.shrsi %111, %cst_14 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.5106916Z       %115 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5107255Z       %116 = tt.expand_dims %114 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:55:13.5107540Z       %117 = tt.broadcast %115 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5107779Z       %118 = arith.select %17, %117, %cst_13 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5108022Z       %119 = tt.broadcast %116 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5108256Z       %120 = arith.select %19, %119, %118 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:55:13.5108489Z       %121 = tt.reshape %120 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.5108715Z       %122 = arith.sitofp %121 : tensor<4x128xi8, #blocked2> to tensor<4x128xf32, #blocked2>
2026-02-21T09:55:13.5108969Z       %123 = ttg.local_alloc %122 : (tensor<4x128xf32, #blocked2>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:55:13.5109295Z       %124 = ttg.local_load %123 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.5109779Z       %125 = tt.dot %98, %124, %96, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.5110162Z       ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.5110379Z       %126 = arith.truncf %125 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.5110655Z       %127 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.5110896Z       %128 = arith.muli %127, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.5111124Z       %129 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.5111386Z       %130 = tt.broadcast %128 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5111596Z       %131 = tt.broadcast %129 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5111777Z       %132 = arith.addi %130, %131 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5111972Z       %133 = tt.addptr %20, %132 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.5112170Z       tt.store %133, %126 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.5112327Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:55:13.5112429Z     tt.return
2026-02-21T09:55:13.5112512Z   }
2026-02-21T09:55:13.5112588Z }
2026-02-21T09:55:13.5112632Z 
2026-02-21T09:55:13.5112663Z {-#
2026-02-21T09:55:13.5112746Z   external_resources: {
2026-02-21T09:55:13.5112847Z     mlir_reproducer: {
2026-02-21T09:55:13.5113842Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:55:13.5114855Z       disable_threading: false,
2026-02-21T09:55:13.5114961Z       verify_each: true
2026-02-21T09:55:13.5123724Z     }
2026-02-21T09:55:13.5123818Z   }
2026-02-21T09:55:13.5123891Z #-}
2026-02-21T09:55:13.5124241Z /tmp/torchinductor_root/vt/cvtqkqvhi7lrkmrn7rd2e2qmchoq3bnvg4cimoocko5nbr34ijna.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:55:13.5124955Z /tmp/torchinductor_root/vt/cvtqkqvhi7lrkmrn7rd2e2qmchoq3bnvg4cimoocko5nbr34ijna.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:55:13.5125519Z [644s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:55:13.5126295Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:55:13.5127010Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:55:13.5127181Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:55:13.6235436Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:55:13.6244477Z #blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:55:13.6244832Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:55:13.6245141Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:55:13.6245444Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:55:13.6245729Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:55:13.6245980Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:55:13.6246217Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:55:13.6246395Z #smem = #ttg.shared_memory
2026-02-21T09:55:13.6246630Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:55:13.6247103Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:55:13.6247577Z     %cst = arith.constant dense<4> : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6247731Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:55:13.6247850Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:55:13.6247993Z     %cst_0 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6248172Z     %cst_1 = arith.constant dense<0> : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6248313Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:55:13.6248432Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:55:13.6248546Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:55:13.6248660Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:55:13.6248850Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:55:13.6248964Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:55:13.6249108Z     %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.6249286Z     %cst_3 = arith.constant dense<8192> : tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6249464Z     %cst_4 = arith.constant dense<0> : tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6249633Z     %cst_5 = arith.constant dense<512> : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6249861Z     %cst_6 = arith.constant dense<0> : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6250030Z     %cst_7 = arith.constant dense<8192> : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6250239Z     %cst_8 = arith.constant dense<2> : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6250487Z     %cst_9 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6250672Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:55:13.6250829Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6251014Z     %cst_11 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:55:13.6251189Z     %cst_12 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:55:13.6251361Z     %cst_13 = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.6251507Z     %0 = tt.get_program_id x : i32
2026-02-21T09:55:13.6251622Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:55:13.6251738Z     %2 = arith.minsi %1, %c4096_i32 : i32
2026-02-21T09:55:13.6251941Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.6252234Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.6252503Z     %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:55:13.6252770Z     %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.6253030Z     %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6253272Z     %8 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6253476Z     %9 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6253703Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6254016Z     %11 = arith.extsi %10 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked}>> to tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6254372Z     %12 = arith.extsi %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked}>> to tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:55:13.6254725Z     %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>>
2026-02-21T09:55:13.6255139Z     %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T09:55:13.6255559Z     %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T09:55:13.6255815Z     %16 = arith.cmpi eq, %15, %cst_11 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:55:13.6256019Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1>
2026-02-21T09:55:13.6256221Z     %18 = arith.cmpi eq, %15, %cst_12 : tensor<1x2x1xi32, #blocked1>
2026-02-21T09:55:13.6256418Z     %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked1> -> tensor<2x2x256xi1, #blocked1>
2026-02-21T09:55:13.6256646Z     %20 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.6256806Z     %21 = arith.subi %2, %0 : i32
2026-02-21T09:55:13.6256920Z     %22 = arith.remsi %21, %c3_i32 : i32
2026-02-21T09:55:13.6257034Z     %23 = arith.subi %21, %22 : i32
2026-02-21T09:55:13.6257149Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:55:13.6257271Z     scf.for %arg3 = %0 to %24 step %c3_i32  : i32 {
2026-02-21T09:55:13.6257411Z       %25 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:55:13.6257548Z       %26 = arith.muli %25, %c4_i32 : i32
2026-02-21T09:55:13.6257667Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:55:13.6257784Z       %28 = arith.minsi %27, %c4_i32 : i32
2026-02-21T09:55:13.6257905Z       %29 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:55:13.6258023Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:55:13.6258134Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:55:13.6258246Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:55:13.6258362Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:55:13.6258530Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.6258743Z       %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.6258959Z       %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.6259171Z       %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.6259333Z       %38 = arith.muli %32, %c256_i32 : i32
2026-02-21T09:55:13.6259491Z       %39 = tt.splat %38 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.6259692Z       %40 = arith.addi %39, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.6259979Z       %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.6260230Z       %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.6260423Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6260597Z       %44 = arith.extsi %38 : i32 to i64
2026-02-21T09:55:13.6260760Z       %45 = tt.splat %44 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:55:13.6260975Z       %46 = arith.addi %45, %12 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:55:13.6261245Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6261521Z       %48 = tt.broadcast %47 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6261718Z       %49 = arith.cmpi sge, %47, %cst_4 : tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6261884Z       %50 = arith.cmpi slt, %47, %cst_3 : tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6262047Z       %51 = arith.andi %49, %50 : tensor<1x256xi1, #blocked>
2026-02-21T09:55:13.6262224Z       %52 = tt.broadcast %51 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6262437Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.6262720Z       %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:55:13.6262988Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6263178Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6263373Z       %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6263579Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6263820Z       %59 = tt.expand_dims %11 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6264076Z       %60 = arith.muli %59, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6264261Z       %61 = tt.broadcast %60 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6264444Z       %62 = arith.addi %61, %48 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6264636Z       %63 = tt.addptr %9, %62 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6264849Z       %64 = arith.cmpi sge, %59, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6265015Z       %65 = arith.cmpi slt, %59, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6265174Z       %66 = arith.andi %64, %65 : tensor<2x1xi1, #blocked>
2026-02-21T09:55:13.6265347Z       %67 = tt.broadcast %66 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6265530Z       %68 = arith.andi %67, %52 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6265746Z       %69 = tt.load %63, %68, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6266089Z       %70 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6266451Z       ttg.local_store %58, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6266721Z       %71 = arith.addi %7, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6266999Z       %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:55:13.6267284Z       %73 = tt.broadcast %72 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6267495Z       %74 = arith.addi %43, %73 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6267687Z       %75 = tt.addptr %8, %74 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6267892Z       %76 = tt.load %75 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6268079Z       %77 = arith.addi %11, %cst_8 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6268351Z       %78 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6268589Z       %79 = arith.muli %78, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6268771Z       %80 = tt.broadcast %79 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6268955Z       %81 = arith.addi %80, %48 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6269142Z       %82 = tt.addptr %9, %81 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6269342Z       %83 = arith.cmpi sge, %78, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6269503Z       %84 = arith.cmpi slt, %78, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6269661Z       %85 = arith.andi %83, %84 : tensor<2x1xi1, #blocked>
2026-02-21T09:55:13.6269835Z       %86 = tt.broadcast %85 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6270016Z       %87 = arith.andi %86, %52 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6270247Z       %88 = tt.load %82, %87, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6270580Z       %89 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6270937Z       ttg.local_store %76, %89 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6271563Z       %90:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %70, %arg8 = %89, %arg9 = %69, %arg10 = %88) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:55:13.6272102Z         %317 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:55:13.6272229Z         %318 = arith.muli %317, %c2_i32 : i32
2026-02-21T09:55:13.6272404Z         %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6272627Z         %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6272923Z         %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:55:13.6273201Z         %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6273398Z         %323 = arith.addi %43, %322 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6273601Z         %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6273809Z         %325 = tt.load %324 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6274113Z         %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6274556Z         %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6274842Z         %328 = arith.extsi %317 : i32 to i64
2026-02-21T09:55:13.6275014Z         %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6275233Z         %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6275523Z         %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6275766Z         %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6275959Z         %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6276151Z         %334 = arith.addi %333, %48 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6276346Z         %335 = tt.addptr %9, %334 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6276555Z         %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6276723Z         %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6276887Z         %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked>
2026-02-21T09:55:13.6277072Z         %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6277261Z         %340 = arith.andi %339, %52 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6277431Z         %341 = tt.load %335, %340, %cst_1 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6277603Z         %342 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6277763Z         %343 = arith.shrsi %342, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6278008Z         %344 = ttg.convert_layout %343 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6278275Z         %345 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6278521Z         %346 = ttg.convert_layout %345 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6278858Z         %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6279205Z         %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6279513Z         %349 = tt.broadcast %347 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6279762Z         %350 = arith.select %17, %349, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6280025Z         %351 = tt.broadcast %348 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6280273Z         %352 = arith.select %19, %351, %350 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6280534Z         %353 = tt.reshape %352 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6280762Z         %354 = arith.sitofp %353 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6281020Z         %355 = ttg.local_alloc %354 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6281348Z         %356 = ttg.local_load %355 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6281834Z         %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6282191Z         %358 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.6282319Z         %359 = arith.cmpi slt, %358, %c2_i32 : i32
2026-02-21T09:55:13.6282456Z         %360 = arith.select %359, %358, %c0_i32 : i32
2026-02-21T09:55:13.6282831Z         %361 = ttg.memdesc_index %53[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6283213Z         ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6283699Z         scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6284123Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.6284443Z       %91 = ttg.local_load %90#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6284882Z       %92 = arith.extf %91 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6285182Z       %93 = arith.shli %90#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6285344Z       %94 = arith.shrsi %93, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6285583Z       %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6285822Z       %96 = arith.shrsi %90#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6286059Z       %97 = ttg.convert_layout %96 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6286407Z       %98 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6286747Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6287033Z       %100 = tt.broadcast %98 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6287275Z       %101 = arith.select %17, %100, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6287521Z       %102 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6287775Z       %103 = arith.select %19, %102, %101 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6288013Z       %104 = tt.reshape %103 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6288241Z       %105 = arith.sitofp %104 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6288494Z       %106 = ttg.local_alloc %105 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6288850Z       %107 = ttg.local_load %106 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6289320Z       %108 = tt.dot %92, %107, %90#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6289814Z       %109 = ttg.local_load %90#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6290244Z       %110 = arith.extf %109 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6290542Z       %111 = arith.shli %90#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6290703Z       %112 = arith.shrsi %111, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6290948Z       %113 = ttg.convert_layout %112 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6291198Z       %114 = arith.shrsi %90#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6291457Z       %115 = ttg.convert_layout %114 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6291791Z       %116 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6292136Z       %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6292429Z       %118 = tt.broadcast %116 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6292673Z       %119 = arith.select %17, %118, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6292918Z       %120 = tt.broadcast %117 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6293154Z       %121 = arith.select %19, %120, %119 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6293390Z       %122 = tt.reshape %121 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6293616Z       %123 = arith.sitofp %122 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6293869Z       %124 = ttg.local_alloc %123 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6294197Z       %125 = ttg.local_load %124 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6294685Z       %126 = tt.dot %110, %125, %108, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6295068Z       ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.6295285Z       %127 = arith.truncf %126 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:55:13.6295555Z       %128 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.6295800Z       %129 = arith.muli %128, %cst_13 : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.6296053Z       %130 = tt.expand_dims %40 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:55:13.6296311Z       %131 = tt.broadcast %129 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6296521Z       %132 = tt.broadcast %130 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6296702Z       %133 = arith.addi %131, %132 : tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6296911Z       %134 = tt.addptr %20, %133 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6297113Z       tt.store %134, %127 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.6297256Z       %135 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:55:13.6297381Z       %136 = arith.divsi %135, %c128_i32 : i32
2026-02-21T09:55:13.6297503Z       %137 = arith.muli %136, %c4_i32 : i32
2026-02-21T09:55:13.6297625Z       %138 = arith.subi %c128_i32, %137 : i32
2026-02-21T09:55:13.6297745Z       %139 = arith.minsi %138, %c4_i32 : i32
2026-02-21T09:55:13.6297867Z       %140 = arith.remsi %135, %c128_i32 : i32
2026-02-21T09:55:13.6297982Z       %141 = arith.remsi %140, %139 : i32
2026-02-21T09:55:13.6298098Z       %142 = arith.addi %137, %141 : i32
2026-02-21T09:55:13.6298214Z       %143 = arith.divsi %140, %139 : i32
2026-02-21T09:55:13.6298331Z       %144 = arith.muli %142, %c128_i32 : i32
2026-02-21T09:55:13.6298507Z       %145 = tt.splat %144 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.6298728Z       %146 = tt.splat %144 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.6298948Z       %147 = arith.addi %145, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.6299184Z       %148 = arith.addi %146, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.6299353Z       %149 = arith.muli %143, %c256_i32 : i32
2026-02-21T09:55:13.6299518Z       %150 = tt.splat %149 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.6299751Z       %151 = arith.addi %150, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.6300028Z       %152 = tt.expand_dims %147 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.6300286Z       %153 = arith.muli %152, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.6300489Z       %154 = tt.broadcast %153 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6300668Z       %155 = arith.extsi %149 : i32 to i64
2026-02-21T09:55:13.6300834Z       %156 = tt.splat %155 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:55:13.6301056Z       %157 = arith.addi %156, %12 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:55:13.6301332Z       %158 = tt.expand_dims %157 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6301612Z       %159 = tt.broadcast %158 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6301817Z       %160 = arith.cmpi sge, %158, %cst_4 : tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6302006Z       %161 = arith.cmpi slt, %158, %cst_3 : tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6302172Z       %162 = arith.andi %160, %161 : tensor<1x256xi1, #blocked>
2026-02-21T09:55:13.6302356Z       %163 = tt.broadcast %162 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6302573Z       %164 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.6302760Z       %165 = arith.addi %154, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6302960Z       %166 = tt.addptr %8, %165 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6303170Z       %167 = tt.load %166 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6303345Z       %168 = arith.addi %61, %159 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6303537Z       %169 = tt.addptr %9, %168 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6303733Z       %170 = arith.andi %67, %163 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6303949Z       %171 = tt.load %169, %170, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6304309Z       %172 = ttg.memdesc_index %164[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6304670Z       ttg.local_store %167, %172 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6304913Z       %173 = arith.addi %154, %73 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6305112Z       %174 = tt.addptr %8, %173 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6305321Z       %175 = tt.load %174 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6305480Z       %176 = arith.addi %80, %159 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6305671Z       %177 = tt.addptr %9, %176 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6305868Z       %178 = arith.andi %86, %163 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6306081Z       %179 = tt.load %177, %178, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6306422Z       %180 = ttg.memdesc_index %164[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6306799Z       ttg.local_store %175, %180 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6307428Z       %181:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %172, %arg8 = %180, %arg9 = %171, %arg10 = %179) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:55:13.6307961Z         %317 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:55:13.6308087Z         %318 = arith.muli %317, %c2_i32 : i32
2026-02-21T09:55:13.6308259Z         %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6308485Z         %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6308762Z         %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:55:13.6309045Z         %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6309245Z         %323 = arith.addi %154, %322 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6309448Z         %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6309658Z         %325 = tt.load %324 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6309975Z         %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6310415Z         %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6310699Z         %328 = arith.extsi %317 : i32 to i64
2026-02-21T09:55:13.6310874Z         %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6311097Z         %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6311382Z         %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6311627Z         %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6311818Z         %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6312007Z         %334 = arith.addi %333, %159 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6312219Z         %335 = tt.addptr %9, %334 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6312423Z         %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6312596Z         %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6312760Z         %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked>
2026-02-21T09:55:13.6312943Z         %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6313135Z         %340 = arith.andi %339, %163 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6313303Z         %341 = tt.load %335, %340, %cst_1 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6313477Z         %342 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6313637Z         %343 = arith.shrsi %342, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6313886Z         %344 = ttg.convert_layout %343 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6314141Z         %345 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6314387Z         %346 = ttg.convert_layout %345 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6314743Z         %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6315091Z         %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6315380Z         %349 = tt.broadcast %347 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6315632Z         %350 = arith.select %17, %349, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6315876Z         %351 = tt.broadcast %348 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6316120Z         %352 = arith.select %19, %351, %350 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6316360Z         %353 = tt.reshape %352 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6316587Z         %354 = arith.sitofp %353 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6316844Z         %355 = ttg.local_alloc %354 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6317172Z         %356 = ttg.local_load %355 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6317653Z         %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6318019Z         %358 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.6318148Z         %359 = arith.cmpi slt, %358, %c2_i32 : i32
2026-02-21T09:55:13.6318285Z         %360 = arith.select %359, %358, %c0_i32 : i32
2026-02-21T09:55:13.6318557Z         %361 = ttg.memdesc_index %164[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6318923Z         ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6319426Z         scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6319848Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.6320180Z       %182 = ttg.local_load %181#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6320613Z       %183 = arith.extf %182 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6320914Z       %184 = arith.shli %181#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6321077Z       %185 = arith.shrsi %184, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6321323Z       %186 = ttg.convert_layout %185 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6321571Z       %187 = arith.shrsi %181#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6321818Z       %188 = ttg.convert_layout %187 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6322155Z       %189 = tt.expand_dims %186 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6322501Z       %190 = tt.expand_dims %188 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6322854Z       %191 = tt.broadcast %189 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6323100Z       %192 = arith.select %17, %191, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6323347Z       %193 = tt.broadcast %190 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6323586Z       %194 = arith.select %19, %193, %192 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6323825Z       %195 = tt.reshape %194 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6324050Z       %196 = arith.sitofp %195 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6324306Z       %197 = ttg.local_alloc %196 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6324634Z       %198 = ttg.local_load %197 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6325107Z       %199 = tt.dot %183, %198, %181#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6325609Z       %200 = ttg.local_load %181#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6326062Z       %201 = arith.extf %200 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6326362Z       %202 = arith.shli %181#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6326523Z       %203 = arith.shrsi %202, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6326767Z       %204 = ttg.convert_layout %203 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6326835Z       %205 = arith.shrsi %181#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6326979Z       %206 = ttg.convert_layout %205 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6327157Z       %207 = tt.expand_dims %204 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6327309Z       %208 = tt.expand_dims %206 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6327408Z       %209 = tt.broadcast %207 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6327540Z       %210 = arith.select %17, %209, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6327638Z       %211 = tt.broadcast %208 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6327742Z       %212 = arith.select %19, %211, %210 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6327835Z       %213 = tt.reshape %212 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6327933Z       %214 = arith.sitofp %213 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6328052Z       %215 = ttg.local_alloc %214 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6328220Z       %216 = ttg.local_load %215 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6328486Z       %217 = tt.dot %201, %216, %199, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6328575Z       ttg.local_dealloc %164 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.6328685Z       %218 = arith.truncf %217 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:55:13.6328829Z       %219 = tt.expand_dims %148 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.6328893Z       %220 = arith.muli %219, %cst_13 : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.6329031Z       %221 = tt.expand_dims %151 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:55:13.6329119Z       %222 = tt.broadcast %220 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6329201Z       %223 = tt.broadcast %221 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6329261Z       %224 = arith.addi %222, %223 : tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6329364Z       %225 = tt.addptr %20, %224 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6329432Z       tt.store %225, %218 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.6329477Z       %226 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:55:13.6329526Z       %227 = arith.divsi %226, %c128_i32 : i32
2026-02-21T09:55:13.6329571Z       %228 = arith.muli %227, %c4_i32 : i32
2026-02-21T09:55:13.6329616Z       %229 = arith.subi %c128_i32, %228 : i32
2026-02-21T09:55:13.6329657Z       %230 = arith.minsi %229, %c4_i32 : i32
2026-02-21T09:55:13.6329705Z       %231 = arith.remsi %226, %c128_i32 : i32
2026-02-21T09:55:13.6329762Z       %232 = arith.remsi %231, %230 : i32
2026-02-21T09:55:13.6329802Z       %233 = arith.addi %228, %232 : i32
2026-02-21T09:55:13.6329845Z       %234 = arith.divsi %231, %230 : i32
2026-02-21T09:55:13.6329890Z       %235 = arith.muli %233, %c128_i32 : i32
2026-02-21T09:55:13.6329985Z       %236 = tt.splat %235 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.6330074Z       %237 = tt.splat %235 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.6330168Z       %238 = arith.addi %236, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.6330272Z       %239 = arith.addi %237, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.6330319Z       %240 = arith.muli %234, %c256_i32 : i32
2026-02-21T09:55:13.6330403Z       %241 = tt.splat %240 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.6330486Z       %242 = arith.addi %241, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.6330640Z       %243 = tt.expand_dims %238 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.6330721Z       %244 = arith.muli %243, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.6330816Z       %245 = tt.broadcast %244 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6330859Z       %246 = arith.extsi %240 : i32 to i64
2026-02-21T09:55:13.6330952Z       %247 = tt.splat %246 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:55:13.6331043Z       %248 = arith.addi %247, %12 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:55:13.6331191Z       %249 = tt.expand_dims %248 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6331289Z       %250 = tt.broadcast %249 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6331361Z       %251 = arith.cmpi sge, %249, %cst_4 : tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6331431Z       %252 = arith.cmpi slt, %249, %cst_3 : tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6331495Z       %253 = arith.andi %251, %252 : tensor<1x256xi1, #blocked>
2026-02-21T09:55:13.6331585Z       %254 = tt.broadcast %253 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6331688Z       %255 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.6331751Z       %256 = arith.addi %245, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6331855Z       %257 = tt.addptr %8, %256 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6331920Z       %258 = tt.load %257 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6331981Z       %259 = arith.addi %61, %250 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6332079Z       %260 = tt.addptr %9, %259 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6332138Z       %261 = arith.andi %67, %254 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6332261Z       %262 = tt.load %260, %261, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6332448Z       %263 = ttg.memdesc_index %255[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6332592Z       ttg.local_store %258, %263 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6332655Z       %264 = arith.addi %245, %73 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6332757Z       %265 = tt.addptr %8, %264 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6332821Z       %266 = tt.load %265 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6332896Z       %267 = arith.addi %80, %250 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6332996Z       %268 = tt.addptr %9, %267 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6333056Z       %269 = arith.andi %86, %254 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6333174Z       %270 = tt.load %268, %269, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6333361Z       %271 = ttg.memdesc_index %255[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6333504Z       ttg.local_store %266, %271 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6333978Z       %272:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %263, %arg8 = %271, %arg9 = %262, %arg10 = %270) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:55:13.6334026Z         %317 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:55:13.6334086Z         %318 = arith.muli %317, %c2_i32 : i32
2026-02-21T09:55:13.6334184Z         %319 = tt.splat %318 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6334276Z         %320 = arith.addi %319, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6334425Z         %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:55:13.6334520Z         %322 = tt.broadcast %321 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6334586Z         %323 = arith.addi %245, %322 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6334690Z         %324 = tt.addptr %8, %323 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6334756Z         %325 = tt.load %324 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6334964Z         %326 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6335168Z         %327 = arith.extf %326 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6335227Z         %328 = arith.extsi %317 : i32 to i64
2026-02-21T09:55:13.6335321Z         %329 = tt.splat %328 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6335412Z         %330 = arith.addi %329, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6335558Z         %331 = tt.expand_dims %330 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6335625Z         %332 = arith.muli %331, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6335718Z         %333 = tt.broadcast %332 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6335778Z         %334 = arith.addi %333, %250 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6335882Z         %335 = tt.addptr %9, %334 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6335951Z         %336 = arith.cmpi sge, %331, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6336018Z         %337 = arith.cmpi slt, %331, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6336080Z         %338 = arith.andi %336, %337 : tensor<2x1xi1, #blocked>
2026-02-21T09:55:13.6336169Z         %339 = tt.broadcast %338 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6336230Z         %340 = arith.andi %339, %254 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6336320Z         %341 = tt.load %335, %340, %cst_1 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6336384Z         %342 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6336445Z         %343 = arith.shrsi %342, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6336596Z         %344 = ttg.convert_layout %343 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6336661Z         %345 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6336808Z         %346 = ttg.convert_layout %345 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6336984Z         %347 = tt.expand_dims %344 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6337137Z         %348 = tt.expand_dims %346 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6337238Z         %349 = tt.broadcast %347 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6337367Z         %350 = arith.select %17, %349, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6337467Z         %351 = tt.broadcast %348 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6337574Z         %352 = arith.select %19, %351, %350 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6337670Z         %353 = tt.reshape %352 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6337772Z         %354 = arith.sitofp %353 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6337894Z         %355 = ttg.local_alloc %354 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6338069Z         %356 = ttg.local_load %355 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6338351Z         %357 = tt.dot %327, %356, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6338400Z         %358 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.6338454Z         %359 = arith.cmpi slt, %358, %c2_i32 : i32
2026-02-21T09:55:13.6338522Z         %360 = arith.select %359, %358, %c0_i32 : i32
2026-02-21T09:55:13.6338711Z         %361 = ttg.memdesc_index %255[%360] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6338857Z         ttg.local_store %325, %361 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6339172Z         scf.yield %357, %360, %arg8, %361, %arg10, %341 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6339258Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.6339464Z       %273 = ttg.local_load %272#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6339667Z       %274 = arith.extf %273 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6339736Z       %275 = arith.shli %272#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6339806Z       %276 = arith.shrsi %275, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6339954Z       %277 = ttg.convert_layout %276 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6340036Z       %278 = arith.shrsi %272#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6340185Z       %279 = ttg.convert_layout %278 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6340342Z       %280 = tt.expand_dims %277 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6340496Z       %281 = tt.expand_dims %279 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6340602Z       %282 = tt.broadcast %280 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6340734Z       %283 = arith.select %17, %282, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6340833Z       %284 = tt.broadcast %281 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6340944Z       %285 = arith.select %19, %284, %283 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6341055Z       %286 = tt.reshape %285 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6341190Z       %287 = arith.sitofp %286 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6341382Z       %288 = ttg.local_alloc %287 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6341679Z       %289 = ttg.local_load %288 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6342171Z       %290 = tt.dot %274, %289, %272#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6342545Z       %291 = ttg.local_load %272#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6342913Z       %292 = arith.extf %291 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6343024Z       %293 = arith.shli %272#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6343136Z       %294 = arith.shrsi %293, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6343422Z       %295 = ttg.convert_layout %294 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6343530Z       %296 = arith.shrsi %272#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6343792Z       %297 = ttg.convert_layout %296 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6344073Z       %298 = tt.expand_dims %295 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6344353Z       %299 = tt.expand_dims %297 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6344533Z       %300 = tt.broadcast %298 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6344728Z       %301 = arith.select %17, %300, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6344904Z       %302 = tt.broadcast %299 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6345099Z       %303 = arith.select %19, %302, %301 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6345265Z       %304 = tt.reshape %303 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6345434Z       %305 = arith.sitofp %304 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6345672Z       %306 = ttg.local_alloc %305 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6345990Z       %307 = ttg.local_load %306 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6346490Z       %308 = tt.dot %292, %307, %290, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6346651Z       ttg.local_dealloc %255 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.6346836Z       %309 = arith.truncf %308 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:55:13.6347094Z       %310 = tt.expand_dims %239 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.6347206Z       %311 = arith.muli %310, %cst_13 : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.6347462Z       %312 = tt.expand_dims %242 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:55:13.6347633Z       %313 = tt.broadcast %311 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6347781Z       %314 = tt.broadcast %312 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6347888Z       %315 = arith.addi %313, %314 : tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6348064Z       %316 = tt.addptr %20, %315 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6348173Z       tt.store %316, %309 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.6348249Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:55:13.6348337Z     scf.for %arg3 = %24 to %2 step %c1_i32  : i32 {
2026-02-21T09:55:13.6348418Z       %25 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:55:13.6348494Z       %26 = arith.muli %25, %c4_i32 : i32
2026-02-21T09:55:13.6348567Z       %27 = arith.subi %c128_i32, %26 : i32
2026-02-21T09:55:13.6348639Z       %28 = arith.minsi %27, %c4_i32 : i32
2026-02-21T09:55:13.6348721Z       %29 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:55:13.6348793Z       %30 = arith.remsi %29, %28 : i32
2026-02-21T09:55:13.6348862Z       %31 = arith.addi %26, %30 : i32
2026-02-21T09:55:13.6348930Z       %32 = arith.divsi %29, %28 : i32
2026-02-21T09:55:13.6349007Z       %33 = arith.muli %31, %c128_i32 : i32
2026-02-21T09:55:13.6349192Z       %34 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.6349340Z       %35 = tt.splat %33 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.6349508Z       %36 = arith.addi %34, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.6349653Z       %37 = arith.addi %35, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.6349725Z       %38 = arith.muli %32, %c256_i32 : i32
2026-02-21T09:55:13.6349875Z       %39 = tt.splat %38 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.6350022Z       %40 = arith.addi %39, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.6350296Z       %41 = tt.expand_dims %36 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.6350408Z       %42 = arith.muli %41, %cst_2 : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.6350577Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6350648Z       %44 = arith.extsi %38 : i32 to i64
2026-02-21T09:55:13.6350807Z       %45 = tt.splat %44 : i64 -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:55:13.6350972Z       %46 = arith.addi %45, %12 : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>>
2026-02-21T09:55:13.6351258Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6351420Z       %48 = tt.broadcast %47 : tensor<1x256xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6351544Z       %49 = arith.cmpi sge, %47, %cst_4 : tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6351660Z       %50 = arith.cmpi slt, %47, %cst_3 : tensor<1x256xi64, #blocked>
2026-02-21T09:55:13.6351760Z       %51 = arith.andi %49, %50 : tensor<1x256xi1, #blocked>
2026-02-21T09:55:13.6351925Z       %52 = tt.broadcast %51 : tensor<1x256xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6352093Z       %53 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.6352418Z       %54 = tt.expand_dims %7 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:55:13.6352621Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6352739Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6352944Z       %57 = tt.addptr %8, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6353087Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6353376Z       %59 = tt.expand_dims %11 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6353492Z       %60 = arith.muli %59, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6353657Z       %61 = tt.broadcast %60 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6353764Z       %62 = arith.addi %61, %48 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6353954Z       %63 = tt.addptr %9, %62 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6354078Z       %64 = arith.cmpi sge, %59, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6354200Z       %65 = arith.cmpi slt, %59, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6354312Z       %66 = arith.andi %64, %65 : tensor<2x1xi1, #blocked>
2026-02-21T09:55:13.6354477Z       %67 = tt.broadcast %66 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6354586Z       %68 = arith.andi %67, %52 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6354813Z       %69 = tt.load %63, %68, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6355218Z       %70 = ttg.memdesc_index %53[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6355503Z       ttg.local_store %58, %70 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6355691Z       %71 = arith.addi %7, %cst_9 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6355980Z       %72 = tt.expand_dims %71 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:55:13.6356155Z       %73 = tt.broadcast %72 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6356270Z       %74 = arith.addi %43, %73 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6356469Z       %75 = tt.addptr %8, %74 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6356586Z       %76 = tt.load %75 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6356771Z       %77 = arith.addi %11, %cst_8 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6357054Z       %78 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6357167Z       %79 = arith.muli %78, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6357341Z       %80 = tt.broadcast %79 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6357472Z       %81 = arith.addi %80, %48 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6357661Z       %82 = tt.addptr %9, %81 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6357785Z       %83 = arith.cmpi sge, %78, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6357913Z       %84 = arith.cmpi slt, %78, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6358017Z       %85 = arith.andi %83, %84 : tensor<2x1xi1, #blocked>
2026-02-21T09:55:13.6358184Z       %86 = tt.broadcast %85 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6358321Z       %87 = arith.andi %86, %52 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6358547Z       %88 = tt.load %82, %87, %cst_1 {amd.pipeliner_part = "prologue"} : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6358928Z       %89 = ttg.memdesc_index %53[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6359214Z       ttg.local_store %76, %89 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6360175Z       %90:6 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_10, %arg6 = %c1_i32, %arg7 = %70, %arg8 = %89, %arg9 = %69, %arg10 = %88) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>)  : i32 {
2026-02-21T09:55:13.6360261Z         %135 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:55:13.6360348Z         %136 = arith.muli %135, %c2_i32 : i32
2026-02-21T09:55:13.6360533Z         %137 = tt.splat %136 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6360715Z         %138 = arith.addi %137, %7 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.6361018Z         %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:55:13.6361208Z         %140 = tt.broadcast %139 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6361328Z         %141 = arith.addi %43, %140 : tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6361541Z         %142 = tt.addptr %8, %141 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:55:13.6361687Z         %143 = tt.load %142 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.6362121Z         %144 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6362615Z         %145 = arith.extf %144 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6362698Z         %146 = arith.extsi %135 : i32 to i64
2026-02-21T09:55:13.6362894Z         %147 = tt.splat %146 : i64 -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6363101Z         %148 = arith.addi %147, %11 : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.6363412Z         %149 = tt.expand_dims %148 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6363540Z         %150 = arith.muli %149, %cst_7 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6363734Z         %151 = tt.broadcast %150 : tensor<2x1xi64, #blocked> -> tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6363860Z         %152 = arith.addi %151, %48 : tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6364071Z         %153 = tt.addptr %9, %152 : tensor<2x256x!tt.ptr<i8>, #blocked>, tensor<2x256xi64, #blocked>
2026-02-21T09:55:13.6364218Z         %154 = arith.cmpi sge, %149, %cst_6 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6364396Z         %155 = arith.cmpi slt, %149, %cst_5 : tensor<2x1xi64, #blocked>
2026-02-21T09:55:13.6364515Z         %156 = arith.andi %154, %155 : tensor<2x1xi1, #blocked>
2026-02-21T09:55:13.6364703Z         %157 = tt.broadcast %156 : tensor<2x1xi1, #blocked> -> tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6364825Z         %158 = arith.andi %157, %52 : tensor<2x256xi1, #blocked>
2026-02-21T09:55:13.6364975Z         %159 = tt.load %153, %158, %cst_1 : tensor<2x256x!tt.ptr<i8>, #blocked>
2026-02-21T09:55:13.6365104Z         %160 = arith.shli %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6365234Z         %161 = arith.shrsi %160, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6365582Z         %162 = ttg.convert_layout %161 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6365714Z         %163 = arith.shrsi %arg9, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6366036Z         %164 = ttg.convert_layout %163 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6366398Z         %165 = tt.expand_dims %162 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6366732Z         %166 = tt.expand_dims %164 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6366950Z         %167 = tt.broadcast %165 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6367185Z         %168 = arith.select %17, %167, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6367396Z         %169 = tt.broadcast %166 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6367623Z         %170 = arith.select %19, %169, %168 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6367821Z         %171 = tt.reshape %170 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6368023Z         %172 = arith.sitofp %171 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6368286Z         %173 = ttg.local_alloc %172 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6368663Z         %174 = ttg.local_load %173 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6369288Z         %175 = tt.dot %145, %174, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6369385Z         %176 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.6369481Z         %177 = arith.cmpi slt, %176, %c2_i32 : i32
2026-02-21T09:55:13.6369580Z         %178 = arith.select %177, %176, %c0_i32 : i32
2026-02-21T09:55:13.6369983Z         %179 = ttg.memdesc_index %53[%178] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6370295Z         ttg.local_store %143, %179 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:55:13.6370988Z         scf.yield %175, %178, %arg8, %179, %arg10, %159 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, tensor<2x256xi8, #blocked>, tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6371160Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.6371587Z       %91 = ttg.local_load %90#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6372016Z       %92 = arith.extf %91 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6372168Z       %93 = arith.shli %90#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6372289Z       %94 = arith.shrsi %93, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6372599Z       %95 = ttg.convert_layout %94 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6372730Z       %96 = arith.shrsi %90#4, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6373035Z       %97 = ttg.convert_layout %96 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6373387Z       %98 = tt.expand_dims %95 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6373717Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6373926Z       %100 = tt.broadcast %98 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6374179Z       %101 = arith.select %17, %100, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6374390Z       %102 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6374613Z       %103 = arith.select %19, %102, %101 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6374811Z       %104 = tt.reshape %103 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6375018Z       %105 = arith.sitofp %104 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6375274Z       %106 = ttg.local_alloc %105 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6375650Z       %107 = ttg.local_load %106 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6376251Z       %108 = tt.dot %92, %107, %90#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6376682Z       %109 = ttg.local_load %90#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6377139Z       %110 = arith.extf %109 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6377272Z       %111 = arith.shli %90#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6377395Z       %112 = arith.shrsi %111, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6377713Z       %113 = ttg.convert_layout %112 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6377849Z       %114 = arith.shrsi %90#5, %cst : tensor<2x256xi8, #blocked>
2026-02-21T09:55:13.6378163Z       %115 = ttg.convert_layout %114 : tensor<2x256xi8, #blocked> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.6378501Z       %116 = tt.expand_dims %113 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6378837Z       %117 = tt.expand_dims %115 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1x256xi8, #blocked1>
2026-02-21T09:55:13.6379053Z       %118 = tt.broadcast %116 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6379285Z       %119 = arith.select %17, %118, %cst_0 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6379503Z       %120 = tt.broadcast %117 : tensor<2x1x256xi8, #blocked1> -> tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6379750Z       %121 = arith.select %19, %120, %119 : tensor<2x2x256xi1, #blocked1>, tensor<2x2x256xi8, #blocked1>
2026-02-21T09:55:13.6379946Z       %122 = tt.reshape %121 : tensor<2x2x256xi8, #blocked1> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:55:13.6380152Z       %123 = arith.sitofp %122 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:55:13.6380408Z       %124 = ttg.local_alloc %123 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:55:13.6380777Z       %125 = ttg.local_load %124 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.6381398Z       %126 = tt.dot %110, %125, %108, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:55:13.6381580Z       ttg.local_dealloc %53 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.6381768Z       %127 = arith.truncf %126 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:55:13.6382097Z       %128 = tt.expand_dims %37 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.6382221Z       %129 = arith.muli %128, %cst_13 : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.6382523Z       %130 = tt.expand_dims %40 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:55:13.6382706Z       %131 = tt.broadcast %129 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6382882Z       %132 = tt.broadcast %130 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6383001Z       %133 = arith.addi %131, %132 : tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6383214Z       %134 = tt.addptr %20, %133 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi32, #mma>
2026-02-21T09:55:13.6383341Z       tt.store %134, %127 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.6383422Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:55:13.6383496Z     tt.return
2026-02-21T09:55:13.6383560Z   }
2026-02-21T09:55:13.6383627Z }
2026-02-21T09:55:13.6383634Z 
2026-02-21T09:55:13.6383695Z {-#
2026-02-21T09:55:13.6383781Z   external_resources: {
2026-02-21T09:55:13.6383860Z     mlir_reproducer: {
2026-02-21T09:55:13.6386007Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:55:13.6386099Z       disable_threading: false,
2026-02-21T09:55:13.6386173Z       verify_each: true
2026-02-21T09:55:13.6386238Z     }
2026-02-21T09:55:13.6386305Z   }
2026-02-21T09:55:13.6386367Z #-}
2026-02-21T09:55:13.6386900Z /tmp/torchinductor_root/os/coswwui5nvwsubzopnbmqmwkhbwsnfpi7fvgc4rugf6f6skohdge.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:55:13.6387846Z /tmp/torchinductor_root/os/coswwui5nvwsubzopnbmqmwkhbwsnfpi7fvgc4rugf6f6skohdge.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:55:13.6388084Z [644s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:55:13.6389490Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:55:13.6389621Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:55:13.6389795Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:55:13.7717324Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:55:13.7729725Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:55:13.7729900Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}>
2026-02-21T09:55:13.7730107Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:55:13.7730249Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 4], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:55:13.7730376Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:55:13.7730506Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:55:13.7730562Z #smem = #ttg.shared_memory
2026-02-21T09:55:13.7730773Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:55:13.7731124Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:55:13.7731217Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.7731313Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.7731402Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.7731493Z     %cst_2 = arith.constant dense<8192> : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7731619Z     %cst_3 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.7731682Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:55:13.7731781Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7731838Z     %c500_i32 = arith.constant 500 : i32
2026-02-21T09:55:13.7731895Z     %c12_i32 = arith.constant 12 : i32
2026-02-21T09:55:13.7732032Z     %cst_5 = arith.constant dense<500> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7732168Z     %cst_6 = arith.constant dense<504> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7732304Z     %cst_7 = arith.constant dense<508> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7732436Z     %cst_8 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7732573Z     %cst_9 = arith.constant dense<16> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7732630Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:55:13.7732682Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:55:13.7732735Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:55:13.7732831Z     %cst_10 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7732888Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:55:13.7732939Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:55:13.7732997Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:55:13.7733073Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:55:13.7733208Z     %cst_11 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7733263Z     %0 = tt.get_program_id x : i32
2026-02-21T09:55:13.7733322Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:55:13.7733374Z     %2 = arith.minsi %1, %c8192_i32 : i32
2026-02-21T09:55:13.7733529Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.7733673Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.7733833Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.7733977Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.7734119Z     %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7734254Z     %8 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7734376Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7734475Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7734672Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:55:13.7734943Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:55:13.7735127Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.7735214Z     %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.7735326Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:55:13.7735414Z     %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.7735521Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:55:13.7735619Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.7735691Z     %19 = arith.subi %2, %0 : i32
2026-02-21T09:55:13.7735743Z     %20 = arith.remsi %19, %c3_i32 : i32
2026-02-21T09:55:13.7735794Z     %21 = arith.subi %19, %20 : i32
2026-02-21T09:55:13.7735845Z     %22 = arith.addi %0, %21 : i32
2026-02-21T09:55:13.7735919Z     scf.for %arg3 = %0 to %22 step %c3_i32  : i32 {
2026-02-21T09:55:13.7735977Z       %23 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:55:13.7736027Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:55:13.7736082Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:55:13.7736132Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:55:13.7736187Z       %27 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:55:13.7736237Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:55:13.7736286Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:55:13.7736334Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:55:13.7736383Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:55:13.7736500Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.7736599Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.7736716Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.7736816Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.7736865Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:55:13.7736980Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.7737090Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.7737188Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.7737294Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.7737477Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.7737555Z       %42 = arith.muli %41, %cst_3 : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.7737685Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7737866Z       %44 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:55:13.7737979Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7738110Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.7738287Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:13.7738394Z       %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7738468Z       %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7738595Z       %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7738671Z       %51 = tt.load %50 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7738899Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7739073Z       ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7739187Z       %53 = arith.addi %8, %cst_8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7739357Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:13.7739485Z       %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7739557Z       %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7739681Z       %57 = tt.addptr %9, %56 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7739755Z       %58 = tt.load %57 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7739986Z       %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7740157Z       ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7740271Z       %60 = arith.addi %8, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7740441Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:13.7740549Z       %62 = tt.broadcast %61 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7740622Z       %63 = arith.addi %43, %62 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7740742Z       %64 = tt.addptr %9, %63 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7740814Z       %65 = tt.load %64 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7741033Z       %66 = ttg.memdesc_index %46[%c2_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7741237Z       ttg.local_store %65, %66 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7741767Z       %67:5 = scf.for %arg4 = %c0_i32 to %c500_i32 step %c4_i32 iter_args(%arg5 = %cst_4, %arg6 = %c2_i32, %arg7 = %52, %arg8 = %59, %arg9 = %66) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>)  : i32 {
2026-02-21T09:55:13.7741891Z         %360 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7742023Z         %361 = arith.addi %360, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7742080Z         %362 = arith.addi %arg4, %c12_i32 : i32
2026-02-21T09:55:13.7742138Z         %363 = arith.muli %362, %c2_i32 : i32
2026-02-21T09:55:13.7742250Z         %364 = tt.splat %363 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7742374Z         %365 = arith.addi %364, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7742553Z         %366 = tt.expand_dims %365 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:13.7742666Z         %367 = tt.broadcast %366 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7742743Z         %368 = arith.addi %43, %367 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7742874Z         %369 = tt.addptr %9, %368 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7742953Z         %370 = tt.load %369 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7743202Z         %371 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7743452Z         %372 = arith.extf %371 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7743623Z         %373 = tt.expand_dims %361 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7743700Z         %374 = arith.muli %373, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7743839Z         %375 = tt.broadcast %374 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7743916Z         %376 = arith.addi %375, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7744037Z         %377 = tt.addptr %10, %376 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7744112Z         %378 = tt.load %377 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7744291Z         %379 = ttg.convert_layout %378 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7744410Z         %380 = arith.shli %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7744582Z         %381 = arith.shrsi %380, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7744772Z         %382 = arith.shrsi %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7745017Z         %383 = tt.expand_dims %381 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7745198Z         %384 = tt.expand_dims %382 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7745314Z         %385 = tt.broadcast %383 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7745464Z         %386 = arith.select %15, %385, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7745578Z         %387 = tt.broadcast %384 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7745698Z         %388 = arith.select %17, %387, %386 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7745805Z         %389 = tt.reshape %388 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7745921Z         %390 = arith.sitofp %389 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7746065Z         %391 = ttg.local_alloc %390 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7746288Z         %392 = ttg.local_load %391 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7746616Z         %393 = tt.dot %372, %392, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7746689Z         %394 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.7746747Z         %395 = arith.cmpi slt, %394, %c3_i32 : i32
2026-02-21T09:55:13.7746809Z         %396 = arith.select %395, %394, %c0_i32 : i32
2026-02-21T09:55:13.7747025Z         %397 = ttg.memdesc_index %46[%396] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7747196Z         ttg.local_store %370, %397 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7747565Z         scf.yield %393, %396, %arg8, %arg9, %397 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7747665Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T09:55:13.7747775Z       %68 = arith.addi %7, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7748010Z       %69 = ttg.local_load %67#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7748259Z       %70 = arith.extf %69 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7748429Z       %71 = tt.expand_dims %68 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7748507Z       %72 = arith.muli %71, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7748612Z       %73 = tt.broadcast %72 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7748682Z       %74 = arith.addi %73, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7748801Z       %75 = tt.addptr %10, %74 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7748871Z       %76 = tt.load %75 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7749140Z       %77 = ttg.convert_layout %76 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7749277Z       %78 = arith.shli %77, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7749394Z       %79 = arith.shrsi %78, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7749508Z       %80 = arith.shrsi %77, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7749690Z       %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7749940Z       %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7750114Z       %83 = tt.broadcast %81 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7750270Z       %84 = arith.select %15, %83, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7750378Z       %85 = tt.broadcast %82 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7750491Z       %86 = arith.select %17, %85, %84 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7750613Z       %87 = tt.reshape %86 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7750720Z       %88 = arith.sitofp %87 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7750858Z       %89 = ttg.local_alloc %88 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7751061Z       %90 = ttg.local_load %89 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7751390Z       %91 = tt.dot %70, %90, %67#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7751500Z       %92 = arith.addi %7, %cst_6 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7751732Z       %93 = ttg.local_load %67#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7751963Z       %94 = arith.extf %93 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7752130Z       %95 = tt.expand_dims %92 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7752206Z       %96 = arith.muli %95, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7752310Z       %97 = tt.broadcast %96 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7752377Z       %98 = arith.addi %97, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7752513Z       %99 = tt.addptr %10, %98 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7752585Z       %100 = tt.load %99 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7752761Z       %101 = ttg.convert_layout %100 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7752883Z       %102 = arith.shli %101, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7753001Z       %103 = arith.shrsi %102, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7753115Z       %104 = arith.shrsi %101, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7753303Z       %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7753476Z       %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7753591Z       %107 = tt.broadcast %105 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7753722Z       %108 = arith.select %15, %107, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7753832Z       %109 = tt.broadcast %106 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7753948Z       %110 = arith.select %17, %109, %108 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7754075Z       %111 = tt.reshape %110 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7754186Z       %112 = arith.sitofp %111 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7754326Z       %113 = ttg.local_alloc %112 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7754531Z       %114 = ttg.local_load %113 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7754839Z       %115 = tt.dot %94, %114, %91, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7754965Z       %116 = arith.addi %7, %cst_7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7755200Z       %117 = ttg.local_load %67#4 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7755451Z       %118 = arith.extf %117 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7755621Z       %119 = tt.expand_dims %116 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7755701Z       %120 = arith.muli %119, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7755808Z       %121 = tt.broadcast %120 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7755881Z       %122 = arith.addi %121, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7756004Z       %123 = tt.addptr %10, %122 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7756075Z       %124 = tt.load %123 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7756251Z       %125 = ttg.convert_layout %124 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7756373Z       %126 = arith.shli %125, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7756490Z       %127 = arith.shrsi %126, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7756623Z       %128 = arith.shrsi %125, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7756803Z       %129 = tt.expand_dims %127 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7756977Z       %130 = tt.expand_dims %128 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7757089Z       %131 = tt.broadcast %129 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7757217Z       %132 = arith.select %15, %131, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7757327Z       %133 = tt.broadcast %130 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7757444Z       %134 = arith.select %17, %133, %132 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7757553Z       %135 = tt.reshape %134 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7757669Z       %136 = arith.sitofp %135 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7757809Z       %137 = ttg.local_alloc %136 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7758015Z       %138 = ttg.local_load %137 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7758340Z       %139 = tt.dot %118, %138, %115, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7758446Z       ttg.local_dealloc %46 : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.7758553Z       %140 = arith.truncf %139 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.7758718Z       %141 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.7758790Z       %142 = arith.muli %141, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.7758971Z       %143 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.7759069Z       %144 = tt.broadcast %142 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7759165Z       %145 = tt.broadcast %143 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7759239Z       %146 = arith.addi %144, %145 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7759377Z       %147 = tt.addptr %18, %146 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7759451Z       tt.store %147, %140 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.7759507Z       %148 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:55:13.7759561Z       %149 = arith.divsi %148, %c256_i32 : i32
2026-02-21T09:55:13.7759611Z       %150 = arith.muli %149, %c4_i32 : i32
2026-02-21T09:55:13.7759667Z       %151 = arith.subi %c128_i32, %150 : i32
2026-02-21T09:55:13.7759719Z       %152 = arith.minsi %151, %c4_i32 : i32
2026-02-21T09:55:13.7759769Z       %153 = arith.remsi %148, %c256_i32 : i32
2026-02-21T09:55:13.7759820Z       %154 = arith.remsi %153, %152 : i32
2026-02-21T09:55:13.7759867Z       %155 = arith.addi %150, %154 : i32
2026-02-21T09:55:13.7759913Z       %156 = arith.divsi %153, %152 : i32
2026-02-21T09:55:13.7759965Z       %157 = arith.muli %155, %c128_i32 : i32
2026-02-21T09:55:13.7760080Z       %158 = tt.splat %157 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.7760181Z       %159 = tt.splat %157 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.7760290Z       %160 = arith.addi %158, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.7760393Z       %161 = arith.addi %159, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.7760459Z       %162 = arith.muli %156, %c128_i32 : i32
2026-02-21T09:55:13.7760559Z       %163 = tt.splat %162 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.7760672Z       %164 = tt.splat %162 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.7760771Z       %165 = arith.addi %163, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.7760878Z       %166 = arith.addi %164, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.7761059Z       %167 = tt.expand_dims %160 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.7761137Z       %168 = arith.muli %167, %cst_3 : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.7761250Z       %169 = tt.broadcast %168 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7761427Z       %170 = tt.expand_dims %166 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:55:13.7761539Z       %171 = tt.broadcast %170 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7761642Z       %172 = ttg.local_alloc : () -> !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.7761717Z       %173 = arith.addi %169, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7761855Z       %174 = tt.addptr %9, %173 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7761930Z       %175 = tt.load %174 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7762156Z       %176 = ttg.memdesc_index %172[%c0_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7762327Z       ttg.local_store %175, %176 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7762398Z       %177 = arith.addi %169, %55 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7762519Z       %178 = tt.addptr %9, %177 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7762670Z       %179 = tt.load %178 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7762885Z       %180 = ttg.memdesc_index %172[%c1_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7763052Z       ttg.local_store %179, %180 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7763145Z       %181 = arith.addi %169, %62 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7763265Z       %182 = tt.addptr %9, %181 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7763338Z       %183 = tt.load %182 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7763563Z       %184 = ttg.memdesc_index %172[%c2_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7763731Z       ttg.local_store %183, %184 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7764256Z       %185:5 = scf.for %arg4 = %c0_i32 to %c500_i32 step %c4_i32 iter_args(%arg5 = %cst_4, %arg6 = %c2_i32, %arg7 = %176, %arg8 = %180, %arg9 = %184) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>)  : i32 {
2026-02-21T09:55:13.7764377Z         %360 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7764486Z         %361 = arith.addi %360, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7764546Z         %362 = arith.addi %arg4, %c12_i32 : i32
2026-02-21T09:55:13.7764619Z         %363 = arith.muli %362, %c2_i32 : i32
2026-02-21T09:55:13.7764729Z         %364 = tt.splat %363 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7764839Z         %365 = arith.addi %364, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7765011Z         %366 = tt.expand_dims %365 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:13.7765121Z         %367 = tt.broadcast %366 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7765200Z         %368 = arith.addi %169, %367 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7765322Z         %369 = tt.addptr %9, %368 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7765399Z         %370 = tt.load %369 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7765640Z         %371 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7765879Z         %372 = arith.extf %371 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7766048Z         %373 = tt.expand_dims %361 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7766145Z         %374 = arith.muli %373, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7766255Z         %375 = tt.broadcast %374 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7766328Z         %376 = arith.addi %375, %171 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7766452Z         %377 = tt.addptr %10, %376 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7766527Z         %378 = tt.load %377 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7766700Z         %379 = ttg.convert_layout %378 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7766839Z         %380 = arith.shli %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7766957Z         %381 = arith.shrsi %380, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7767076Z         %382 = arith.shrsi %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7767272Z         %383 = tt.expand_dims %381 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7767447Z         %384 = tt.expand_dims %382 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7767562Z         %385 = tt.broadcast %383 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7767692Z         %386 = arith.select %15, %385, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7767805Z         %387 = tt.broadcast %384 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7767924Z         %388 = arith.select %17, %387, %386 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7768036Z         %389 = tt.reshape %388 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7768147Z         %390 = arith.sitofp %389 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7768290Z         %391 = ttg.local_alloc %390 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7768492Z         %392 = ttg.local_load %391 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7768829Z         %393 = tt.dot %372, %392, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7768886Z         %394 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.7768945Z         %395 = arith.cmpi slt, %394, %c3_i32 : i32
2026-02-21T09:55:13.7769004Z         %396 = arith.select %395, %394, %c0_i32 : i32
2026-02-21T09:55:13.7769221Z         %397 = ttg.memdesc_index %172[%396] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7769392Z         ttg.local_store %370, %397 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7769759Z         scf.yield %393, %396, %arg8, %arg9, %397 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7769855Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T09:55:13.7770093Z       %186 = ttg.local_load %185#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7770328Z       %187 = arith.extf %186 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7770415Z       %188 = arith.addi %73, %171 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7770540Z       %189 = tt.addptr %10, %188 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7770611Z       %190 = tt.load %189 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7770780Z       %191 = ttg.convert_layout %190 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7770910Z       %192 = arith.shli %191, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7771043Z       %193 = arith.shrsi %192, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7771157Z       %194 = arith.shrsi %191, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7771340Z       %195 = tt.expand_dims %193 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7771529Z       %196 = tt.expand_dims %194 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7771643Z       %197 = tt.broadcast %195 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7771769Z       %198 = arith.select %15, %197, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7771878Z       %199 = tt.broadcast %196 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7771996Z       %200 = arith.select %17, %199, %198 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7772106Z       %201 = tt.reshape %200 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7772215Z       %202 = arith.sitofp %201 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7772356Z       %203 = ttg.local_alloc %202 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7772561Z       %204 = ttg.local_load %203 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7772891Z       %205 = tt.dot %187, %204, %185#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7773125Z       %206 = ttg.local_load %185#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7773356Z       %207 = arith.extf %206 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7773427Z       %208 = arith.addi %97, %171 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7773550Z       %209 = tt.addptr %10, %208 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7773621Z       %210 = tt.load %209 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7773790Z       %211 = ttg.convert_layout %210 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7773908Z       %212 = arith.shli %211, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7774026Z       %213 = arith.shrsi %212, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7774140Z       %214 = arith.shrsi %211, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7774317Z       %215 = tt.expand_dims %213 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7774507Z       %216 = tt.expand_dims %214 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7774620Z       %217 = tt.broadcast %215 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7774741Z       %218 = arith.select %15, %217, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7774854Z       %219 = tt.broadcast %216 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7774969Z       %220 = arith.select %17, %219, %218 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7775098Z       %221 = tt.reshape %220 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7775208Z       %222 = arith.sitofp %221 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7775346Z       %223 = ttg.local_alloc %222 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7775562Z       %224 = ttg.local_load %223 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7775870Z       %225 = tt.dot %207, %224, %205, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7776099Z       %226 = ttg.local_load %185#4 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7776333Z       %227 = arith.extf %226 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7776407Z       %228 = arith.addi %121, %171 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7776528Z       %229 = tt.addptr %10, %228 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7776599Z       %230 = tt.load %229 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7776771Z       %231 = ttg.convert_layout %230 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7776886Z       %232 = arith.shli %231, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7777018Z       %233 = arith.shrsi %232, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7777134Z       %234 = arith.shrsi %231, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7777310Z       %235 = tt.expand_dims %233 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7777481Z       %236 = tt.expand_dims %234 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7777598Z       %237 = tt.broadcast %235 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7777720Z       %238 = arith.select %15, %237, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7777829Z       %239 = tt.broadcast %236 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7777950Z       %240 = arith.select %17, %239, %238 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7778055Z       %241 = tt.reshape %240 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7778165Z       %242 = arith.sitofp %241 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7778305Z       %243 = ttg.local_alloc %242 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7778517Z       %244 = ttg.local_load %243 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7778822Z       %245 = tt.dot %227, %244, %225, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7778928Z       ttg.local_dealloc %172 : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.7779033Z       %246 = arith.truncf %245 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.7779196Z       %247 = tt.expand_dims %161 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.7779285Z       %248 = arith.muli %247, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.7779445Z       %249 = tt.expand_dims %165 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.7779543Z       %250 = tt.broadcast %248 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7779641Z       %251 = tt.broadcast %249 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7779725Z       %252 = arith.addi %250, %251 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7779840Z       %253 = tt.addptr %18, %252 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7779917Z       tt.store %253, %246 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.7779970Z       %254 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:55:13.7780022Z       %255 = arith.divsi %254, %c256_i32 : i32
2026-02-21T09:55:13.7780075Z       %256 = arith.muli %255, %c4_i32 : i32
2026-02-21T09:55:13.7780126Z       %257 = arith.subi %c128_i32, %256 : i32
2026-02-21T09:55:13.7780175Z       %258 = arith.minsi %257, %c4_i32 : i32
2026-02-21T09:55:13.7780238Z       %259 = arith.remsi %254, %c256_i32 : i32
2026-02-21T09:55:13.7780287Z       %260 = arith.remsi %259, %258 : i32
2026-02-21T09:55:13.7780334Z       %261 = arith.addi %256, %260 : i32
2026-02-21T09:55:13.7780384Z       %262 = arith.divsi %259, %258 : i32
2026-02-21T09:55:13.7780435Z       %263 = arith.muli %261, %c128_i32 : i32
2026-02-21T09:55:13.7780546Z       %264 = tt.splat %263 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.7780649Z       %265 = tt.splat %263 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.7780772Z       %266 = arith.addi %264, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.7780868Z       %267 = arith.addi %265, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.7780923Z       %268 = arith.muli %262, %c128_i32 : i32
2026-02-21T09:55:13.7781022Z       %269 = tt.splat %268 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.7781128Z       %270 = tt.splat %268 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.7781229Z       %271 = arith.addi %269, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.7781334Z       %272 = arith.addi %270, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.7781509Z       %273 = tt.expand_dims %266 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.7781588Z       %274 = arith.muli %273, %cst_3 : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.7781701Z       %275 = tt.broadcast %274 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7781873Z       %276 = tt.expand_dims %272 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:55:13.7781982Z       %277 = tt.broadcast %276 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7782101Z       %278 = ttg.local_alloc : () -> !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.7782170Z       %279 = arith.addi %275, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7782290Z       %280 = tt.addptr %9, %279 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7782368Z       %281 = tt.load %280 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7782588Z       %282 = ttg.memdesc_index %278[%c0_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7782754Z       ttg.local_store %281, %282 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7782842Z       %283 = arith.addi %275, %55 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7782960Z       %284 = tt.addptr %9, %283 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7783034Z       %285 = tt.load %284 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7783251Z       %286 = ttg.memdesc_index %278[%c1_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7783432Z       ttg.local_store %285, %286 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7783502Z       %287 = arith.addi %275, %62 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7783619Z       %288 = tt.addptr %9, %287 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7783688Z       %289 = tt.load %288 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7783888Z       %290 = ttg.memdesc_index %278[%c2_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7784047Z       ttg.local_store %289, %290 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7784535Z       %291:5 = scf.for %arg4 = %c0_i32 to %c500_i32 step %c4_i32 iter_args(%arg5 = %cst_4, %arg6 = %c2_i32, %arg7 = %282, %arg8 = %286, %arg9 = %290) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>)  : i32 {
2026-02-21T09:55:13.7784646Z         %360 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7784767Z         %361 = arith.addi %360, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7784822Z         %362 = arith.addi %arg4, %c12_i32 : i32
2026-02-21T09:55:13.7784870Z         %363 = arith.muli %362, %c2_i32 : i32
2026-02-21T09:55:13.7784976Z         %364 = tt.splat %363 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7785076Z         %365 = arith.addi %364, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7785238Z         %366 = tt.expand_dims %365 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:13.7785348Z         %367 = tt.broadcast %366 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7785418Z         %368 = arith.addi %275, %367 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7785534Z         %369 = tt.addptr %9, %368 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7785611Z         %370 = tt.load %369 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7785835Z         %371 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7786058Z         %372 = arith.extf %371 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7786237Z         %373 = tt.expand_dims %361 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7786308Z         %374 = arith.muli %373, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7786410Z         %375 = tt.broadcast %374 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7786482Z         %376 = arith.addi %375, %277 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7786596Z         %377 = tt.addptr %10, %376 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7786680Z         %378 = tt.load %377 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7786848Z         %379 = ttg.convert_layout %378 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7786958Z         %380 = arith.shli %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7787069Z         %381 = arith.shrsi %380, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7787196Z         %382 = arith.shrsi %379, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7787363Z         %383 = tt.expand_dims %381 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7787528Z         %384 = tt.expand_dims %382 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7787639Z         %385 = tt.broadcast %383 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7787758Z         %386 = arith.select %15, %385, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7787863Z         %387 = tt.broadcast %384 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7787979Z         %388 = arith.select %17, %387, %386 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7788080Z         %389 = tt.reshape %388 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7788184Z         %390 = arith.sitofp %389 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7788319Z         %391 = ttg.local_alloc %390 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7788525Z         %392 = ttg.local_load %391 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7788822Z         %393 = tt.dot %372, %392, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7788879Z         %394 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.7788933Z         %395 = arith.cmpi slt, %394, %c3_i32 : i32
2026-02-21T09:55:13.7788988Z         %396 = arith.select %395, %394, %c0_i32 : i32
2026-02-21T09:55:13.7789197Z         %397 = ttg.memdesc_index %278[%396] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7789356Z         ttg.local_store %370, %397 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7789695Z         scf.yield %393, %396, %arg8, %arg9, %397 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7798446Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T09:55:13.7798718Z       %292 = ttg.local_load %291#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7798972Z       %293 = arith.extf %292 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7799036Z       %294 = arith.addi %73, %277 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7799141Z       %295 = tt.addptr %10, %294 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7799204Z       %296 = tt.load %295 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7799353Z       %297 = ttg.convert_layout %296 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7799473Z       %298 = arith.shli %297, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7799573Z       %299 = arith.shrsi %298, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7799673Z       %300 = arith.shrsi %297, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7799844Z       %301 = tt.expand_dims %299 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7799993Z       %302 = tt.expand_dims %300 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7800093Z       %303 = tt.broadcast %301 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7800199Z       %304 = arith.select %15, %303, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7800294Z       %305 = tt.broadcast %302 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7800396Z       %306 = arith.select %17, %305, %304 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7800506Z       %307 = tt.reshape %306 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7800598Z       %308 = arith.sitofp %307 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7800719Z       %309 = ttg.local_alloc %308 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7800894Z       %310 = ttg.local_load %309 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7801177Z       %311 = tt.dot %293, %310, %291#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7801377Z       %312 = ttg.local_load %291#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7801579Z       %313 = arith.extf %312 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7801642Z       %314 = arith.addi %97, %277 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7801746Z       %315 = tt.addptr %10, %314 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7801808Z       %316 = tt.load %315 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7801953Z       %317 = ttg.convert_layout %316 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7802053Z       %318 = arith.shli %317, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7802154Z       %319 = arith.shrsi %318, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7802252Z       %320 = arith.shrsi %317, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7802429Z       %321 = tt.expand_dims %319 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7802658Z       %322 = tt.expand_dims %320 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7802757Z       %323 = tt.broadcast %321 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7802862Z       %324 = arith.select %15, %323, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7802960Z       %325 = tt.broadcast %322 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7803083Z       %326 = arith.select %17, %325, %324 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7803173Z       %327 = tt.reshape %326 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7803270Z       %328 = arith.sitofp %327 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7803406Z       %329 = ttg.local_alloc %328 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7803573Z       %330 = ttg.local_load %329 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7803839Z       %331 = tt.dot %313, %330, %311, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7804035Z       %332 = ttg.local_load %291#4 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7804229Z       %333 = arith.extf %332 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7804295Z       %334 = arith.addi %121, %277 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7804395Z       %335 = tt.addptr %10, %334 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7804454Z       %336 = tt.load %335 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7804601Z       %337 = ttg.convert_layout %336 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7804716Z       %338 = arith.shli %337, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7804816Z       %339 = arith.shrsi %338, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7804915Z       %340 = arith.shrsi %337, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7805063Z       %341 = tt.expand_dims %339 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7805212Z       %342 = tt.expand_dims %340 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7805310Z       %343 = tt.broadcast %341 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7805412Z       %344 = arith.select %15, %343, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7805506Z       %345 = tt.broadcast %342 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7805606Z       %346 = arith.select %17, %345, %344 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7805696Z       %347 = tt.reshape %346 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7805789Z       %348 = arith.sitofp %347 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7805924Z       %349 = ttg.local_alloc %348 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7806091Z       %350 = ttg.local_load %349 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7807549Z       %351 = tt.dot %333, %350, %331, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7807650Z       ttg.local_dealloc %278 : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.7807757Z       %352 = arith.truncf %351 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.7807898Z       %353 = tt.expand_dims %267 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.7807958Z       %354 = arith.muli %353, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.7808094Z       %355 = tt.expand_dims %271 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.7808192Z       %356 = tt.broadcast %354 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7808276Z       %357 = tt.broadcast %355 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7808337Z       %358 = arith.addi %356, %357 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7808434Z       %359 = tt.addptr %18, %358 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7808500Z       tt.store %359, %352 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.7808542Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:55:13.7808594Z     scf.for %arg3 = %22 to %2 step %c1_i32  : i32 {
2026-02-21T09:55:13.7808641Z       %23 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:55:13.7808684Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:55:13.7808727Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:55:13.7808769Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:55:13.7808816Z       %27 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:55:13.7808857Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:55:13.7808897Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:55:13.7808937Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:55:13.7808979Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:55:13.7809086Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.7809169Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.7809263Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.7809342Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.7809383Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:55:13.7809467Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.7809554Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.7809634Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.7809722Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.7809870Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.7809934Z       %42 = arith.muli %41, %cst_3 : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:13.7810027Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7810170Z       %44 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:55:13.7810272Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7810360Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.7810499Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:13.7810586Z       %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7810648Z       %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7810747Z       %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7810822Z       %51 = tt.load %50 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7811006Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7811146Z       ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7811257Z       %53 = arith.addi %8, %cst_8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7811400Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:13.7811488Z       %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7811545Z       %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7811647Z       %57 = tt.addptr %9, %56 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7811705Z       %58 = tt.load %57 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7811884Z       %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7812025Z       ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7812115Z       %60 = arith.addi %8, %cst_9 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7812253Z       %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:13.7812356Z       %62 = tt.broadcast %61 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7812413Z       %63 = arith.addi %43, %62 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7812511Z       %64 = tt.addptr %9, %63 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7812568Z       %65 = tt.load %64 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7812749Z       %66 = ttg.memdesc_index %46[%c2_i32] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7812885Z       ttg.local_store %65, %66 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7813318Z       %67:5 = scf.for %arg4 = %c0_i32 to %c500_i32 step %c4_i32 iter_args(%arg5 = %cst_4, %arg6 = %c2_i32, %arg7 = %52, %arg8 = %59, %arg9 = %66) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>)  : i32 {
2026-02-21T09:55:13.7813419Z         %148 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7813511Z         %149 = arith.addi %148, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7813560Z         %150 = arith.addi %arg4, %c12_i32 : i32
2026-02-21T09:55:13.7813618Z         %151 = arith.muli %150, %c2_i32 : i32
2026-02-21T09:55:13.7813709Z         %152 = tt.splat %151 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7813798Z         %153 = arith.addi %152, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.7813943Z         %154 = tt.expand_dims %153 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:13.7814036Z         %155 = tt.broadcast %154 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7814097Z         %156 = arith.addi %43, %155 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7814216Z         %157 = tt.addptr %9, %156 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:13.7814280Z         %158 = tt.load %157 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:13.7814479Z         %159 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7814694Z         %160 = arith.extf %159 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7814837Z         %161 = tt.expand_dims %149 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7814903Z         %162 = arith.muli %161, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7814998Z         %163 = tt.broadcast %162 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7815062Z         %164 = arith.addi %163, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7815163Z         %165 = tt.addptr %10, %164 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7815226Z         %166 = tt.load %165 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7815374Z         %167 = ttg.convert_layout %166 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7815473Z         %168 = arith.shli %167, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7815574Z         %169 = arith.shrsi %168, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7815671Z         %170 = arith.shrsi %167, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7815834Z         %171 = tt.expand_dims %169 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7815985Z         %172 = tt.expand_dims %170 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7816081Z         %173 = tt.broadcast %171 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7816188Z         %174 = arith.select %15, %173, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7816285Z         %175 = tt.broadcast %172 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7816386Z         %176 = arith.select %17, %175, %174 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7816477Z         %177 = tt.reshape %176 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7816573Z         %178 = arith.sitofp %177 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7816690Z         %179 = ttg.local_alloc %178 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7816862Z         %180 = ttg.local_load %179 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7817132Z         %181 = tt.dot %160, %180, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7817193Z         %182 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.7817240Z         %183 = arith.cmpi slt, %182, %c3_i32 : i32
2026-02-21T09:55:13.7817291Z         %184 = arith.select %183, %182, %c0_i32 : i32
2026-02-21T09:55:13.7817471Z         %185 = ttg.memdesc_index %46[%184] : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7817612Z         ttg.local_store %158, %185 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7817934Z         scf.yield %181, %184, %arg8, %arg9, %185 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8>
2026-02-21T09:55:13.7818017Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T09:55:13.7818123Z       %68 = arith.addi %7, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7818319Z       %69 = ttg.local_load %67#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7818513Z       %70 = arith.extf %69 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7818654Z       %71 = tt.expand_dims %68 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7818717Z       %72 = arith.muli %71, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7818805Z       %73 = tt.broadcast %72 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7818863Z       %74 = arith.addi %73, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7818962Z       %75 = tt.addptr %10, %74 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7819020Z       %76 = tt.load %75 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7819174Z       %77 = ttg.convert_layout %76 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7819281Z       %78 = arith.shli %77, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7819377Z       %79 = arith.shrsi %78, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7819474Z       %80 = arith.shrsi %77, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7819620Z       %81 = tt.expand_dims %79 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7819767Z       %82 = tt.expand_dims %80 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7819862Z       %83 = tt.broadcast %81 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7819961Z       %84 = arith.select %15, %83, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7820051Z       %85 = tt.broadcast %82 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7820146Z       %86 = arith.select %17, %85, %84 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7820232Z       %87 = tt.reshape %86 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7820320Z       %88 = arith.sitofp %87 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7820435Z       %89 = ttg.local_alloc %88 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7820618Z       %90 = ttg.local_load %89 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7820878Z       %91 = tt.dot %70, %90, %67#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7820971Z       %92 = arith.addi %7, %cst_6 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7821165Z       %93 = ttg.local_load %67#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7821371Z       %94 = arith.extf %93 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7821514Z       %95 = tt.expand_dims %92 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7821590Z       %96 = arith.muli %95, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7821678Z       %97 = tt.broadcast %96 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7821735Z       %98 = arith.addi %97, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7821834Z       %99 = tt.addptr %10, %98 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7821895Z       %100 = tt.load %99 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7822043Z       %101 = ttg.convert_layout %100 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7822142Z       %102 = arith.shli %101, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7822242Z       %103 = arith.shrsi %102, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7822339Z       %104 = arith.shrsi %101, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7822488Z       %105 = tt.expand_dims %103 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7822634Z       %106 = tt.expand_dims %104 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7822745Z       %107 = tt.broadcast %105 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7822850Z       %108 = arith.select %15, %107, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7822944Z       %109 = tt.broadcast %106 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7823045Z       %110 = arith.select %17, %109, %108 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7823135Z       %111 = tt.reshape %110 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7823227Z       %112 = arith.sitofp %111 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7823345Z       %113 = ttg.local_alloc %112 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7823514Z       %114 = ttg.local_load %113 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7823769Z       %115 = tt.dot %94, %114, %91, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7823864Z       %116 = arith.addi %7, %cst_7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.7824071Z       %117 = ttg.local_load %67#4 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 3x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7824266Z       %118 = arith.extf %117 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7824409Z       %119 = tt.expand_dims %116 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7824474Z       %120 = arith.muli %119, %cst_2 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:13.7824564Z       %121 = tt.broadcast %120 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7824648Z       %122 = arith.addi %121, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7824748Z       %123 = tt.addptr %10, %122 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:13.7824808Z       %124 = tt.load %123 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:13.7824957Z       %125 = ttg.convert_layout %124 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7825070Z       %126 = arith.shli %125, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7825168Z       %127 = arith.shrsi %126, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7825265Z       %128 = arith.shrsi %125, %cst_11 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.7825416Z       %129 = tt.expand_dims %127 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7825562Z       %130 = tt.expand_dims %128 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.7825659Z       %131 = tt.broadcast %129 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7825765Z       %132 = arith.select %15, %131, %cst_10 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7825858Z       %133 = tt.broadcast %130 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7825955Z       %134 = arith.select %17, %133, %132 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.7826046Z       %135 = tt.reshape %134 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked1>
2026-02-21T09:55:13.7826151Z       %136 = arith.sitofp %135 : tensor<8x128xi8, #blocked1> to tensor<8x128xf32, #blocked1>
2026-02-21T09:55:13.7826269Z       %137 = ttg.local_alloc %136 : (tensor<8x128xf32, #blocked1>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.7826438Z       %138 = ttg.local_load %137 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.7826699Z       %139 = tt.dot %118, %138, %115, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.7826787Z       ttg.local_dealloc %46 : !ttg.memdesc<3x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.7826877Z       %140 = arith.truncf %139 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.7827015Z       %141 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.7827073Z       %142 = arith.muli %141, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.7827210Z       %143 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.7827294Z       %144 = tt.broadcast %142 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7827390Z       %145 = tt.broadcast %143 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7827449Z       %146 = arith.addi %144, %145 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7827546Z       %147 = tt.addptr %18, %146 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.7827608Z       tt.store %147, %140 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.7827650Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:55:13.7827683Z     tt.return
2026-02-21T09:55:13.7827715Z   }
2026-02-21T09:55:13.7827747Z }
2026-02-21T09:55:13.7827753Z 
2026-02-21T09:55:13.7827783Z {-#
2026-02-21T09:55:13.7827822Z   external_resources: {
2026-02-21T09:55:13.7827874Z     mlir_reproducer: {
2026-02-21T09:55:13.7828820Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:55:13.7828863Z       disable_threading: false,
2026-02-21T09:55:13.7828901Z       verify_each: true
2026-02-21T09:55:13.7828932Z     }
2026-02-21T09:55:13.7828961Z   }
2026-02-21T09:55:13.7828992Z #-}
2026-02-21T09:55:13.7829232Z /tmp/torchinductor_root/da/cdarxfmep5fjd6wacwiqv2hhflmegucy3f3xcz54fqqe7jhe47c5.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:55:13.7829648Z /tmp/torchinductor_root/da/cdarxfmep5fjd6wacwiqv2hhflmegucy3f3xcz54fqqe7jhe47c5.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:55:13.7829760Z [644s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:55:13.7830412Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[1, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:55:13.7830468Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:55:13.7830551Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:55:13.9087284Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:55:13.9098857Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:55:13.9099224Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:55:13.9099525Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:55:13.9099815Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:55:13.9100092Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:55:13.9100340Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:55:13.9100572Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:55:13.9100820Z #smem = #ttg.shared_memory
2026-02-21T09:55:13.9101050Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:55:13.9101524Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:55:13.9101902Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9102077Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.9102273Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.9102454Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9102621Z     %c127_i32 = arith.constant 127 : i32
2026-02-21T09:55:13.9102739Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:55:13.9102857Z     %c504_i32 = arith.constant 504 : i32
2026-02-21T09:55:13.9103035Z     %cst_3 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9103297Z     %cst_4 = arith.constant dense<4> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9103506Z     %cst_5 = arith.constant dense<8192> : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9103677Z     %cst_6 = arith.constant dense<0> : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9103847Z     %cst_7 = arith.constant dense<512> : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9104019Z     %cst_8 = arith.constant dense<0> : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9104196Z     %cst_9 = arith.constant dense<8192> : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9104372Z     %cst_10 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9104519Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:55:13.9104628Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:55:13.9104741Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:55:13.9104855Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:55:13.9104970Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:55:13.9105109Z     %cst_11 = arith.constant dense<0> : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9105278Z     %cst_12 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9105422Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:55:13.9105531Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:55:13.9105666Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:55:13.9105801Z     %cst_13 = arith.constant dense<4> : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9106009Z     %cst_14 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9106193Z     %0 = tt.get_program_id x : i32
2026-02-21T09:55:13.9106304Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:55:13.9106418Z     %2 = arith.minsi %1, %c8192_i32 : i32
2026-02-21T09:55:13.9106621Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9106899Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9107166Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9107432Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9107691Z     %7 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9107928Z     %8 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9108128Z     %9 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9108355Z     %10 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9108684Z     %11 = arith.extsi %10 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9109044Z     %12 = arith.extsi %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9109395Z     %13 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:55:13.9109798Z     %14 = tt.expand_dims %13 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:55:13.9110210Z     %15 = tt.expand_dims %14 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.9110456Z     %16 = arith.cmpi eq, %15, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.9110648Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:55:13.9110856Z     %18 = arith.cmpi eq, %15, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:13.9111040Z     %19 = tt.broadcast %18 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:55:13.9111244Z     %20 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.9111398Z     %21 = arith.subi %2, %0 : i32
2026-02-21T09:55:13.9111508Z     %22 = arith.remsi %21, %c3_i32 : i32
2026-02-21T09:55:13.9111619Z     %23 = arith.subi %21, %22 : i32
2026-02-21T09:55:13.9111729Z     %24 = arith.addi %0, %23 : i32
2026-02-21T09:55:13.9111848Z     scf.for %arg3 = %0 to %24 step %c3_i32  : i32 {
2026-02-21T09:55:13.9111984Z       %147 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:55:13.9112107Z       %148 = arith.muli %147, %c4_i32 : i32
2026-02-21T09:55:13.9112224Z       %149 = arith.subi %c128_i32, %148 : i32
2026-02-21T09:55:13.9112342Z       %150 = arith.minsi %149, %c4_i32 : i32
2026-02-21T09:55:13.9112459Z       %151 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:55:13.9112577Z       %152 = arith.remsi %151, %150 : i32
2026-02-21T09:55:13.9112686Z       %153 = arith.addi %148, %152 : i32
2026-02-21T09:55:13.9112799Z       %154 = arith.divsi %151, %150 : i32
2026-02-21T09:55:13.9112912Z       %155 = arith.muli %153, %c128_i32 : i32
2026-02-21T09:55:13.9113103Z       %156 = tt.splat %155 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9113322Z       %157 = tt.splat %155 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9113540Z       %158 = arith.addi %156, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9113753Z       %159 = arith.addi %157, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9113918Z       %160 = arith.muli %154, %c128_i32 : i32
2026-02-21T09:55:13.9114078Z       %161 = tt.splat %160 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9114286Z       %162 = arith.addi %161, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9114557Z       %163 = tt.expand_dims %158 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9114811Z       %164 = arith.muli %163, %cst_10 : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9115008Z       %165 = tt.broadcast %164 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9115183Z       %166 = arith.extsi %160 : i32 to i64
2026-02-21T09:55:13.9115351Z       %167 = tt.splat %166 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9115572Z       %168 = arith.addi %167, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9115850Z       %169 = tt.expand_dims %168 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9116154Z       %170 = tt.broadcast %169 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9116358Z       %171 = arith.cmpi sge, %169, %cst_8 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9116533Z       %172 = arith.cmpi slt, %169, %cst_9 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9116698Z       %173 = arith.andi %171, %172 : tensor<1x128xi1, #blocked2>
2026-02-21T09:55:13.9116886Z       %174 = tt.broadcast %173 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9117104Z       %175 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.9117387Z       %176 = tt.expand_dims %7 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:13.9117656Z       %177 = tt.broadcast %176 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9117848Z       %178 = arith.addi %165, %177 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9118063Z       %179 = tt.addptr %8, %178 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9118268Z       %180 = tt.load %179 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9118508Z       %181 = tt.expand_dims %11 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9118751Z       %182 = arith.muli %181, %cst_5 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9118940Z       %183 = tt.broadcast %182 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9119138Z       %184 = arith.addi %183, %170 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9119334Z       %185 = tt.addptr %9, %184 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9119544Z       %186 = arith.cmpi sge, %181, %cst_6 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9119722Z       %187 = arith.cmpi slt, %181, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9119882Z       %188 = arith.andi %186, %187 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:13.9120066Z       %189 = tt.broadcast %188 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9120252Z       %190 = arith.andi %189, %174 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9120488Z       %191 = tt.load %185, %190, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9120834Z       %192 = ttg.memdesc_index %175[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9121196Z       ttg.local_store %180, %192 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9121472Z       %193 = arith.addi %7, %cst_3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9121747Z       %194 = tt.expand_dims %193 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:13.9122021Z       %195 = tt.broadcast %194 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9122214Z       %196 = arith.addi %165, %195 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9122411Z       %197 = tt.addptr %8, %196 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9122719Z       %198 = tt.load %197 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9122909Z       %199 = arith.addi %11, %cst_4 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9123183Z       %200 = tt.expand_dims %199 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9123425Z       %201 = arith.muli %200, %cst_5 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9123636Z       %202 = tt.broadcast %201 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9123826Z       %203 = arith.addi %202, %170 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9124020Z       %204 = tt.addptr %9, %203 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9124225Z       %205 = arith.cmpi sge, %200, %cst_6 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9124394Z       %206 = arith.cmpi slt, %200, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9124553Z       %207 = arith.andi %205, %206 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:13.9124734Z       %208 = tt.broadcast %207 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9124938Z       %209 = arith.andi %208, %174 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9125150Z       %210 = tt.load %204, %209, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9125491Z       %211 = ttg.memdesc_index %175[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9125865Z       ttg.local_store %198, %211 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9126503Z       %212:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %192, %arg8 = %211, %arg9 = %191, %arg10 = %210) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2>)  : i32 {
2026-02-21T09:55:13.9127038Z         %439 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:55:13.9127162Z         %440 = arith.muli %439, %c2_i32 : i32
2026-02-21T09:55:13.9127332Z         %441 = tt.splat %440 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9127558Z         %442 = arith.addi %441, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9127832Z         %443 = tt.expand_dims %442 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:13.9128109Z         %444 = tt.broadcast %443 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9128305Z         %445 = arith.addi %165, %444 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9128526Z         %446 = tt.addptr %8, %445 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9128729Z         %447 = tt.load %446 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9129031Z         %448 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9129465Z         %449 = arith.extf %448 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9129748Z         %450 = arith.extsi %439 : i32 to i64
2026-02-21T09:55:13.9129918Z         %451 = tt.splat %450 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9130140Z         %452 = arith.addi %451, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9130414Z         %453 = tt.expand_dims %452 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9130660Z         %454 = arith.muli %453, %cst_5 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9130852Z         %455 = tt.broadcast %454 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9131045Z         %456 = arith.addi %455, %170 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9131243Z         %457 = tt.addptr %9, %456 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9131464Z         %458 = arith.cmpi sge, %453, %cst_6 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9131633Z         %459 = arith.cmpi slt, %453, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9131795Z         %460 = arith.andi %458, %459 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:13.9131980Z         %461 = tt.broadcast %460 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9132170Z         %462 = arith.andi %461, %174 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9132342Z         %463 = tt.load %457, %462, %cst_11 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9132538Z         %464 = arith.shli %arg9, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9132704Z         %465 = arith.shrsi %464, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9132950Z         %466 = ttg.convert_layout %465 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9133203Z         %467 = arith.shrsi %arg9, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9133468Z         %468 = ttg.convert_layout %467 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9133799Z         %469 = tt.expand_dims %466 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9134136Z         %470 = tt.expand_dims %468 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9134417Z         %471 = tt.broadcast %469 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9134661Z         %472 = arith.select %17, %471, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9134902Z         %473 = tt.broadcast %470 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9135137Z         %474 = arith.select %19, %473, %472 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9135367Z         %475 = tt.reshape %474 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9135595Z         %476 = arith.sitofp %475 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9135846Z         %477 = ttg.local_alloc %476 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9136187Z         %478 = ttg.local_load %477 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9136668Z         %479 = tt.dot %449, %478, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9137019Z         %480 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.9137148Z         %481 = arith.cmpi slt, %480, %c2_i32 : i32
2026-02-21T09:55:13.9137280Z         %482 = arith.select %481, %480, %c0_i32 : i32
2026-02-21T09:55:13.9137553Z         %483 = ttg.memdesc_index %175[%482] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9137917Z         ttg.local_store %447, %483 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9138403Z         scf.yield %479, %482, %arg8, %483, %arg10, %463 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9138831Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.9139144Z       %213 = ttg.local_load %212#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9139593Z       %214 = arith.extf %213 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9139893Z       %215 = arith.shli %212#4, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9140060Z       %216 = arith.shrsi %215, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9140308Z       %217 = ttg.convert_layout %216 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9140554Z       %218 = arith.shrsi %212#4, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9140816Z       %219 = ttg.convert_layout %218 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9141147Z       %220 = tt.expand_dims %217 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9141484Z       %221 = tt.expand_dims %219 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9141780Z       %222 = tt.broadcast %220 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9142020Z       %223 = arith.select %17, %222, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9142261Z       %224 = tt.broadcast %221 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9142491Z       %225 = arith.select %19, %224, %223 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9142722Z       %226 = tt.reshape %225 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9142946Z       %227 = arith.sitofp %226 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9143198Z       %228 = ttg.local_alloc %227 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9143523Z       %229 = ttg.local_load %228 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9144015Z       %230 = tt.dot %214, %229, %212#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9144536Z       %231 = ttg.local_load %212#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9144969Z       %232 = arith.extf %231 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9145271Z       %233 = arith.shli %212#5, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9145439Z       %234 = arith.shrsi %233, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9145684Z       %235 = ttg.convert_layout %234 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9145930Z       %236 = arith.shrsi %212#5, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9146175Z       %237 = ttg.convert_layout %236 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9146505Z       %238 = tt.expand_dims %235 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9146841Z       %239 = tt.expand_dims %237 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9147123Z       %240 = tt.broadcast %238 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9147373Z       %241 = arith.select %17, %240, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9147610Z       %242 = tt.broadcast %239 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9147843Z       %243 = arith.select %19, %242, %241 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9148074Z       %244 = tt.reshape %243 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9148298Z       %245 = arith.sitofp %244 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9148548Z       %246 = ttg.local_alloc %245 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9148889Z       %247 = ttg.local_load %246 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9149363Z       %248 = tt.dot %232, %247, %230, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9149763Z       ttg.local_dealloc %175 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.9149979Z       %249 = arith.truncf %248 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.9150251Z       %250 = tt.expand_dims %159 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9150491Z       %251 = arith.muli %250, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9150721Z       %252 = tt.expand_dims %162 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.9150983Z       %253 = tt.broadcast %251 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9151187Z       %254 = tt.broadcast %252 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9151369Z       %255 = arith.addi %253, %254 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9151561Z       %256 = tt.addptr %20, %255 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9151762Z       tt.store %256, %249 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.9151906Z       %257 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:55:13.9152028Z       %258 = arith.divsi %257, %c256_i32 : i32
2026-02-21T09:55:13.9152147Z       %259 = arith.muli %258, %c4_i32 : i32
2026-02-21T09:55:13.9152285Z       %260 = arith.subi %c128_i32, %259 : i32
2026-02-21T09:55:13.9152403Z       %261 = arith.minsi %260, %c4_i32 : i32
2026-02-21T09:55:13.9152521Z       %262 = arith.remsi %257, %c256_i32 : i32
2026-02-21T09:55:13.9152635Z       %263 = arith.remsi %262, %261 : i32
2026-02-21T09:55:13.9152749Z       %264 = arith.addi %259, %263 : i32
2026-02-21T09:55:13.9152862Z       %265 = arith.divsi %262, %261 : i32
2026-02-21T09:55:13.9152977Z       %266 = arith.muli %264, %c128_i32 : i32
2026-02-21T09:55:13.9153149Z       %267 = tt.splat %266 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9153370Z       %268 = tt.splat %266 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9153591Z       %269 = arith.addi %267, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9153803Z       %270 = arith.addi %268, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9153973Z       %271 = arith.muli %265, %c128_i32 : i32
2026-02-21T09:55:13.9154134Z       %272 = tt.splat %271 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9154340Z       %273 = arith.addi %272, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9154614Z       %274 = tt.expand_dims %269 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9154885Z       %275 = arith.muli %274, %cst_10 : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9155081Z       %276 = tt.broadcast %275 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9155259Z       %277 = arith.extsi %271 : i32 to i64
2026-02-21T09:55:13.9155427Z       %278 = tt.splat %277 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9155654Z       %279 = arith.addi %278, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9155936Z       %280 = tt.expand_dims %279 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9156243Z       %281 = tt.broadcast %280 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9156448Z       %282 = arith.cmpi sge, %280, %cst_8 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9156625Z       %283 = arith.cmpi slt, %280, %cst_9 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9156793Z       %284 = arith.andi %282, %283 : tensor<1x128xi1, #blocked2>
2026-02-21T09:55:13.9156984Z       %285 = tt.broadcast %284 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9157222Z       %286 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.9157410Z       %287 = arith.addi %276, %177 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9157613Z       %288 = tt.addptr %8, %287 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9157825Z       %289 = tt.load %288 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9157987Z       %290 = arith.addi %183, %281 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9158187Z       %291 = tt.addptr %9, %290 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9158387Z       %292 = arith.andi %189, %285 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9158607Z       %293 = tt.load %291, %292, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9158959Z       %294 = ttg.memdesc_index %286[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9159323Z       ttg.local_store %289, %294 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9159568Z       %295 = arith.addi %276, %195 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9159784Z       %296 = tt.addptr %8, %295 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9159990Z       %297 = tt.load %296 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9160155Z       %298 = arith.addi %202, %281 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9160348Z       %299 = tt.addptr %9, %298 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9160550Z       %300 = arith.andi %208, %285 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9160763Z       %301 = tt.load %299, %300, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9161111Z       %302 = ttg.memdesc_index %286[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9161475Z       ttg.local_store %297, %302 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9162108Z       %303:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %294, %arg8 = %302, %arg9 = %293, %arg10 = %301) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2>)  : i32 {
2026-02-21T09:55:13.9162697Z         %439 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:55:13.9162825Z         %440 = arith.muli %439, %c2_i32 : i32
2026-02-21T09:55:13.9162997Z         %441 = tt.splat %440 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9163224Z         %442 = arith.addi %441, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9163502Z         %443 = tt.expand_dims %442 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:13.9163785Z         %444 = tt.broadcast %443 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9163988Z         %445 = arith.addi %276, %444 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9164210Z         %446 = tt.addptr %8, %445 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9164418Z         %447 = tt.load %446 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9164720Z         %448 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9165183Z         %449 = arith.extf %448 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9165466Z         %450 = arith.extsi %439 : i32 to i64
2026-02-21T09:55:13.9165639Z         %451 = tt.splat %450 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9165865Z         %452 = arith.addi %451, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9166144Z         %453 = tt.expand_dims %452 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9166396Z         %454 = arith.muli %453, %cst_5 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9166596Z         %455 = tt.broadcast %454 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9166790Z         %456 = arith.addi %455, %281 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9166995Z         %457 = tt.addptr %9, %456 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9167202Z         %458 = arith.cmpi sge, %453, %cst_6 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9167374Z         %459 = arith.cmpi slt, %453, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9167535Z         %460 = arith.andi %458, %459 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:13.9167746Z         %461 = tt.broadcast %460 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9167946Z         %462 = arith.andi %461, %285 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9168117Z         %463 = tt.load %457, %462, %cst_11 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9168298Z         %464 = arith.shli %arg9, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9168466Z         %465 = arith.shrsi %464, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9168717Z         %466 = ttg.convert_layout %465 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9168973Z         %467 = arith.shrsi %arg9, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9169222Z         %468 = ttg.convert_layout %467 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9169561Z         %469 = tt.expand_dims %466 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9169901Z         %470 = tt.expand_dims %468 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9170189Z         %471 = tt.broadcast %469 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9170439Z         %472 = arith.select %17, %471, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9170697Z         %473 = tt.broadcast %470 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9170938Z         %474 = arith.select %19, %473, %472 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9171173Z         %475 = tt.reshape %474 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9171402Z         %476 = arith.sitofp %475 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9171659Z         %477 = ttg.local_alloc %476 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9172004Z         %478 = ttg.local_load %477 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9172485Z         %479 = tt.dot %449, %478, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9172838Z         %480 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.9172999Z         %481 = arith.cmpi slt, %480, %c2_i32 : i32
2026-02-21T09:55:13.9173137Z         %482 = arith.select %481, %480, %c0_i32 : i32
2026-02-21T09:55:13.9173408Z         %483 = ttg.memdesc_index %286[%482] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9173773Z         ttg.local_store %447, %483 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9174269Z         scf.yield %479, %482, %arg8, %483, %arg10, %463 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9174695Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.9175014Z       %304 = ttg.local_load %303#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9175451Z       %305 = arith.extf %304 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9175781Z       %306 = arith.shli %303#4, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9175957Z       %307 = arith.shrsi %306, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9176206Z       %308 = ttg.convert_layout %307 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9176458Z       %309 = arith.shrsi %303#4, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9176706Z       %310 = ttg.convert_layout %309 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9177041Z       %311 = tt.expand_dims %308 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9177382Z       %312 = tt.expand_dims %310 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9177669Z       %313 = tt.broadcast %311 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9177914Z       %314 = arith.select %17, %313, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9178161Z       %315 = tt.broadcast %312 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9178401Z       %316 = arith.select %19, %315, %314 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9178636Z       %317 = tt.reshape %316 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9178879Z       %318 = arith.sitofp %317 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9179139Z       %319 = ttg.local_alloc %318 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9179471Z       %320 = ttg.local_load %319 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9179951Z       %321 = tt.dot %305, %320, %303#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9180465Z       %322 = ttg.local_load %303#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9180899Z       %323 = arith.extf %322 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9181202Z       %324 = arith.shli %303#5, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9181387Z       %325 = arith.shrsi %324, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9181633Z       %326 = ttg.convert_layout %325 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9181883Z       %327 = arith.shrsi %303#5, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9182131Z       %328 = ttg.convert_layout %327 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9182466Z       %329 = tt.expand_dims %326 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9182805Z       %330 = tt.expand_dims %328 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9183090Z       %331 = tt.broadcast %329 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9183334Z       %332 = arith.select %17, %331, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9183576Z       %333 = tt.broadcast %330 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9183833Z       %334 = arith.select %19, %333, %332 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9184065Z       %335 = tt.reshape %334 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9184292Z       %336 = arith.sitofp %335 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9184545Z       %337 = ttg.local_alloc %336 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9184872Z       %338 = ttg.local_load %337 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9185339Z       %339 = tt.dot %323, %338, %321, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9185730Z       ttg.local_dealloc %286 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.9185952Z       %340 = arith.truncf %339 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.9186224Z       %341 = tt.expand_dims %270 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9186470Z       %342 = arith.muli %341, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9186701Z       %343 = tt.expand_dims %273 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.9186981Z       %344 = tt.broadcast %342 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9187187Z       %345 = tt.broadcast %343 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9187368Z       %346 = arith.addi %344, %345 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9187563Z       %347 = tt.addptr %20, %346 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9187762Z       tt.store %347, %340 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.9187909Z       %348 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:55:13.9188030Z       %349 = arith.divsi %348, %c256_i32 : i32
2026-02-21T09:55:13.9188171Z       %350 = arith.muli %349, %c4_i32 : i32
2026-02-21T09:55:13.9188292Z       %351 = arith.subi %c128_i32, %350 : i32
2026-02-21T09:55:13.9188412Z       %352 = arith.minsi %351, %c4_i32 : i32
2026-02-21T09:55:13.9188529Z       %353 = arith.remsi %348, %c256_i32 : i32
2026-02-21T09:55:13.9188646Z       %354 = arith.remsi %353, %352 : i32
2026-02-21T09:55:13.9188760Z       %355 = arith.addi %350, %354 : i32
2026-02-21T09:55:13.9188873Z       %356 = arith.divsi %353, %352 : i32
2026-02-21T09:55:13.9189010Z       %357 = arith.muli %355, %c128_i32 : i32
2026-02-21T09:55:13.9189186Z       %358 = tt.splat %357 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9189405Z       %359 = tt.splat %357 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9189629Z       %360 = arith.addi %358, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9189844Z       %361 = arith.addi %359, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9190016Z       %362 = arith.muli %356, %c128_i32 : i32
2026-02-21T09:55:13.9190178Z       %363 = tt.splat %362 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9190388Z       %364 = arith.addi %363, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9190664Z       %365 = tt.expand_dims %360 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9190917Z       %366 = arith.muli %365, %cst_10 : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9191115Z       %367 = tt.broadcast %366 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9191288Z       %368 = arith.extsi %362 : i32 to i64
2026-02-21T09:55:13.9191476Z       %369 = tt.splat %368 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9191701Z       %370 = arith.addi %369, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9191983Z       %371 = tt.expand_dims %370 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9192268Z       %372 = tt.broadcast %371 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9192473Z       %373 = arith.cmpi sge, %371, %cst_8 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9192649Z       %374 = arith.cmpi slt, %371, %cst_9 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9192818Z       %375 = arith.andi %373, %374 : tensor<1x128xi1, #blocked2>
2026-02-21T09:55:13.9193009Z       %376 = tt.broadcast %375 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9193231Z       %377 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.9193420Z       %378 = arith.addi %367, %177 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9193620Z       %379 = tt.addptr %8, %378 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9193826Z       %380 = tt.load %379 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9193986Z       %381 = arith.addi %183, %372 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9194198Z       %382 = tt.addptr %9, %381 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9194397Z       %383 = arith.andi %189, %376 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9194613Z       %384 = tt.load %382, %383, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9194957Z       %385 = ttg.memdesc_index %377[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9195324Z       ttg.local_store %380, %385 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9195584Z       %386 = arith.addi %367, %195 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9195780Z       %387 = tt.addptr %8, %386 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9195983Z       %388 = tt.load %387 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9196143Z       %389 = arith.addi %202, %372 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9196337Z       %390 = tt.addptr %9, %389 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9196547Z       %391 = arith.andi %208, %376 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9196759Z       %392 = tt.load %390, %391, %cst_11 {amd.pipeliner_part = "prologue"} : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9197100Z       %393 = ttg.memdesc_index %377[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9197458Z       ttg.local_store %388, %393 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9198093Z       %394:6 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %385, %arg8 = %393, %arg9 = %384, %arg10 = %392) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2>)  : i32 {
2026-02-21T09:55:13.9198624Z         %439 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:55:13.9198751Z         %440 = arith.muli %439, %c2_i32 : i32
2026-02-21T09:55:13.9198954Z         %441 = tt.splat %440 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9199182Z         %442 = arith.addi %441, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9199477Z         %443 = tt.expand_dims %442 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:13.9199757Z         %444 = tt.broadcast %443 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9199950Z         %445 = arith.addi %367, %444 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9200155Z         %446 = tt.addptr %8, %445 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9200361Z         %447 = tt.load %446 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9200661Z         %448 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9201100Z         %449 = arith.extf %448 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9201380Z         %450 = arith.extsi %439 : i32 to i64
2026-02-21T09:55:13.9201552Z         %451 = tt.splat %450 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9201777Z         %452 = arith.addi %451, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9202050Z         %453 = tt.expand_dims %452 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9202317Z         %454 = arith.muli %453, %cst_5 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9202510Z         %455 = tt.broadcast %454 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9202751Z         %456 = arith.addi %455, %372 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9202951Z         %457 = tt.addptr %9, %456 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9203157Z         %458 = arith.cmpi sge, %453, %cst_6 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9203327Z         %459 = arith.cmpi slt, %453, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9203517Z         %460 = arith.andi %458, %459 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:13.9203705Z         %461 = tt.broadcast %460 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9203898Z         %462 = arith.andi %461, %376 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9204068Z         %463 = tt.load %457, %462, %cst_11 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9204248Z         %464 = arith.shli %arg9, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9204431Z         %465 = arith.shrsi %464, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9204682Z         %466 = ttg.convert_layout %465 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9204933Z         %467 = arith.shrsi %arg9, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9205180Z         %468 = ttg.convert_layout %467 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9205520Z         %469 = tt.expand_dims %466 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9205855Z         %470 = tt.expand_dims %468 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9206141Z         %471 = tt.broadcast %469 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9206386Z         %472 = arith.select %17, %471, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9206628Z         %473 = tt.broadcast %470 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9206865Z         %474 = arith.select %19, %473, %472 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9207115Z         %475 = tt.reshape %474 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9207341Z         %476 = arith.sitofp %475 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9207594Z         %477 = ttg.local_alloc %476 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9207921Z         %478 = ttg.local_load %477 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9208401Z         %479 = tt.dot %449, %478, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9208748Z         %480 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:55:13.9208875Z         %481 = arith.cmpi slt, %480, %c2_i32 : i32
2026-02-21T09:55:13.9209013Z         %482 = arith.select %481, %480, %c0_i32 : i32
2026-02-21T09:55:13.9209282Z         %483 = ttg.memdesc_index %377[%482] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9209643Z         ttg.local_store %447, %483 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9210129Z         scf.yield %479, %482, %arg8, %483, %arg10, %463 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi8, #blocked2>, tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9210570Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:13.9210886Z       %395 = ttg.local_load %394#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9211321Z       %396 = arith.extf %395 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9211645Z       %397 = arith.shli %394#4, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9211815Z       %398 = arith.shrsi %397, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9212059Z       %399 = ttg.convert_layout %398 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9212308Z       %400 = arith.shrsi %394#4, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9212563Z       %401 = ttg.convert_layout %400 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9212893Z       %402 = tt.expand_dims %399 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9213230Z       %403 = tt.expand_dims %401 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9213514Z       %404 = tt.broadcast %402 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9213758Z       %405 = arith.select %17, %404, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9213995Z       %406 = tt.broadcast %403 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9214228Z       %407 = arith.select %19, %406, %405 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9214462Z       %408 = tt.reshape %407 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9214684Z       %409 = arith.sitofp %408 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9214935Z       %410 = ttg.local_alloc %409 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9215279Z       %411 = ttg.local_load %410 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9215752Z       %412 = tt.dot %396, %411, %394#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9216251Z       %413 = ttg.local_load %394#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9216685Z       %414 = arith.extf %413 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9216994Z       %415 = arith.shli %394#5, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9217166Z       %416 = arith.shrsi %415, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9217410Z       %417 = ttg.convert_layout %416 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9217659Z       %418 = arith.shrsi %394#5, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9217901Z       %419 = ttg.convert_layout %418 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9218250Z       %420 = tt.expand_dims %417 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9218587Z       %421 = tt.expand_dims %419 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9218866Z       %422 = tt.broadcast %420 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9219108Z       %423 = arith.select %17, %422, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9219348Z       %424 = tt.broadcast %421 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9219597Z       %425 = arith.select %19, %424, %423 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9219827Z       %426 = tt.reshape %425 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9220049Z       %427 = arith.sitofp %426 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9220301Z       %428 = ttg.local_alloc %427 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9220638Z       %429 = ttg.local_load %428 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9221107Z       %430 = tt.dot %414, %429, %412, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9221195Z       ttg.local_dealloc %377 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.9221287Z       %431 = arith.truncf %430 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.9221425Z       %432 = tt.expand_dims %361 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9221486Z       %433 = arith.muli %432, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9221622Z       %434 = tt.expand_dims %364 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.9221705Z       %435 = tt.broadcast %433 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9221789Z       %436 = tt.broadcast %434 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9221849Z       %437 = arith.addi %435, %436 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9221961Z       %438 = tt.addptr %20, %437 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9222026Z       tt.store %438, %431 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.9222075Z     } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:55:13.9222116Z     %25 = arith.subi %2, %24 : i32
2026-02-21T09:55:13.9222158Z     %26 = arith.muli %25, %c128_i32 : i32
2026-02-21T09:55:13.9222246Z     %27 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.9222291Z     %28 = arith.cmpi sgt, %26, %c0_i32 : i32
2026-02-21T09:55:13.9222332Z     %29 = arith.divsi %24, %c256_i32 : i32
2026-02-21T09:55:13.9222373Z     %30 = arith.muli %29, %c4_i32 : i32
2026-02-21T09:55:13.9222413Z     %31 = arith.subi %c128_i32, %30 : i32
2026-02-21T09:55:13.9222453Z     %32 = arith.minsi %31, %c4_i32 : i32
2026-02-21T09:55:13.9222493Z     %33 = arith.remsi %24, %c256_i32 : i32
2026-02-21T09:55:13.9222535Z     %34 = arith.remsi %33, %32 : i32
2026-02-21T09:55:13.9222574Z     %35 = arith.addi %30, %34 : i32
2026-02-21T09:55:13.9222611Z     %36 = arith.divsi %33, %32 : i32
2026-02-21T09:55:13.9222655Z     %37 = arith.muli %35, %c128_i32 : i32
2026-02-21T09:55:13.9222745Z     %38 = tt.splat %37 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9222835Z     %39 = arith.addi %38, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9222891Z     %40 = arith.muli %36, %c128_i32 : i32
2026-02-21T09:55:13.9223036Z     %41 = tt.expand_dims %39 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9223100Z     %42 = arith.muli %41, %cst_10 : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9223192Z     %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9223233Z     %44 = arith.extsi %40 : i32 to i64
2026-02-21T09:55:13.9223322Z     %45 = tt.splat %44 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9223407Z     %46 = arith.addi %45, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9223571Z     %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9223658Z     %48 = tt.broadcast %47 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9223727Z     %49 = arith.cmpi sge, %47, %cst_8 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9223796Z     %50 = arith.cmpi slt, %47, %cst_9 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9223867Z     %51 = arith.andi %49, %50 : tensor<1x128xi1, #blocked2>
2026-02-21T09:55:13.9223953Z     %52 = tt.broadcast %51 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9224095Z     %53 = tt.expand_dims %7 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:13.9224180Z     %54 = tt.broadcast %53 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9224239Z     %55 = arith.addi %43, %54 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9224340Z     %56 = tt.addptr %8, %55 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9224398Z     %57 = tt.splat %28 : i1 -> tensor<128x8xi1, #blocked1>
2026-02-21T09:55:13.9224503Z     %58 = tt.load %56, %57 {amd.pipeliner_part = "prologue"} : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9224694Z     %59 = ttg.memdesc_index %27[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9224833Z     ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9224878Z     %60 = arith.cmpi sgt, %26, %c1_i32 : i32
2026-02-21T09:55:13.9224987Z     %61 = arith.addi %7, %cst_3 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9225123Z     %62 = tt.expand_dims %61 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:13.9225208Z     %63 = tt.broadcast %62 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9225268Z     %64 = arith.addi %43, %63 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9225368Z     %65 = tt.addptr %8, %64 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9225424Z     %66 = tt.splat %60 : i1 -> tensor<128x8xi1, #blocked1>
2026-02-21T09:55:13.9225527Z     %67 = tt.load %65, %66 {amd.pipeliner_part = "prologue"} : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9225709Z     %68 = ttg.memdesc_index %27[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9225846Z     ttg.local_store %67, %68 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9225887Z     %69 = arith.subi %26, %c2_i32 : i32
2026-02-21T09:55:13.9226680Z     %70:18 = scf.for %arg3 = %c0_i32 to %69 step %c1_i32 iter_args(%arg4 = %c1_i32, %arg5 = %24, %arg6 = %cst_2, %arg7 = %37, %arg8 = %40, %arg9 = %43, %arg10 = %48, %arg11 = %52, %arg12 = %c1_i32, %arg13 = %c0_i32, %arg14 = %c4_i32, %arg15 = %59, %arg16 = %68, %arg17 = %48, %arg18 = %52, %arg19 = %c0_i32, %arg20 = %37, %arg21 = %40) -> (i32, i32, tensor<128x128xf32, #mma>, i32, i32, tensor<128x8xi32, #blocked1>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32, i32, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32, i32, i32)  : i32 {
2026-02-21T09:55:13.9226745Z       %147 = arith.addi %arg14, %c4_i32 : i32
2026-02-21T09:55:13.9226791Z       %148 = arith.addi %arg4, %c1_i32 : i32
2026-02-21T09:55:13.9226841Z       %149 = arith.cmpi eq, %arg4, %c127_i32 : i32
2026-02-21T09:55:13.9226905Z       %150 = arith.select %149, %c0_i32, %148 : i32
2026-02-21T09:55:13.9226953Z       %151 = arith.cmpi eq, %150, %c0_i32 : i32
2026-02-21T09:55:13.9226999Z       %152 = arith.select %151, %c0_i32, %147 : i32
2026-02-21T09:55:13.9227148Z       %153:6 = scf.if %151 -> (i32, i32, tensor<128x8xi32, #blocked1>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32) {
2026-02-21T09:55:13.9227194Z         %199 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:55:13.9227238Z         %200 = arith.divsi %199, %c256_i32 : i32
2026-02-21T09:55:13.9227294Z         %201 = arith.muli %200, %c4_i32 : i32
2026-02-21T09:55:13.9227336Z         %202 = arith.subi %c128_i32, %201 : i32
2026-02-21T09:55:13.9227380Z         %203 = arith.minsi %202, %c4_i32 : i32
2026-02-21T09:55:13.9227424Z         %204 = arith.remsi %199, %c256_i32 : i32
2026-02-21T09:55:13.9227466Z         %205 = arith.remsi %204, %203 : i32
2026-02-21T09:55:13.9227509Z         %206 = arith.addi %201, %205 : i32
2026-02-21T09:55:13.9227549Z         %207 = arith.divsi %204, %203 : i32
2026-02-21T09:55:13.9227592Z         %208 = arith.muli %206, %c128_i32 : i32
2026-02-21T09:55:13.9227690Z         %209 = tt.splat %208 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9227783Z         %210 = arith.addi %209, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:13.9227827Z         %211 = arith.muli %207, %c128_i32 : i32
2026-02-21T09:55:13.9227978Z         %212 = tt.expand_dims %210 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9228046Z         %213 = arith.muli %212, %cst_10 : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:13.9228140Z         %214 = tt.broadcast %213 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9228199Z         %215 = arith.extsi %211 : i32 to i64
2026-02-21T09:55:13.9228294Z         %216 = tt.splat %215 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9228388Z         %217 = arith.addi %216, %12 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:13.9228537Z         %218 = tt.expand_dims %217 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9228633Z         %219 = tt.broadcast %218 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9228705Z         %220 = arith.cmpi sge, %218, %cst_8 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9228774Z         %221 = arith.cmpi slt, %218, %cst_9 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:13.9228839Z         %222 = arith.andi %220, %221 : tensor<1x128xi1, #blocked2>
2026-02-21T09:55:13.9228931Z         %223 = tt.broadcast %222 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9229097Z         scf.yield %208, %211, %214, %219, %223, %199 : i32, i32, tensor<128x8xi32, #blocked1>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32
2026-02-21T09:55:13.9229134Z       } else {
2026-02-21T09:55:13.9229318Z         scf.yield %arg7, %arg8, %arg9, %arg10, %arg11, %arg5 : i32, i32, tensor<128x8xi32, #blocked1>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32
2026-02-21T09:55:13.9229350Z       }
2026-02-21T09:55:13.9229406Z       %154 = arith.muli %152, %c2_i32 : i32
2026-02-21T09:55:13.9229499Z       %155 = tt.splat %154 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9229590Z       %156 = arith.addi %155, %7 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:13.9229734Z       %157 = tt.expand_dims %156 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:13.9229825Z       %158 = tt.broadcast %157 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9229889Z       %159 = arith.addi %153#2, %158 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9230102Z       %160 = tt.addptr %8, %159 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:13.9230167Z       %161 = tt.load %160 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:13.9230369Z       %162 = ttg.local_load %arg15 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9230595Z       %163 = arith.extf %162 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9230640Z       %164 = arith.extsi %arg13 : i32 to i64
2026-02-21T09:55:13.9230731Z       %165 = tt.splat %164 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9230821Z       %166 = arith.addi %165, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9230965Z       %167 = tt.expand_dims %166 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9231031Z       %168 = arith.muli %167, %cst_5 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9231120Z       %169 = tt.broadcast %168 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9231188Z       %170 = arith.addi %169, %arg17 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9231288Z       %171 = tt.addptr %9, %170 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9231355Z       %172 = arith.cmpi sge, %167, %cst_6 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9231423Z       %173 = arith.cmpi slt, %167, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9231481Z       %174 = arith.andi %172, %173 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:13.9231584Z       %175 = tt.broadcast %174 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9231649Z       %176 = arith.andi %175, %arg18 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9231722Z       %177 = tt.load %171, %176, %cst_11 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9231866Z       %178 = ttg.convert_layout %177 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9231968Z       %179 = arith.shli %178, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9232066Z       %180 = arith.shrsi %179, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9232163Z       %181 = arith.shrsi %178, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9232314Z       %182 = tt.expand_dims %180 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9232461Z       %183 = tt.expand_dims %181 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9232559Z       %184 = tt.broadcast %182 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9232665Z       %185 = arith.select %17, %184, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9232756Z       %186 = tt.broadcast %183 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9232868Z       %187 = arith.select %19, %186, %185 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9232958Z       %188 = tt.reshape %187 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9233050Z       %189 = arith.sitofp %188 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9233167Z       %190 = ttg.local_alloc %189 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9233338Z       %191 = ttg.local_load %190 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9233619Z       %192 = tt.dot %163, %191, %arg6, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9233670Z       %193 = arith.cmpi eq, %arg19, %c127_i32 : i32
2026-02-21T09:55:13.9233741Z       %194 = arith.select %193, %cst_2, %192 : tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9233776Z       scf.if %193 {
2026-02-21T09:55:13.9233880Z         %199 = tt.splat %arg20 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9233965Z         %200 = arith.addi %199, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9234056Z         %201 = tt.splat %arg21 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9234140Z         %202 = arith.addi %201, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9234229Z         %203 = arith.truncf %192 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.9234370Z         %204 = tt.expand_dims %200 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9234431Z         %205 = arith.muli %204, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9234568Z         %206 = tt.expand_dims %202 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.9234655Z         %207 = tt.broadcast %205 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9234737Z         %208 = tt.broadcast %206 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9234796Z         %209 = arith.addi %207, %208 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9234910Z         %210 = tt.addptr %20, %209 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9234975Z         tt.store %210, %203 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.9235007Z       }
2026-02-21T09:55:13.9235054Z       %195 = arith.addi %arg12, %c1_i32 : i32
2026-02-21T09:55:13.9235099Z       %196 = arith.cmpi slt, %195, %c2_i32 : i32
2026-02-21T09:55:13.9235145Z       %197 = arith.select %196, %195, %c0_i32 : i32
2026-02-21T09:55:13.9235328Z       %198 = ttg.memdesc_index %27[%197] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9235471Z       ttg.local_store %161, %198 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:13.9236042Z       scf.yield %150, %153#5, %194, %153#0, %153#1, %153#2, %153#3, %153#4, %197, %arg14, %152, %arg16, %198, %arg10, %arg11, %arg4, %arg7, %arg8 : i32, i32, tensor<128x128xf32, #mma>, i32, i32, tensor<128x8xi32, #blocked1>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32, i32, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, tensor<4x128xi64, #blocked2>, tensor<4x128xi1, #blocked2>, i32, i32, i32
2026-02-21T09:55:13.9236086Z     } {tt.num_stages = 3 : i32}
2026-02-21T09:55:13.9236132Z     %71 = arith.cmpi sge, %26, %c1_i32 : i32
2026-02-21T09:55:13.9236187Z     %72 = arith.cmpi sge, %26, %c2_i32 : i32
2026-02-21T09:55:13.9236383Z     %73 = ttg.local_load %70#11 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9236576Z     %74 = arith.extf %73 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9236618Z     %75 = arith.extsi %70#9 : i32 to i64
2026-02-21T09:55:13.9236707Z     %76 = tt.splat %75 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9236792Z     %77 = arith.addi %76, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9236947Z     %78 = tt.expand_dims %77 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9237009Z     %79 = arith.muli %78, %cst_5 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9237097Z     %80 = tt.broadcast %79 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9237157Z     %81 = arith.addi %80, %70#13 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9237267Z     %82 = tt.addptr %9, %81 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9237335Z     %83 = arith.cmpi sge, %78, %cst_6 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9237398Z     %84 = arith.cmpi slt, %78, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9237454Z     %85 = arith.andi %83, %84 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:13.9237542Z     %86 = tt.broadcast %85 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9237602Z     %87 = arith.andi %86, %70#14 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9237659Z     %88 = tt.splat %71 : i1 -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9237716Z     %89 = arith.andi %88, %87 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9237785Z     %90 = tt.load %82, %89, %cst_11 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9237844Z     %91 = arith.shli %90, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9237908Z     %92 = arith.shrsi %91, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9238049Z     %93 = ttg.convert_layout %92 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9238108Z     %94 = arith.shrsi %90, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9238264Z     %95 = ttg.convert_layout %94 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9238407Z     %96 = tt.expand_dims %93 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9238549Z     %97 = tt.expand_dims %95 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9238643Z     %98 = tt.broadcast %96 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9238742Z     %99 = arith.select %17, %98, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9238834Z     %100 = tt.broadcast %97 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9238933Z     %101 = arith.select %19, %100, %99 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9239023Z     %102 = tt.reshape %101 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9239113Z     %103 = arith.sitofp %102 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9239232Z     %104 = ttg.local_alloc %103 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9239402Z     %105 = ttg.local_load %104 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9239470Z     %106 = scf.if %71 -> (tensor<128x128xf32, #mma>) {
2026-02-21T09:55:13.9239733Z       %147 = tt.dot %74, %105, %70#2, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9239780Z       scf.yield %147 : tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9239813Z     } else {
2026-02-21T09:55:13.9239861Z       scf.yield %70#2 : tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9239894Z     }
2026-02-21T09:55:13.9239941Z     %107 = arith.cmpi eq, %70#15, %c127_i32 : i32
2026-02-21T09:55:13.9240010Z     %108 = arith.select %107, %cst_2, %106 : tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9240069Z     %109 = arith.andi %71, %107 : i1
2026-02-21T09:55:13.9240102Z     scf.if %109 {
2026-02-21T09:55:13.9240189Z       %147 = tt.splat %70#16 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9240276Z       %148 = arith.addi %147, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9240363Z       %149 = tt.splat %70#17 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9240458Z       %150 = arith.addi %149, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9240544Z       %151 = arith.truncf %106 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.9240688Z       %152 = tt.expand_dims %148 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9240746Z       %153 = arith.muli %152, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9240885Z       %154 = tt.expand_dims %150 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.9240969Z       %155 = tt.broadcast %153 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9241051Z       %156 = tt.broadcast %154 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9241110Z       %157 = arith.addi %155, %156 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9241209Z       %158 = tt.addptr %20, %157 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9241271Z       tt.store %158, %151 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.9241302Z     }
2026-02-21T09:55:13.9241372Z     %110 = arith.select %71, %108, %70#2 : tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9241582Z     %111 = ttg.local_load %70#12 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9241777Z     %112 = arith.extf %111 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9241824Z     %113 = arith.extsi %70#10 : i32 to i64
2026-02-21T09:55:13.9241914Z     %114 = tt.splat %113 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9242002Z     %115 = arith.addi %114, %11 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:13.9242147Z     %116 = tt.expand_dims %115 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9242209Z     %117 = arith.muli %116, %cst_5 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9242299Z     %118 = tt.broadcast %117 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9242363Z     %119 = arith.addi %118, %70#6 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9242463Z     %120 = tt.addptr %9, %119 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:13.9242530Z     %121 = arith.cmpi sge, %116, %cst_6 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9248194Z     %122 = arith.cmpi slt, %116, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:13.9248314Z     %123 = arith.andi %121, %122 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:13.9248408Z     %124 = tt.broadcast %123 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9248473Z     %125 = arith.andi %124, %70#7 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9248532Z     %126 = tt.splat %72 : i1 -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9248591Z     %127 = arith.andi %126, %125 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:13.9248665Z     %128 = tt.load %120, %127, %cst_11 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:13.9248731Z     %129 = arith.shli %128, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9248794Z     %130 = arith.shrsi %129, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9248968Z     %131 = ttg.convert_layout %130 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9249032Z     %132 = arith.shrsi %128, %cst_13 : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:13.9249175Z     %133 = ttg.convert_layout %132 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:13.9249343Z     %134 = tt.expand_dims %131 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9249491Z     %135 = tt.expand_dims %133 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:13.9249588Z     %136 = tt.broadcast %134 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9249692Z     %137 = arith.select %17, %136, %cst_12 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9249787Z     %138 = tt.broadcast %135 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9249885Z     %139 = arith.select %19, %138, %137 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:13.9249975Z     %140 = tt.reshape %139 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:13.9250067Z     %141 = arith.sitofp %140 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:13.9250184Z     %142 = ttg.local_alloc %141 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:13.9250354Z     %143 = ttg.local_load %142 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:13.9250429Z     %144 = scf.if %72 -> (tensor<128x128xf32, #mma>) {
2026-02-21T09:55:13.9250695Z       %147 = tt.dot %112, %143, %110, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9250745Z       scf.yield %147 : tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9250780Z     } else {
2026-02-21T09:55:13.9250829Z       scf.yield %110 : tensor<128x128xf32, #mma>
2026-02-21T09:55:13.9250860Z     }
2026-02-21T09:55:13.9250906Z     %145 = arith.cmpi eq, %70#0, %c127_i32 : i32
2026-02-21T09:55:13.9250947Z     %146 = arith.andi %72, %145 : i1
2026-02-21T09:55:13.9250982Z     scf.if %146 {
2026-02-21T09:55:13.9251071Z       %147 = tt.splat %70#3 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9251158Z       %148 = arith.addi %147, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:13.9251242Z       %149 = tt.splat %70#4 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9251324Z       %150 = arith.addi %149, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:13.9251415Z       %151 = arith.truncf %144 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:13.9251556Z       %152 = tt.expand_dims %148 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9251629Z       %153 = arith.muli %152, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:13.9251768Z       %154 = tt.expand_dims %150 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:13.9251852Z       %155 = tt.broadcast %153 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9251933Z       %156 = tt.broadcast %154 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9251993Z       %157 = arith.addi %155, %156 : tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9252091Z       %158 = tt.addptr %20, %157 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:13.9252170Z       tt.store %158, %151 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:13.9252203Z     }
2026-02-21T09:55:13.9252288Z     ttg.local_dealloc %27 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:13.9252320Z     tt.return
2026-02-21T09:55:13.9252351Z   }
2026-02-21T09:55:13.9252386Z }
2026-02-21T09:55:13.9252390Z 
2026-02-21T09:55:13.9252420Z {-#
2026-02-21T09:55:13.9252460Z   external_resources: {
2026-02-21T09:55:13.9252497Z     mlir_reproducer: {
2026-02-21T09:55:13.9253456Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:55:13.9253499Z       disable_threading: false,
2026-02-21T09:55:13.9253535Z       verify_each: true
2026-02-21T09:55:13.9253566Z     }
2026-02-21T09:55:13.9253596Z   }
2026-02-21T09:55:13.9253627Z #-}
2026-02-21T09:55:13.9253867Z /tmp/torchinductor_root/3p/c3pbjzzhxqydmil5toh4ejnohqlb63egl3ekytu6xyigao735epj.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:55:13.9254276Z /tmp/torchinductor_root/3p/c3pbjzzhxqydmil5toh4ejnohqlb63egl3ekytu6xyigao735epj.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:55:13.9254403Z [644s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:55:13.9255027Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T09:55:13.9255084Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:55:13.9255166Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:55:14.1819626Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:55:14.1824127Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:55:14.1824812Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T09:55:14.1825230Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}>
2026-02-21T09:55:14.1825882Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 4], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:55:14.1826163Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:55:14.1826432Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:55:14.1826552Z #smem = #ttg.shared_memory
2026-02-21T09:55:14.1827012Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:55:14.1827759Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:55:14.1828036Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:55:14.1828234Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:14.1828417Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:14.1828627Z     %cst_2 = arith.constant dense<1024> : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:14.1828747Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:55:14.1829007Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:55:14.1829117Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:55:14.1829225Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:55:14.1829341Z     %c504_i32 = arith.constant 504 : i32
2026-02-21T09:55:14.1829636Z     %cst_4 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:14.1829934Z     %cst_5 = arith.constant dense<504> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.1830216Z     %cst_6 = arith.constant dense<508> : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.1830407Z     %cst_7 = arith.constant dense<8192> : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1830597Z     %cst_8 = arith.constant dense<0> : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1830785Z     %cst_9 = arith.constant dense<512> : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1830974Z     %cst_10 = arith.constant dense<0> : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:14.1831170Z     %cst_11 = arith.constant dense<8192> : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:14.1831354Z     %cst_12 = arith.constant dense<0> : tensor<4x128xi8, #blocked2>
2026-02-21T09:55:14.1831464Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:55:14.1831635Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:55:14.1831821Z     %cst_13 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1831933Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:55:14.1832021Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:55:14.1832186Z     %cst_14 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1832238Z     %0 = tt.get_program_id x : i32
2026-02-21T09:55:14.1832285Z     %1 = arith.divsi %0, %c256_i32 : i32
2026-02-21T09:55:14.1832334Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T09:55:14.1832380Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:55:14.1832430Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T09:55:14.1832477Z     %5 = arith.remsi %0, %c256_i32 : i32
2026-02-21T09:55:14.1832535Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:55:14.1832582Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:55:14.1832623Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:55:14.1832666Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:55:14.1832818Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:14.1832956Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:14.1833084Z     %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:14.1833239Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:14.1833353Z     %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:14.1833446Z     %15 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:14.1833551Z     %16 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:14.1833639Z     %17 = arith.addi %15, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:14.1833687Z     %18 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:55:14.1833798Z     %19 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:14.1833888Z     %20 = arith.addi %19, %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:14.1834029Z     %21 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:14.1834200Z     %22 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<128x1xi32, #blocked1>
2026-02-21T09:55:14.1834300Z     %23 = arith.muli %22, %cst_2 : tensor<128x1xi32, #blocked1>
2026-02-21T09:55:14.1834408Z     %24 = tt.broadcast %23 : tensor<128x1xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:14.1834505Z     %25 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:14.1834554Z     %26 = arith.extsi %18 : i32 to i64
2026-02-21T09:55:14.1834646Z     %27 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:14.1834785Z     %28 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.1834967Z     %29 = arith.extsi %28 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> to tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.1835071Z     %30 = tt.splat %26 : i64 -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:14.1835268Z     %31 = arith.extsi %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> to tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:14.1835370Z     %32 = arith.addi %30, %31 : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:14.1835568Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x128xi64, #blocked2>
2026-02-21T09:55:14.1835674Z     %34 = tt.broadcast %33 : tensor<1x128xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:14.1835759Z     %35 = arith.cmpi sge, %33, %cst_10 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:14.1835834Z     %36 = arith.cmpi slt, %33, %cst_11 : tensor<1x128xi64, #blocked2>
2026-02-21T09:55:14.1835904Z     %37 = arith.andi %35, %36 : tensor<1x128xi1, #blocked2>
2026-02-21T09:55:14.1836005Z     %38 = tt.broadcast %37 : tensor<1x128xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:14.1836188Z     %39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:55:14.1836443Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:55:14.1836608Z     %41 = tt.expand_dims %40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:14.1836692Z     %42 = arith.cmpi eq, %41, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:14.1836794Z     %43 = tt.broadcast %42 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:55:14.1836864Z     %44 = arith.cmpi eq, %41, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:14.1836985Z     %45 = tt.broadcast %44 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:55:14.1837082Z     %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:14.1837246Z     %47 = tt.expand_dims %21 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:14.1837346Z     %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:14.1837412Z     %49 = arith.addi %24, %48 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:14.1837528Z     %50 = tt.addptr %25, %49 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:14.1837618Z     %51 = tt.load %50 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:14.1837834Z     %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:14.1837997Z     ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:14.1838110Z     %53 = arith.addi %21, %cst_4 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:14.1838286Z     %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:14.1838386Z     %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:14.1838451Z     %56 = arith.addi %24, %55 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:14.1838572Z     %57 = tt.addptr %25, %56 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:14.1838638Z     %58 = tt.load %57 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:14.1838846Z     %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:14.1839006Z     ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:14.1839415Z     %60:4 = scf.for %arg3 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg4 = %cst_3, %arg5 = %c1_i32, %arg6 = %52, %arg7 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>)  : i32 {
2026-02-21T09:55:14.1839468Z       %128 = arith.addi %arg3, %c8_i32 : i32
2026-02-21T09:55:14.1839541Z       %129 = arith.muli %128, %c2_i32 : i32
2026-02-21T09:55:14.1839652Z       %130 = tt.splat %129 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:14.1839758Z       %131 = arith.addi %130, %21 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:14.1839926Z       %132 = tt.expand_dims %131 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x8xi32, #blocked1>
2026-02-21T09:55:14.1840034Z       %133 = tt.broadcast %132 : tensor<1x8xi32, #blocked1> -> tensor<128x8xi32, #blocked1>
2026-02-21T09:55:14.1840110Z       %134 = arith.addi %24, %133 : tensor<128x8xi32, #blocked1>
2026-02-21T09:55:14.1840233Z       %135 = tt.addptr %25, %134 : tensor<128x8x!tt.ptr<bf16>, #blocked1>, tensor<128x8xi32, #blocked1>
2026-02-21T09:55:14.1840308Z       %136 = tt.load %135 : tensor<128x8x!tt.ptr<bf16>, #blocked1>
2026-02-21T09:55:14.1840541Z       %137 = ttg.local_load %arg6 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.1840777Z       %138 = arith.extf %137 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.1840829Z       %139 = arith.extsi %arg3 : i32 to i64
2026-02-21T09:55:14.1840934Z       %140 = tt.splat %139 : i64 -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.1841053Z       %141 = arith.addi %140, %29 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.1841218Z       %142 = tt.expand_dims %141 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1841297Z       %143 = arith.muli %142, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1841404Z       %144 = tt.broadcast %143 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:14.1841472Z       %145 = arith.addi %144, %34 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:14.1841595Z       %146 = tt.addptr %27, %145 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:14.1841694Z       %147 = arith.cmpi sge, %142, %cst_8 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1841769Z       %148 = arith.cmpi slt, %142, %cst_9 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1841835Z       %149 = arith.andi %147, %148 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:14.1841924Z       %150 = tt.broadcast %149 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:14.1841984Z       %151 = arith.andi %150, %38 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:14.1842086Z       %152 = tt.load %146, %151, %cst_12 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:14.1842244Z       %153 = ttg.convert_layout %152 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1842342Z       %154 = arith.shli %153, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1842440Z       %155 = arith.shrsi %154, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1842541Z       %156 = arith.shrsi %153, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1842753Z       %157 = tt.expand_dims %155 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:14.1842900Z       %158 = tt.expand_dims %156 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:14.1842997Z       %159 = tt.broadcast %157 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1843102Z       %160 = arith.select %43, %159, %cst_13 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1843215Z       %161 = tt.broadcast %158 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1843316Z       %162 = arith.select %45, %161, %160 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1843404Z       %163 = tt.reshape %162 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked2>
2026-02-21T09:55:14.1843495Z       %164 = arith.sitofp %163 : tensor<8x128xi8, #blocked2> to tensor<8x128xf32, #blocked2>
2026-02-21T09:55:14.1843614Z       %165 = ttg.local_alloc %164 : (tensor<8x128xf32, #blocked2>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:14.1843784Z       %166 = ttg.local_load %165 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.1844054Z       %167 = tt.dot %138, %166, %arg4, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:14.1844101Z       %168 = arith.addi %arg5, %c1_i32 : i32
2026-02-21T09:55:14.1844148Z       %169 = arith.cmpi slt, %168, %c2_i32 : i32
2026-02-21T09:55:14.1844196Z       %170 = arith.select %169, %168, %c0_i32 : i32
2026-02-21T09:55:14.1844375Z       %171 = ttg.memdesc_index %46[%170] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:14.1844514Z       ttg.local_store %136, %171 : tensor<128x8xbf16, #blocked1> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:14.1844749Z       scf.yield %167, %170, %arg7, %171 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:55:14.1844831Z     } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:55:14.1845022Z     %61 = ttg.local_load %60#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.1845213Z     %62 = arith.extf %61 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.1845325Z     %63 = arith.addi %29, %cst_5 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.1845465Z     %64 = tt.expand_dims %63 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1845526Z     %65 = arith.muli %64, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1845634Z     %66 = tt.broadcast %65 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:14.1845692Z     %67 = arith.addi %66, %34 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:14.1845788Z     %68 = tt.addptr %27, %67 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:14.1845854Z     %69 = arith.cmpi sge, %64, %cst_8 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1845917Z     %70 = arith.cmpi slt, %64, %cst_9 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1845974Z     %71 = arith.andi %69, %70 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:14.1846059Z     %72 = tt.broadcast %71 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:14.1846115Z     %73 = arith.andi %72, %38 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:14.1846181Z     %74 = tt.load %68, %73, %cst_12 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:14.1846321Z     %75 = ttg.convert_layout %74 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1846417Z     %76 = arith.shli %75, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1846512Z     %77 = arith.shrsi %76, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1846604Z     %78 = arith.shrsi %75, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1846769Z     %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:14.1846910Z     %80 = tt.expand_dims %78 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:14.1847001Z     %81 = tt.broadcast %79 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1847102Z     %82 = arith.select %43, %81, %cst_13 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1847189Z     %83 = tt.broadcast %80 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1847283Z     %84 = arith.select %45, %83, %82 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1847368Z     %85 = tt.reshape %84 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked2>
2026-02-21T09:55:14.1847455Z     %86 = arith.sitofp %85 : tensor<8x128xi8, #blocked2> to tensor<8x128xf32, #blocked2>
2026-02-21T09:55:14.1847567Z     %87 = ttg.local_alloc %86 : (tensor<8x128xf32, #blocked2>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:14.1847732Z     %88 = ttg.local_load %87 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.1847988Z     %89 = tt.dot %62, %88, %60#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:14.1848197Z     %90 = ttg.local_load %60#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.1848388Z     %91 = arith.extf %90 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.1848480Z     %92 = arith.addi %29, %cst_6 : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.1848636Z     %93 = tt.expand_dims %92 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1848695Z     %94 = arith.muli %93, %cst_7 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1848781Z     %95 = tt.broadcast %94 : tensor<4x1xi64, #blocked2> -> tensor<4x128xi64, #blocked2>
2026-02-21T09:55:14.1848837Z     %96 = arith.addi %95, %34 : tensor<4x128xi64, #blocked2>
2026-02-21T09:55:14.1848933Z     %97 = tt.addptr %27, %96 : tensor<4x128x!tt.ptr<i8>, #blocked2>, tensor<4x128xi64, #blocked2>
2026-02-21T09:55:14.1849010Z     %98 = arith.cmpi sge, %93, %cst_8 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1849072Z     %99 = arith.cmpi slt, %93, %cst_9 : tensor<4x1xi64, #blocked2>
2026-02-21T09:55:14.1849130Z     %100 = arith.andi %98, %99 : tensor<4x1xi1, #blocked2>
2026-02-21T09:55:14.1849217Z     %101 = tt.broadcast %100 : tensor<4x1xi1, #blocked2> -> tensor<4x128xi1, #blocked2>
2026-02-21T09:55:14.1849274Z     %102 = arith.andi %101, %38 : tensor<4x128xi1, #blocked2>
2026-02-21T09:55:14.1849346Z     %103 = tt.load %97, %102, %cst_12 : tensor<4x128x!tt.ptr<i8>, #blocked2>
2026-02-21T09:55:14.1849488Z     %104 = ttg.convert_layout %103 : tensor<4x128xi8, #blocked2> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1849587Z     %105 = arith.shli %104, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1849685Z     %106 = arith.shrsi %105, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1849779Z     %107 = arith.shrsi %104, %cst_14 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.1849925Z     %108 = tt.expand_dims %106 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:14.1850088Z     %109 = tt.expand_dims %107 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:14.1850182Z     %110 = tt.broadcast %108 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1850284Z     %111 = arith.select %43, %110, %cst_13 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1850377Z     %112 = tt.broadcast %109 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1850475Z     %113 = arith.select %45, %112, %111 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.1850562Z     %114 = tt.reshape %113 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked2>
2026-02-21T09:55:14.1850654Z     %115 = arith.sitofp %114 : tensor<8x128xi8, #blocked2> to tensor<8x128xf32, #blocked2>
2026-02-21T09:55:14.1850771Z     %116 = ttg.local_alloc %115 : (tensor<8x128xf32, #blocked2>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:14.1850938Z     %117 = ttg.local_load %116 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.1851198Z     %118 = tt.dot %91, %117, %89, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:14.1851294Z     ttg.local_dealloc %46 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:55:14.1851380Z     %119 = arith.truncf %118 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:14.1851519Z     %120 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:14.1851575Z     %121 = arith.muli %120, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:14.1851708Z     %122 = tt.expand_dims %20 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:14.1851793Z     %123 = tt.broadcast %121 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:14.1851888Z     %124 = tt.broadcast %122 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:14.1851944Z     %125 = arith.addi %123, %124 : tensor<128x128xi32, #mma>
2026-02-21T09:55:14.1852023Z     %126 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:14.1852122Z     %127 = tt.addptr %126, %125 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:14.1852185Z     tt.store %127, %119 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:14.1852219Z     tt.return
2026-02-21T09:55:14.1852268Z   }
2026-02-21T09:55:14.1852299Z }
2026-02-21T09:55:14.1852304Z 
2026-02-21T09:55:14.1852333Z {-#
2026-02-21T09:55:14.1852375Z   external_resources: {
2026-02-21T09:55:14.1852411Z     mlir_reproducer: {
2026-02-21T09:55:14.1853346Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:55:14.1853392Z       disable_threading: false,
2026-02-21T09:55:14.1853427Z       verify_each: true
2026-02-21T09:55:14.1853458Z     }
2026-02-21T09:55:14.1853489Z   }
2026-02-21T09:55:14.1853518Z #-}
2026-02-21T09:55:14.1853754Z /tmp/torchinductor_root/mm/cmmpqft2zdf5a5jb2psuhnxqw36eba67uxhlyoxpyotpi5naouph.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:55:14.1854180Z /tmp/torchinductor_root/mm/cmmpqft2zdf5a5jb2psuhnxqw36eba67uxhlyoxpyotpi5naouph.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:55:14.1854293Z [645s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:55:14.1854861Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:55:14.1854918Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:55:14.1854998Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:55:14.8219658Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:55:14.8222479Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:55:14.8222950Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:55:14.8223257Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:55:14.8223550Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:55:14.8223822Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:55:14.8224072Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:55:14.8224301Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:55:14.8224530Z #smem = #ttg.shared_memory
2026-02-21T09:55:14.8224758Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:55:14.8225225Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:55:14.8225629Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:55:14.8225807Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:14.8225979Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:14.8226159Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:55:14.8226344Z     %cst_3 = arith.constant dense<8192> : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:14.8226525Z     %cst_4 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:14.8226674Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:55:14.8226792Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:55:14.8226933Z     %cst_5 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.8227080Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:55:14.8227194Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:55:14.8227305Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:55:14.8227493Z     %cst_6 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.8227681Z     %0 = tt.get_program_id x : i32
2026-02-21T09:55:14.8227792Z     %1 = arith.divsi %0, %c128_i32 : i32
2026-02-21T09:55:14.8227907Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T09:55:14.8228050Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T09:55:14.8228164Z     %4 = arith.minsi %3, %c2_i32 : i32
2026-02-21T09:55:14.8228273Z     %5 = arith.remsi %0, %c128_i32 : i32
2026-02-21T09:55:14.8228391Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T09:55:14.8228498Z     %7 = arith.addi %2, %6 : i32
2026-02-21T09:55:14.8228603Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T09:55:14.8228707Z     %9 = arith.muli %7, %c128_i32 : i32
2026-02-21T09:55:14.8228910Z     %10 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.8229189Z     %11 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:14.8229458Z     %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:14.8229729Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:14.8229975Z     %14 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.8230190Z     %15 = tt.splat %9 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:14.8230399Z     %16 = arith.addi %14, %10 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:55:14.8230606Z     %17 = arith.addi %15, %11 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:55:14.8230786Z     %18 = arith.muli %8, %c128_i32 : i32
2026-02-21T09:55:14.8230949Z     %19 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:14.8231156Z     %20 = tt.splat %18 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:14.8231360Z     %21 = arith.addi %19, %12 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:55:14.8231563Z     %22 = arith.addi %20, %13 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:55:14.8231799Z     %23 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:14.8232085Z     %24 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:14.8232388Z     %25 = tt.expand_dims %16 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:55:14.8232640Z     %26 = arith.muli %25, %cst_4 : tensor<128x1xi32, #blocked2>
2026-02-21T09:55:14.8232833Z     %27 = tt.broadcast %26 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:14.8233063Z     %28 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:14.8233325Z     %29 = tt.expand_dims %21 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:55:14.8233601Z     %30 = tt.broadcast %29 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:14.8233809Z     %31 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:14.8234080Z     %32 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:55:14.8234487Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:55:14.8234878Z     %34 = tt.expand_dims %33 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:14.8235127Z     %35 = arith.cmpi eq, %34, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:14.8235321Z     %36 = tt.broadcast %35 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:55:14.8235511Z     %37 = arith.cmpi eq, %34, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:55:14.8235716Z     %38 = tt.broadcast %37 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:55:14.8235974Z     %39 = scf.for %arg3 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg4 = %cst_2) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:55:14.8236239Z       %49 = tt.splat %arg3 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:14.8236459Z       %50 = arith.addi %49, %23 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:55:14.8236630Z       %51 = arith.muli %arg3, %c2_i32 : i32
2026-02-21T09:55:14.8236792Z       %52 = tt.splat %51 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:14.8237000Z       %53 = arith.addi %52, %24 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:55:14.8237265Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:55:14.8237533Z       %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:55:14.8237719Z       %56 = arith.addi %27, %55 : tensor<128x8xi32, #blocked2>
2026-02-21T09:55:14.8237917Z       %57 = tt.addptr %28, %56 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:55:14.8238115Z       %58 = tt.load %57 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:55:14.8238332Z       %59 = ttg.local_alloc %58 : (tensor<128x8xbf16, #blocked2>) -> !ttg.memdesc<128x8xbf16, #shared, #smem>
2026-02-21T09:55:14.8238681Z       %60 = ttg.local_load %59 : !ttg.memdesc<128x8xbf16, #shared, #smem> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.8239082Z       %61 = arith.extf %60 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.8239463Z       %62 = tt.expand_dims %50 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:55:14.8239701Z       %63 = arith.muli %62, %cst_3 : tensor<4x1xi32, #blocked1>
2026-02-21T09:55:14.8239907Z       %64 = tt.broadcast %63 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:55:14.8240093Z       %65 = arith.addi %64, %30 : tensor<4x128xi32, #blocked1>
2026-02-21T09:55:14.8240285Z       %66 = tt.addptr %31, %65 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:55:14.8240479Z       %67 = tt.load %66 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:55:14.8240750Z       %68 = ttg.convert_layout %67 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.8241025Z       %69 = arith.shli %68, %cst_6 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.8241255Z       %70 = arith.shrsi %69, %cst_6 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.8241480Z       %71 = arith.shrsi %68, %cst_6 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:55:14.8241761Z       %72 = tt.expand_dims %70 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:14.8242092Z       %73 = tt.expand_dims %71 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:55:14.8242370Z       %74 = tt.broadcast %72 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.8242676Z       %75 = arith.select %36, %74, %cst_5 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.8242906Z       %76 = tt.broadcast %73 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.8243135Z       %77 = arith.select %38, %76, %75 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:55:14.8243380Z       %78 = tt.reshape %77 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:55:14.8243595Z       %79 = arith.sitofp %78 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:55:14.8243846Z       %80 = ttg.local_alloc %79 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:55:14.8244165Z       %81 = ttg.local_load %80 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:55:14.8244638Z       %82 = tt.dot %61, %81, %arg4, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:55:14.8244988Z       scf.yield %82 : tensor<128x128xf32, #mma>
2026-02-21T09:55:14.8245134Z     } {tt.disallow_acc_multi_buffer, tt.num_stages = 1 : i32}
2026-02-21T09:55:14.8245322Z     %40 = arith.truncf %39 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:55:14.8245586Z     %41 = tt.expand_dims %17 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:55:14.8245820Z     %42 = arith.muli %41, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:55:14.8246044Z     %43 = tt.expand_dims %22 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:55:14.8246294Z     %44 = tt.broadcast %42 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:14.8246510Z     %45 = tt.broadcast %43 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:55:14.8246684Z     %46 = arith.addi %44, %45 : tensor<128x128xi32, #mma>
2026-02-21T09:55:14.8246856Z     %47 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:14.8247068Z     %48 = tt.addptr %47, %46 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:55:14.8247259Z     tt.store %48, %40 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:55:14.8247390Z     tt.return
2026-02-21T09:55:14.8247469Z   }
2026-02-21T09:55:14.8247540Z }
2026-02-21T09:55:14.8247583Z 
2026-02-21T09:55:14.8247634Z {-#
2026-02-21T09:55:14.8247714Z   external_resources: {
2026-02-21T09:55:14.8247810Z     mlir_reproducer: {
2026-02-21T09:55:14.8248817Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:55:14.8249799Z       disable_threading: false,
2026-02-21T09:55:14.8249904Z       verify_each: true
2026-02-21T09:55:14.8249993Z     }
2026-02-21T09:55:14.8250064Z   }
2026-02-21T09:55:14.8250132Z #-}
2026-02-21T09:55:14.8250407Z /tmp/torchinductor_root/4s/c4s22dpmjftsnnhkpn6wss5xkukvnvtbi4gqvbg7rc7f6mabm3hf.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:55:14.8251088Z /tmp/torchinductor_root/4s/c4s22dpmjftsnnhkpn6wss5xkukvnvtbi4gqvbg7rc7f6mabm3hf.py:13:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:55:14.8251642Z [645s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:55:14.8252386Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:55:14.8253037Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:55:14.8253203Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:55:15.4173893Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 61/61 12.6 configs/s
2026-02-21T09:55:19.3748194Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 214/214 36.2 configs/s
2026-02-21T09:55:22.7464716Z [653s] Generation 10 complete: 
2026-02-21T09:55:22.7464962Z error=10
2026-02-21T09:55:22.7465059Z ok=53
2026-02-21T09:55:22.7465190Z min=0.9404
2026-02-21T09:55:22.7465276Z mid=1.4120
2026-02-21T09:55:22.7465369Z max=29.3351
2026-02-21T09:55:22.7465476Z best={'block_sizes': [16, 128, 128],
2026-02-21T09:55:22.7465654Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:55:22.7465814Z  'l2_groupings': [2],
2026-02-21T09:55:22.7465943Z  'load_eviction_policies': ['', ''],
2026-02-21T09:55:22.7466083Z  'loop_orders': [[0, 1]],
2026-02-21T09:55:22.7466226Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:55:22.7466344Z  'num_stages': 1,
2026-02-21T09:55:22.7466446Z  'num_warps': 4,
2026-02-21T09:55:22.7466551Z  'pid_type': 'flat',
2026-02-21T09:55:22.7466677Z  'range_flattens': [None, None],
2026-02-21T09:55:22.7466809Z  'range_multi_buffers': [None, False],
2026-02-21T09:55:22.7467204Z  'range_num_stages': [0, 1],
2026-02-21T09:55:22.7467334Z  'range_unroll_factors': [0, 0],
2026-02-21T09:55:22.7467462Z  'range_warp_specializes': [],
2026-02-21T09:55:22.7467588Z  'waves_per_eu': 2}
2026-02-21T09:55:22.7537408Z [653s] Fitting surrogate: 973 points, 973 targets
2026-02-21T09:55:23.3132565Z [654s] Generation 11 starting: 42 neighbors, 2 active search path(s)
2026-02-21T09:55:44.1868171Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 1.3 configs/s
2026-02-21T09:55:48.6099098Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 9.7 configs/s
2026-02-21T09:55:51.4537167Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━ 214/214 36.5 configs/s
2026-02-21T09:55:53.8901585Z [684s] Generation 11 complete: 
2026-02-21T09:55:53.8901978Z error=2
2026-02-21T09:55:53.8902186Z ok=43
2026-02-21T09:55:53.8902392Z min=0.9382
2026-02-21T09:55:53.8902596Z mid=1.4550
2026-02-21T09:55:53.8902793Z max=32.6063
2026-02-21T09:55:53.8903021Z best={'block_sizes': [16, 128, 128],
2026-02-21T09:55:53.8903434Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:55:53.8903790Z  'l2_groupings': [2],
2026-02-21T09:55:53.8904070Z  'load_eviction_policies': ['', ''],
2026-02-21T09:55:53.8904782Z  'loop_orders': [[0, 1]],
2026-02-21T09:55:53.8905060Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:55:53.8905327Z  'num_stages': 1,
2026-02-21T09:55:53.8905552Z  'num_warps': 4,
2026-02-21T09:55:53.8905787Z  'pid_type': 'flat',
2026-02-21T09:55:53.8906043Z  'range_flattens': [None, None],
2026-02-21T09:55:53.8906358Z  'range_multi_buffers': [None, False],
2026-02-21T09:55:53.8906664Z  'range_num_stages': [0, 1],
2026-02-21T09:55:53.8906942Z  'range_unroll_factors': [0, 0],
2026-02-21T09:55:53.8907246Z  'range_warp_specializes': [],
2026-02-21T09:55:53.8907520Z  'waves_per_eu': 2}
2026-02-21T09:55:53.8964464Z [684s] Fitting surrogate: 1018 points, 1018 targets
2026-02-21T09:55:54.4155352Z [685s] Generation 12 starting: 42 neighbors, 2 active search path(s)
2026-02-21T09:56:07.7178842Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 1.9 configs/s
2026-02-21T09:56:10.1147525Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:56:10.1157283Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:56:10.1158535Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:56:10.1159217Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:56:10.1159896Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:56:10.1160525Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:56:10.1161107Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:56:10.1161633Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:56:10.1162043Z #smem = #ttg.shared_memory
2026-02-21T09:56:10.1162642Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:56:10.1163701Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:56:10.1164693Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:56:10.1165085Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:10.1165477Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:10.1165936Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1166229Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T09:56:10.1166438Z     %c504_i32 = arith.constant 504 : i32
2026-02-21T09:56:10.1166780Z     %cst_3 = arith.constant dense<504> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1167232Z     %cst_4 = arith.constant dense<508> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1167669Z     %cst_5 = arith.constant dense<8> : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1168053Z     %cst_6 = arith.constant dense<8192> : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1168378Z     %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.1168645Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:56:10.1168848Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:56:10.1169053Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:56:10.1169265Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:56:10.1169467Z     %c3_i32 = arith.constant 3 : i32
2026-02-21T09:56:10.1169728Z     %cst_8 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1170068Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:56:10.1170270Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:56:10.1170460Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:56:10.1170782Z     %cst_9 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1171146Z     %0 = tt.get_program_id x : i32
2026-02-21T09:56:10.1171350Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:56:10.1171562Z     %2 = arith.minsi %1, %c8192_i32 : i32
2026-02-21T09:56:10.1171919Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.1172419Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.1172897Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.1173387Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.1173862Z     %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1174337Z     %8 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1174774Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1175141Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1175640Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:56:10.1176276Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:56:10.1176812Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:10.1177154Z     %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:10.1177415Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:56:10.1177676Z     %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:10.1177959Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:56:10.1178232Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:10.1178450Z     %19 = arith.subi %2, %0 : i32
2026-02-21T09:56:10.1178598Z     %20 = arith.remsi %19, %c3_i32 : i32
2026-02-21T09:56:10.1178780Z     %21 = arith.subi %19, %20 : i32
2026-02-21T09:56:10.1178935Z     %22 = arith.addi %0, %21 : i32
2026-02-21T09:56:10.1179097Z     scf.for %arg3 = %0 to %22 step %c3_i32  : i32 {
2026-02-21T09:56:10.1179296Z       %23 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:56:10.1179459Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:56:10.1179622Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:10.1179779Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:56:10.1179947Z       %27 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:56:10.1180107Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:10.1180259Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:10.1180417Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:10.1180566Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:10.1180792Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.1181082Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.1181368Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.1181674Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.1181896Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:56:10.1182117Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.1182422Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.1182706Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.1182982Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.1183345Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.1183690Z       %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.1183948Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1184323Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:10.1184685Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1184982Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.1185344Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.1185708Z       %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1185929Z       %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1186198Z       %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1186418Z       %51 = tt.load %50 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1186728Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1187118Z       ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1187416Z       %53 = arith.addi %8, %cst_5 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1187722Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.1188013Z       %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1188218Z       %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1188440Z       %57 = tt.addptr %9, %56 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1188659Z       %58 = tt.load %57 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1188954Z       %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1189334Z       ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1189892Z       %60:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>)  : i32 {
2026-02-21T09:56:10.1190399Z         %281 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1190648Z         %282 = arith.addi %281, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1190840Z         %283 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:56:10.1190987Z         %284 = arith.muli %283, %c2_i32 : i32
2026-02-21T09:56:10.1191173Z         %285 = tt.splat %284 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1191407Z         %286 = arith.addi %285, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1191722Z         %287 = tt.expand_dims %286 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.1192025Z         %288 = tt.broadcast %287 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1192238Z         %289 = arith.addi %43, %288 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1192460Z         %290 = tt.addptr %9, %289 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1192687Z         %291 = tt.load %290 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1193021Z         %292 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1193494Z         %293 = arith.extf %292 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1193903Z         %294 = tt.expand_dims %282 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1194173Z         %295 = arith.muli %294, %cst_6 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1194380Z         %296 = tt.broadcast %295 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1194588Z         %297 = arith.addi %296, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1194811Z         %298 = tt.addptr %10, %297 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1195026Z         %299 = tt.load %298 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1195293Z         %300 = ttg.convert_layout %299 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1195593Z         %301 = arith.shli %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1195851Z         %302 = arith.shrsi %301, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1196107Z         %303 = arith.shrsi %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1196432Z         %304 = tt.expand_dims %302 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1196783Z         %305 = tt.expand_dims %303 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1197098Z         %306 = tt.broadcast %304 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1197349Z         %307 = arith.select %15, %306, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1197596Z         %308 = tt.broadcast %305 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1197835Z         %309 = arith.select %17, %308, %307 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1198074Z         %310 = tt.reshape %309 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1198303Z         %311 = arith.sitofp %310 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1198567Z         %312 = ttg.local_alloc %311 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1198902Z         %313 = ttg.local_load %312 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1199401Z         %314 = tt.dot %293, %313, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1199761Z         %315 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:10.1199914Z         %316 = arith.cmpi slt, %315, %c2_i32 : i32
2026-02-21T09:56:10.1200052Z         %317 = arith.select %316, %315, %c0_i32 : i32
2026-02-21T09:56:10.1200332Z         %318 = ttg.memdesc_index %46[%317] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1200694Z         ttg.local_store %291, %318 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1201102Z         scf.yield %314, %317, %arg8, %318 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1201452Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:10.1201667Z       %61 = arith.addi %7, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1202018Z       %62 = ttg.local_load %60#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1202445Z       %63 = arith.extf %62 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1202863Z       %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1203107Z       %65 = arith.muli %64, %cst_6 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1203293Z       %66 = tt.broadcast %65 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1203485Z       %67 = arith.addi %66, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1203680Z       %68 = tt.addptr %10, %67 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1203874Z       %69 = tt.load %68 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1204115Z       %70 = ttg.convert_layout %69 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1204405Z       %71 = arith.shli %70, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1204634Z       %72 = arith.shrsi %71, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1204861Z       %73 = arith.shrsi %70, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1205157Z       %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1205488Z       %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1205765Z       %76 = tt.broadcast %74 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1206002Z       %77 = arith.select %15, %76, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1206232Z       %78 = tt.broadcast %75 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1206458Z       %79 = arith.select %17, %78, %77 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1206681Z       %80 = tt.reshape %79 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1206899Z       %81 = arith.sitofp %80 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1207147Z       %82 = ttg.local_alloc %81 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1207485Z       %83 = ttg.local_load %82 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1207961Z       %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1208348Z       %85 = arith.addi %7, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1208673Z       %86 = ttg.local_load %60#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1209095Z       %87 = arith.extf %86 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1209472Z       %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1209710Z       %89 = arith.muli %88, %cst_6 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1209897Z       %90 = tt.broadcast %89 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1210086Z       %91 = arith.addi %90, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1210275Z       %92 = tt.addptr %10, %91 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1210470Z       %93 = tt.load %92 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1210704Z       %94 = ttg.convert_layout %93 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1210979Z       %95 = arith.shli %94, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1211208Z       %96 = arith.shrsi %95, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1211435Z       %97 = arith.shrsi %94, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1211718Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1212047Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1212344Z       %100 = tt.broadcast %98 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1212585Z       %101 = arith.select %15, %100, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1212818Z       %102 = tt.broadcast %99 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1213064Z       %103 = arith.select %17, %102, %101 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1213296Z       %104 = tt.reshape %103 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1213522Z       %105 = arith.sitofp %104 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1213783Z       %106 = ttg.local_alloc %105 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1214111Z       %107 = ttg.local_load %106 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1214580Z       %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1214972Z       ttg.local_dealloc %46 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.1215190Z       %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:10.1215478Z       %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:10.1215714Z       %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:10.1215959Z       %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:10.1216222Z       %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1216430Z       %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1216613Z       %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1216805Z       %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1217009Z       tt.store %116, %109 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:10.1217151Z       %117 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:56:10.1217277Z       %118 = arith.divsi %117, %c256_i32 : i32
2026-02-21T09:56:10.1217401Z       %119 = arith.muli %118, %c4_i32 : i32
2026-02-21T09:56:10.1217519Z       %120 = arith.subi %c128_i32, %119 : i32
2026-02-21T09:56:10.1217639Z       %121 = arith.minsi %120, %c4_i32 : i32
2026-02-21T09:56:10.1217757Z       %122 = arith.remsi %117, %c256_i32 : i32
2026-02-21T09:56:10.1217875Z       %123 = arith.remsi %122, %121 : i32
2026-02-21T09:56:10.1217989Z       %124 = arith.addi %119, %123 : i32
2026-02-21T09:56:10.1218104Z       %125 = arith.divsi %122, %121 : i32
2026-02-21T09:56:10.1218220Z       %126 = arith.muli %124, %c128_i32 : i32
2026-02-21T09:56:10.1218395Z       %127 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.1218616Z       %128 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.1218835Z       %129 = arith.addi %127, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.1219052Z       %130 = arith.addi %128, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.1219217Z       %131 = arith.muli %125, %c128_i32 : i32
2026-02-21T09:56:10.1219388Z       %132 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.1219606Z       %133 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.1219837Z       %134 = arith.addi %132, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.1220051Z       %135 = arith.addi %133, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.1220322Z       %136 = tt.expand_dims %129 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.1220596Z       %137 = arith.muli %136, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.1220794Z       %138 = tt.broadcast %137 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1221075Z       %139 = tt.expand_dims %134 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:10.1221359Z       %140 = tt.broadcast %139 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1221581Z       %141 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.1221770Z       %142 = arith.addi %138, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1221971Z       %143 = tt.addptr %9, %142 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1222176Z       %144 = tt.load %143 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1222465Z       %145 = ttg.memdesc_index %141[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1222838Z       ttg.local_store %144, %145 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1223082Z       %146 = arith.addi %138, %55 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1223282Z       %147 = tt.addptr %9, %146 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1223502Z       %148 = tt.load %147 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1223785Z       %149 = ttg.memdesc_index %141[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1224140Z       ttg.local_store %148, %149 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1224672Z       %150:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %145, %arg8 = %149) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>)  : i32 {
2026-02-21T09:56:10.1225151Z         %281 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1225379Z         %282 = arith.addi %281, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1225559Z         %283 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:56:10.1225679Z         %284 = arith.muli %283, %c2_i32 : i32
2026-02-21T09:56:10.1225851Z         %285 = tt.splat %284 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1226073Z         %286 = arith.addi %285, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1226348Z         %287 = tt.expand_dims %286 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.1226628Z         %288 = tt.broadcast %287 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1226825Z         %289 = arith.addi %138, %288 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1227030Z         %290 = tt.addptr %9, %289 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1227237Z         %291 = tt.load %290 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1227539Z         %292 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1228005Z         %293 = arith.extf %292 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1228388Z         %294 = tt.expand_dims %282 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1228652Z         %295 = arith.muli %294, %cst_6 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1228847Z         %296 = tt.broadcast %295 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1229038Z         %297 = arith.addi %296, %140 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1229241Z         %298 = tt.addptr %10, %297 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1229443Z         %299 = tt.load %298 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1229692Z         %300 = ttg.convert_layout %299 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1229979Z         %301 = arith.shli %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1230216Z         %302 = arith.shrsi %301, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1230454Z         %303 = arith.shrsi %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1230766Z         %304 = tt.expand_dims %302 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1231102Z         %305 = tt.expand_dims %303 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1231407Z         %306 = tt.broadcast %304 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1231649Z         %307 = arith.select %15, %306, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1231891Z         %308 = tt.broadcast %305 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1232127Z         %309 = arith.select %17, %308, %307 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1232359Z         %310 = tt.reshape %309 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1232589Z         %311 = arith.sitofp %310 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1232844Z         %312 = ttg.local_alloc %311 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1233176Z         %313 = ttg.local_load %312 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1233655Z         %314 = tt.dot %293, %313, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1234006Z         %315 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:10.1234135Z         %316 = arith.cmpi slt, %315, %c2_i32 : i32
2026-02-21T09:56:10.1234268Z         %317 = arith.select %316, %315, %c0_i32 : i32
2026-02-21T09:56:10.1234539Z         %318 = ttg.memdesc_index %141[%317] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1234900Z         ttg.local_store %291, %318 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1235297Z         scf.yield %314, %317, %arg8, %318 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1235638Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:10.1235974Z       %151 = ttg.local_load %150#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1236409Z       %152 = arith.extf %151 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1236732Z       %153 = arith.addi %66, %140 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1236933Z       %154 = tt.addptr %10, %153 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1237135Z       %155 = tt.load %154 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1237378Z       %156 = ttg.convert_layout %155 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1237657Z       %157 = arith.shli %156, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1237897Z       %158 = arith.shrsi %157, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1238130Z       %159 = arith.shrsi %156, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1238422Z       %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1238774Z       %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1239057Z       %162 = tt.broadcast %160 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1239297Z       %163 = arith.select %15, %162, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1239547Z       %164 = tt.broadcast %161 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1239784Z       %165 = arith.select %17, %164, %163 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1240017Z       %166 = tt.reshape %165 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1240241Z       %167 = arith.sitofp %166 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1240497Z       %168 = ttg.local_alloc %167 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1240823Z       %169 = ttg.local_load %168 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1241300Z       %170 = tt.dot %152, %169, %150#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1241803Z       %171 = ttg.local_load %150#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1242234Z       %172 = arith.extf %171 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1242533Z       %173 = arith.addi %90, %140 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1242774Z       %174 = tt.addptr %10, %173 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1242973Z       %175 = tt.load %174 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1243220Z       %176 = ttg.convert_layout %175 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1243501Z       %177 = arith.shli %176, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1243739Z       %178 = arith.shrsi %177, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1244000Z       %179 = arith.shrsi %176, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1244286Z       %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1244639Z       %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1244922Z       %182 = tt.broadcast %180 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1245162Z       %183 = arith.select %15, %182, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1245401Z       %184 = tt.broadcast %181 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1245633Z       %185 = arith.select %17, %184, %183 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1245867Z       %186 = tt.reshape %185 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1246088Z       %187 = arith.sitofp %186 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1246340Z       %188 = ttg.local_alloc %187 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1246669Z       %189 = ttg.local_load %188 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1247152Z       %190 = tt.dot %172, %189, %170, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1247557Z       ttg.local_dealloc %141 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.1247775Z       %191 = arith.truncf %190 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:10.1248044Z       %192 = tt.expand_dims %130 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:10.1248286Z       %193 = arith.muli %192, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:10.1248515Z       %194 = tt.expand_dims %135 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:10.1248776Z       %195 = tt.broadcast %193 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1248983Z       %196 = tt.broadcast %194 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1249168Z       %197 = arith.addi %195, %196 : tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1249362Z       %198 = tt.addptr %18, %197 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1249560Z       tt.store %198, %191 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:10.1249705Z       %199 = arith.addi %arg3, %c2_i32 : i32
2026-02-21T09:56:10.1249826Z       %200 = arith.divsi %199, %c256_i32 : i32
2026-02-21T09:56:10.1249946Z       %201 = arith.muli %200, %c4_i32 : i32
2026-02-21T09:56:10.1250067Z       %202 = arith.subi %c128_i32, %201 : i32
2026-02-21T09:56:10.1250185Z       %203 = arith.minsi %202, %c4_i32 : i32
2026-02-21T09:56:10.1250306Z       %204 = arith.remsi %199, %c256_i32 : i32
2026-02-21T09:56:10.1250421Z       %205 = arith.remsi %204, %203 : i32
2026-02-21T09:56:10.1250538Z       %206 = arith.addi %201, %205 : i32
2026-02-21T09:56:10.1250651Z       %207 = arith.divsi %204, %203 : i32
2026-02-21T09:56:10.1250771Z       %208 = arith.muli %206, %c128_i32 : i32
2026-02-21T09:56:10.1250943Z       %209 = tt.splat %208 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.1251162Z       %210 = tt.splat %208 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.1251386Z       %211 = arith.addi %209, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.1251617Z       %212 = arith.addi %210, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.1251788Z       %213 = arith.muli %207, %c128_i32 : i32
2026-02-21T09:56:10.1251954Z       %214 = tt.splat %213 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.1252191Z       %215 = tt.splat %213 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.1252410Z       %216 = arith.addi %214, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.1252621Z       %217 = arith.addi %215, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.1252901Z       %218 = tt.expand_dims %211 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.1253156Z       %219 = arith.muli %218, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.1253357Z       %220 = tt.broadcast %219 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1253640Z       %221 = tt.expand_dims %216 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:10.1253919Z       %222 = tt.broadcast %221 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1254142Z       %223 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.1254342Z       %224 = arith.addi %220, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1254541Z       %225 = tt.addptr %9, %224 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1254747Z       %226 = tt.load %225 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1255051Z       %227 = ttg.memdesc_index %223[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1255414Z       ttg.local_store %226, %227 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1255654Z       %228 = arith.addi %220, %55 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1255854Z       %229 = tt.addptr %9, %228 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1256060Z       %230 = tt.load %229 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1256342Z       %231 = ttg.memdesc_index %223[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1256704Z       ttg.local_store %230, %231 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1257229Z       %232:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %227, %arg8 = %231) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>)  : i32 {
2026-02-21T09:56:10.1257704Z         %281 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1257938Z         %282 = arith.addi %281, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1258116Z         %283 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:56:10.1258242Z         %284 = arith.muli %283, %c2_i32 : i32
2026-02-21T09:56:10.1258412Z         %285 = tt.splat %284 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1258630Z         %286 = arith.addi %285, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1258911Z         %287 = tt.expand_dims %286 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.1259187Z         %288 = tt.broadcast %287 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1267175Z         %289 = arith.addi %220, %288 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1267404Z         %290 = tt.addptr %9, %289 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1267616Z         %291 = tt.load %290 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1267965Z         %292 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1268407Z         %293 = arith.extf %292 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1268793Z         %294 = tt.expand_dims %282 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1269042Z         %295 = arith.muli %294, %cst_6 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1269239Z         %296 = tt.broadcast %295 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1269431Z         %297 = arith.addi %296, %222 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1269634Z         %298 = tt.addptr %10, %297 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1269844Z         %299 = tt.load %298 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1270109Z         %300 = ttg.convert_layout %299 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1270392Z         %301 = arith.shli %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1270627Z         %302 = arith.shrsi %301, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1270883Z         %303 = arith.shrsi %300, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1271175Z         %304 = tt.expand_dims %302 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1271512Z         %305 = tt.expand_dims %303 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1271802Z         %306 = tt.broadcast %304 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1272045Z         %307 = arith.select %15, %306, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1272288Z         %308 = tt.broadcast %305 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1272524Z         %309 = arith.select %17, %308, %307 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1272755Z         %310 = tt.reshape %309 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1272980Z         %311 = arith.sitofp %310 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1273234Z         %312 = ttg.local_alloc %311 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1273566Z         %313 = ttg.local_load %312 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1274050Z         %314 = tt.dot %293, %313, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1274402Z         %315 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:10.1274532Z         %316 = arith.cmpi slt, %315, %c2_i32 : i32
2026-02-21T09:56:10.1274671Z         %317 = arith.select %316, %315, %c0_i32 : i32
2026-02-21T09:56:10.1274938Z         %318 = ttg.memdesc_index %223[%317] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1275317Z         ttg.local_store %291, %318 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1275715Z         scf.yield %314, %317, %arg8, %318 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1276073Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:10.1276393Z       %233 = ttg.local_load %232#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1276824Z       %234 = arith.extf %233 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1277122Z       %235 = arith.addi %66, %222 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1277328Z       %236 = tt.addptr %10, %235 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1277527Z       %237 = tt.load %236 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1277773Z       %238 = ttg.convert_layout %237 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1278054Z       %239 = arith.shli %238, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1278305Z       %240 = arith.shrsi %239, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1278539Z       %241 = arith.shrsi %238, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1278845Z       %242 = tt.expand_dims %240 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1279183Z       %243 = tt.expand_dims %241 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1279467Z       %244 = tt.broadcast %242 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1279709Z       %245 = arith.select %15, %244, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1279951Z       %246 = tt.broadcast %243 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1280187Z       %247 = arith.select %17, %246, %245 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1280423Z       %248 = tt.reshape %247 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1280649Z       %249 = arith.sitofp %248 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1280908Z       %250 = ttg.local_alloc %249 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1281238Z       %251 = ttg.local_load %250 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1281712Z       %252 = tt.dot %234, %251, %232#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1282215Z       %253 = ttg.local_load %232#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1282693Z       %254 = arith.extf %253 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1282992Z       %255 = arith.addi %90, %222 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1283195Z       %256 = tt.addptr %10, %255 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1283416Z       %257 = tt.load %256 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1283659Z       %258 = ttg.convert_layout %257 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1283956Z       %259 = arith.shli %258, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1284189Z       %260 = arith.shrsi %259, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1284426Z       %261 = arith.shrsi %258, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1284713Z       %262 = tt.expand_dims %260 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1285051Z       %263 = tt.expand_dims %261 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1285338Z       %264 = tt.broadcast %262 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1285575Z       %265 = arith.select %15, %264, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1285813Z       %266 = tt.broadcast %263 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1286045Z       %267 = arith.select %17, %266, %265 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1286291Z       %268 = tt.reshape %267 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1286518Z       %269 = arith.sitofp %268 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1286770Z       %270 = ttg.local_alloc %269 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1287114Z       %271 = ttg.local_load %270 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1287585Z       %272 = tt.dot %254, %271, %252, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1287972Z       ttg.local_dealloc %223 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.1288189Z       %273 = arith.truncf %272 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:10.1288459Z       %274 = tt.expand_dims %212 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:10.1288698Z       %275 = arith.muli %274, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:10.1288931Z       %276 = tt.expand_dims %217 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:10.1289191Z       %277 = tt.broadcast %275 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1289399Z       %278 = tt.broadcast %276 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1289580Z       %279 = arith.addi %277, %278 : tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1289773Z       %280 = tt.addptr %18, %279 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1289974Z       tt.store %280, %273 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:10.1290108Z     }
2026-02-21T09:56:10.1290207Z     scf.for %arg3 = %22 to %2 step %c1_i32  : i32 {
2026-02-21T09:56:10.1290344Z       %23 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:56:10.1290468Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:56:10.1290583Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:10.1290700Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:56:10.1290820Z       %27 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:56:10.1290939Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:10.1291068Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:10.1291179Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:10.1291292Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:10.1291459Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.1291691Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.1291904Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.1292118Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.1292283Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:56:10.1292449Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.1292661Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.1292872Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.1293083Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.1293352Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.1293608Z       %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.1293816Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1294093Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:10.1294370Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1294610Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.1294882Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.1295149Z       %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1295338Z       %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1295537Z       %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1295736Z       %51 = tt.load %50 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1296019Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1296383Z       ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1296650Z       %53 = arith.addi %8, %cst_5 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1296925Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.1297195Z       %55 = tt.broadcast %54 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1297383Z       %56 = arith.addi %43, %55 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1297577Z       %57 = tt.addptr %9, %56 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1297775Z       %58 = tt.load %57 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1298053Z       %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1298406Z       ttg.local_store %58, %59 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1298928Z       %60:4 = scf.for %arg4 = %c0_i32 to %c504_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>)  : i32 {
2026-02-21T09:56:10.1299436Z         %117 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1299663Z         %118 = arith.addi %117, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1299841Z         %119 = arith.addi %arg4, %c8_i32 : i32
2026-02-21T09:56:10.1299966Z         %120 = arith.muli %119, %c2_i32 : i32
2026-02-21T09:56:10.1300134Z         %121 = tt.splat %120 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1300358Z         %122 = arith.addi %121, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.1300630Z         %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.1300908Z         %124 = tt.broadcast %123 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1301105Z         %125 = arith.addi %43, %124 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1301306Z         %126 = tt.addptr %9, %125 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.1301513Z         %127 = tt.load %126 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.1301832Z         %128 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1302285Z         %129 = arith.extf %128 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1302667Z         %130 = tt.expand_dims %118 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1302914Z         %131 = arith.muli %130, %cst_6 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1303110Z         %132 = tt.broadcast %131 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1303302Z         %133 = arith.addi %132, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1303505Z         %134 = tt.addptr %10, %133 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1303711Z         %135 = tt.load %134 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1303955Z         %136 = ttg.convert_layout %135 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1304238Z         %137 = arith.shli %136, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1304473Z         %138 = arith.shrsi %137, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1304710Z         %139 = arith.shrsi %136, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1305000Z         %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1305336Z         %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1305627Z         %142 = tt.broadcast %140 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1305869Z         %143 = arith.select %15, %142, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1306111Z         %144 = tt.broadcast %141 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1306348Z         %145 = arith.select %17, %144, %143 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1306596Z         %146 = tt.reshape %145 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1306822Z         %147 = arith.sitofp %146 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1307076Z         %148 = ttg.local_alloc %147 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1307421Z         %149 = ttg.local_load %148 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1307900Z         %150 = tt.dot %129, %149, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1308246Z         %151 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:10.1308379Z         %152 = arith.cmpi slt, %151, %c2_i32 : i32
2026-02-21T09:56:10.1308513Z         %153 = arith.select %152, %151, %c0_i32 : i32
2026-02-21T09:56:10.1308780Z         %154 = ttg.memdesc_index %46[%153] : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1309139Z         ttg.local_store %127, %154 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1309552Z         scf.yield %150, %153, %arg8, %154 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8>
2026-02-21T09:56:10.1309891Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:10.1310105Z       %61 = arith.addi %7, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1310446Z       %62 = ttg.local_load %60#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1310874Z       %63 = arith.extf %62 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1311252Z       %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1311495Z       %65 = arith.muli %64, %cst_6 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1311685Z       %66 = tt.broadcast %65 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1311873Z       %67 = arith.addi %66, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1312065Z       %68 = tt.addptr %10, %67 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1312261Z       %69 = tt.load %68 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1312499Z       %70 = ttg.convert_layout %69 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1312773Z       %71 = arith.shli %70, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1312998Z       %72 = arith.shrsi %71, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1313228Z       %73 = arith.shrsi %70, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1313509Z       %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1313838Z       %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1314116Z       %76 = tt.broadcast %74 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1314349Z       %77 = arith.select %15, %76, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1314597Z       %78 = tt.broadcast %75 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1314821Z       %79 = arith.select %17, %78, %77 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1315040Z       %80 = tt.reshape %79 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1315273Z       %81 = arith.sitofp %80 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1315517Z       %82 = ttg.local_alloc %81 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1315837Z       %83 = ttg.local_load %82 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1316304Z       %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1316691Z       %85 = arith.addi %7, %cst_4 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.1317014Z       %86 = ttg.local_load %60#3 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 2x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1317438Z       %87 = arith.extf %86 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1317825Z       %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1318066Z       %89 = arith.muli %88, %cst_6 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.1318268Z       %90 = tt.broadcast %89 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1318456Z       %91 = arith.addi %90, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1318645Z       %92 = tt.addptr %10, %91 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.1318839Z       %93 = tt.load %92 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.1319075Z       %94 = ttg.convert_layout %93 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1319346Z       %95 = arith.shli %94, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1319575Z       %96 = arith.shrsi %95, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1319799Z       %97 = arith.shrsi %94, %cst_9 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.1320080Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1320409Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.1320687Z       %100 = tt.broadcast %98 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1320925Z       %101 = arith.select %15, %100, %cst_8 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1321161Z       %102 = tt.broadcast %99 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1321389Z       %103 = arith.select %17, %102, %101 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.1321622Z       %104 = tt.reshape %103 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.1321842Z       %105 = arith.sitofp %104 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.1322095Z       %106 = ttg.local_alloc %105 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.1322418Z       %107 = ttg.local_load %106 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.1322931Z       %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.1323328Z       ttg.local_dealloc %46 : !ttg.memdesc<2x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.1323543Z       %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:10.1323817Z       %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:10.1324058Z       %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:10.1324289Z       %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:10.1324553Z       %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1324761Z       %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1324944Z       %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1325140Z       %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:10.1325339Z       tt.store %116, %109 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:10.1325502Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:56:10.1325602Z     tt.return
2026-02-21T09:56:10.1325682Z   }
2026-02-21T09:56:10.1325759Z }
2026-02-21T09:56:10.1325805Z 
2026-02-21T09:56:10.1325835Z {-#
2026-02-21T09:56:10.1325912Z   external_resources: {
2026-02-21T09:56:10.1326010Z     mlir_reproducer: {
2026-02-21T09:56:10.1327018Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:56:10.1328003Z       disable_threading: false,
2026-02-21T09:56:10.1328106Z       verify_each: true
2026-02-21T09:56:10.1328195Z     }
2026-02-21T09:56:10.1328263Z   }
2026-02-21T09:56:10.1328331Z #-}
2026-02-21T09:56:10.1328605Z /tmp/torchinductor_root/sm/csmwdsu7lmd7uxkmcghpo2ohq4qfl5uu3uhxadnjsecl6z57rpnl.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:56:10.1329304Z /tmp/torchinductor_root/sm/csmwdsu7lmd7uxkmcghpo2ohq4qfl5uu3uhxadnjsecl6z57rpnl.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:56:10.1329858Z [700s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:56:10.1330628Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[0, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:56:10.1331329Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:56:10.1331496Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:56:10.9030045Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:56:10.9036101Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T09:56:10.9037028Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:56:10.9037665Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:56:10.9038273Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:56:10.9038853Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [2, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:56:10.9039374Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 8, maxPhase = 2, order = [1, 0]}>
2026-02-21T09:56:10.9039864Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:56:10.9040233Z #smem = #ttg.shared_memory
2026-02-21T09:56:10.9040705Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:56:10.9041743Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:56:10.9042524Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:56:10.9042952Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:10.9043409Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:10.9043780Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:56:10.9044116Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:56:10.9044499Z     %cst_3 = arith.constant dense<508> : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.9044952Z     %cst_4 = arith.constant dense<8192> : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9045324Z     %cst_5 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.9045632Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:56:10.9045846Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:56:10.9046023Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:56:10.9046210Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:56:10.9046456Z     %cst_6 = arith.constant dense<0> : tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9046688Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:56:10.9046874Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:56:10.9047052Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:56:10.9047338Z     %cst_7 = arith.constant dense<4> : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9047644Z     %0 = tt.get_program_id x : i32
2026-02-21T09:56:10.9047830Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:56:10.9048010Z     %2 = arith.minsi %1, %c8192_i32 : i32
2026-02-21T09:56:10.9048345Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.9048795Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.9049233Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.9049676Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.9050104Z     %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.9050529Z     %8 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.9050992Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.9051316Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.9051793Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:56:10.9052468Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:56:10.9053120Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:10.9053530Z     %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:10.9053840Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:56:10.9054165Z     %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:10.9054471Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<4x2x128xi1, #blocked>
2026-02-21T09:56:10.9054812Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:10.9055067Z     %19 = arith.subi %2, %0 : i32
2026-02-21T09:56:10.9055238Z     %20 = arith.remsi %19, %c2_i32 : i32
2026-02-21T09:56:10.9055452Z     %21 = arith.subi %19, %20 : i32
2026-02-21T09:56:10.9055630Z     %22 = arith.addi %0, %21 : i32
2026-02-21T09:56:10.9055853Z     scf.for %arg3 = %0 to %22 step %c2_i32  : i32 {
2026-02-21T09:56:10.9056022Z       %23 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:56:10.9056167Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:56:10.9056339Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:10.9056476Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:56:10.9056624Z       %27 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:56:10.9056768Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:10.9056908Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:10.9057036Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:10.9057178Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:10.9057386Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.9057649Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.9057911Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.9058169Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.9058369Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:56:10.9058570Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.9058830Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.9059088Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.9059345Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.9059684Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.9059997Z       %42 = arith.muli %41, %cst_5 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.9060239Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9060586Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:10.9060924Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9061220Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.9061583Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.9061934Z       %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9062168Z       %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9062408Z       %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9062661Z       %51 = tt.load %50 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.9063014Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9063463Z       ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9064002Z       %53:3 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c0_i32, %arg7 = %52) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>)  : i32 {
2026-02-21T09:56:10.9064486Z         %144 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.9064765Z         %145 = arith.addi %144, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.9065004Z         %146 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:10.9065153Z         %147 = arith.muli %146, %c2_i32 : i32
2026-02-21T09:56:10.9065361Z         %148 = tt.splat %147 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.9065644Z         %149 = arith.addi %148, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.9065923Z         %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.9066204Z         %151 = tt.broadcast %150 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9066399Z         %152 = arith.addi %43, %151 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9066606Z         %153 = tt.addptr %9, %152 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9066814Z         %154 = tt.load %153 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.9067123Z         %155 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9067572Z         %156 = arith.extf %155 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9067961Z         %157 = tt.expand_dims %145 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9068216Z         %158 = arith.muli %157, %cst_4 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9068412Z         %159 = tt.broadcast %158 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9068610Z         %160 = arith.addi %159, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9068812Z         %161 = tt.addptr %10, %160 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9069017Z         %162 = tt.load %161 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.9069286Z         %163 = ttg.convert_layout %162 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9069571Z         %164 = arith.shli %163, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9069814Z         %165 = arith.shrsi %164, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9070068Z         %166 = arith.shrsi %163, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9070363Z         %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9070720Z         %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9071009Z         %169 = tt.broadcast %167 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9071256Z         %170 = arith.select %15, %169, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9071498Z         %171 = tt.broadcast %168 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9071737Z         %172 = arith.select %17, %171, %170 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9071973Z         %173 = tt.reshape %172 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.9072199Z         %174 = arith.sitofp %173 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.9072458Z         %175 = ttg.local_alloc %174 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.9072809Z         %176 = ttg.local_load %175 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9073292Z         %177 = tt.dot %156, %176, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.9073676Z         %178 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:10.9073805Z         %179 = arith.cmpi slt, %178, %c1_i32 : i32
2026-02-21T09:56:10.9073944Z         %180 = arith.select %179, %178, %c0_i32 : i32
2026-02-21T09:56:10.9074215Z         %181 = ttg.memdesc_index %46[%180] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9074576Z         ttg.local_store %154, %181 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9074894Z         scf.yield %177, %180, %181 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9075137Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:56:10.9075338Z       %54 = arith.addi %7, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.9075670Z       %55 = ttg.local_load %53#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9076100Z       %56 = arith.extf %55 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9076484Z       %57 = tt.expand_dims %54 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9076726Z       %58 = arith.muli %57, %cst_4 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9076918Z       %59 = tt.broadcast %58 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9077112Z       %60 = arith.addi %59, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9077304Z       %61 = tt.addptr %10, %60 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9077504Z       %62 = tt.load %61 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.9077743Z       %63 = ttg.convert_layout %62 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9078040Z       %64 = arith.shli %63, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9078269Z       %65 = arith.shrsi %64, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9078500Z       %66 = arith.shrsi %63, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9078804Z       %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9079135Z       %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9079417Z       %69 = tt.broadcast %67 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9079657Z       %70 = arith.select %15, %69, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9079889Z       %71 = tt.broadcast %68 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9080121Z       %72 = arith.select %17, %71, %70 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9080345Z       %73 = tt.reshape %72 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.9080570Z       %74 = arith.sitofp %73 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.9080825Z       %75 = ttg.local_alloc %74 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.9081164Z       %76 = ttg.local_load %75 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9081655Z       %77 = tt.dot %56, %76, %53#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.9082046Z       ttg.local_dealloc %46 : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.9082261Z       %78 = arith.truncf %77 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:10.9082536Z       %79 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:10.9082822Z       %80 = arith.muli %79, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:10.9083059Z       %81 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:10.9083326Z       %82 = tt.broadcast %80 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9083530Z       %83 = tt.broadcast %81 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9083717Z       %84 = arith.addi %82, %83 : tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9083906Z       %85 = tt.addptr %18, %84 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9084113Z       tt.store %85, %78 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:10.9084259Z       %86 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:56:10.9084389Z       %87 = arith.divsi %86, %c256_i32 : i32
2026-02-21T09:56:10.9084516Z       %88 = arith.muli %87, %c4_i32 : i32
2026-02-21T09:56:10.9084637Z       %89 = arith.subi %c128_i32, %88 : i32
2026-02-21T09:56:10.9084762Z       %90 = arith.minsi %89, %c4_i32 : i32
2026-02-21T09:56:10.9084883Z       %91 = arith.remsi %86, %c256_i32 : i32
2026-02-21T09:56:10.9085006Z       %92 = arith.remsi %91, %90 : i32
2026-02-21T09:56:10.9085122Z       %93 = arith.addi %88, %92 : i32
2026-02-21T09:56:10.9085241Z       %94 = arith.divsi %91, %90 : i32
2026-02-21T09:56:10.9085356Z       %95 = arith.muli %93, %c128_i32 : i32
2026-02-21T09:56:10.9085529Z       %96 = tt.splat %95 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.9085750Z       %97 = tt.splat %95 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.9085990Z       %98 = arith.addi %96, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.9086210Z       %99 = arith.addi %97, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.9086395Z       %100 = arith.muli %94, %c128_i32 : i32
2026-02-21T09:56:10.9086576Z       %101 = tt.splat %100 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.9086803Z       %102 = tt.splat %100 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.9087026Z       %103 = arith.addi %101, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.9087247Z       %104 = arith.addi %102, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.9087528Z       %105 = tt.expand_dims %98 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.9087794Z       %106 = arith.muli %105, %cst_5 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.9088000Z       %107 = tt.broadcast %106 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9088288Z       %108 = tt.expand_dims %103 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:10.9088582Z       %109 = tt.broadcast %108 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9088826Z       %110 = ttg.local_alloc : () -> !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.9089021Z       %111 = arith.addi %107, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9089229Z       %112 = tt.addptr %9, %111 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9089458Z       %113 = tt.load %112 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.9089754Z       %114 = ttg.memdesc_index %110[%c0_i32] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9090127Z       ttg.local_store %113, %114 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9090574Z       %115:3 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c0_i32, %arg7 = %114) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>)  : i32 {
2026-02-21T09:56:10.9090968Z         %144 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.9091200Z         %145 = arith.addi %144, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.9091383Z         %146 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:10.9091509Z         %147 = arith.muli %146, %c2_i32 : i32
2026-02-21T09:56:10.9091686Z         %148 = tt.splat %147 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.9091915Z         %149 = arith.addi %148, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.9092193Z         %150 = tt.expand_dims %149 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.9092479Z         %151 = tt.broadcast %150 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9092679Z         %152 = arith.addi %107, %151 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9092892Z         %153 = tt.addptr %9, %152 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9093107Z         %154 = tt.load %153 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.9093418Z         %155 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9093874Z         %156 = arith.extf %155 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9094286Z         %157 = tt.expand_dims %145 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9094564Z         %158 = arith.muli %157, %cst_4 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9094765Z         %159 = tt.broadcast %158 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9094962Z         %160 = arith.addi %159, %109 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9095172Z         %161 = tt.addptr %10, %160 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9095379Z         %162 = tt.load %161 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.9095635Z         %163 = ttg.convert_layout %162 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9095928Z         %164 = arith.shli %163, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9096167Z         %165 = arith.shrsi %164, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9096411Z         %166 = arith.shrsi %163, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9096711Z         %167 = tt.expand_dims %165 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9097076Z         %168 = tt.expand_dims %166 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9097370Z         %169 = tt.broadcast %167 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9097631Z         %170 = arith.select %15, %169, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9097882Z         %171 = tt.broadcast %168 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9098121Z         %172 = arith.select %17, %171, %170 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9098360Z         %173 = tt.reshape %172 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.9098596Z         %174 = arith.sitofp %173 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.9098856Z         %175 = ttg.local_alloc %174 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.9099196Z         %176 = ttg.local_load %175 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9099684Z         %177 = tt.dot %156, %176, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.9100037Z         %178 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:10.9100177Z         %179 = arith.cmpi slt, %178, %c1_i32 : i32
2026-02-21T09:56:10.9100314Z         %180 = arith.select %179, %178, %c0_i32 : i32
2026-02-21T09:56:10.9100593Z         %181 = ttg.memdesc_index %110[%180] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9100965Z         ttg.local_store %154, %181 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9101284Z         scf.yield %177, %180, %181 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9101533Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:56:10.9101843Z       %116 = ttg.local_load %115#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9102298Z       %117 = arith.extf %116 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9102606Z       %118 = arith.addi %59, %109 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9102825Z       %119 = tt.addptr %10, %118 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9103036Z       %120 = tt.load %119 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.9103283Z       %121 = ttg.convert_layout %120 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9103572Z       %122 = arith.shli %121, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9103818Z       %123 = arith.shrsi %122, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9104060Z       %124 = arith.shrsi %121, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9104358Z       %125 = tt.expand_dims %123 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9104704Z       %126 = tt.expand_dims %124 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9105002Z       %127 = tt.broadcast %125 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9105273Z       %128 = arith.select %15, %127, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9105515Z       %129 = tt.broadcast %126 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9105771Z       %130 = arith.select %17, %129, %128 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9106009Z       %131 = tt.reshape %130 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.9106238Z       %132 = arith.sitofp %131 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.9106499Z       %133 = ttg.local_alloc %132 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.9106828Z       %134 = ttg.local_load %133 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9107310Z       %135 = tt.dot %117, %134, %115#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.9107707Z       ttg.local_dealloc %110 : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.9107928Z       %136 = arith.truncf %135 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:10.9108210Z       %137 = tt.expand_dims %99 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:10.9108452Z       %138 = arith.muli %137, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:10.9108691Z       %139 = tt.expand_dims %104 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:10.9108961Z       %140 = tt.broadcast %138 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9109171Z       %141 = tt.broadcast %139 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9109360Z       %142 = arith.addi %140, %141 : tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9109555Z       %143 = tt.addptr %18, %142 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9109766Z       tt.store %143, %136 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:10.9109907Z     }
2026-02-21T09:56:10.9110005Z     scf.for %arg3 = %22 to %2 step %c1_i32  : i32 {
2026-02-21T09:56:10.9110167Z       %23 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:56:10.9110292Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:56:10.9110417Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:10.9110538Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:56:10.9110680Z       %27 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:56:10.9110801Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:10.9110919Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:10.9111037Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:10.9111152Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:10.9111325Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.9111541Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.9111765Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:10.9111983Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:10.9112155Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:56:10.9112325Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.9112539Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.9112760Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:10.9112987Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:10.9113265Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.9113539Z       %42 = arith.muli %41, %cst_5 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:10.9113733Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9114018Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:10.9114298Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9114522Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.9114798Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.9115071Z       %48 = tt.broadcast %47 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9115268Z       %49 = arith.addi %43, %48 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9115468Z       %50 = tt.addptr %9, %49 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9115675Z       %51 = tt.load %50 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.9115968Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9116326Z       ttg.local_store %51, %52 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9116762Z       %53:3 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c4_i32 iter_args(%arg5 = %cst_2, %arg6 = %c0_i32, %arg7 = %52) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>)  : i32 {
2026-02-21T09:56:10.9117143Z         %86 = tt.splat %arg4 : i32 -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.9117375Z         %87 = arith.addi %86, %7 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.9117556Z         %88 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:10.9117678Z         %89 = arith.muli %88, %c2_i32 : i32
2026-02-21T09:56:10.9117864Z         %90 = tt.splat %89 : i32 -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.9118077Z         %91 = arith.addi %90, %8 : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:10.9118350Z         %92 = tt.expand_dims %91 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x8xi32, #blocked2>
2026-02-21T09:56:10.9118638Z         %93 = tt.broadcast %92 : tensor<1x8xi32, #blocked2> -> tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9118828Z         %94 = arith.addi %43, %93 : tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9119025Z         %95 = tt.addptr %9, %94 : tensor<128x8x!tt.ptr<bf16>, #blocked2>, tensor<128x8xi32, #blocked2>
2026-02-21T09:56:10.9119225Z         %96 = tt.load %95 : tensor<128x8x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:10.9119529Z         %97 = ttg.local_load %arg7 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9119967Z         %98 = arith.extf %97 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9120346Z         %99 = tt.expand_dims %87 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9120592Z         %100 = arith.muli %99, %cst_4 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9120803Z         %101 = tt.broadcast %100 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9120999Z         %102 = arith.addi %101, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9121204Z         %103 = tt.addptr %10, %102 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9121423Z         %104 = tt.load %103 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.9121671Z         %105 = ttg.convert_layout %104 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9121958Z         %106 = arith.shli %105, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9122199Z         %107 = arith.shrsi %106, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9122440Z         %108 = arith.shrsi %105, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9122765Z         %109 = tt.expand_dims %107 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9123109Z         %110 = tt.expand_dims %108 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9123399Z         %111 = tt.broadcast %109 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9123644Z         %112 = arith.select %15, %111, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9123889Z         %113 = tt.broadcast %110 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9124125Z         %114 = arith.select %17, %113, %112 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9124360Z         %115 = tt.reshape %114 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.9124585Z         %116 = arith.sitofp %115 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.9124841Z         %117 = ttg.local_alloc %116 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.9125173Z         %118 = ttg.local_load %117 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9125651Z         %119 = tt.dot %98, %118, %arg5, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.9126027Z         %120 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:10.9126157Z         %121 = arith.cmpi slt, %120, %c1_i32 : i32
2026-02-21T09:56:10.9126290Z         %122 = arith.select %121, %120, %c0_i32 : i32
2026-02-21T09:56:10.9126576Z         %123 = ttg.memdesc_index %46[%122] : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9126935Z         ttg.local_store %96, %123 : tensor<128x8xbf16, #blocked2> -> !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9127251Z         scf.yield %119, %122, %123 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8>
2026-02-21T09:56:10.9127494Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 2 : i32}
2026-02-21T09:56:10.9127690Z       %54 = arith.addi %7, %cst_3 : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:10.9128027Z       %55 = ttg.local_load %53#2 : !ttg.memdesc<128x8xbf16, #shared, #smem, mutable, 1x128x8> -> tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9128456Z       %56 = arith.extf %55 : tensor<128x8xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9128853Z       %57 = tt.expand_dims %54 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9129098Z       %58 = arith.muli %57, %cst_4 : tensor<4x1xi32, #blocked1>
2026-02-21T09:56:10.9129286Z       %59 = tt.broadcast %58 : tensor<4x1xi32, #blocked1> -> tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9129496Z       %60 = arith.addi %59, %45 : tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9129687Z       %61 = tt.addptr %10, %60 : tensor<4x128x!tt.ptr<i8>, #blocked1>, tensor<4x128xi32, #blocked1>
2026-02-21T09:56:10.9129888Z       %62 = tt.load %61 : tensor<4x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:10.9130129Z       %63 = ttg.convert_layout %62 : tensor<4x128xi8, #blocked1> -> tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9130402Z       %64 = arith.shli %63, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9130632Z       %65 = arith.shrsi %64, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9130862Z       %66 = arith.shrsi %63, %cst_7 : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:10.9131145Z       %67 = tt.expand_dims %65 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9131479Z       %68 = tt.expand_dims %66 {axis = 1 : i32} : tensor<4x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<4x1x128xi8, #blocked>
2026-02-21T09:56:10.9131755Z       %69 = tt.broadcast %67 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9131991Z       %70 = arith.select %15, %69, %cst_6 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9132220Z       %71 = tt.broadcast %68 : tensor<4x1x128xi8, #blocked> -> tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9132449Z       %72 = arith.select %17, %71, %70 : tensor<4x2x128xi1, #blocked>, tensor<4x2x128xi8, #blocked>
2026-02-21T09:56:10.9132674Z       %73 = tt.reshape %72 : tensor<4x2x128xi8, #blocked> -> tensor<8x128xi8, #blocked3>
2026-02-21T09:56:10.9132891Z       %74 = arith.sitofp %73 : tensor<8x128xi8, #blocked3> to tensor<8x128xf32, #blocked3>
2026-02-21T09:56:10.9133140Z       %75 = ttg.local_alloc %74 : (tensor<8x128xf32, #blocked3>) -> !ttg.memdesc<8x128xf32, #shared1, #smem>
2026-02-21T09:56:10.9133466Z       %76 = ttg.local_load %75 : !ttg.memdesc<8x128xf32, #shared1, #smem> -> tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:10.9133951Z       %77 = tt.dot %56, %76, %53#0, inputPrecision = tf32 : tensor<128x8xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<8x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:10.9134337Z       ttg.local_dealloc %46 : !ttg.memdesc<1x128x8xbf16, #shared, #smem, mutable>
2026-02-21T09:56:10.9134571Z       %78 = arith.truncf %77 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:10.9134842Z       %79 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:10.9135077Z       %80 = arith.muli %79, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:10.9135302Z       %81 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:10.9135563Z       %82 = tt.broadcast %80 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9135767Z       %83 = tt.broadcast %81 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9135947Z       %84 = arith.addi %82, %83 : tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9136133Z       %85 = tt.addptr %18, %84 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:10.9136330Z       tt.store %85, %78 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:10.9136471Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:56:10.9136576Z     tt.return
2026-02-21T09:56:10.9136660Z   }
2026-02-21T09:56:10.9136734Z }
2026-02-21T09:56:10.9136794Z 
2026-02-21T09:56:10.9136826Z {-#
2026-02-21T09:56:10.9136903Z   external_resources: {
2026-02-21T09:56:10.9137003Z     mlir_reproducer: {
2026-02-21T09:56:10.9138028Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:56:10.9139018Z       disable_threading: false,
2026-02-21T09:56:10.9139124Z       verify_each: true
2026-02-21T09:56:10.9139215Z     }
2026-02-21T09:56:10.9139285Z   }
2026-02-21T09:56:10.9139355Z #-}
2026-02-21T09:56:10.9139629Z /tmp/torchinductor_root/lj/cljgqngrf6gusqnmjzzg3742xw7s5ov2mbua3vqk27mj2ej4ybxp.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:56:10.9140309Z /tmp/torchinductor_root/lj/cljgqngrf6gusqnmjzzg3742xw7s5ov2mbua3vqk27mj2ej4ybxp.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:56:10.9140856Z [701s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:56:10.9141632Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[0, 2], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:56:10.9142336Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:56:10.9142505Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:56:11.0128334Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:56:11.0133929Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:56:11.0134856Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:56:11.0135672Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:56:11.0135968Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:56:11.0136251Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:56:11.0136504Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:56:11.0136740Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:56:11.0136922Z #smem = #ttg.shared_memory
2026-02-21T09:56:11.0137157Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:56:11.0137628Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:56:11.0138000Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.0138203Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.0138375Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.0138558Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:56:11.0138770Z     %cst_3 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0138951Z     %cst_4 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.0139107Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:56:11.0139224Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:56:11.0139341Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:56:11.0139455Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:56:11.0139606Z     %cst_5 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0139750Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:56:11.0139865Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T09:56:11.0139982Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:56:11.0140090Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:56:11.0140269Z     %cst_6 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0140456Z     %0 = tt.get_program_id x : i32
2026-02-21T09:56:11.0140573Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:56:11.0140687Z     %2 = arith.minsi %1, %c8192_i32 : i32
2026-02-21T09:56:11.0140890Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.0141167Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.0141438Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.0141708Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.0141971Z     %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0142235Z     %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0142476Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.0142676Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.0142969Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:56:11.0143381Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:56:11.0143798Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.0144053Z     %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.0144247Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:56:11.0144446Z     %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.0144633Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:56:11.0144844Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.0145002Z     %19 = arith.subi %2, %0 : i32
2026-02-21T09:56:11.0145111Z     %20 = arith.remsi %19, %c2_i32 : i32
2026-02-21T09:56:11.0145227Z     %21 = arith.subi %19, %20 : i32
2026-02-21T09:56:11.0145338Z     %22 = arith.addi %0, %21 : i32
2026-02-21T09:56:11.0145461Z     scf.for %arg3 = %0 to %22 step %c2_i32  : i32 {
2026-02-21T09:56:11.0145597Z       %23 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:56:11.0145737Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:56:11.0145856Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:11.0145971Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:56:11.0146091Z       %27 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:56:11.0146225Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:11.0146339Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:11.0146448Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:11.0146563Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:11.0146753Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.0146967Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.0147179Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.0147393Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.0147554Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:56:11.0147721Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.0147928Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.0148140Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.0148351Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.0148621Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.0148875Z       %42 = arith.muli %41, %cst_4 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.0149067Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0149349Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:11.0149627Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0149893Z       %46 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:56:11.0150162Z         %88 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0150413Z         %89 = arith.addi %88, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0150587Z         %90 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:56:11.0150754Z         %91 = tt.splat %90 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0150983Z         %92 = arith.addi %91, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0151254Z         %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.0151525Z         %94 = tt.broadcast %93 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0151719Z         %95 = arith.addi %43, %94 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0151920Z         %96 = tt.addptr %9, %95 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0152123Z         %97 = tt.load %96 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.0152347Z         %98 = ttg.local_alloc %97 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:56:11.0152675Z         %99 = ttg.local_load %98 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0153107Z         %100 = arith.extf %99 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0153494Z         %101 = tt.expand_dims %89 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0153740Z         %102 = arith.muli %101, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0153954Z         %103 = tt.broadcast %102 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0154152Z         %104 = arith.addi %103, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0154353Z         %105 = tt.addptr %10, %104 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0154559Z         %106 = tt.load %105 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.0154805Z         %107 = ttg.convert_layout %106 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0155093Z         %108 = arith.shli %107, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0155331Z         %109 = arith.shrsi %108, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0155571Z         %110 = arith.shrsi %107, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0155865Z         %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0156206Z         %112 = tt.expand_dims %110 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0156497Z         %113 = tt.broadcast %111 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0156747Z         %114 = arith.select %15, %113, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0156990Z         %115 = tt.broadcast %112 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0157232Z         %116 = arith.select %17, %115, %114 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0157463Z         %117 = tt.reshape %116 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.0157695Z         %118 = arith.sitofp %117 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.0157952Z         %119 = ttg.local_alloc %118 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.0158308Z         %120 = ttg.local_load %119 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0158792Z         %121 = tt.dot %100, %120, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.0159163Z         %122 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:56:11.0159338Z         %123 = tt.splat %122 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0159565Z         %124 = arith.addi %123, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0159739Z         %125 = arith.muli %122, %c2_i32 : i32
2026-02-21T09:56:11.0159910Z         %126 = tt.splat %125 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0160132Z         %127 = arith.addi %126, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0160410Z         %128 = tt.expand_dims %127 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.0160692Z         %129 = tt.broadcast %128 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0160889Z         %130 = arith.addi %43, %129 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0169004Z         %131 = tt.addptr %9, %130 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0169221Z         %132 = tt.load %131 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.0169452Z         %133 = ttg.local_alloc %132 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:56:11.0169810Z         %134 = ttg.local_load %133 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0170228Z         %135 = arith.extf %134 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0170620Z         %136 = tt.expand_dims %124 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0170875Z         %137 = arith.muli %136, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0171079Z         %138 = tt.broadcast %137 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0171277Z         %139 = arith.addi %138, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0171478Z         %140 = tt.addptr %10, %139 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0171686Z         %141 = tt.load %140 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.0171930Z         %142 = ttg.convert_layout %141 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0172215Z         %143 = arith.shli %142, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0172454Z         %144 = arith.shrsi %143, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0172691Z         %145 = arith.shrsi %142, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0172984Z         %146 = tt.expand_dims %144 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0173322Z         %147 = tt.expand_dims %145 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0173613Z         %148 = tt.broadcast %146 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0173860Z         %149 = arith.select %15, %148, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0174122Z         %150 = tt.broadcast %147 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0174360Z         %151 = arith.select %17, %150, %149 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0174607Z         %152 = tt.reshape %151 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.0174836Z         %153 = arith.sitofp %152 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.0175095Z         %154 = ttg.local_alloc %153 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.0175423Z         %155 = ttg.local_load %154 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0175906Z         %156 = tt.dot %135, %155, %121, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.0176270Z         scf.yield %156 : tensor<128x128xf32, #mma>
2026-02-21T09:56:11.0176406Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:56:11.0176581Z       %47 = arith.truncf %46 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:11.0176852Z       %48 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:11.0177111Z       %49 = arith.muli %48, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.0177337Z       %50 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:11.0177613Z       %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0177818Z       %52 = tt.broadcast %50 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0177996Z       %53 = arith.addi %51, %52 : tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0178189Z       %54 = tt.addptr %18, %53 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0178383Z       tt.store %54, %47 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.0178527Z       %55 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:56:11.0178652Z       %56 = arith.divsi %55, %c256_i32 : i32
2026-02-21T09:56:11.0178769Z       %57 = arith.muli %56, %c4_i32 : i32
2026-02-21T09:56:11.0178890Z       %58 = arith.subi %c128_i32, %57 : i32
2026-02-21T09:56:11.0179004Z       %59 = arith.minsi %58, %c4_i32 : i32
2026-02-21T09:56:11.0179122Z       %60 = arith.remsi %55, %c256_i32 : i32
2026-02-21T09:56:11.0179236Z       %61 = arith.remsi %60, %59 : i32
2026-02-21T09:56:11.0179353Z       %62 = arith.addi %57, %61 : i32
2026-02-21T09:56:11.0179464Z       %63 = arith.divsi %60, %59 : i32
2026-02-21T09:56:11.0179577Z       %64 = arith.muli %62, %c128_i32 : i32
2026-02-21T09:56:11.0179744Z       %65 = tt.splat %64 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.0179958Z       %66 = tt.splat %64 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.0180172Z       %67 = arith.addi %65, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.0180383Z       %68 = arith.addi %66, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.0180547Z       %69 = arith.muli %63, %c128_i32 : i32
2026-02-21T09:56:11.0180712Z       %70 = tt.splat %69 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.0180921Z       %71 = tt.splat %69 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.0181133Z       %72 = arith.addi %70, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.0181342Z       %73 = arith.addi %71, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.0181630Z       %74 = tt.expand_dims %67 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.0181880Z       %75 = arith.muli %74, %cst_4 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.0182074Z       %76 = tt.broadcast %75 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0182367Z       %77 = tt.expand_dims %72 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:11.0182645Z       %78 = tt.broadcast %77 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0182912Z       %79 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:56:11.0183180Z         %88 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0183402Z         %89 = arith.addi %88, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0183580Z         %90 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:56:11.0183745Z         %91 = tt.splat %90 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0183960Z         %92 = arith.addi %91, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0184232Z         %93 = tt.expand_dims %92 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.0184527Z         %94 = tt.broadcast %93 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0184720Z         %95 = arith.addi %76, %94 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0184915Z         %96 = tt.addptr %9, %95 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0185136Z         %97 = tt.load %96 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.0185353Z         %98 = ttg.local_alloc %97 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:56:11.0185684Z         %99 = ttg.local_load %98 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0186095Z         %100 = arith.extf %99 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0186479Z         %101 = tt.expand_dims %89 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0186732Z         %102 = arith.muli %101, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0186928Z         %103 = tt.broadcast %102 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0187125Z         %104 = arith.addi %103, %78 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0187328Z         %105 = tt.addptr %10, %104 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0187534Z         %106 = tt.load %105 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.0187780Z         %107 = ttg.convert_layout %106 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0188066Z         %108 = arith.shli %107, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0188306Z         %109 = arith.shrsi %108, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0188544Z         %110 = arith.shrsi %107, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0188832Z         %111 = tt.expand_dims %109 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0189171Z         %112 = tt.expand_dims %110 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0189477Z         %113 = tt.broadcast %111 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0189717Z         %114 = arith.select %15, %113, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0189975Z         %115 = tt.broadcast %112 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0190210Z         %116 = arith.select %17, %115, %114 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0190446Z         %117 = tt.reshape %116 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.0190673Z         %118 = arith.sitofp %117 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.0190927Z         %119 = ttg.local_alloc %118 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.0191257Z         %120 = ttg.local_load %119 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0191742Z         %121 = tt.dot %100, %120, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.0192096Z         %122 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:56:11.0192287Z         %123 = tt.splat %122 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0192512Z         %124 = arith.addi %123, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0192689Z         %125 = arith.muli %122, %c2_i32 : i32
2026-02-21T09:56:11.0192885Z         %126 = tt.splat %125 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0193108Z         %127 = arith.addi %126, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0193387Z         %128 = tt.expand_dims %127 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.0193666Z         %129 = tt.broadcast %128 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0193864Z         %130 = arith.addi %76, %129 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0194066Z         %131 = tt.addptr %9, %130 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0194278Z         %132 = tt.load %131 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.0194506Z         %133 = ttg.local_alloc %132 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:56:11.0194844Z         %134 = ttg.local_load %133 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0195255Z         %135 = arith.extf %134 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0195641Z         %136 = tt.expand_dims %124 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0195894Z         %137 = arith.muli %136, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0196090Z         %138 = tt.broadcast %137 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0196284Z         %139 = arith.addi %138, %78 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0196483Z         %140 = tt.addptr %10, %139 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0196687Z         %141 = tt.load %140 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.0196935Z         %142 = ttg.convert_layout %141 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0197246Z         %143 = arith.shli %142, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0197482Z         %144 = arith.shrsi %143, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0197719Z         %145 = arith.shrsi %142, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0198033Z         %146 = tt.expand_dims %144 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0198376Z         %147 = tt.expand_dims %145 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0198664Z         %148 = tt.broadcast %146 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0198908Z         %149 = arith.select %15, %148, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0199149Z         %150 = tt.broadcast %147 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0199384Z         %151 = arith.select %17, %150, %149 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0199620Z         %152 = tt.reshape %151 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.0199850Z         %153 = arith.sitofp %152 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.0200126Z         %154 = ttg.local_alloc %153 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.0200458Z         %155 = ttg.local_load %154 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0200949Z         %156 = tt.dot %135, %155, %121, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.0201302Z         scf.yield %156 : tensor<128x128xf32, #mma>
2026-02-21T09:56:11.0201439Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:56:11.0201607Z       %80 = arith.truncf %79 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:11.0201878Z       %81 = tt.expand_dims %68 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:11.0202115Z       %82 = arith.muli %81, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.0202343Z       %83 = tt.expand_dims %73 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:11.0202659Z       %84 = tt.broadcast %82 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0202860Z       %85 = tt.broadcast %83 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0203040Z       %86 = arith.addi %84, %85 : tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0203227Z       %87 = tt.addptr %18, %86 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0203423Z       tt.store %87, %80 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.0203554Z     }
2026-02-21T09:56:11.0203649Z     scf.for %arg3 = %22 to %2 step %c1_i32  : i32 {
2026-02-21T09:56:11.0203786Z       %23 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:56:11.0203905Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:56:11.0204025Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:11.0204142Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:56:11.0204263Z       %27 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:56:11.0204382Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:11.0204492Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:11.0204604Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:11.0204715Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:11.0204883Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.0205115Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.0205331Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.0205560Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.0205723Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:56:11.0205888Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.0206095Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.0206306Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.0206515Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.0206787Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.0207044Z       %42 = arith.muli %41, %cst_4 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.0207236Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0207522Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:11.0207824Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0208093Z       %46 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c4_i32 iter_args(%arg5 = %cst_2) -> (tensor<128x128xf32, #mma>)  : i32 {
2026-02-21T09:56:11.0208380Z         %55 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0208601Z         %56 = arith.addi %55, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0208780Z         %57 = arith.muli %arg4, %c2_i32 : i32
2026-02-21T09:56:11.0208945Z         %58 = tt.splat %57 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0209162Z         %59 = arith.addi %58, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0209434Z         %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.0209705Z         %61 = tt.broadcast %60 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0209898Z         %62 = arith.addi %43, %61 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0210097Z         %63 = tt.addptr %9, %62 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0210302Z         %64 = tt.load %63 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.0210521Z         %65 = ttg.local_alloc %64 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:56:11.0210849Z         %66 = ttg.local_load %65 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0211255Z         %67 = arith.extf %66 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0211643Z         %68 = tt.expand_dims %56 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0211888Z         %69 = arith.muli %68, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0212075Z         %70 = tt.broadcast %69 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0212266Z         %71 = arith.addi %70, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0212457Z         %72 = tt.addptr %10, %71 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0212671Z         %73 = tt.load %72 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.0212911Z         %74 = ttg.convert_layout %73 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0213201Z         %75 = arith.shli %74, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0213432Z         %76 = arith.shrsi %75, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0213664Z         %77 = arith.shrsi %74, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0213948Z         %78 = tt.expand_dims %76 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0214284Z         %79 = tt.expand_dims %77 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0214583Z         %80 = tt.broadcast %78 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0214817Z         %81 = arith.select %15, %80, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0215051Z         %82 = tt.broadcast %79 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0215278Z         %83 = arith.select %17, %82, %81 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0215526Z         %84 = tt.reshape %83 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.0215747Z         %85 = arith.sitofp %84 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.0215994Z         %86 = ttg.local_alloc %85 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.0216331Z         %87 = ttg.local_load %86 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0216799Z         %88 = tt.dot %67, %87, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.0217145Z         %89 = arith.addi %arg4, %c2_i32 : i32
2026-02-21T09:56:11.0217315Z         %90 = tt.splat %89 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0217532Z         %91 = arith.addi %90, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.0217703Z         %92 = arith.muli %89, %c2_i32 : i32
2026-02-21T09:56:11.0217865Z         %93 = tt.splat %92 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0218082Z         %94 = arith.addi %93, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.0218352Z         %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.0218621Z         %96 = tt.broadcast %95 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0218811Z         %97 = arith.addi %43, %96 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0219005Z         %98 = tt.addptr %9, %97 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.0219206Z         %99 = tt.load %98 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.0219426Z         %100 = ttg.local_alloc %99 : (tensor<128x4xbf16, #blocked2>) -> !ttg.memdesc<128x4xbf16, #shared, #smem>
2026-02-21T09:56:11.0219756Z         %101 = ttg.local_load %100 : !ttg.memdesc<128x4xbf16, #shared, #smem> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0220166Z         %102 = arith.extf %101 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0220564Z         %103 = tt.expand_dims %91 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0220811Z         %104 = arith.muli %103, %cst_3 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.0221025Z         %105 = tt.broadcast %104 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0221218Z         %106 = arith.addi %105, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0221420Z         %107 = tt.addptr %10, %106 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.0221622Z         %108 = tt.load %107 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.0221867Z         %109 = ttg.convert_layout %108 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0222151Z         %110 = arith.shli %109, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0222392Z         %111 = arith.shrsi %110, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0222630Z         %112 = arith.shrsi %109, %cst_6 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.0222920Z         %113 = tt.expand_dims %111 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0223274Z         %114 = tt.expand_dims %112 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.0223561Z         %115 = tt.broadcast %113 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0223801Z         %116 = arith.select %15, %115, %cst_5 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0224064Z         %117 = tt.broadcast %114 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0224299Z         %118 = arith.select %17, %117, %116 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.0224531Z         %119 = tt.reshape %118 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.0224758Z         %120 = arith.sitofp %119 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.0225011Z         %121 = ttg.local_alloc %120 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.0225339Z         %122 = ttg.local_load %121 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.0225816Z         %123 = tt.dot %102, %122, %88, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.0226164Z         scf.yield %123 : tensor<128x128xf32, #mma>
2026-02-21T09:56:11.0226298Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T09:56:11.0226465Z       %47 = arith.truncf %46 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:11.0226730Z       %48 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:11.0226964Z       %49 = arith.muli %48, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.0227194Z       %50 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:11.0227450Z       %51 = tt.broadcast %49 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0227650Z       %52 = tt.broadcast %50 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0227827Z       %53 = arith.addi %51, %52 : tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0228012Z       %54 = tt.addptr %18, %53 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:11.0228222Z       tt.store %54, %47 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.0228359Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:56:11.0228460Z     tt.return
2026-02-21T09:56:11.0228539Z   }
2026-02-21T09:56:11.0228609Z }
2026-02-21T09:56:11.0228652Z 
2026-02-21T09:56:11.0228699Z {-#
2026-02-21T09:56:11.0228775Z   external_resources: {
2026-02-21T09:56:11.0228874Z     mlir_reproducer: {
2026-02-21T09:56:11.0229871Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:56:11.0230859Z       disable_threading: false,
2026-02-21T09:56:11.0230963Z       verify_each: true
2026-02-21T09:56:11.0231050Z     }
2026-02-21T09:56:11.0231119Z   }
2026-02-21T09:56:11.0231186Z #-}
2026-02-21T09:56:11.0231461Z /tmp/torchinductor_root/6o/c6ofv26dcuti2kka3crljighg33iusjmljdwlhhl27molm6wtgw6.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:56:11.0232165Z /tmp/torchinductor_root/6o/c6ofv26dcuti2kka3crljighg33iusjmljdwlhhl27molm6wtgw6.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:56:11.0232707Z [701s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:56:11.0233498Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[0, 3], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:56:11.0234203Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:56:11.0234367Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:56:11.5973539Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:56:11.5980977Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:56:11.5981327Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:56:11.5981638Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:56:11.5981930Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:56:11.5982212Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:56:11.5982466Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:56:11.5982697Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:56:11.5982876Z #smem = #ttg.shared_memory
2026-02-21T09:56:11.5983107Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:56:11.5983580Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:56:11.5984156Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.5984331Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.5984575Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.5984756Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:56:11.5984918Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:56:11.5985037Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:56:11.5985224Z     %cst_3 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.5985480Z     %cst_4 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.5985727Z     %cst_5 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.5985943Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.5986118Z     %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.5986268Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:56:11.5986386Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:56:11.5986500Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:56:11.5986647Z     %cst_8 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.5986831Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:56:11.5986941Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:56:11.5987114Z     %cst_9 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.5987301Z     %0 = tt.get_program_id x : i32
2026-02-21T09:56:11.5987487Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:56:11.5987602Z     %2 = arith.minsi %1, %c8192_i32 : i32
2026-02-21T09:56:11.5987805Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.5988078Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.5988348Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.5988615Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.5988877Z     %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.5989139Z     %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.5989380Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.5989579Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.5989848Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:56:11.5990263Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:56:11.5990663Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.5990912Z     %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.5991106Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:56:11.5991306Z     %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.5991493Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:56:11.5991719Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.5991875Z     %19 = arith.subi %2, %0 : i32
2026-02-21T09:56:11.5992005Z     %20 = arith.remsi %19, %c2_i32 : i32
2026-02-21T09:56:11.5992148Z     %21 = arith.subi %19, %20 : i32
2026-02-21T09:56:11.5992256Z     %22 = arith.addi %0, %21 : i32
2026-02-21T09:56:11.5992378Z     scf.for %arg3 = %0 to %22 step %c2_i32  : i32 {
2026-02-21T09:56:11.5992515Z       %23 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:56:11.5992637Z       %24 = arith.muli %23, %c2_i32 : i32
2026-02-21T09:56:11.5992754Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:11.5992867Z       %26 = arith.minsi %25, %c2_i32 : i32
2026-02-21T09:56:11.5992986Z       %27 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:56:11.5993103Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:11.5993217Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:11.5993329Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:11.5993438Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:11.5993607Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.5993817Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.5994034Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.5994263Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.5994426Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:56:11.5994590Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.5994796Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.5995026Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.5995237Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.5995506Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.5995761Z       %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.5995952Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.5996231Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:11.5996507Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.5996728Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.5996992Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.5997261Z       %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.5997452Z       %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.5997650Z       %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.5997875Z       %51 = tt.load %50 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.5998161Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.5998515Z       ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.5998790Z       %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.5999061Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.5999349Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.5999537Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.5999744Z       %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.5999945Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.6000222Z       %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6000578Z       ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6001105Z       %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:56:11.6001583Z         %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.6001817Z         %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.6001993Z         %201 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:11.6002113Z         %202 = arith.muli %201, %c2_i32 : i32
2026-02-21T09:56:11.6002302Z         %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.6002521Z         %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.6002895Z         %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.6003171Z         %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6003369Z         %207 = arith.addi %43, %206 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6003572Z         %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6003780Z         %209 = tt.load %208 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.6004087Z         %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6004526Z         %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6004915Z         %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6005165Z         %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6005360Z         %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6005554Z         %215 = arith.addi %214, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6005756Z         %216 = tt.addptr %10, %215 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6005957Z         %217 = tt.load %216 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.6006206Z         %218 = ttg.convert_layout %217 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6006488Z         %219 = arith.shli %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6006726Z         %220 = arith.shrsi %219, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6006963Z         %221 = arith.shrsi %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6007271Z         %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6007609Z         %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6007914Z         %224 = tt.broadcast %222 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6008161Z         %225 = arith.select %15, %224, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6008404Z         %226 = tt.broadcast %223 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6008641Z         %227 = arith.select %17, %226, %225 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6008879Z         %228 = tt.reshape %227 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.6009107Z         %229 = arith.sitofp %228 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.6009372Z         %230 = ttg.local_alloc %229 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.6009710Z         %231 = ttg.local_load %230 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6010231Z         %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.6010589Z         %233 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:11.6010720Z         %234 = arith.cmpi slt, %233, %c2_i32 : i32
2026-02-21T09:56:11.6010881Z         %235 = arith.select %234, %233, %c0_i32 : i32
2026-02-21T09:56:11.6011156Z         %236 = ttg.memdesc_index %46[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6011522Z         ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6011930Z         scf.yield %232, %235, %arg8, %236 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6012280Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:11.6012495Z       %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.6012829Z       %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6013258Z       %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6013646Z       %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6013898Z       %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6014089Z       %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6014283Z       %67 = arith.addi %66, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6014479Z       %68 = tt.addptr %10, %67 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6014682Z       %69 = tt.load %68 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.6014924Z       %70 = ttg.convert_layout %69 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6015200Z       %71 = arith.shli %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6015451Z       %72 = arith.shrsi %71, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6015681Z       %73 = arith.shrsi %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6015986Z       %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6016323Z       %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6016603Z       %76 = tt.broadcast %74 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6016845Z       %77 = arith.select %15, %76, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6017080Z       %78 = tt.broadcast %75 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6017314Z       %79 = arith.select %17, %78, %77 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6017543Z       %80 = tt.reshape %79 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.6017764Z       %81 = arith.sitofp %80 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.6018019Z       %82 = ttg.local_alloc %81 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.6018359Z       %83 = ttg.local_load %82 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6018849Z       %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.6019244Z       %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.6019577Z       %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6020011Z       %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6020396Z       %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6020641Z       %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6020834Z       %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6021023Z       %91 = arith.addi %90, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6021219Z       %92 = tt.addptr %10, %91 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6021422Z       %93 = tt.load %92 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.6021661Z       %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6021941Z       %95 = arith.shli %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6022172Z       %96 = arith.shrsi %95, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6022407Z       %97 = arith.shrsi %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6022695Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6023026Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6023309Z       %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6023571Z       %101 = arith.select %15, %100, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6023813Z       %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6024068Z       %103 = arith.select %17, %102, %101 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6024303Z       %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.6024535Z       %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.6024790Z       %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.6025124Z       %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6025600Z       %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.6025983Z       ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.6026206Z       %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:11.6026501Z       %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:11.6026743Z       %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.6026996Z       %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:11.6027258Z       %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6027472Z       %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6027656Z       %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6027857Z       %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6028064Z       tt.store %116, %109 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.6028211Z       %117 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:56:11.6028341Z       %118 = arith.divsi %117, %c128_i32 : i32
2026-02-21T09:56:11.6028468Z       %119 = arith.muli %118, %c2_i32 : i32
2026-02-21T09:56:11.6028596Z       %120 = arith.subi %c128_i32, %119 : i32
2026-02-21T09:56:11.6028720Z       %121 = arith.minsi %120, %c2_i32 : i32
2026-02-21T09:56:11.6028846Z       %122 = arith.remsi %117, %c128_i32 : i32
2026-02-21T09:56:11.6028971Z       %123 = arith.remsi %122, %121 : i32
2026-02-21T09:56:11.6029089Z       %124 = arith.addi %119, %123 : i32
2026-02-21T09:56:11.6029211Z       %125 = arith.divsi %122, %121 : i32
2026-02-21T09:56:11.6029332Z       %126 = arith.muli %124, %c128_i32 : i32
2026-02-21T09:56:11.6029511Z       %127 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.6029731Z       %128 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.6029958Z       %129 = arith.addi %127, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.6030178Z       %130 = arith.addi %128, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.6030350Z       %131 = arith.muli %125, %c128_i32 : i32
2026-02-21T09:56:11.6030525Z       %132 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.6030743Z       %133 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.6030965Z       %134 = arith.addi %132, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.6031203Z       %135 = arith.addi %133, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.6031478Z       %136 = tt.expand_dims %129 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.6031755Z       %137 = arith.muli %136, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.6031955Z       %138 = tt.broadcast %137 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6032246Z       %139 = tt.expand_dims %134 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:11.6032534Z       %140 = tt.broadcast %139 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6032759Z       %141 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.6032952Z       %142 = arith.addi %138, %48 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6033153Z       %143 = tt.addptr %9, %142 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6033365Z       %144 = tt.load %143 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.6033656Z       %145 = ttg.memdesc_index %141[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6034040Z       ttg.local_store %144, %145 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6034287Z       %146 = arith.addi %138, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6034489Z       %147 = tt.addptr %9, %146 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6034714Z       %148 = tt.load %147 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.6035006Z       %149 = ttg.memdesc_index %141[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6035371Z       ttg.local_store %148, %149 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6035903Z       %150:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %145, %arg8 = %149) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:56:11.6036390Z         %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.6036620Z         %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.6036804Z         %201 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:11.6036930Z         %202 = arith.muli %201, %c2_i32 : i32
2026-02-21T09:56:11.6037106Z         %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.6037328Z         %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.6037611Z         %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.6037896Z         %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6038094Z         %207 = arith.addi %138, %206 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6038303Z         %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6038513Z         %209 = tt.load %208 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.6038823Z         %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6039285Z         %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6039670Z         %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6039945Z         %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6040146Z         %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6040342Z         %215 = arith.addi %214, %140 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6040549Z         %216 = tt.addptr %10, %215 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6040755Z         %217 = tt.load %216 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.6041007Z         %218 = ttg.convert_layout %217 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6041292Z         %219 = arith.shli %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6041534Z         %220 = arith.shrsi %219, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6041776Z         %221 = arith.shrsi %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6042088Z         %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6042434Z         %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6042800Z         %224 = tt.broadcast %222 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6043044Z         %225 = arith.select %15, %224, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6043295Z         %226 = tt.broadcast %223 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6043533Z         %227 = arith.select %17, %226, %225 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6043774Z         %228 = tt.reshape %227 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.6044007Z         %229 = arith.sitofp %228 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.6044266Z         %230 = ttg.local_alloc %229 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.6044601Z         %231 = ttg.local_load %230 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6045084Z         %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.6045440Z         %233 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:11.6045574Z         %234 = arith.cmpi slt, %233, %c2_i32 : i32
2026-02-21T09:56:11.6045710Z         %235 = arith.select %234, %233, %c0_i32 : i32
2026-02-21T09:56:11.6045985Z         %236 = ttg.memdesc_index %141[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6046371Z         ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6046778Z         scf.yield %232, %235, %arg8, %236 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6047120Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:11.6047464Z       %151 = ttg.local_load %150#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6047904Z       %152 = arith.extf %151 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6048234Z       %153 = arith.addi %66, %140 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6048441Z       %154 = tt.addptr %10, %153 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6048644Z       %155 = tt.load %154 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.6048894Z       %156 = ttg.convert_layout %155 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6049187Z       %157 = arith.shli %156, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6049425Z       %158 = arith.shrsi %157, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6049660Z       %159 = arith.shrsi %156, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6049951Z       %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6050308Z       %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6050594Z       %162 = tt.broadcast %160 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6050833Z       %163 = arith.select %15, %162, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6051087Z       %164 = tt.broadcast %161 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6051319Z       %165 = arith.select %17, %164, %163 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6051558Z       %166 = tt.reshape %165 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.6051786Z       %167 = arith.sitofp %166 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.6052040Z       %168 = ttg.local_alloc %167 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.6052369Z       %169 = ttg.local_load %168 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6052841Z       %170 = tt.dot %152, %169, %150#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.6053339Z       %171 = ttg.local_load %150#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6053772Z       %172 = arith.extf %171 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6054069Z       %173 = arith.addi %90, %140 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6054269Z       %174 = tt.addptr %10, %173 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6054473Z       %175 = tt.load %174 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.6054715Z       %176 = ttg.convert_layout %175 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6054996Z       %177 = arith.shli %176, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6055230Z       %178 = arith.shrsi %177, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6055488Z       %179 = arith.shrsi %176, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6055776Z       %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6056125Z       %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6056412Z       %182 = tt.broadcast %180 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6056649Z       %183 = arith.select %15, %182, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6056887Z       %184 = tt.broadcast %181 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6057123Z       %185 = arith.select %17, %184, %183 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6057353Z       %186 = tt.reshape %185 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.6057583Z       %187 = arith.sitofp %186 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.6057834Z       %188 = ttg.local_alloc %187 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.6058163Z       %189 = ttg.local_load %188 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6058650Z       %190 = tt.dot %172, %189, %170, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.6059037Z       ttg.local_dealloc %141 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.6059271Z       %191 = arith.truncf %190 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:11.6059545Z       %192 = tt.expand_dims %130 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:11.6059786Z       %193 = arith.muli %192, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.6060021Z       %194 = tt.expand_dims %135 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:11.6060281Z       %195 = tt.broadcast %193 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6060492Z       %196 = tt.broadcast %194 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6060676Z       %197 = arith.addi %195, %196 : tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6060867Z       %198 = tt.addptr %18, %197 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6061069Z       tt.store %198, %191 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.6061201Z     }
2026-02-21T09:56:11.6061298Z     scf.for %arg3 = %22 to %2 step %c1_i32  : i32 {
2026-02-21T09:56:11.6061434Z       %23 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:56:11.6061559Z       %24 = arith.muli %23, %c2_i32 : i32
2026-02-21T09:56:11.6061677Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:11.6061796Z       %26 = arith.minsi %25, %c2_i32 : i32
2026-02-21T09:56:11.6061917Z       %27 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:56:11.6062033Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:11.6062146Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:11.6062257Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:11.6062370Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:11.6062535Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.6062750Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.6062966Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.6063195Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.6063358Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:56:11.6063519Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.6063746Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.6063956Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.6064165Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.6064436Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.6064686Z       %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.6064879Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6065158Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:11.6065435Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6065655Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.6065934Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.6066203Z       %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6066392Z       %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6066600Z       %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6066801Z       %51 = tt.load %50 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.6067083Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6067439Z       ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6067713Z       %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.6067986Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.6068256Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6068444Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6068640Z       %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6068838Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.6069117Z       %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6069475Z       ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6069994Z       %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:56:11.6070467Z         %117 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.6070700Z         %118 = arith.addi %117, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.6070875Z         %119 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:11.6071022Z         %120 = arith.muli %119, %c2_i32 : i32
2026-02-21T09:56:11.6071190Z         %121 = tt.splat %120 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.6071413Z         %122 = arith.addi %121, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.6071706Z         %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.6071981Z         %124 = tt.broadcast %123 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6072177Z         %125 = arith.addi %43, %124 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6072377Z         %126 = tt.addptr %9, %125 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.6072588Z         %127 = tt.load %126 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.6072892Z         %128 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6073331Z         %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6073718Z         %130 = tt.expand_dims %118 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6073981Z         %131 = arith.muli %130, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6074177Z         %132 = tt.broadcast %131 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6074376Z         %133 = arith.addi %132, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6074592Z         %134 = tt.addptr %10, %133 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6074796Z         %135 = tt.load %134 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.6075044Z         %136 = ttg.convert_layout %135 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6075327Z         %137 = arith.shli %136, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6075566Z         %138 = arith.shrsi %137, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6075802Z         %139 = arith.shrsi %136, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6076093Z         %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6076432Z         %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6076722Z         %142 = tt.broadcast %140 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6076971Z         %143 = arith.select %15, %142, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6077212Z         %144 = tt.broadcast %141 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6077452Z         %145 = arith.select %17, %144, %143 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6077682Z         %146 = tt.reshape %145 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.6077909Z         %147 = arith.sitofp %146 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.6078164Z         %148 = ttg.local_alloc %147 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.6078493Z         %149 = ttg.local_load %148 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6078988Z         %150 = tt.dot %129, %149, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.6079340Z         %151 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:11.6079484Z         %152 = arith.cmpi slt, %151, %c2_i32 : i32
2026-02-21T09:56:11.6079619Z         %153 = arith.select %152, %151, %c0_i32 : i32
2026-02-21T09:56:11.6079885Z         %154 = ttg.memdesc_index %46[%153] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6080244Z         ttg.local_store %127, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6080646Z         scf.yield %150, %153, %arg8, %154 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.6080987Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:11.6081201Z       %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.6081524Z       %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6081965Z       %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6082343Z       %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6082661Z       %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6082852Z       %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6083044Z       %67 = arith.addi %66, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6083231Z       %68 = tt.addptr %10, %67 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6083430Z       %69 = tt.load %68 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.6083666Z       %70 = ttg.convert_layout %69 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6083944Z       %71 = arith.shli %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6084172Z       %72 = arith.shrsi %71, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6084404Z       %73 = arith.shrsi %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6084687Z       %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6085014Z       %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6085295Z       %76 = tt.broadcast %74 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6085531Z       %77 = arith.select %15, %76, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6085761Z       %78 = tt.broadcast %75 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6085989Z       %79 = arith.select %17, %78, %77 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6086210Z       %80 = tt.reshape %79 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.6086430Z       %81 = arith.sitofp %80 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.6086680Z       %82 = ttg.local_alloc %81 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.6087028Z       %83 = ttg.local_load %82 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6087493Z       %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.6087899Z       %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.6088225Z       %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6088652Z       %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6089025Z       %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6089269Z       %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.6089460Z       %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6089647Z       %91 = arith.addi %90, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6089839Z       %92 = tt.addptr %10, %91 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.6090052Z       %93 = tt.load %92 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.6090290Z       %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6090580Z       %95 = arith.shli %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6090809Z       %96 = arith.shrsi %95, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6091040Z       %97 = arith.shrsi %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.6091320Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6091652Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.6091933Z       %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6092171Z       %101 = arith.select %15, %100, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6092407Z       %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6092640Z       %103 = arith.select %17, %102, %101 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.6092872Z       %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.6093096Z       %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.6093347Z       %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.6093676Z       %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.6094141Z       %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.6094522Z       ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.6094737Z       %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:11.6095027Z       %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:11.6095265Z       %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.6095511Z       %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:11.6095772Z       %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6095981Z       %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6096161Z       %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6096354Z       %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:11.6096553Z       tt.store %116, %109 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.6096695Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:56:11.6096799Z     tt.return
2026-02-21T09:56:11.6096880Z   }
2026-02-21T09:56:11.6096958Z }
2026-02-21T09:56:11.6097001Z 
2026-02-21T09:56:11.6097033Z {-#
2026-02-21T09:56:11.6097116Z   external_resources: {
2026-02-21T09:56:11.6097216Z     mlir_reproducer: {
2026-02-21T09:56:11.6098244Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:56:11.6099239Z       disable_threading: false,
2026-02-21T09:56:11.6099346Z       verify_each: true
2026-02-21T09:56:11.6099437Z     }
2026-02-21T09:56:11.6099507Z   }
2026-02-21T09:56:11.6099581Z #-}
2026-02-21T09:56:11.6099856Z /tmp/torchinductor_root/gb/cgbsjwqgofmihm2crsdbzbb7mfznt7moa5ficlcmjnnpxx5kut7y.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:56:11.6100547Z /tmp/torchinductor_root/gb/cgbsjwqgofmihm2crsdbzbb7mfznt7moa5ficlcmjnnpxx5kut7y.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:56:11.6101097Z [702s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:56:11.6101872Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[0, 3], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:56:11.6102573Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:56:11.6102741Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:56:11.7584434Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:56:11.7591464Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T09:56:11.7592004Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:56:11.7592485Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T09:56:11.7593167Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T09:56:11.7593600Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 2], instrShape = [16, 16], isTransposed = true}>
2026-02-21T09:56:11.7594054Z #shared = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [0, 1]}>
2026-02-21T09:56:11.7594417Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:56:11.7594701Z #smem = #ttg.shared_memory
2026-02-21T09:56:11.7595071Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:56:11.7595800Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:56:11.7607252Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.7607456Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.7607646Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.7607855Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x128xf32, #mma>
2026-02-21T09:56:11.7608040Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:56:11.7608307Z     %cst_3 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7608595Z     %cst_4 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7608867Z     %cst_5 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.7609157Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7609356Z     %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.7609525Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:56:11.7609651Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:56:11.7609780Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:56:11.7609906Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T09:56:11.7610075Z     %cst_8 = arith.constant dense<0> : tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7610241Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:56:11.7610364Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:56:11.7610494Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:56:11.7610691Z     %cst_9 = arith.constant dense<4> : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7610898Z     %0 = tt.get_program_id x : i32
2026-02-21T09:56:11.7611023Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:56:11.7611154Z     %2 = arith.minsi %1, %c8192_i32 : i32
2026-02-21T09:56:11.7611379Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.7611692Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.7611999Z     %5 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.7612306Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.7612609Z     %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7612909Z     %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.7613178Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.7613408Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.7613780Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:56:11.7614242Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:56:11.7614711Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.7615000Z     %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.7615216Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:56:11.7615435Z     %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:11.7615627Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x128xi1, #blocked>
2026-02-21T09:56:11.7615835Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.7615992Z     %19 = arith.subi %2, %0 : i32
2026-02-21T09:56:11.7616100Z     %20 = arith.remsi %19, %c2_i32 : i32
2026-02-21T09:56:11.7616215Z     %21 = arith.subi %19, %20 : i32
2026-02-21T09:56:11.7616323Z     %22 = arith.addi %0, %21 : i32
2026-02-21T09:56:11.7616445Z     scf.for %arg3 = %0 to %22 step %c2_i32  : i32 {
2026-02-21T09:56:11.7616580Z       %23 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:56:11.7616701Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:56:11.7616831Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:11.7616947Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:56:11.7617065Z       %27 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:56:11.7617180Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:11.7617309Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:11.7617416Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:11.7617529Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:11.7617697Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.7617911Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.7618124Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.7618331Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.7618495Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:56:11.7618655Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.7618864Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.7619070Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.7619280Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.7619549Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.7619796Z       %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.7619989Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7620269Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:11.7620545Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7620764Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.7621028Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.7621323Z       %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7621510Z       %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7621706Z       %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7621932Z       %51 = tt.load %50 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.7622213Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7622571Z       ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7622841Z       %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.7623114Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.7623386Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7623573Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7623770Z       %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7623969Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.7624277Z       %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7624632Z       ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7625181Z       %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:56:11.7625677Z         %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7625905Z         %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7626079Z         %201 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:11.7626204Z         %202 = arith.muli %201, %c2_i32 : i32
2026-02-21T09:56:11.7626373Z         %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.7626593Z         %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.7626868Z         %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.7627143Z         %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7627339Z         %207 = arith.addi %43, %206 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7627539Z         %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7627746Z         %209 = tt.load %208 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.7628050Z         %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7628492Z         %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7628875Z         %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7629123Z         %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7629345Z         %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7629537Z         %215 = arith.addi %214, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7629737Z         %216 = tt.addptr %10, %215 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7629957Z         %217 = tt.load %216 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.7630199Z         %218 = ttg.convert_layout %217 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7630485Z         %219 = arith.shli %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7630723Z         %220 = arith.shrsi %219, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7630959Z         %221 = arith.shrsi %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7631250Z         %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7631586Z         %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7631872Z         %224 = tt.broadcast %222 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7632116Z         %225 = arith.select %15, %224, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7632377Z         %226 = tt.broadcast %223 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7632615Z         %227 = arith.select %17, %226, %225 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7632861Z         %228 = tt.reshape %227 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.7633089Z         %229 = arith.sitofp %228 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.7633347Z         %230 = ttg.local_alloc %229 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.7633673Z         %231 = ttg.local_load %230 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7634161Z         %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.7634516Z         %233 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:11.7634643Z         %234 = arith.cmpi slt, %233, %c2_i32 : i32
2026-02-21T09:56:11.7634778Z         %235 = arith.select %234, %233, %c0_i32 : i32
2026-02-21T09:56:11.7635050Z         %236 = ttg.memdesc_index %46[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7635414Z         ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7635819Z         scf.yield %232, %235, %arg8, %236 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7636202Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:11.7636462Z       %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7636789Z       %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7637219Z       %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7637621Z       %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7637861Z       %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7638180Z       %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7638367Z       %67 = arith.addi %66, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7638557Z       %68 = tt.addptr %10, %67 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7638755Z       %69 = tt.load %68 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.7638990Z       %70 = ttg.convert_layout %69 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7639263Z       %71 = arith.shli %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7639491Z       %72 = arith.shrsi %71, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7639719Z       %73 = arith.shrsi %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7640002Z       %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7640347Z       %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7640624Z       %76 = tt.broadcast %74 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7640857Z       %77 = arith.select %15, %76, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7641103Z       %78 = tt.broadcast %75 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7641331Z       %79 = arith.select %17, %78, %77 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7641553Z       %80 = tt.reshape %79 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.7641772Z       %81 = arith.sitofp %80 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.7642016Z       %82 = ttg.local_alloc %81 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.7642336Z       %83 = ttg.local_load %82 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7642859Z       %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.7643244Z       %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7643573Z       %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7644007Z       %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7644381Z       %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7644627Z       %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7644815Z       %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7645005Z       %91 = arith.addi %90, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7645198Z       %92 = tt.addptr %10, %91 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7645423Z       %93 = tt.load %92 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.7645660Z       %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7645931Z       %95 = arith.shli %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7646177Z       %96 = arith.shrsi %95, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7646405Z       %97 = arith.shrsi %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7646683Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7647019Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7647296Z       %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7647542Z       %101 = arith.select %15, %100, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7647774Z       %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7648009Z       %103 = arith.select %17, %102, %101 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7648242Z       %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.7648478Z       %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.7648730Z       %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.7649071Z       %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7649536Z       %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.7649916Z       ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.7650129Z       %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:11.7650400Z       %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:11.7650636Z       %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.7650863Z       %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:11.7651123Z       %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7651327Z       %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7651511Z       %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7651705Z       %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7651907Z       tt.store %116, %109 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.7652050Z       %117 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:56:11.7652171Z       %118 = arith.divsi %117, %c256_i32 : i32
2026-02-21T09:56:11.7652294Z       %119 = arith.muli %118, %c4_i32 : i32
2026-02-21T09:56:11.7652413Z       %120 = arith.subi %c128_i32, %119 : i32
2026-02-21T09:56:11.7652532Z       %121 = arith.minsi %120, %c4_i32 : i32
2026-02-21T09:56:11.7652650Z       %122 = arith.remsi %117, %c256_i32 : i32
2026-02-21T09:56:11.7652770Z       %123 = arith.remsi %122, %121 : i32
2026-02-21T09:56:11.7652885Z       %124 = arith.addi %119, %123 : i32
2026-02-21T09:56:11.7652996Z       %125 = arith.divsi %122, %121 : i32
2026-02-21T09:56:11.7677557Z       %126 = arith.muli %124, %c128_i32 : i32
2026-02-21T09:56:11.7677727Z       %127 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.7677947Z       %128 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.7678191Z       %129 = arith.addi %127, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.7678409Z       %130 = arith.addi %128, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.7678578Z       %131 = arith.muli %125, %c128_i32 : i32
2026-02-21T09:56:11.7678744Z       %132 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.7678961Z       %133 = tt.splat %131 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.7679174Z       %134 = arith.addi %132, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.7679389Z       %135 = arith.addi %133, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.7679664Z       %136 = tt.expand_dims %129 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.7679922Z       %137 = arith.muli %136, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.7680122Z       %138 = tt.broadcast %137 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7680425Z       %139 = tt.expand_dims %134 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:11.7680706Z       %140 = tt.broadcast %139 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7680944Z       %141 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.7681128Z       %142 = arith.addi %138, %48 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7681330Z       %143 = tt.addptr %9, %142 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7681533Z       %144 = tt.load %143 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.7681822Z       %145 = ttg.memdesc_index %141[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7682187Z       ttg.local_store %144, %145 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7682428Z       %146 = arith.addi %138, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7682674Z       %147 = tt.addptr %9, %146 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7682878Z       %148 = tt.load %147 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.7683162Z       %149 = ttg.memdesc_index %141[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7683521Z       ttg.local_store %148, %149 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7684049Z       %150:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %145, %arg8 = %149) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:56:11.7684527Z         %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7684759Z         %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7684934Z         %201 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:11.7685058Z         %202 = arith.muli %201, %c2_i32 : i32
2026-02-21T09:56:11.7685225Z         %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.7685471Z         %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.7685743Z         %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.7686038Z         %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7686233Z         %207 = arith.addi %138, %206 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7686435Z         %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7686644Z         %209 = tt.load %208 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.7686945Z         %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7687386Z         %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7687768Z         %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7688015Z         %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7688207Z         %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7688420Z         %215 = arith.addi %214, %140 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7688621Z         %216 = tt.addptr %10, %215 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7688825Z         %217 = tt.load %216 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.7689094Z         %218 = ttg.convert_layout %217 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7689378Z         %219 = arith.shli %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7689612Z         %220 = arith.shrsi %219, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7689851Z         %221 = arith.shrsi %218, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7690142Z         %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7690478Z         %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7690763Z         %224 = tt.broadcast %222 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7691009Z         %225 = arith.select %15, %224, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7691247Z         %226 = tt.broadcast %223 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7691486Z         %227 = arith.select %17, %226, %225 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7691714Z         %228 = tt.reshape %227 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.7691940Z         %229 = arith.sitofp %228 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.7692196Z         %230 = ttg.local_alloc %229 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.7692521Z         %231 = ttg.local_load %230 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7692997Z         %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.7693365Z         %233 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:11.7693495Z         %234 = arith.cmpi slt, %233, %c2_i32 : i32
2026-02-21T09:56:11.7693629Z         %235 = arith.select %234, %233, %c0_i32 : i32
2026-02-21T09:56:11.7693913Z         %236 = ttg.memdesc_index %141[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7694275Z         ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7694671Z         scf.yield %232, %235, %arg8, %236 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7695057Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:11.7695425Z       %151 = ttg.local_load %150#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7695858Z       %152 = arith.extf %151 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7696158Z       %153 = arith.addi %66, %140 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7696281Z       %154 = tt.addptr %10, %153 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7696345Z       %155 = tt.load %154 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.7696489Z       %156 = ttg.convert_layout %155 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7696600Z       %157 = arith.shli %156, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7696704Z       %158 = arith.shrsi %157, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7696799Z       %159 = arith.shrsi %156, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7696950Z       %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7697101Z       %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7697197Z       %162 = tt.broadcast %160 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7697299Z       %163 = arith.select %15, %162, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7697396Z       %164 = tt.broadcast %161 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7697494Z       %165 = arith.select %17, %164, %163 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7697586Z       %166 = tt.reshape %165 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.7697682Z       %167 = arith.sitofp %166 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.7697800Z       %168 = ttg.local_alloc %167 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.7697969Z       %169 = ttg.local_load %168 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7698236Z       %170 = tt.dot %152, %169, %150#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.7698431Z       %171 = ttg.local_load %150#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7698639Z       %172 = arith.extf %171 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7698716Z       %173 = arith.addi %90, %140 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7698816Z       %174 = tt.addptr %10, %173 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7698876Z       %175 = tt.load %174 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.7699021Z       %176 = ttg.convert_layout %175 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7699118Z       %177 = arith.shli %176, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7699216Z       %178 = arith.shrsi %177, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7699314Z       %179 = arith.shrsi %176, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7699462Z       %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7699607Z       %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7699703Z       %182 = tt.broadcast %180 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7699822Z       %183 = arith.select %15, %182, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7699914Z       %184 = tt.broadcast %181 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7700029Z       %185 = arith.select %17, %184, %183 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7700119Z       %186 = tt.reshape %185 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.7700212Z       %187 = arith.sitofp %186 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.7700331Z       %188 = ttg.local_alloc %187 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.7700499Z       %189 = ttg.local_load %188 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7700760Z       %190 = tt.dot %172, %189, %170, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.7700848Z       ttg.local_dealloc %141 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.7700938Z       %191 = arith.truncf %190 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:11.7701077Z       %192 = tt.expand_dims %130 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:11.7701138Z       %193 = arith.muli %192, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.7701276Z       %194 = tt.expand_dims %135 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:11.7701360Z       %195 = tt.broadcast %193 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7701443Z       %196 = tt.broadcast %194 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7701502Z       %197 = arith.addi %195, %196 : tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7701599Z       %198 = tt.addptr %18, %197 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7701665Z       tt.store %198, %191 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.7701709Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:56:11.7701776Z     scf.for %arg3 = %22 to %2 step %c1_i32  : i32 {
2026-02-21T09:56:11.7701821Z       %23 = arith.divsi %arg3, %c256_i32 : i32
2026-02-21T09:56:11.7701866Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:56:11.7701907Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:11.7701947Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:56:11.7702041Z       %27 = arith.remsi %arg3, %c256_i32 : i32
2026-02-21T09:56:11.7702081Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:11.7702121Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:11.7702160Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:11.7702203Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:11.7702293Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.7702377Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.7702469Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:11.7702551Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:11.7702592Z       %36 = arith.muli %30, %c128_i32 : i32
2026-02-21T09:56:11.7702683Z       %37 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.7702763Z       %38 = tt.splat %36 : i32 -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.7702849Z       %39 = arith.addi %37, %5 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:11.7702945Z       %40 = arith.addi %38, %6 : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:11.7703090Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.7703171Z       %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:11.7703268Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7703412Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x128xi32, #blocked1>
2026-02-21T09:56:11.7703501Z       %45 = tt.broadcast %44 : tensor<1x128xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7703589Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.7703728Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.7703814Z       %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7703875Z       %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7703975Z       %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7704035Z       %51 = tt.load %50 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.7704218Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7704359Z       ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7704450Z       %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.7704591Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.7704680Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7704739Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7704839Z       %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7704915Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.7705093Z       %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7705229Z       ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7705584Z       %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:56:11.7705682Z         %117 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7705774Z         %118 = arith.addi %117, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7705820Z         %119 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:11.7705863Z         %120 = arith.muli %119, %c2_i32 : i32
2026-02-21T09:56:11.7705954Z         %121 = tt.splat %120 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.7706045Z         %122 = arith.addi %121, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:11.7706191Z         %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:11.7706303Z         %124 = tt.broadcast %123 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7706368Z         %125 = arith.addi %43, %124 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7706469Z         %126 = tt.addptr %9, %125 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:11.7706546Z         %127 = tt.load %126 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:11.7706748Z         %128 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7706945Z         %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7707090Z         %130 = tt.expand_dims %118 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7707157Z         %131 = arith.muli %130, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7707248Z         %132 = tt.broadcast %131 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7707307Z         %133 = arith.addi %132, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7707411Z         %134 = tt.addptr %10, %133 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7707473Z         %135 = tt.load %134 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.7707619Z         %136 = ttg.convert_layout %135 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7707718Z         %137 = arith.shli %136, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7707817Z         %138 = arith.shrsi %137, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7707914Z         %139 = arith.shrsi %136, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7708063Z         %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7708209Z         %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7708305Z         %142 = tt.broadcast %140 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7708426Z         %143 = arith.select %15, %142, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7708520Z         %144 = tt.broadcast %141 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7708633Z         %145 = arith.select %17, %144, %143 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7708725Z         %146 = tt.reshape %145 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.7708819Z         %147 = arith.sitofp %146 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.7708937Z         %148 = ttg.local_alloc %147 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.7709111Z         %149 = ttg.local_load %148 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7709376Z         %150 = tt.dot %129, %149, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.7709423Z         %151 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:11.7709473Z         %152 = arith.cmpi slt, %151, %c2_i32 : i32
2026-02-21T09:56:11.7709523Z         %153 = arith.select %152, %151, %c0_i32 : i32
2026-02-21T09:56:11.7709715Z         %154 = ttg.memdesc_index %46[%153] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7709858Z         ttg.local_store %127, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7710091Z         scf.yield %150, %153, %arg8, %154 : tensor<128x128xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:11.7710214Z       } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:11.7710308Z       %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7710506Z       %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7710701Z       %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7710846Z       %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7710907Z       %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7710995Z       %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7711057Z       %67 = arith.addi %66, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7711153Z       %68 = tt.addptr %10, %67 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7711211Z       %69 = tt.load %68 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.7711355Z       %70 = ttg.convert_layout %69 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7711449Z       %71 = arith.shli %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7711545Z       %72 = arith.shrsi %71, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7711639Z       %73 = arith.shrsi %70, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7711786Z       %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7711943Z       %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7712037Z       %76 = tt.broadcast %74 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7712149Z       %77 = arith.select %15, %76, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7712240Z       %78 = tt.broadcast %75 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7712336Z       %79 = arith.select %17, %78, %77 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7712422Z       %80 = tt.reshape %79 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.7712511Z       %81 = arith.sitofp %80 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.7712625Z       %82 = ttg.local_alloc %81 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.7712796Z       %83 = ttg.local_load %82 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7713052Z       %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.7713158Z       %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:11.7713350Z       %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7713555Z       %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7713699Z       %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7713759Z       %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:11.7713846Z       %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7713905Z       %91 = arith.addi %90, %45 : tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7714003Z       %92 = tt.addptr %10, %91 : tensor<2x128x!tt.ptr<i8>, #blocked1>, tensor<2x128xi32, #blocked1>
2026-02-21T09:56:11.7714059Z       %93 = tt.load %92 : tensor<2x128x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:11.7714200Z       %94 = ttg.convert_layout %93 : tensor<2x128xi8, #blocked1> -> tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7714294Z       %95 = arith.shli %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7714391Z       %96 = arith.shrsi %95, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7714482Z       %97 = arith.shrsi %94, %cst_9 : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:11.7714630Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7714773Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x128xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x128xi8, #blocked>
2026-02-21T09:56:11.7714867Z       %100 = tt.broadcast %98 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7714972Z       %101 = arith.select %15, %100, %cst_8 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7715063Z       %102 = tt.broadcast %99 : tensor<2x1x128xi8, #blocked> -> tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7715162Z       %103 = arith.select %17, %102, %101 : tensor<2x2x128xi1, #blocked>, tensor<2x2x128xi8, #blocked>
2026-02-21T09:56:11.7715269Z       %104 = tt.reshape %103 : tensor<2x2x128xi8, #blocked> -> tensor<4x128xi8, #blocked3>
2026-02-21T09:56:11.7715360Z       %105 = arith.sitofp %104 : tensor<4x128xi8, #blocked3> to tensor<4x128xf32, #blocked3>
2026-02-21T09:56:11.7715499Z       %106 = ttg.local_alloc %105 : (tensor<4x128xf32, #blocked3>) -> !ttg.memdesc<4x128xf32, #shared1, #smem>
2026-02-21T09:56:11.7715670Z       %107 = ttg.local_load %106 : !ttg.memdesc<4x128xf32, #shared1, #smem> -> tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:11.7715929Z       %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x128xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x128xf32, #mma>
2026-02-21T09:56:11.7716015Z       ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:11.7716108Z       %109 = arith.truncf %108 : tensor<128x128xf32, #mma> to tensor<128x128xbf16, #mma>
2026-02-21T09:56:11.7716246Z       %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:11.7716304Z       %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:11.7716442Z       %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x128xi32, #mma>
2026-02-21T09:56:11.7716538Z       %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7716619Z       %114 = tt.broadcast %112 : tensor<1x128xi32, #mma> -> tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7716678Z       %115 = arith.addi %113, %114 : tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7716787Z       %116 = tt.addptr %18, %115 : tensor<128x128x!tt.ptr<bf16>, #mma>, tensor<128x128xi32, #mma>
2026-02-21T09:56:11.7716850Z       tt.store %116, %109 : tensor<128x128x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:11.7716895Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:56:11.7716931Z     tt.return
2026-02-21T09:56:11.7716962Z   }
2026-02-21T09:56:11.7716996Z }
2026-02-21T09:56:11.7717003Z 
2026-02-21T09:56:11.7717035Z {-#
2026-02-21T09:56:11.7717076Z   external_resources: {
2026-02-21T09:56:11.7717114Z     mlir_reproducer: {
2026-02-21T09:56:11.7718057Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:56:11.7718100Z       disable_threading: false,
2026-02-21T09:56:11.7718139Z       verify_each: true
2026-02-21T09:56:11.7718170Z     }
2026-02-21T09:56:11.7718200Z   }
2026-02-21T09:56:11.7718229Z #-}
2026-02-21T09:56:11.7718469Z /tmp/torchinductor_root/ua/cuarvwdhzhdpgdiqi4kmk3qzuzohxvqdbtvdai46jfdu5h44yne7.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:56:11.7718888Z /tmp/torchinductor_root/ua/cuarvwdhzhdpgdiqi4kmk3qzuzohxvqdbtvdai46jfdu5h44yne7.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:56:11.7719001Z [702s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:56:11.7719627Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[1, 3], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:56:11.7719717Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:56:11.7719801Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:56:11.9500620Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 42/42 10.2 configs/s
2026-02-21T09:56:15.6226814Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━ 214/214 33.0 configs/s
2026-02-21T09:56:18.2676565Z [709s] Generation 12 complete: 
2026-02-21T09:56:18.2676944Z error=6
2026-02-21T09:56:18.2677153Z ok=39
2026-02-21T09:56:18.2677359Z min=0.8961
2026-02-21T09:56:18.2677590Z mid=1.1081
2026-02-21T09:56:18.2677792Z max=32.6472
2026-02-21T09:56:18.2678033Z best={'block_sizes': [16, 128, 128],
2026-02-21T09:56:18.2678434Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:56:18.2678794Z  'l2_groupings': [2],
2026-02-21T09:56:18.2679066Z  'load_eviction_policies': ['', ''],
2026-02-21T09:56:18.2679374Z  'loop_orders': [[0, 1]],
2026-02-21T09:56:18.2679656Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:56:18.2679937Z  'num_stages': 1,
2026-02-21T09:56:18.2680170Z  'num_warps': 4,
2026-02-21T09:56:18.2680407Z  'pid_type': 'flat',
2026-02-21T09:56:18.2680665Z  'range_flattens': [None, None],
2026-02-21T09:56:18.2680975Z  'range_multi_buffers': [None, False],
2026-02-21T09:56:18.2681298Z  'range_num_stages': [0, 1],
2026-02-21T09:56:18.2681582Z  'range_unroll_factors': [0, 0],
2026-02-21T09:56:18.2681880Z  'range_warp_specializes': [],
2026-02-21T09:56:18.2682159Z  'waves_per_eu': 2}
2026-02-21T09:56:18.2762566Z [709s] Fitting surrogate: 1063 points, 1063 targets
2026-02-21T09:56:18.6038077Z [709s] Generation 13 starting: 21 neighbors, 1 active search path(s)
2026-02-21T09:56:23.5749151Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 4.1 configs/s
2026-02-21T09:56:25.3396425Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 12.0 configs/s
2026-02-21T09:56:25.6386785Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━ 223/223 270.2 configs/s
2026-02-21T09:56:28.7577544Z [719s] Generation 13 complete: 
2026-02-21T09:56:28.7577765Z ok=23
2026-02-21T09:56:28.7577855Z min=0.9161
2026-02-21T09:56:28.7577946Z mid=1.3263
2026-02-21T09:56:28.7578027Z max=13.0921
2026-02-21T09:56:28.7578122Z best={'block_sizes': [16, 128, 128],
2026-02-21T09:56:28.7578294Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:56:28.7578437Z  'l2_groupings': [2],
2026-02-21T09:56:28.7578543Z  'load_eviction_policies': ['', ''],
2026-02-21T09:56:28.7578665Z  'loop_orders': [[0, 1]],
2026-02-21T09:56:28.7578773Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:56:28.7578890Z  'num_stages': 1,
2026-02-21T09:56:28.7578980Z  'num_warps': 4,
2026-02-21T09:56:28.7579085Z  'pid_type': 'flat',
2026-02-21T09:56:28.7579209Z  'range_flattens': [None, None],
2026-02-21T09:56:28.7579330Z  'range_multi_buffers': [None, False],
2026-02-21T09:56:28.7579454Z  'range_num_stages': [0, 1],
2026-02-21T09:56:28.7579563Z  'range_unroll_factors': [0, 0],
2026-02-21T09:56:28.7579682Z  'range_warp_specializes': [],
2026-02-21T09:56:28.7580087Z  'waves_per_eu': 2}
2026-02-21T09:56:28.7629289Z [719s] Fitting surrogate: 1086 points, 1086 targets
2026-02-21T09:56:29.0900377Z [719s] Generation 14 starting: 22 neighbors, 1 active search path(s)
2026-02-21T09:56:34.7399453Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 2.5 configs/s
2026-02-21T09:56:36.6326768Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/ADT/SmallVector.h:292: reference llvm::SmallVectorTemplateCommon<mlir::Value>::operator[](size_type) [T = mlir::Value]: Assertion `idx < size()' failed.
2026-02-21T09:56:36.6333512Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 4], order = [2, 1, 0]}>
2026-02-21T09:56:36.6334347Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T09:56:36.6334717Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 2], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T09:56:36.6335099Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [1, 0]}>
2026-02-21T09:56:36.6335533Z #mma = #ttg.amd_mfma<{version = 3, warpsPerCTA = [1, 4], instrShape = [32, 32], isTransposed = true}>
2026-02-21T09:56:36.6335852Z #shared = #ttg.swizzled_shared<{vec = 2, perPhase = 16, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:56:36.6336150Z #shared1 = #ttg.swizzled_shared<{vec = 1, perPhase = 1, maxPhase = 1, order = [1, 0]}>
2026-02-21T09:56:36.6336377Z #smem = #ttg.shared_memory
2026-02-21T09:56:36.6336668Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T09:56:36.6337278Z   tt.func public @_helion_matmul_bf16_int4(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<i8> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T09:56:36.6337764Z     %cst = arith.constant dense<8192> : tensor<128x1xi32, #mma>
2026-02-21T09:56:36.6337990Z     %cst_0 = arith.constant dense<1> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:36.6338205Z     %cst_1 = arith.constant dense<0> : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:36.6338435Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x256xf32, #mma>
2026-02-21T09:56:36.6338638Z     %c508_i32 = arith.constant 508 : i32
2026-02-21T09:56:36.6338862Z     %cst_3 = arith.constant dense<508> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6339280Z     %cst_4 = arith.constant dense<510> : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6339583Z     %cst_5 = arith.constant dense<4> : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:36.6339853Z     %cst_6 = arith.constant dense<8192> : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6340080Z     %cst_7 = arith.constant dense<1024> : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:36.6340269Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T09:56:36.6340413Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T09:56:36.6340558Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T09:56:36.6340709Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T09:56:36.6340853Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T09:56:36.6341046Z     %cst_8 = arith.constant dense<0> : tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6341224Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T09:56:36.6341368Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T09:56:36.6341589Z     %cst_9 = arith.constant dense<4> : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6341822Z     %0 = tt.get_program_id x : i32
2026-02-21T09:56:36.6341970Z     %1 = arith.addi %0, %c1_i32 : i32
2026-02-21T09:56:36.6342113Z     %2 = arith.minsi %1, %c4096_i32 : i32
2026-02-21T09:56:36.6342375Z     %3 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:36.6342799Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
﻿2026-02-21T09:56:36.6346349Z     %5 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:36.6346645Z     %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:36.6346927Z     %7 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6347212Z     %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:36.6347501Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:36.6347719Z     %10 = tt.splat %arg1 : !tt.ptr<i8> -> tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:36.6348014Z     %11 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>>
2026-02-21T09:56:36.6348477Z     %12 = tt.expand_dims %11 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #blocked}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T09:56:36.6348920Z     %13 = tt.expand_dims %12 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:36.6349195Z     %14 = arith.cmpi eq, %13, %cst_1 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:36.6349409Z     %15 = tt.broadcast %14 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:56:36.6349620Z     %16 = arith.cmpi eq, %13, %cst_0 : tensor<1x2x1xi32, #blocked>
2026-02-21T09:56:36.6349826Z     %17 = tt.broadcast %16 : tensor<1x2x1xi1, #blocked> -> tensor<2x2x256xi1, #blocked>
2026-02-21T09:56:36.6350047Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:36.6350221Z     %19 = arith.subi %2, %0 : i32
2026-02-21T09:56:36.6350347Z     %20 = arith.remsi %19, %c2_i32 : i32
2026-02-21T09:56:36.6350474Z     %21 = arith.subi %19, %20 : i32
2026-02-21T09:56:36.6350597Z     %22 = arith.addi %0, %21 : i32
2026-02-21T09:56:36.6350733Z     scf.for %arg3 = %0 to %22 step %c2_i32  : i32 {
2026-02-21T09:56:36.6350885Z       %23 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:56:36.6351016Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:56:36.6351151Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:36.6351302Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:56:36.6351437Z       %27 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:56:36.6351570Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:36.6351692Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:36.6351813Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:36.6351933Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:36.6352121Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:36.6352350Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:36.6352586Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:36.6352813Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:36.6352976Z       %36 = arith.muli %30, %c256_i32 : i32
2026-02-21T09:56:36.6353147Z       %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:36.6353358Z       %38 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:36.6353576Z       %39 = arith.addi %37, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:36.6353792Z       %40 = arith.addi %38, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:36.6354067Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:36.6354326Z       %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:36.6354618Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6354893Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:56:36.6355173Z       %45 = tt.broadcast %44 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6355392Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:36.6355682Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:36.6355954Z       %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6356142Z       %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6356340Z       %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6356561Z       %51 = tt.load %50 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:36.6356844Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6357202Z       ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6357473Z       %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:36.6357747Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:36.6358013Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6358206Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6358402Z       %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6358601Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:36.6358879Z       %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6359254Z       ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6359782Z       %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:56:36.6360256Z         %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6360482Z         %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6360661Z         %201 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:36.6360787Z         %202 = arith.muli %201, %c2_i32 : i32
2026-02-21T09:56:36.6360954Z         %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:36.6361178Z         %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:36.6361453Z         %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:36.6361737Z         %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6361932Z         %207 = arith.addi %43, %206 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6362134Z         %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6362367Z         %209 = tt.load %208 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:36.6362752Z         %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6363202Z         %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6363588Z         %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6363861Z         %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6364057Z         %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6364249Z         %215 = arith.addi %214, %45 : tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6364453Z         %216 = tt.addptr %10, %215 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6364694Z         %217 = tt.load %216 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:36.6364939Z         %218 = ttg.convert_layout %217 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6365224Z         %219 = arith.shli %218, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6365459Z         %220 = arith.shrsi %219, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6365700Z         %221 = arith.shrsi %218, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6365993Z         %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6366330Z         %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6366620Z         %224 = tt.broadcast %222 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6366862Z         %225 = arith.select %15, %224, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6367125Z         %226 = tt.broadcast %223 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6367393Z         %227 = arith.select %17, %226, %225 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6367627Z         %228 = tt.reshape %227 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:56:36.6367855Z         %229 = arith.sitofp %228 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:56:36.6368108Z         %230 = ttg.local_alloc %229 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:56:36.6368442Z         %231 = ttg.local_load %230 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6368924Z         %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:56:36.6369275Z         %233 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:36.6369405Z         %234 = arith.cmpi slt, %233, %c2_i32 : i32
2026-02-21T09:56:36.6369542Z         %235 = arith.select %234, %233, %c0_i32 : i32
2026-02-21T09:56:36.6369809Z         %236 = ttg.memdesc_index %46[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6370173Z         ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6370574Z         scf.yield %232, %235, %arg8, %236 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6370934Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:36.6371150Z       %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6371474Z       %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6371920Z       %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6372297Z       %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6372537Z       %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6372745Z       %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6372933Z       %67 = arith.addi %66, %45 : tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6373128Z       %68 = tt.addptr %10, %67 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6373325Z       %69 = tt.load %68 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:36.6373564Z       %70 = ttg.convert_layout %69 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6373843Z       %71 = arith.shli %70, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6374074Z       %72 = arith.shrsi %71, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6374305Z       %73 = arith.shrsi %70, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6374589Z       %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6374921Z       %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6375215Z       %76 = tt.broadcast %74 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6375450Z       %77 = arith.select %15, %76, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6375685Z       %78 = tt.broadcast %75 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6375911Z       %79 = arith.select %17, %78, %77 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6376137Z       %80 = tt.reshape %79 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:56:36.6376357Z       %81 = arith.sitofp %80 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:56:36.6376605Z       %82 = ttg.local_alloc %81 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:56:36.6376926Z       %83 = ttg.local_load %82 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6377395Z       %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:56:36.6377783Z       %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6378113Z       %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6378537Z       %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6378927Z       %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6379171Z       %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6379362Z       %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6379551Z       %91 = arith.addi %90, %45 : tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6379760Z       %92 = tt.addptr %10, %91 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6379956Z       %93 = tt.load %92 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:36.6380193Z       %94 = ttg.convert_layout %93 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6380470Z       %95 = arith.shli %94, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6380714Z       %96 = arith.shrsi %95, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6380940Z       %97 = arith.shrsi %94, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6381226Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6381557Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6381839Z       %100 = tt.broadcast %98 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6382079Z       %101 = arith.select %15, %100, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6382314Z       %102 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6382547Z       %103 = arith.select %17, %102, %101 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6382779Z       %104 = tt.reshape %103 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:56:36.6383003Z       %105 = arith.sitofp %104 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:56:36.6394185Z       %106 = ttg.local_alloc %105 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:56:36.6394528Z       %107 = ttg.local_load %106 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6394992Z       %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:56:36.6395377Z       ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:36.6395596Z       %109 = arith.truncf %108 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:56:36.6395870Z       %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:36.6396111Z       %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:36.6396341Z       %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:56:36.6396605Z       %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6396812Z       %114 = tt.broadcast %112 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6396996Z       %115 = arith.addi %113, %114 : tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6397192Z       %116 = tt.addptr %18, %115 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6397416Z       tt.store %116, %109 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:36.6397563Z       %117 = arith.addi %arg3, %c1_i32 : i32
2026-02-21T09:56:36.6397685Z       %118 = arith.divsi %117, %c128_i32 : i32
2026-02-21T09:56:36.6397807Z       %119 = arith.muli %118, %c4_i32 : i32
2026-02-21T09:56:36.6397926Z       %120 = arith.subi %c128_i32, %119 : i32
2026-02-21T09:56:36.6398047Z       %121 = arith.minsi %120, %c4_i32 : i32
2026-02-21T09:56:36.6398167Z       %122 = arith.remsi %117, %c128_i32 : i32
2026-02-21T09:56:36.6398302Z       %123 = arith.remsi %122, %121 : i32
2026-02-21T09:56:36.6398419Z       %124 = arith.addi %119, %123 : i32
2026-02-21T09:56:36.6398532Z       %125 = arith.divsi %122, %121 : i32
2026-02-21T09:56:36.6398651Z       %126 = arith.muli %124, %c128_i32 : i32
2026-02-21T09:56:36.6398822Z       %127 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:36.6399043Z       %128 = tt.splat %126 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:36.6399289Z       %129 = arith.addi %127, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:36.6399502Z       %130 = arith.addi %128, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:36.6399671Z       %131 = arith.muli %125, %c256_i32 : i32
2026-02-21T09:56:36.6399841Z       %132 = tt.splat %131 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:36.6400057Z       %133 = tt.splat %131 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:36.6400274Z       %134 = arith.addi %132, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:36.6400488Z       %135 = arith.addi %133, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:36.6400763Z       %136 = tt.expand_dims %129 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:36.6401023Z       %137 = arith.muli %136, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:36.6401221Z       %138 = tt.broadcast %137 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6401504Z       %139 = tt.expand_dims %134 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:56:36.6401801Z       %140 = tt.broadcast %139 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6402025Z       %141 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:36.6402209Z       %142 = arith.addi %138, %48 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6402410Z       %143 = tt.addptr %9, %142 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6402678Z       %144 = tt.load %143 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:36.6402972Z       %145 = ttg.memdesc_index %141[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6403336Z       ttg.local_store %144, %145 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6403577Z       %146 = arith.addi %138, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6403778Z       %147 = tt.addptr %9, %146 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6403980Z       %148 = tt.load %147 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:36.6404264Z       %149 = ttg.memdesc_index %141[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6404621Z       ttg.local_store %148, %149 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6405149Z       %150:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %145, %arg8 = %149) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:56:36.6405646Z         %199 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6405877Z         %200 = arith.addi %199, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6406051Z         %201 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:36.6406200Z         %202 = arith.muli %201, %c2_i32 : i32
2026-02-21T09:56:36.6406370Z         %203 = tt.splat %202 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:36.6406590Z         %204 = arith.addi %203, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:36.6406864Z         %205 = tt.expand_dims %204 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:36.6407161Z         %206 = tt.broadcast %205 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6407357Z         %207 = arith.addi %138, %206 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6407559Z         %208 = tt.addptr %9, %207 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6407768Z         %209 = tt.load %208 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:36.6408070Z         %210 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6408510Z         %211 = arith.extf %210 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6408894Z         %212 = tt.expand_dims %200 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6409144Z         %213 = arith.muli %212, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6409340Z         %214 = tt.broadcast %213 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6409534Z         %215 = arith.addi %214, %140 : tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6409752Z         %216 = tt.addptr %10, %215 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6409958Z         %217 = tt.load %216 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:36.6410205Z         %218 = ttg.convert_layout %217 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6410490Z         %219 = arith.shli %218, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6410727Z         %220 = arith.shrsi %219, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6410964Z         %221 = arith.shrsi %218, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6411256Z         %222 = tt.expand_dims %220 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6411595Z         %223 = tt.expand_dims %221 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6411882Z         %224 = tt.broadcast %222 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6412129Z         %225 = arith.select %15, %224, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6412371Z         %226 = tt.broadcast %223 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6412609Z         %227 = arith.select %17, %226, %225 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6412845Z         %228 = tt.reshape %227 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:56:36.6413092Z         %229 = arith.sitofp %228 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:56:36.6413348Z         %230 = ttg.local_alloc %229 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:56:36.6413676Z         %231 = ttg.local_load %230 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6414176Z         %232 = tt.dot %211, %231, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:56:36.6414528Z         %233 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:36.6414656Z         %234 = arith.cmpi slt, %233, %c2_i32 : i32
2026-02-21T09:56:36.6414793Z         %235 = arith.select %234, %233, %c0_i32 : i32
2026-02-21T09:56:36.6415078Z         %236 = ttg.memdesc_index %141[%235] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6415442Z         ttg.local_store %209, %236 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6415845Z         scf.yield %232, %235, %arg8, %236 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6416185Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:36.6416512Z       %151 = ttg.local_load %150#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6416943Z       %152 = arith.extf %151 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6417246Z       %153 = arith.addi %66, %140 : tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6417448Z       %154 = tt.addptr %10, %153 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6417649Z       %155 = tt.load %154 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:36.6417911Z       %156 = ttg.convert_layout %155 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6418194Z       %157 = arith.shli %156, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6418430Z       %158 = arith.shrsi %157, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6418665Z       %159 = arith.shrsi %156, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6418955Z       %160 = tt.expand_dims %158 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6419297Z       %161 = tt.expand_dims %159 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6419584Z       %162 = tt.broadcast %160 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6419825Z       %163 = arith.select %15, %162, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6420067Z       %164 = tt.broadcast %161 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6420297Z       %165 = arith.select %17, %164, %163 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6420534Z       %166 = tt.reshape %165 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:56:36.6420760Z       %167 = arith.sitofp %166 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:56:36.6421030Z       %168 = ttg.local_alloc %167 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:56:36.6421357Z       %169 = ttg.local_load %168 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6421831Z       %170 = tt.dot %152, %169, %150#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:56:36.6422348Z       %171 = ttg.local_load %150#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6422780Z       %172 = arith.extf %171 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6423080Z       %173 = arith.addi %90, %140 : tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6423296Z       %174 = tt.addptr %10, %173 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6423500Z       %175 = tt.load %174 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:36.6423745Z       %176 = ttg.convert_layout %175 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6431378Z       %177 = arith.shli %176, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6431646Z       %178 = arith.shrsi %177, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6431889Z       %179 = arith.shrsi %176, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6432188Z       %180 = tt.expand_dims %178 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6432536Z       %181 = tt.expand_dims %179 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6432830Z       %182 = tt.broadcast %180 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6433078Z       %183 = arith.select %15, %182, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6433373Z       %184 = tt.broadcast %181 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6433610Z       %185 = arith.select %17, %184, %183 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6433841Z       %186 = tt.reshape %185 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:56:36.6434067Z       %187 = arith.sitofp %186 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:56:36.6434320Z       %188 = ttg.local_alloc %187 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:56:36.6434649Z       %189 = ttg.local_load %188 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6435128Z       %190 = tt.dot %172, %189, %170, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:56:36.6435519Z       ttg.local_dealloc %141 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:36.6435739Z       %191 = arith.truncf %190 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:56:36.6436015Z       %192 = tt.expand_dims %130 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:36.6436260Z       %193 = arith.muli %192, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:36.6436495Z       %194 = tt.expand_dims %135 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:56:36.6436774Z       %195 = tt.broadcast %193 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6436982Z       %196 = tt.broadcast %194 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6437166Z       %197 = arith.addi %195, %196 : tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6437360Z       %198 = tt.addptr %18, %197 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6437580Z       tt.store %198, %191 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:36.6437712Z     }
2026-02-21T09:56:36.6437810Z     scf.for %arg3 = %22 to %2 step %c1_i32  : i32 {
2026-02-21T09:56:36.6437946Z       %23 = arith.divsi %arg3, %c128_i32 : i32
2026-02-21T09:56:36.6438072Z       %24 = arith.muli %23, %c4_i32 : i32
2026-02-21T09:56:36.6438191Z       %25 = arith.subi %c128_i32, %24 : i32
2026-02-21T09:56:36.6438304Z       %26 = arith.minsi %25, %c4_i32 : i32
2026-02-21T09:56:36.6438427Z       %27 = arith.remsi %arg3, %c128_i32 : i32
2026-02-21T09:56:36.6438558Z       %28 = arith.remsi %27, %26 : i32
2026-02-21T09:56:36.6438671Z       %29 = arith.addi %24, %28 : i32
2026-02-21T09:56:36.6438779Z       %30 = arith.divsi %27, %26 : i32
2026-02-21T09:56:36.6438893Z       %31 = arith.muli %29, %c128_i32 : i32
2026-02-21T09:56:36.6439061Z       %32 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:36.6439275Z       %33 = tt.splat %31 : i32 -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:36.6439491Z       %34 = arith.addi %32, %3 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>>
2026-02-21T09:56:36.6439701Z       %35 = arith.addi %33, %4 : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>>
2026-02-21T09:56:36.6439864Z       %36 = arith.muli %30, %c256_i32 : i32
2026-02-21T09:56:36.6440024Z       %37 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:36.6440239Z       %38 = tt.splat %36 : i32 -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:36.6440449Z       %39 = arith.addi %37, %5 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>>
2026-02-21T09:56:36.6440657Z       %40 = arith.addi %38, %6 : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>>
2026-02-21T09:56:36.6440946Z       %41 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked2}>> -> tensor<128x1xi32, #blocked2>
2026-02-21T09:56:36.6441199Z       %42 = arith.muli %41, %cst_7 : tensor<128x1xi32, #blocked2>
2026-02-21T09:56:36.6441390Z       %43 = tt.broadcast %42 : tensor<128x1xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6441671Z       %44 = tt.expand_dims %39 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x256xi32, #blocked1>
2026-02-21T09:56:36.6441947Z       %45 = tt.broadcast %44 : tensor<1x256xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6442165Z       %46 = ttg.local_alloc : () -> !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:36.6442429Z       %47 = tt.expand_dims %8 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:36.6442752Z       %48 = tt.broadcast %47 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6442946Z       %49 = arith.addi %43, %48 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6443141Z       %50 = tt.addptr %9, %49 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6443348Z       %51 = tt.load %50 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:36.6443631Z       %52 = ttg.memdesc_index %46[%c0_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6443989Z       ttg.local_store %51, %52 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6444280Z       %53 = arith.addi %8, %cst_5 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:36.6444548Z       %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:36.6444819Z       %55 = tt.broadcast %54 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6445004Z       %56 = arith.addi %43, %55 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6445218Z       %57 = tt.addptr %9, %56 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6445419Z       %58 = tt.load %57 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:36.6445695Z       %59 = ttg.memdesc_index %46[%c1_i32] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6446050Z       ttg.local_store %58, %59 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6446592Z       %60:4 = scf.for %arg4 = %c0_i32 to %c508_i32 step %c2_i32 iter_args(%arg5 = %cst_2, %arg6 = %c1_i32, %arg7 = %52, %arg8 = %59) -> (tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>)  : i32 {
2026-02-21T09:56:36.6447066Z         %117 = tt.splat %arg4 : i32 -> tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6447299Z         %118 = arith.addi %117, %7 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6447475Z         %119 = arith.addi %arg4, %c4_i32 : i32
2026-02-21T09:56:36.6447599Z         %120 = arith.muli %119, %c2_i32 : i32
2026-02-21T09:56:36.6447767Z         %121 = tt.splat %120 : i32 -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:36.6447989Z         %122 = arith.addi %121, %8 : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>>
2026-02-21T09:56:36.6448268Z         %123 = tt.expand_dims %122 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked2}>> -> tensor<1x4xi32, #blocked2>
2026-02-21T09:56:36.6448543Z         %124 = tt.broadcast %123 : tensor<1x4xi32, #blocked2> -> tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6448739Z         %125 = arith.addi %43, %124 : tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6448961Z         %126 = tt.addptr %9, %125 : tensor<128x4x!tt.ptr<bf16>, #blocked2>, tensor<128x4xi32, #blocked2>
2026-02-21T09:56:36.6449172Z         %127 = tt.load %126 : tensor<128x4x!tt.ptr<bf16>, #blocked2>
2026-02-21T09:56:36.6449475Z         %128 = ttg.local_load %arg7 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6449911Z         %129 = arith.extf %128 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6450295Z         %130 = tt.expand_dims %118 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6450549Z         %131 = arith.muli %130, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6450744Z         %132 = tt.broadcast %131 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6450939Z         %133 = arith.addi %132, %45 : tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6451139Z         %134 = tt.addptr %10, %133 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6451344Z         %135 = tt.load %134 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:36.6451591Z         %136 = ttg.convert_layout %135 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6451872Z         %137 = arith.shli %136, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6452125Z         %138 = arith.shrsi %137, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6452361Z         %139 = arith.shrsi %136, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6452654Z         %140 = tt.expand_dims %138 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6452992Z         %141 = tt.expand_dims %139 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6453297Z         %142 = tt.broadcast %140 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6453542Z         %143 = arith.select %15, %142, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6453781Z         %144 = tt.broadcast %141 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6454036Z         %145 = arith.select %17, %144, %143 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6454271Z         %146 = tt.reshape %145 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:56:36.6454495Z         %147 = arith.sitofp %146 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:56:36.6454752Z         %148 = ttg.local_alloc %147 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:56:36.6455082Z         %149 = ttg.local_load %148 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6455561Z         %150 = tt.dot %129, %149, %arg5, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:56:36.6455912Z         %151 = arith.addi %arg6, %c1_i32 : i32
2026-02-21T09:56:36.6456041Z         %152 = arith.cmpi slt, %151, %c2_i32 : i32
2026-02-21T09:56:36.6456177Z         %153 = arith.select %152, %151, %c0_i32 : i32
2026-02-21T09:56:36.6456444Z         %154 = ttg.memdesc_index %46[%153] : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6456826Z         ttg.local_store %127, %154 : tensor<128x4xbf16, #blocked2> -> !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6457230Z         scf.yield %150, %153, %arg8, %154 : tensor<128x256xf32, #mma>, i32, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>, !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4>
2026-02-21T09:56:36.6457570Z       } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 3 : i32}
2026-02-21T09:56:36.6457785Z       %61 = arith.addi %7, %cst_3 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6458113Z       %62 = ttg.local_load %60#2 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6458541Z       %63 = arith.extf %62 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6458920Z       %64 = tt.expand_dims %61 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6459162Z       %65 = arith.muli %64, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6459352Z       %66 = tt.broadcast %65 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6459543Z       %67 = arith.addi %66, %45 : tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6459735Z       %68 = tt.addptr %10, %67 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6459933Z       %69 = tt.load %68 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:36.6460184Z       %70 = ttg.convert_layout %69 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6460461Z       %71 = arith.shli %70, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6460691Z       %72 = arith.shrsi %71, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6460919Z       %73 = arith.shrsi %70, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6461225Z       %74 = tt.expand_dims %72 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6461552Z       %75 = tt.expand_dims %73 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6461832Z       %76 = tt.broadcast %74 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6462085Z       %77 = arith.select %15, %76, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6462315Z       %78 = tt.broadcast %75 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6462544Z       %79 = arith.select %17, %78, %77 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6462767Z       %80 = tt.reshape %79 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:56:36.6462987Z       %81 = arith.sitofp %80 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:56:36.6463239Z       %82 = ttg.local_alloc %81 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:56:36.6463557Z       %83 = ttg.local_load %82 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6464021Z       %84 = tt.dot %63, %83, %60#0, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:56:36.6464408Z       %85 = arith.addi %7, %cst_4 : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>>
2026-02-21T09:56:36.6464750Z       %86 = ttg.local_load %60#3 : !ttg.memdesc<128x4xbf16, #shared, #smem, mutable, 2x128x4> -> tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6465177Z       %87 = arith.extf %86 : tensor<128x4xbf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> to tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6465552Z       %88 = tt.expand_dims %85 {axis = 1 : i32} : tensor<2xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6465796Z       %89 = arith.muli %88, %cst_6 : tensor<2x1xi32, #blocked1>
2026-02-21T09:56:36.6465987Z       %90 = tt.broadcast %89 : tensor<2x1xi32, #blocked1> -> tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6466178Z       %91 = arith.addi %90, %45 : tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6466373Z       %92 = tt.addptr %10, %91 : tensor<2x256x!tt.ptr<i8>, #blocked1>, tensor<2x256xi32, #blocked1>
2026-02-21T09:56:36.6466566Z       %93 = tt.load %92 : tensor<2x256x!tt.ptr<i8>, #blocked1>
2026-02-21T09:56:36.6466804Z       %94 = ttg.convert_layout %93 : tensor<2x256xi8, #blocked1> -> tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6467078Z       %95 = arith.shli %94, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6467306Z       %96 = arith.shrsi %95, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6467534Z       %97 = arith.shrsi %94, %cst_9 : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>>
2026-02-21T09:56:36.6467811Z       %98 = tt.expand_dims %96 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6468156Z       %99 = tt.expand_dims %97 {axis = 1 : i32} : tensor<2x256xi8, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<2x1x256xi8, #blocked>
2026-02-21T09:56:36.6468439Z       %100 = tt.broadcast %98 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6468679Z       %101 = arith.select %15, %100, %cst_8 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6468916Z       %102 = tt.broadcast %99 : tensor<2x1x256xi8, #blocked> -> tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6469162Z       %103 = arith.select %17, %102, %101 : tensor<2x2x256xi1, #blocked>, tensor<2x2x256xi8, #blocked>
2026-02-21T09:56:36.6469395Z       %104 = tt.reshape %103 : tensor<2x2x256xi8, #blocked> -> tensor<4x256xi8, #blocked3>
2026-02-21T09:56:36.6469620Z       %105 = arith.sitofp %104 : tensor<4x256xi8, #blocked3> to tensor<4x256xf32, #blocked3>
2026-02-21T09:56:36.6469872Z       %106 = ttg.local_alloc %105 : (tensor<4x256xf32, #blocked3>) -> !ttg.memdesc<4x256xf32, #shared1, #smem>
2026-02-21T09:56:36.6470214Z       %107 = ttg.local_load %106 : !ttg.memdesc<4x256xf32, #shared1, #smem> -> tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>
2026-02-21T09:56:36.6470681Z       %108 = tt.dot %87, %107, %84, inputPrecision = tf32 : tensor<128x4xf32, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 2}>> * tensor<4x256xf32, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<128x256xf32, #mma>
2026-02-21T09:56:36.6471064Z       ttg.local_dealloc %46 : !ttg.memdesc<2x128x4xbf16, #shared, #smem, mutable>
2026-02-21T09:56:36.6471282Z       %109 = arith.truncf %108 : tensor<128x256xf32, #mma> to tensor<128x256xbf16, #mma>
2026-02-21T09:56:36.6471554Z       %110 = tt.expand_dims %35 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #mma}>> -> tensor<128x1xi32, #mma>
2026-02-21T09:56:36.6471794Z       %111 = arith.muli %110, %cst : tensor<128x1xi32, #mma>
2026-02-21T09:56:36.6472028Z       %112 = tt.expand_dims %40 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #mma}>> -> tensor<1x256xi32, #mma>
2026-02-21T09:56:36.6472288Z       %113 = tt.broadcast %111 : tensor<128x1xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6472496Z       %114 = tt.broadcast %112 : tensor<1x256xi32, #mma> -> tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6472690Z       %115 = arith.addi %113, %114 : tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6472885Z       %116 = tt.addptr %18, %115 : tensor<128x256x!tt.ptr<bf16>, #mma>, tensor<128x256xi32, #mma>
2026-02-21T09:56:36.6473086Z       tt.store %116, %109 : tensor<128x256x!tt.ptr<bf16>, #mma>
2026-02-21T09:56:36.6473227Z     } {tt.num_stages = 1 : i32}
2026-02-21T09:56:36.6473330Z     tt.return
2026-02-21T09:56:36.6473410Z   }
2026-02-21T09:56:36.6473487Z }
2026-02-21T09:56:36.6473531Z 
2026-02-21T09:56:36.6473562Z {-#
2026-02-21T09:56:36.6473645Z   external_resources: {
2026-02-21T09:56:36.6473743Z     mlir_reproducer: {
2026-02-21T09:56:36.6474744Z       pipeline: "builtin.module(optimize-amd-lds-usage{lds-limit=0 target-arch=gfx942}, convert-scf-to-cf, convert-index-to-llvm{index-bitwidth=0}, allocate-amdgpu-shared-memory, convert-triton-amdgpu-to-llvm{arch=gfx942 ftz=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, convert-cf-to-llvm{index-bitwidth=0}, convert-arith-to-llvm{index-bitwidth=0}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce, enable-line-info, convert-builtin-func-to-llvm{ftz=true})",
2026-02-21T09:56:36.6475731Z       disable_threading: false,
2026-02-21T09:56:36.6475839Z       verify_each: true
2026-02-21T09:56:36.6475928Z     }
2026-02-21T09:56:36.6476000Z   }
2026-02-21T09:56:36.6476068Z #-}
2026-02-21T09:56:36.6476349Z /tmp/torchinductor_root/3r/c3rtn5cq6dgpubecss5ujyjziuowuzxapwooes2cp2khsntpawqe.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T09:56:36.6477057Z /tmp/torchinductor_root/3r/c3rtn5cq6dgpubecss5ujyjziuowuzxapwooes2cp2khsntpawqe.py:14:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T09:56:36.6477609Z [727s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T09:56:36.6478385Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 128, 256], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:56:36.6479106Z Error: RuntimeError: PassManager::run failed
2026-02-21T09:56:36.6479274Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T09:56:36.9305009Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 22/22 10.1 configs/s
2026-02-21T09:56:37.5833563Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━ 223/223 2027.1         
2026-02-21T09:56:37.5833939Z                                                                  configs/s      
2026-02-21T09:56:40.4600339Z [731s] Generation 14 complete: 
2026-02-21T09:56:40.4600736Z error=3
2026-02-21T09:56:40.4600953Z ok=21
2026-02-21T09:56:40.4601141Z min=0.9121
2026-02-21T09:56:40.4601330Z mid=1.2230
2026-02-21T09:56:40.4601509Z max=38.3666
2026-02-21T09:56:40.4601719Z best={'block_sizes': [16, 128, 128],
2026-02-21T09:56:40.4602057Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:56:40.4602383Z  'l2_groupings': [2],
2026-02-21T09:56:40.4602719Z  'load_eviction_policies': ['', ''],
2026-02-21T09:56:40.4603030Z  'loop_orders': [[0, 1]],
2026-02-21T09:56:40.4603307Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:56:40.4603552Z  'num_stages': 1,
2026-02-21T09:56:40.4603769Z  'num_warps': 4,
2026-02-21T09:56:40.4603975Z  'pid_type': 'flat',
2026-02-21T09:56:40.4604214Z  'range_flattens': [None, None],
2026-02-21T09:56:40.4604485Z  'range_multi_buffers': [None, False],
2026-02-21T09:56:40.4604774Z  'range_num_stages': [0, 1],
2026-02-21T09:56:40.4605404Z  'range_unroll_factors': [0, 0],
2026-02-21T09:56:40.4605674Z  'range_warp_specializes': [],
2026-02-21T09:56:40.4605924Z  'waves_per_eu': 2}
2026-02-21T09:56:40.4644951Z [731s] Fitting surrogate: 1110 points, 1110 targets
2026-02-21T09:56:41.7829554Z [732s] Generation 15 starting: 23 neighbors, 1 active search path(s)
2026-02-21T09:56:46.2743455Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23/23 4.8 configs/s
2026-02-21T09:56:49.2089557Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 23/23 8.7 configs/s
2026-02-21T09:56:49.5817978Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━ 223/223 38.9 configs/s
2026-02-21T09:56:52.1177904Z [742s] Generation 15 complete: 
2026-02-21T09:56:52.1178384Z error=5
2026-02-21T09:56:52.1178584Z ok=20
2026-02-21T09:56:52.1178789Z min=0.9190
2026-02-21T09:56:52.1178993Z mid=1.2972
2026-02-21T09:56:52.1179196Z max=35.8059
2026-02-21T09:56:52.1179433Z best={'block_sizes': [16, 128, 128],
2026-02-21T09:56:52.1179823Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:56:52.1180190Z  'l2_groupings': [2],
2026-02-21T09:56:52.1180467Z  'load_eviction_policies': ['', ''],
2026-02-21T09:56:52.1180801Z  'loop_orders': [[0, 1]],
2026-02-21T09:56:52.1181086Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:56:52.1181285Z  'num_stages': 1,
2026-02-21T09:56:52.1181430Z  'num_warps': 4,
2026-02-21T09:56:52.1181580Z  'pid_type': 'flat',
2026-02-21T09:56:52.1181744Z  'range_flattens': [None, None],
2026-02-21T09:56:52.1181938Z  'range_multi_buffers': [None, False],
2026-02-21T09:56:52.1182159Z  'range_num_stages': [0, 1],
2026-02-21T09:56:52.1182337Z  'range_unroll_factors': [0, 0],
2026-02-21T09:56:52.1182944Z  'range_warp_specializes': [],
2026-02-21T09:56:52.1183126Z  'waves_per_eu': 2}
2026-02-21T09:56:52.1243529Z [742s] Fitting surrogate: 1135 points, 1135 targets
2026-02-21T09:56:52.5091517Z [743s] Generation 16 starting: 24 neighbors, 1 active search path(s)
2026-02-21T09:56:57.0986589Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 8.2 configs/s
2026-02-21T09:57:00.8350578Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 24/24 6.3 configs/s
2026-02-21T09:57:01.2005378Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━ 223/223 44.1 configs/s
2026-02-21T09:57:03.7989713Z [754s] Generation 16 complete: 
2026-02-21T09:57:03.7990093Z ok=26
2026-02-21T09:57:03.7990279Z min=0.9103
2026-02-21T09:57:03.7990479Z mid=1.5053
2026-02-21T09:57:03.7990662Z max=56.1355
2026-02-21T09:57:03.7990877Z best={'block_sizes': [16, 128, 128],
2026-02-21T09:57:03.7991206Z  'indexing': ['block_ptr', 'pointer', 'pointer'],
2026-02-21T09:57:03.7991580Z  'l2_groupings': [2],
2026-02-21T09:57:03.7991828Z  'load_eviction_policies': ['', ''],
2026-02-21T09:57:03.7992613Z  'loop_orders': [[0, 1]],
2026-02-21T09:57:03.7992891Z  'matrix_instr_nonkdim': 16,
2026-02-21T09:57:03.7993138Z  'num_stages': 1,
2026-02-21T09:57:03.7993343Z  'num_warps': 4,
2026-02-21T09:57:03.7993551Z  'pid_type': 'flat',
2026-02-21T09:57:03.7993793Z  'range_flattens': [None, None],
2026-02-21T09:57:03.7994077Z  'range_multi_buffers': [None, False],
2026-02-21T09:57:03.7994364Z  'range_num_stages': [0, 1],
2026-02-21T09:57:03.7994633Z  'range_unroll_factors': [0, 0],
2026-02-21T09:57:03.7994895Z  'range_warp_specializes': [],
2026-02-21T09:57:03.7995145Z  'waves_per_eu': 2}
2026-02-21T09:57:03.8065936Z [754s] Fitting surrogate: 1161 points, 1161 targets
2026-02-21T09:57:03.9572012Z [754s] Autotuning complete in 754.8s after searching 1110 configs.
2026-02-21T09:57:03.9572246Z One can hardcode the best config and skip autotuning with:
2026-02-21T09:57:03.9572958Z     @helion.kernel(config=helion.Config(block_sizes=[16, 128, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T09:57:03.9573906Z 
2026-02-21T09:57:03.9574071Z [754s] Code of selected kernel: /tmp/torchinductor_root/ib/cibwt6vl2xsbx2eg55v3tglyzonuw3nw5eikfaoheyb5o3uggj2c.py
2026-02-21T09:57:03.9775980Z from __future__ import annotations
2026-02-21T09:57:03.9776088Z 
2026-02-21T09:57:03.9776128Z import torch
2026-02-21T09:57:03.9776215Z import triton
2026-02-21T09:57:03.9776312Z import triton.language as tl
2026-02-21T09:57:03.9776461Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T09:57:03.9776586Z 
2026-02-21T09:57:03.9776634Z _BLOCK_SIZE_1 = tl.constexpr(128)
2026-02-21T09:57:03.9776755Z _BLOCK_SIZE_2 = tl.constexpr(128)
2026-02-21T09:57:03.9776877Z _BLOCK_SIZE_0 = tl.constexpr(16)
2026-02-21T09:57:03.9776950Z 
2026-02-21T09:57:03.9776990Z @triton.jit
2026-02-21T09:57:03.9777141Z def _helion_matmul_bf16_int4(A, B, C, mul_1: tl.constexpr, _SHAPE_DIM_2: tl.constexpr):
2026-02-21T09:57:03.9777369Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:57:03.9777533Z     num_pid_m = tl.cdiv(16384, _BLOCK_SIZE_1)
2026-02-21T09:57:03.9777667Z     num_pid_n = tl.cdiv(8192, _BLOCK_SIZE_2)
2026-02-21T09:57:03.9777800Z     inner_2d_pid = tl.program_id(0)
2026-02-21T09:57:03.9777925Z     num_pid_in_group = 2 * num_pid_n
2026-02-21T09:57:03.9778053Z     group_id = inner_2d_pid // num_pid_in_group
2026-02-21T09:57:03.9778190Z     first_pid_m = group_id * 2
2026-02-21T09:57:03.9778326Z     group_size_m = min(num_pid_m - first_pid_m, 2)
2026-02-21T09:57:03.9778515Z     pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T09:57:03.9778830Z     pid_1 = inner_2d_pid % num_pid_in_group // group_size_m
2026-02-21T09:57:03.9778979Z     offset_1 = pid_0 * _BLOCK_SIZE_1
2026-02-21T09:57:03.9779127Z     indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32)
2026-02-21T09:57:03.9779286Z     offset_2 = pid_1 * _BLOCK_SIZE_2
2026-02-21T09:57:03.9779430Z     indices_2 = (offset_2 + tl.arange(0, _BLOCK_SIZE_2)).to(tl.int32)
2026-02-21T09:57:03.9779633Z     # src[int4_gemm.py:58]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:57:03.9779832Z     acc = tl.full([_BLOCK_SIZE_1, _BLOCK_SIZE_2], 0.0, tl.float32)
2026-02-21T09:57:03.9780140Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:57:03.9780409Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:57:03.9780667Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:57:03.9780851Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:57:03.9781095Z     for offset_3 in tl.range(0, 512, _BLOCK_SIZE_0, num_stages=1, disallow_acc_multi_buffer=True):
2026-02-21T09:57:03.9781323Z         indices_3 = offset_3 + tl.arange(0, _BLOCK_SIZE_0).to(tl.int32)
2026-02-21T09:57:03.9781475Z         acc_copy = acc
2026-02-21T09:57:03.9781577Z         acc_copy_0 = acc_copy
2026-02-21T09:57:03.9781725Z         # src[int4_gemm.py:63]: a_tile_begin = tile_k_packed.begin * 2
2026-02-21T09:57:03.9781872Z         mul = 2 * offset_3
2026-02-21T09:57:03.9782044Z         # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:57:03.9782228Z         iota = mul + tl.arange(0, mul_1)
2026-02-21T09:57:03.9782393Z         load = tl.load(A + (indices_1[:, None] * 1024 + iota[None, :] * 1), None)
2026-02-21T09:57:03.9782609Z         # src[int4_gemm.py:65]: a_tile = A[tile_m, a_tile_begin : (a_tile_begin + a_tile_len)].to(
2026-02-21T09:57:03.9782794Z         # src[int4_gemm.py:66]:     torch.float32
2026-02-21T09:57:03.9782952Z         # src[int4_gemm.py:67]: )  # [BLOCK_SIZE_M, BLOCK_SIZE_K]
2026-02-21T09:57:03.9783093Z         v_0 = tl.cast(load, tl.float32)
2026-02-21T09:57:03.9783273Z         # src[int4_gemm.py:70]: b_tile = B[tile_k_packed, tile_n]  # [BLOCK_SIZE_K//2, BLOCK_SIZE_N]
2026-02-21T09:57:03.9783566Z         b_tile = tl.load(B + (indices_3[:, None] * 8192 + indices_2[None, :] * 1), None)
2026-02-21T09:57:03.9783794Z         # src[int4_gemm.py:74]: b_lo = ((b_tile << 4) >> 4).to(torch.int8)  # Sign-extend low 4 bits
2026-02-21T09:57:03.9783975Z         v_1 = tl.full([], 4, tl.int8)
2026-02-21T09:57:03.9784087Z         v_2 = b_tile << v_1
2026-02-21T09:57:03.9784197Z         v_3 = tl.full([], 4, tl.int8)
2026-02-21T09:57:03.9784305Z         v_4 = v_2 >> v_3
2026-02-21T09:57:03.9784460Z         # src[int4_gemm.py:75]: b_hi = (b_tile >> 4).to(torch.int8)  # Sign-extend high 4 bits
2026-02-21T09:57:03.9784628Z         v_5 = tl.full([], 4, tl.int8)
2026-02-21T09:57:03.9784740Z         v_6 = b_tile >> v_5
2026-02-21T09:57:03.9784878Z         # src[int4_gemm.py:79]: b_stacked = torch.stack([b_lo, b_hi], dim=1)
2026-02-21T09:57:03.9785037Z         stack_idx = tl.arange(0, 2)
2026-02-21T09:57:03.9785164Z         broadcast_idx = stack_idx[None, :, None]
2026-02-21T09:57:03.9785298Z         expanded_0 = tl.expand_dims(v_4, 1)
2026-02-21T09:57:03.9785426Z         expanded_1 = tl.expand_dims(v_6, 1)
2026-02-21T09:57:03.9785557Z         stacked_result = tl.zeros_like(expanded_0)
2026-02-21T09:57:03.9785692Z         mask_0 = broadcast_idx == 0
2026-02-21T09:57:03.9785838Z         stacked_result = tl.where(mask_0, expanded_0, stacked_result)
2026-02-21T09:57:03.9786000Z         mask_1 = broadcast_idx == 1
2026-02-21T09:57:03.9786146Z         stacked_result = tl.where(mask_1, expanded_1, stacked_result)
2026-02-21T09:57:03.9786322Z         # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:57:03.9786509Z         # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:57:03.9786707Z         # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:57:03.9786873Z         view = tl.reshape(stacked_result, [_SHAPE_DIM_2, _BLOCK_SIZE_2])
2026-02-21T09:57:03.9787026Z         v_7 = tl.cast(view, tl.float32)
2026-02-21T09:57:03.9787202Z         # src[int4_gemm.py:87]: a_tile = a_tile.unsqueeze(2)  # [BLOCK_SIZE_M, BLOCK_SIZE_K, 1]
2026-02-21T09:57:03.9787383Z         a_tile_1 = v_0[:, :, None]
2026-02-21T09:57:03.9787524Z         # src[int4_gemm.py:88]: b_unpacked = b_unpacked.unsqueeze(0)
2026-02-21T09:57:03.9787710Z         b_unpacked_1 = v_7[None, :, :]
2026-02-21T09:57:03.9787895Z         # src[int4_gemm.py:89]: acc = acc + (a_tile * b_unpacked).sum(dim=1)  # [BLOCK_SIZE_M, BLOCK_SIZE_N]
2026-02-21T09:57:03.9788093Z         v_8 = a_tile_1 * b_unpacked_1
2026-02-21T09:57:03.9788216Z         sum_1 = tl.cast(tl.sum(v_8, 1), tl.float32)
2026-02-21T09:57:03.9788349Z         acc = acc_copy_0 + sum_1
2026-02-21T09:57:03.9788495Z     # src[int4_gemm.py:91]: C[tile_m, tile_n] = acc.to(torch.bfloat16)
2026-02-21T09:57:03.9788672Z     v_10 = tl.cast(acc, tl.bfloat16)
2026-02-21T09:57:03.9788838Z     tl.store(C + (indices_1[:, None] * 8192 + indices_2[None, :] * 1), v_10, None)
2026-02-21T09:57:03.9788964Z 
2026-02-21T09:57:03.9789052Z def matmul_bf16_int4(A: Tensor, B: Tensor, *, _launcher=_default_launcher):
2026-02-21T09:57:03.9789213Z     """
2026-02-21T09:57:03.9789327Z     BFloat16 x INT4 General Matrix Multiplication (GEMM).
2026-02-21T09:57:03.9789433Z 
2026-02-21T09:57:03.9789497Z     This kernel performs matrix multiplication where:
2026-02-21T09:57:03.9789653Z     - A is a bfloat16 matrix of shape [M, K]
2026-02-21T09:57:03.9789818Z     - B is an int8 matrix of shape [K//2, N] containing packed int4 values
2026-02-21T09:57:03.9789992Z       (two 4-bit values packed into each int8)
2026-02-21T09:57:03.9790079Z 
2026-02-21T09:57:03.9790114Z     Args:
2026-02-21T09:57:03.9790234Z         A (Tensor): Input tensor of shape [M, K] in bfloat16 format.
2026-02-21T09:57:03.9790414Z         B (Tensor): Packed int4 tensor of shape [K//2, N] in int8 format.
2026-02-21T09:57:03.9790534Z 
2026-02-21T09:57:03.9790569Z     Returns:
2026-02-21T09:57:03.9790693Z         Tensor: Output tensor of shape [M, N] in bfloat16 format.
2026-02-21T09:57:03.9790830Z     """
2026-02-21T09:57:03.9790923Z     # src[int4_gemm.py:50]: M, K = A.shape
2026-02-21T09:57:03.9791037Z     M, K = A.shape
2026-02-21T09:57:03.9791169Z     # src[int4_gemm.py:51]: _, N = B.shape
2026-02-21T09:57:03.9791279Z     _, N = B.shape
2026-02-21T09:57:03.9791429Z     # src[int4_gemm.py:53]: C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:57:03.9791630Z     C = torch.zeros(M, N, dtype=torch.bfloat16, device=A.device)
2026-02-21T09:57:03.9791811Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:57:03.9791960Z     _BLOCK_SIZE_1 = 128
2026-02-21T09:57:03.9792057Z     _BLOCK_SIZE_2 = 128
2026-02-21T09:57:03.9792226Z     # src[int4_gemm.py:60]: for tile_k_packed in hl.tile(K // 2, block_size=block_size_k_packed):
2026-02-21T09:57:03.9792491Z     # src[int4_gemm.py:61]:     # Load corresponding tiles from A (need to load twice the packed tile size)
2026-02-21T09:57:03.9792748Z     # src[int4_gemm.py:62]:     # We need to map tile_k_packed to the corresponding range in A
2026-02-21T09:57:03.9792930Z     # src[int4_gemm.py:60-89]: ...
2026-02-21T09:57:03.9793040Z     _BLOCK_SIZE_0 = 16
2026-02-21T09:57:03.9793167Z     # src[int4_gemm.py:83]: b_unpacked = b_stacked.reshape(
2026-02-21T09:57:03.9793346Z     # src[int4_gemm.py:84]:     tile_k_packed.block_size * 2, tile_n.block_size
2026-02-21T09:57:03.9793517Z     # src[int4_gemm.py:85]: ).to(torch.float32)
2026-02-21T09:57:03.9793644Z     _SHAPE_DIM_2 = 2 * _BLOCK_SIZE_0
2026-02-21T09:57:03.9793789Z     # src[int4_gemm.py:57]: for tile_m, tile_n in hl.tile([M, N]):
2026-02-21T09:57:03.9793978Z     # src[int4_gemm.py:58]:     acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
2026-02-21T09:57:03.9794149Z     # src[int4_gemm.py:57-91]: ...
2026-02-21T09:57:03.9794316Z     _RDIM_SIZE_3 = triton.next_power_of_2(2 * _BLOCK_SIZE_0)
2026-02-21T09:57:03.9794686Z     _launcher(_helion_matmul_bf16_int4, (triton.cdiv(16384, _BLOCK_SIZE_1) * triton.cdiv(8192, _BLOCK_SIZE_2),), A, B, C, 2 * _BLOCK_SIZE_0, _SHAPE_DIM_2, num_warps=4, num_stages=1, waves_per_eu=2, matrix_instr_nonkdim=16)
2026-02-21T09:57:03.9795037Z     # src[int4_gemm.py:93]: return C
2026-02-21T09:57:03.9795147Z     return C
2026-02-21T09:57:04.8969725Z WARNING:tritonbench.utils.triton_op:Completed input ID 21:
2026-02-21T09:57:04.8970415Z x_val
2026-02-21T09:57:04.8970645Z ---------------------
2026-02-21T09:57:04.8970904Z (4, 4096, 8192, 1024)
2026-02-21T09:57:04.8971069Z 
2026-02-21T09:57:04.8984761Z  70%|███████   | 7/10 [54:26<24:18, 486.07s/it]WARNING:tritonbench.utils.triton_op:Running input ID 24:
2026-02-21T09:57:04.8985208Z x_val
2026-02-21T09:57:04.8985398Z ----------------------
2026-02-21T09:57:04.8985623Z (16, 4096, 1280, 8192)
2026-02-21T09:57:04.8987994Z INFO:tritonbench.utils.triton_op:Took 0.18ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T09:57:05.8923974Z INFO:tritonbench.utils.triton_op:Took 2.84ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T09:57:07.2664270Z INFO:tritonbench.utils.triton_op:Took 0.18ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T09:57:07.2674312Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:57:07.2674745Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:57:07.2675055Z               'dtype': 'torch.bfloat16',
2026-02-21T09:57:07.2675336Z               'shape': (16, 4096, 8192),
2026-02-21T09:57:07.2675620Z               'stride': (33554432, 8192, 1)},
2026-02-21T09:57:07.2675896Z             { 'device': 'cuda:0',
2026-02-21T09:57:07.2676160Z               'dtype': 'torch.int32',
2026-02-21T09:57:07.2676429Z               'shape': (8192, 1280),
2026-02-21T09:57:07.2676681Z               'stride': (1280, 1)}),
2026-02-21T09:57:07.2676935Z   'kwargs': {}}
2026-02-21T09:57:07.2717462Z INFO:tritonbench.utils.triton_op:Took 4.49ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T09:57:07.4578995Z [0s] Autotune random seed: 2138032649
2026-02-21T09:57:07.5313238Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:57:46.5740749Z [39s] Timeout after 30s compiling Config(block_sizes=[128, 4096, 1], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[1, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:57:46.7600062Z [39s] Timeout after 30s compiling Config(block_sizes=[512, 4, 128], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:57:47.6078925Z [40s] Timeout after 30s compiling Config(block_sizes=[2048, 4, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:57:48.3950732Z [40s] Timeout after 30s compiling Config(block_sizes=[512, 4, 256], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[4, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:57:49.2557159Z [41s] Timeout after 30s compiling Config(block_sizes=[16, 8192, 1], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[0, 1], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T09:57:49.9929676Z [42s] Timeout after 30s compiling Config(block_sizes=[128, 512, 1], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[2, 0], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:57:50.5377436Z [43s] Timeout after 30s compiling Config(block_sizes=[256, 2, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:57:51.4436006Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 32768, 4], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:57:51.4464738Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T09:58:29.7975757Z Initial population exploring neighbors   9% ━                9/100 0.1 configs/s
2026-02-21T09:58:29.7979787Z WARNING:tritonbench.utils.triton_op:Completed input ID 24:
2026-02-21T09:58:29.7980734Z x_val
2026-02-21T09:58:29.7980952Z ----------------------
2026-02-21T09:58:29.7981192Z (16, 4096, 1280, 8192)
2026-02-21T09:58:29.7981339Z 
2026-02-21T09:58:29.7998633Z  80%|████████  | 8/10 [55:51<11:56, 358.35s/it]WARNING:tritonbench.utils.triton_op:Running input ID 28:
2026-02-21T09:58:29.7999022Z x_val
2026-02-21T09:58:29.7999197Z ----------------------
2026-02-21T09:58:29.7999384Z (64, 4096, 1280, 8192)
2026-02-21T09:58:29.8040318Z INFO:tritonbench.utils.triton_op:Took 0.23ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T09:58:30.7955218Z INFO:tritonbench.utils.triton_op:Took 2.04ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T09:58:33.2696114Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T09:58:33.2705621Z WARNING:__main__:Input tensor metadata:
2026-02-21T09:58:33.2706007Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T09:58:33.2706336Z               'dtype': 'torch.bfloat16',
2026-02-21T09:58:33.2706651Z               'shape': (64, 4096, 8192),
2026-02-21T09:58:33.2706967Z               'stride': (33554432, 8192, 1)},
2026-02-21T09:58:33.2707295Z             { 'device': 'cuda:0',
2026-02-21T09:58:33.2707583Z               'dtype': 'torch.int32',
2026-02-21T09:58:33.2707877Z               'shape': (8192, 1280),
2026-02-21T09:58:33.2708157Z               'stride': (1280, 1)}),
2026-02-21T09:58:33.2708434Z   'kwargs': {}}
2026-02-21T09:58:33.2723380Z INFO:tritonbench.utils.triton_op:Took 1.96ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T09:58:33.4552902Z [0s] Autotune random seed: 2138032649
2026-02-21T09:58:33.6719796Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T09:59:08.0436758Z [34s] Timeout after 30s compiling Config(block_sizes=[16, 8192, 4], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[2, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:59:11.6909466Z [38s] Timeout after 30s compiling Config(block_sizes=[128, 4096, 1], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[1, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:59:12.1398018Z [38s] Timeout after 30s compiling Config(block_sizes=[512, 4, 128], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:59:12.8805255Z [39s] Timeout after 30s compiling Config(block_sizes=[2048, 4, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T09:59:13.5618988Z [39s] Timeout after 30s compiling Config(block_sizes=[512, 4, 256], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[4, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:59:14.2619290Z [40s] Timeout after 30s compiling Config(block_sizes=[16, 8192, 1], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[0, 1], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T09:59:14.9414949Z [41s] Timeout after 30s compiling Config(block_sizes=[128, 512, 1], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[2, 0], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:59:15.4541188Z [41s] Timeout after 30s compiling Config(block_sizes=[256, 2, 256], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:59:16.2589306Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 32768, 4], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T09:59:16.2613844Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T10:06:23.7350822Z Initial population exploring neighbors   9% ━                9/100 0.0 configs/s
2026-02-21T10:06:23.7355627Z WARNING:tritonbench.utils.triton_op:Completed input ID 28:
2026-02-21T10:06:23.7356049Z x_val
2026-02-21T10:06:23.7356289Z ----------------------
2026-02-21T10:06:23.7356548Z (64, 4096, 1280, 8192)
2026-02-21T10:06:23.7356705Z 
2026-02-21T10:06:23.7375744Z  90%|█████████ | 9/10 [1:03:44<06:34, 394.49s/it]WARNING:tritonbench.utils.triton_op:Running input ID 31:
2026-02-21T10:06:23.7376130Z x_val
2026-02-21T10:06:23.7376277Z ----------------------
2026-02-21T10:06:23.7376792Z (64, 4096, 8192, 3584)
2026-02-21T10:06:23.7425009Z INFO:tritonbench.utils.triton_op:Took 0.35ms to get benchmark function for preprocessed_eager_int4_gemm
2026-02-21T10:06:24.7379083Z INFO:tritonbench.utils.triton_op:Took 2.11ms to get benchmark function for preprocessed_torch_compile_int4_gemm
2026-02-21T10:06:26.9004879Z INFO:tritonbench.utils.triton_op:Took 0.19ms to get benchmark function for preprocessed_triton_int4_gemm
2026-02-21T10:06:26.9016782Z WARNING:__main__:Input tensor metadata:
2026-02-21T10:06:26.9017030Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T10:06:26.9017245Z               'dtype': 'torch.bfloat16',
2026-02-21T10:06:26.9017460Z               'shape': (64, 4096, 3584),
2026-02-21T10:06:26.9017673Z               'stride': (14680064, 3584, 1)},
2026-02-21T10:06:26.9017883Z             { 'device': 'cuda:0',
2026-02-21T10:06:26.9018076Z               'dtype': 'torch.int32',
2026-02-21T10:06:26.9018284Z               'shape': (3584, 8192),
2026-02-21T10:06:26.9018480Z               'stride': (8192, 1)}),
2026-02-21T10:06:26.9018669Z   'kwargs': {}}
2026-02-21T10:06:26.9035969Z INFO:tritonbench.utils.triton_op:Took 2.15ms to get benchmark function for helion_int4_gemm_tritonbench
2026-02-21T10:06:27.0861738Z [0s] Autotune random seed: 2138032649
2026-02-21T10:06:27.6553364Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T10:07:02.2713267Z [34s] Timeout after 30s compiling Config(block_sizes=[16, 8192, 4], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[2, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T10:07:04.6673111Z [37s] Timeout after 30s compiling Config(block_sizes=[256, 1024, 1], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[3, 3], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T10:07:04.8617632Z [37s] Timeout after 31s compiling Config(block_sizes=[512, 8, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[2, 0], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T10:07:07.6965103Z [40s] Timeout after 30s compiling Config(block_sizes=[2048, 4, 32], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T10:07:07.6991892Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s
2026-02-21T10:34:01.5247524Z Memory access fault by GPU node-2 (Agent handle: 0x5638f4c450c0) on address 0x7f273a000000. Reason: Unknown.
2026-02-21T10:44:03.9163612Z /__w/_temp/1e980714-c608-4472-8874-5b583ec57a31.sh: line 10: 185332 Aborted                 (core dumped) HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py --op $kernel --metrics speedup,accuracy --latency-measure-mode triton_do_bench --cudagraph --only $IMPLS --only-match-mode prefix-with-baseline --baseline $BASELINE --atol 1e-2 --rtol 1e-2 --input-sample-mode equally-spaced-k --output "$TEST_REPORTS_DIR/helionbench.json" --append-to-output --keep-going
2026-02-21T10:44:03.9166071Z ✅ Completed benchmark for kernel: int4_gemm
2026-02-21T10:44:03.9166346Z ==========================================
2026-02-21T10:44:03.9178514Z Running benchmark for kernel: flash_attention
2026-02-21T10:44:03.9178715Z ==========================================
2026-02-21T10:44:15.8028931Z Using baseline: aten
2026-02-21T10:44:15.8029540Z Available implementations for flash_attention: flex_attention,helion_attention,triton_tutorial_flash_v2_tma_ws_persistent
2026-02-21T10:44:26.1863190Z Applying custom args for flash_attention: {'d_head': 128, 'num_inputs': 6}
2026-02-21T10:44:26.2135896Z INFO:root:TMA benchmarks will be running without grid constant TMA descriptor.
2026-02-21T10:44:26.2222978Z TMA benchmarks will be running without grid constant TMA descriptor.
2026-02-21T10:44:26.2349575Z Running flash_attention benchmark with Helion implementation...
2026-02-21T10:44:26.2349864Z 
2026-02-21T10:44:26.3861417Z Equally-spaced-k mode: Selected 6 equally spaced inputs (total available: 7)
2026-02-21T10:44:26.3861887Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 1, 2, 4, 5, 6]
2026-02-21T10:44:26.3866689Z 
2026-02-21T10:44:26.3873440Z   0%|          | 0/6 [00:00<?, ?it/s]WARNING:tritonbench.utils.triton_op:Running input ID 0:
2026-02-21T10:44:26.3874299Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T10:44:26.3874520Z ------------------------------------------
2026-02-21T10:44:26.3874731Z (4, 48, 128, 128, 128)
2026-02-21T10:44:26.3876935Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for aten
2026-02-21T10:44:30.4358640Z INFO:tritonbench.utils.triton_op:Took 33.09ms to get benchmark function for flex_attention
2026-02-21T10:44:33.8675174Z WARNING:__main__:Input tensor metadata:
2026-02-21T10:44:33.8675662Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T10:44:33.8675991Z               'dtype': 'torch.bfloat16',
2026-02-21T10:44:33.8676353Z               'shape': (4, 48, 128, 128),
2026-02-21T10:44:33.8676683Z               'stride': (786432, 16384, 128, 1)},
2026-02-21T10:44:33.8677029Z             { 'device': 'cuda:0',
2026-02-21T10:44:33.8677325Z               'dtype': 'torch.bfloat16',
2026-02-21T10:44:33.8677624Z               'shape': (4, 48, 128, 128),
2026-02-21T10:44:33.8677942Z               'stride': (786432, 16384, 128, 1)},
2026-02-21T10:44:33.8678249Z             { 'device': 'cuda:0',
2026-02-21T10:44:33.8678534Z               'dtype': 'torch.bfloat16',
2026-02-21T10:44:33.8678846Z               'shape': (4, 48, 128, 128),
2026-02-21T10:44:33.8679126Z               'stride': (786432, 16384, 128, 1)}),
2026-02-21T10:44:33.8679367Z   'kwargs': {}}
2026-02-21T10:44:33.8679745Z INFO:tritonbench.utils.triton_op:Took 0.50ms to get benchmark function for helion_attention
2026-02-21T10:44:34.1517207Z [0s] Autotune random seed: 2144140282
2026-02-21T10:44:34.4038420Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T10:44:52.8253203Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 2.0 configs/s
2026-02-21T10:44:55.0556968Z /tmp/torchinductor_root/xd/cxd2hduf2l55d74tltebjgb4u7ej4fcfh4zszguf5nurbyjr3hpx.py:56:20: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T10:44:55.0558494Z         k = tl.load(tl.make_block_ptr(k_view, [192, 128, 128], [16384, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T10:44:55.0559847Z                    ^
2026-02-21T10:44:55.0561532Z /tmp/torchinductor_root/xd/cxd2hduf2l55d74tltebjgb4u7ej4fcfh4zszguf5nurbyjr3hpx.py:58:141: note: - use: %146 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x16xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x16xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>>
2026-02-21T10:44:55.0563230Z 
2026-02-21T10:44:55.0564316Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T10:44:55.0565571Z                                                                                                                                             ^
2026-02-21T10:44:55.0566083Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T10:44:55.0566826Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.0567710Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.0568563Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.0569427Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.0569951Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [16, 4, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.0570439Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.0570798Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T10:44:55.0571147Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T10:44:55.0571484Z #blocked8 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}>
2026-02-21T10:44:55.0571818Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}>
2026-02-21T10:44:55.0572191Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [16, 4, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:55.0572555Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T10:44:55.0572920Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:55.0573294Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:55.0573676Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:55.0574048Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:55.0574485Z #blocked16 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T10:44:55.0574876Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T10:44:55.0575478Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T10:44:55.0575973Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T10:44:55.0576122Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T10:44:55.0576270Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T10:44:55.0576415Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T10:44:55.0576610Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x128x16xbf16, #blocked>
2026-02-21T10:44:55.0576855Z     %cst_0 = arith.constant dense<0> : tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.0577090Z     %cst_1 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:55.0577306Z     %cst_2 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:55.0577520Z     %cst_3 = arith.constant dense<128> : tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.0577757Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<1x4x128xbf16, #blocked3>
2026-02-21T10:44:55.0577992Z     %cst_5 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:55.0578202Z     %cst_6 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:55.0578411Z     %cst_7 = arith.constant dense<0> : tensor<1x4x1xi64, #blocked4>
2026-02-21T10:44:55.0578613Z     %cst_8 = arith.constant dense<128> : tensor<1x4x1xi64, #blocked4>
2026-02-21T10:44:55.0578793Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T10:44:55.0578930Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T10:44:55.0579105Z     %cst_9 = arith.constant dense<128> : tensor<1x4x1xi32, #blocked4>
2026-02-21T10:44:55.0579317Z     %cst_10 = arith.constant dense<128> : tensor<1x16x1xi32, #blocked5>
2026-02-21T10:44:55.0579510Z     %cst_11 = arith.constant dense<0.127517432> : tensor<1x4x16xf32, #blocked>
2026-02-21T10:44:55.0579711Z     %cst_12 = arith.constant dense<0.127517432> : tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0579932Z     %cst_13 = arith.constant dense<0.000000e+00> : tensor<4x16xf32, #blocked7>
2026-02-21T10:44:55.0580103Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T10:44:55.0580265Z     %cst_14 = arith.constant dense<0.000000e+00> : tensor<1x4x128xf32, #blocked3>
2026-02-21T10:44:55.0580474Z     %cst_15 = arith.constant dense<1.000000e+00> : tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0580675Z     %cst_16 = arith.constant dense<0xFF800000> : tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0580836Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T10:44:55.0580958Z     %c192_i32 = arith.constant 192 : i32
2026-02-21T10:44:55.0581080Z     %0 = tt.get_program_id x : i32
2026-02-21T10:44:55.0581201Z     %1 = arith.divsi %0, %c128_i32 : i32
2026-02-21T10:44:55.0581317Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T10:44:55.0581434Z     %3 = arith.subi %c192_i32, %2 : i32
2026-02-21T10:44:55.0581554Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T10:44:55.0581672Z     %5 = arith.remsi %0, %c128_i32 : i32
2026-02-21T10:44:55.0581790Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T10:44:55.0581902Z     %7 = arith.addi %2, %6 : i32
2026-02-21T10:44:55.0582015Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T10:44:55.0582125Z     %9 = arith.muli %8, %c4_i32 : i32
2026-02-21T10:44:55.0582283Z     %10 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked8>
2026-02-21T10:44:55.0582462Z     %11 = tt.splat %9 : i32 -> tensor<4xi32, #blocked8>
2026-02-21T10:44:55.0582616Z     %12 = arith.addi %11, %10 : tensor<4xi32, #blocked8>
2026-02-21T10:44:55.0582794Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked8>
2026-02-21T10:44:55.0582981Z     %14 = arith.extsi %7 : i32 to i64
2026-02-21T10:44:55.0583097Z     %15 = arith.extsi %9 : i32 to i64
2026-02-21T10:44:55.0583255Z     %16 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x4x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:44:55.0583432Z     %17 = arith.muli %14, %c16384_i64 : i64
2026-02-21T10:44:55.0583575Z     %18 = tt.splat %17 : i64 -> tensor<1x4x128xi64, #blocked3>
2026-02-21T10:44:55.0583734Z     %19 = tt.splat %15 : i64 -> tensor<4xi64, #blocked8>
2026-02-21T10:44:55.0583930Z     %20 = arith.extsi %10 : tensor<4xi32, #blocked8> to tensor<4xi64, #blocked8>
2026-02-21T10:44:55.0584106Z     %21 = arith.addi %19, %20 : tensor<4xi64, #blocked8>
2026-02-21T10:44:55.0584342Z     %22 = ttg.convert_layout %21 : tensor<4xi64, #blocked8> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T10:44:55.0584665Z     %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x4xi64, #blocked9>
2026-02-21T10:44:55.0584971Z     %24 = ttg.convert_layout %23 : tensor<1x4xi64, #blocked9> -> tensor<1x4xi64, #blocked6>
2026-02-21T10:44:55.0585255Z     %25 = ttg.convert_layout %24 : tensor<1x4xi64, #blocked6> -> tensor<1x4xi64, #ttg.slice<{dim = 2, parent = #blocked10}>>
2026-02-21T10:44:55.0585588Z     %26 = tt.expand_dims %25 {axis = 2 : i32} : tensor<1x4xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x4x1xi64, #blocked10>
2026-02-21T10:44:55.0585892Z     %27 = ttg.convert_layout %26 : tensor<1x4x1xi64, #blocked10> -> tensor<1x4x1xi64, #blocked4>
2026-02-21T10:44:55.0586101Z     %28 = arith.muli %27, %cst_8 : tensor<1x4x1xi64, #blocked4>
2026-02-21T10:44:55.0586304Z     %29 = tt.broadcast %28 : tensor<1x4x1xi64, #blocked4> -> tensor<1x4x128xi64, #blocked4>
2026-02-21T10:44:55.0586553Z     %30 = ttg.convert_layout %29 : tensor<1x4x128xi64, #blocked4> -> tensor<1x4x128xi64, #blocked3>
2026-02-21T10:44:55.0586788Z     %31 = arith.extsi %13 : tensor<128xi32, #blocked8> to tensor<128xi64, #blocked8>
2026-02-21T10:44:55.0587061Z     %32 = ttg.convert_layout %31 : tensor<128xi64, #blocked8> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T10:44:55.0587385Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi64, #blocked9>
2026-02-21T10:44:55.0587699Z     %34 = ttg.convert_layout %33 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #blocked11>
2026-02-21T10:44:55.0587992Z     %35 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked11> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked12}>>
2026-02-21T10:44:55.0588337Z     %36 = tt.expand_dims %35 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked12}>> -> tensor<1x1x128xi64, #blocked12>
2026-02-21T10:44:55.0588647Z     %37 = ttg.convert_layout %36 : tensor<1x1x128xi64, #blocked12> -> tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:55.0588896Z     %38 = tt.broadcast %37 : tensor<1x1x128xi64, #blocked3> -> tensor<1x4x128xi64, #blocked3>
2026-02-21T10:44:55.0589104Z     %39 = arith.addi %30, %38 : tensor<1x4x128xi64, #blocked3>
2026-02-21T10:44:55.0589267Z     %40 = arith.addi %18, %39 : tensor<1x4x128xi64, #blocked3>
2026-02-21T10:44:55.0589473Z     %41 = tt.addptr %16, %40 : tensor<1x4x128x!tt.ptr<bf16>, #blocked3>, tensor<1x4x128xi64, #blocked3>
2026-02-21T10:44:55.0589677Z     %42 = arith.cmpi sge, %14, %c0_i64 : i64
2026-02-21T10:44:55.0589803Z     %43 = arith.cmpi slt, %14, %c192_i64 : i64
2026-02-21T10:44:55.0589934Z     %44 = arith.andi %42, %43 : i1
2026-02-21T10:44:55.0590076Z     %45 = arith.cmpi sge, %27, %cst_7 : tensor<1x4x1xi64, #blocked4>
2026-02-21T10:44:55.0590256Z     %46 = arith.cmpi slt, %27, %cst_8 : tensor<1x4x1xi64, #blocked4>
2026-02-21T10:44:55.0590424Z     %47 = arith.andi %45, %46 : tensor<1x4x1xi1, #blocked4>
2026-02-21T10:44:55.0590579Z     %48 = tt.splat %44 : i1 -> tensor<1x4x1xi1, #blocked4>
2026-02-21T10:44:55.0590731Z     %49 = arith.andi %48, %47 : tensor<1x4x1xi1, #blocked4>
2026-02-21T10:44:55.0590938Z     %50 = tt.broadcast %49 : tensor<1x4x1xi1, #blocked4> -> tensor<1x4x128xi1, #blocked4>
2026-02-21T10:44:55.0591181Z     %51 = ttg.convert_layout %50 : tensor<1x4x128xi1, #blocked4> -> tensor<1x4x128xi1, #blocked3>
2026-02-21T10:44:55.0591401Z     %52 = arith.cmpi sge, %37, %cst_6 : tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:55.0591577Z     %53 = arith.cmpi slt, %37, %cst_5 : tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:55.0591749Z     %54 = arith.andi %52, %53 : tensor<1x1x128xi1, #blocked3>
2026-02-21T10:44:55.0591962Z     %55 = tt.broadcast %54 : tensor<1x1x128xi1, #blocked3> -> tensor<1x4x128xi1, #blocked3>
2026-02-21T10:44:55.0592160Z     %56 = arith.andi %51, %55 : tensor<1x4x128xi1, #blocked3>
2026-02-21T10:44:55.0592328Z     %57 = tt.load %41, %56, %cst_4 : tensor<1x4x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:44:55.0592527Z     %58 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked8>
2026-02-21T10:44:55.0592741Z     %59 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x16x!tt.ptr<bf16>, #blocked>
2026-02-21T10:44:55.0592942Z     %60 = tt.splat %17 : i64 -> tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.0593198Z     %61 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked11> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked13}>>
2026-02-21T10:44:55.0593547Z     %62 = tt.expand_dims %61 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x128x1xi64, #blocked13>
2026-02-21T10:44:55.0593856Z     %63 = ttg.convert_layout %62 : tensor<1x128x1xi64, #blocked13> -> tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:55.0594109Z     %64 = tt.broadcast %63 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x16xi64, #blocked2>
2026-02-21T10:44:55.0594353Z     %65 = ttg.convert_layout %64 : tensor<1x128x16xi64, #blocked2> -> tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.0594587Z     %66 = arith.extsi %58 : tensor<16xi32, #blocked8> to tensor<16xi64, #blocked8>
2026-02-21T10:44:55.0594781Z     %67 = arith.cmpi sge, %63, %cst_2 : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:55.0594959Z     %68 = arith.cmpi slt, %63, %cst_1 : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:55.0595144Z     %69 = arith.andi %67, %68 : tensor<1x128x1xi1, #blocked2>
2026-02-21T10:44:55.0595305Z     %70 = tt.splat %44 : i1 -> tensor<1x128x1xi1, #blocked2>
2026-02-21T10:44:55.0595497Z     %71 = arith.andi %70, %69 : tensor<1x128x1xi1, #blocked2>
2026-02-21T10:44:55.0595694Z     %72 = tt.broadcast %71 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x16xi1, #blocked2>
2026-02-21T10:44:55.0595936Z     %73 = ttg.convert_layout %72 : tensor<1x128x16xi1, #blocked2> -> tensor<1x128x16xi1, #blocked>
2026-02-21T10:44:55.0596180Z     %74 = tt.reshape %57 : tensor<1x4x128xbf16, #blocked3> -> tensor<4x128xbf16, #blocked11>
2026-02-21T10:44:55.0596366Z     %75 = arith.muli %7, %c16384_i32 : i32
2026-02-21T10:44:55.0596505Z     %76 = tt.splat %75 : i32 -> tensor<1x16x1xi32, #blocked5>
2026-02-21T10:44:55.0596751Z     %77 = ttg.convert_layout %13 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T10:44:55.0597080Z     %78 = tt.expand_dims %77 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9>
2026-02-21T10:44:55.0597377Z     %79 = ttg.convert_layout %78 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked11>
2026-02-21T10:44:55.0597669Z     %80 = ttg.convert_layout %79 : tensor<1x128xi32, #blocked11> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked12}>>
2026-02-21T10:44:55.0598014Z     %81 = tt.expand_dims %80 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked12}>> -> tensor<1x1x128xi32, #blocked12>
2026-02-21T10:44:55.0598322Z     %82 = ttg.convert_layout %81 : tensor<1x1x128xi32, #blocked12> -> tensor<1x1x128xi32, #blocked3>
2026-02-21T10:44:55.0598570Z     %83 = tt.broadcast %82 : tensor<1x1x128xi32, #blocked3> -> tensor<1x16x128xi32, #blocked3>
2026-02-21T10:44:55.0598814Z     %84 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x16x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:44:55.0599201Z     %85:3 = scf.for %arg4 = %c0_i32 to %c128_i32 step %c16_i32 iter_args(%arg5 = %cst_16, %arg6 = %cst_15, %arg7 = %cst_14) -> (tensor<1x4xf32, #blocked6>, tensor<1x4xf32, #blocked6>, tensor<1x4x128xf32, #blocked3>)  : i32 {
2026-02-21T10:44:55.0599554Z       %115 = tt.splat %arg4 : i32 -> tensor<16xi32, #blocked8>
2026-02-21T10:44:55.0599717Z       %116 = arith.addi %115, %58 : tensor<16xi32, #blocked8>
2026-02-21T10:44:55.0599876Z       %117 = arith.extsi %arg4 : i32 to i64
2026-02-21T10:44:55.0600017Z       %118 = tt.splat %117 : i64 -> tensor<16xi64, #blocked8>
2026-02-21T10:44:55.0600172Z       %119 = arith.addi %118, %66 : tensor<16xi64, #blocked8>
2026-02-21T10:44:55.0600408Z       %120 = ttg.convert_layout %119 : tensor<16xi64, #blocked8> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T10:44:55.0600743Z       %121 = tt.expand_dims %120 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x16xi64, #blocked9>
2026-02-21T10:44:55.0601055Z       %122 = ttg.convert_layout %121 : tensor<1x16xi64, #blocked9> -> tensor<1x16xi64, #blocked7>
2026-02-21T10:44:55.0601350Z       %123 = ttg.convert_layout %122 : tensor<1x16xi64, #blocked7> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked14}>>
2026-02-21T10:44:55.0601704Z       %124 = tt.expand_dims %123 {axis = 1 : i32} : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked14}>> -> tensor<1x1x16xi64, #blocked14>
2026-02-21T10:44:55.0602015Z       %125 = ttg.convert_layout %124 : tensor<1x1x16xi64, #blocked14> -> tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.0602236Z       %126 = arith.muli %125, %cst_3 : tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.0602445Z       %127 = tt.broadcast %126 : tensor<1x1x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked1>
2026-02-21T10:44:55.0602765Z       %128 = ttg.convert_layout %127 : tensor<1x128x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.0602990Z       %129 = arith.addi %65, %128 : tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.0603155Z       %130 = arith.addi %60, %129 : tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.0603374Z       %131 = tt.addptr %59, %130 : tensor<1x128x16x!tt.ptr<bf16>, #blocked>, tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.0603625Z       %132 = arith.cmpi sge, %125, %cst_0 : tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.0603810Z       %133 = arith.cmpi slt, %125, %cst_3 : tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.0603987Z       %134 = arith.andi %132, %133 : tensor<1x1x16xi1, #blocked1>
2026-02-21T10:44:55.0604196Z       %135 = tt.broadcast %134 : tensor<1x1x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked1>
2026-02-21T10:44:55.0604450Z       %136 = ttg.convert_layout %135 : tensor<1x128x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked>
2026-02-21T10:44:55.0604661Z       %137 = arith.andi %73, %136 : tensor<1x128x16xi1, #blocked>
2026-02-21T10:44:55.0604842Z       %138 = tt.load %131, %137, %cst : tensor<1x128x16x!tt.ptr<bf16>, #blocked>
2026-02-21T10:44:55.0605069Z       %139 = tt.reshape %138 : tensor<1x128x16xbf16, #blocked> -> tensor<128x16xbf16, #blocked7>
2026-02-21T10:44:55.0605377Z       %140 = ttg.convert_layout %74 : tensor<4x128xbf16, #blocked11> -> tensor<4x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked7}>>
2026-02-21T10:44:55.0605744Z       %141 = ttg.convert_layout %139 : tensor<128x16xbf16, #blocked7> -> tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked7}>>
2026-02-21T10:44:55.0606049Z       %142 = ttg.convert_layout %cst_13 : tensor<4x16xf32, #blocked7> -> tensor<4x16xf32, #blocked7>
2026-02-21T10:44:55.0606467Z       %143 = tt.dot %140, %141, %142, inputPrecision = tf32 : tensor<4x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked7}>> * tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked7}>> -> tensor<4x16xf32, #blocked7>
2026-02-21T10:44:55.0606894Z       %144 = tt.reshape %143 : tensor<4x16xf32, #blocked7> -> tensor<1x4x16xf32, #blocked>
2026-02-21T10:44:55.0607147Z       %145 = arith.truncf %144 : tensor<1x4x16xf32, #blocked> to tensor<1x4x16xbf16, #blocked>
2026-02-21T10:44:55.0607387Z       %146 = arith.extf %145 : tensor<1x4x16xbf16, #blocked> to tensor<1x4x16xf32, #blocked>
2026-02-21T10:44:55.0607581Z       %147 = "tt.reduce"(%146) <{axis = 2 : i32}> ({
2026-02-21T10:44:55.0607713Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T10:44:55.0607841Z         %203 = arith.maxnumf %arg8, %arg9 : f32
2026-02-21T10:44:55.0607969Z         tt.reduce.return %203 : f32
2026-02-21T10:44:55.0608182Z       }) : (tensor<1x4x16xf32, #blocked>) -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T10:44:55.0608474Z       %148 = ttg.convert_layout %147 : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0608750Z       %149 = arith.truncf %148 : tensor<1x4xf32, #blocked6> to tensor<1x4xbf16, #blocked6>
2026-02-21T10:44:55.0608976Z       %150 = arith.extf %149 : tensor<1x4xbf16, #blocked6> to tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0609190Z       %151 = arith.mulf %150, %cst_12 : tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0609387Z       %152 = arith.truncf %151 : tensor<1x4xf32, #blocked6> to tensor<1x4xbf16, #blocked6>
2026-02-21T10:44:55.0609605Z       %153 = arith.extf %152 : tensor<1x4xbf16, #blocked6> to tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0609809Z       %154 = arith.cmpf ogt, %arg5, %153 : tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0609987Z       %155 = arith.cmpf une, %arg5, %arg5 : tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0610154Z       %156 = arith.ori %154, %155 : tensor<1x4xi1, #blocked6>
2026-02-21T10:44:55.0610355Z       %157 = arith.select %156, %arg5, %153 : tensor<1x4xi1, #blocked6>, tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0610562Z       %158 = arith.mulf %146, %cst_11 : tensor<1x4x16xf32, #blocked>
2026-02-21T10:44:55.0610768Z       %159 = arith.truncf %158 : tensor<1x4x16xf32, #blocked> to tensor<1x4x16xbf16, #blocked>
2026-02-21T10:44:55.0611060Z       %160 = ttg.convert_layout %157 : tensor<1x4xf32, #blocked6> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>>
2026-02-21T10:44:55.0611404Z       %161 = tt.expand_dims %160 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x4x1xf32, #blocked10>
2026-02-21T10:44:55.0611729Z       %162 = ttg.convert_layout %161 : tensor<1x4x1xf32, #blocked10> -> tensor<1x4x1xf32, #blocked4>
2026-02-21T10:44:55.0611970Z       %163 = arith.extf %159 : tensor<1x4x16xbf16, #blocked> to tensor<1x4x16xf32, #blocked>
2026-02-21T10:44:55.0612211Z       %164 = tt.broadcast %162 : tensor<1x4x1xf32, #blocked4> -> tensor<1x4x16xf32, #blocked4>
2026-02-21T10:44:55.0612460Z       %165 = ttg.convert_layout %164 : tensor<1x4x16xf32, #blocked4> -> tensor<1x4x16xf32, #blocked>
2026-02-21T10:44:55.0612673Z       %166 = arith.subf %163, %165 : tensor<1x4x16xf32, #blocked>
2026-02-21T10:44:55.0612983Z       %167 = tt.extern_elementwise %166 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x4x16xf32, #blocked>) -> tensor<1x4x16xf32, #blocked>
2026-02-21T10:44:55.0613272Z       %168 = "tt.reduce"(%167) <{axis = 2 : i32}> ({
2026-02-21T10:44:55.0613406Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T10:44:55.0613530Z         %203 = arith.addf %arg8, %arg9 : f32
2026-02-21T10:44:55.0613652Z         tt.reduce.return %203 : f32
2026-02-21T10:44:55.0613846Z       }) : (tensor<1x4x16xf32, #blocked>) -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T10:44:55.0614137Z       %169 = ttg.convert_layout %168 : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0614385Z       %170 = arith.subf %arg5, %157 : tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0614675Z       %171 = tt.extern_elementwise %170 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x4xf32, #blocked6>) -> tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0614982Z       %172 = arith.mulf %arg6, %171 : tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0615150Z       %173 = arith.addf %172, %169 : tensor<1x4xf32, #blocked6>
2026-02-21T10:44:55.0615395Z       %174 = ttg.convert_layout %171 : tensor<1x4xf32, #blocked6> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>>
2026-02-21T10:44:55.0615747Z       %175 = tt.expand_dims %174 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x4x1xf32, #blocked10>
2026-02-21T10:44:55.0616051Z       %176 = ttg.convert_layout %175 : tensor<1x4x1xf32, #blocked10> -> tensor<1x4x1xf32, #blocked4>
2026-02-21T10:44:55.0616319Z       %177 = tt.broadcast %176 : tensor<1x4x1xf32, #blocked4> -> tensor<1x4x128xf32, #blocked4>
2026-02-21T10:44:55.0616573Z       %178 = ttg.convert_layout %177 : tensor<1x4x128xf32, #blocked4> -> tensor<1x4x128xf32, #blocked3>
2026-02-21T10:44:55.0616792Z       %179 = arith.mulf %arg7, %178 : tensor<1x4x128xf32, #blocked3>
2026-02-21T10:44:55.0617059Z       %180 = ttg.convert_layout %116 : tensor<16xi32, #blocked8> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T10:44:55.0617387Z       %181 = tt.expand_dims %180 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x16xi32, #blocked9>
2026-02-21T10:44:55.0617682Z       %182 = ttg.convert_layout %181 : tensor<1x16xi32, #blocked9> -> tensor<1x16xi32, #blocked7>
2026-02-21T10:44:55.0617977Z       %183 = ttg.convert_layout %182 : tensor<1x16xi32, #blocked7> -> tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked15}>>
2026-02-21T10:44:55.0618321Z       %184 = tt.expand_dims %183 {axis = 2 : i32} : tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x16x1xi32, #blocked15>
2026-02-21T10:44:55.0618630Z       %185 = ttg.convert_layout %184 : tensor<1x16x1xi32, #blocked15> -> tensor<1x16x1xi32, #blocked5>
2026-02-21T10:44:55.0618852Z       %186 = arith.muli %185, %cst_10 : tensor<1x16x1xi32, #blocked5>
2026-02-21T10:44:55.0619019Z       %187 = arith.addi %76, %186 : tensor<1x16x1xi32, #blocked5>
2026-02-21T10:44:55.0619225Z       %188 = tt.broadcast %187 : tensor<1x16x1xi32, #blocked5> -> tensor<1x16x128xi32, #blocked5>
2026-02-21T10:44:55.0619480Z       %189 = ttg.convert_layout %188 : tensor<1x16x128xi32, #blocked5> -> tensor<1x16x128xi32, #blocked3>
2026-02-21T10:44:55.0619704Z       %190 = arith.addi %189, %83 : tensor<1x16x128xi32, #blocked3>
2026-02-21T10:44:55.0619941Z       %191 = tt.addptr %84, %190 : tensor<1x16x128x!tt.ptr<bf16>, #blocked3>, tensor<1x16x128xi32, #blocked3>
2026-02-21T10:44:55.0620167Z       %192 = tt.load %191 : tensor<1x16x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:44:55.0620373Z       %193 = arith.truncf %167 : tensor<1x4x16xf32, #blocked> to tensor<1x4x16xbf16, #blocked>
2026-02-21T10:44:55.0620615Z       %194 = tt.reshape %179 : tensor<1x4x128xf32, #blocked3> -> tensor<4x128xf32, #blocked11>
2026-02-21T10:44:55.0620857Z       %195 = tt.reshape %193 : tensor<1x4x16xbf16, #blocked> -> tensor<4x16xbf16, #blocked7>
2026-02-21T10:44:55.0621100Z       %196 = tt.reshape %192 : tensor<1x16x128xbf16, #blocked3> -> tensor<16x128xbf16, #blocked11>
2026-02-21T10:44:55.0621405Z       %197 = ttg.convert_layout %195 : tensor<4x16xbf16, #blocked7> -> tensor<4x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>>
2026-02-21T10:44:55.0621768Z       %198 = ttg.convert_layout %196 : tensor<16x128xbf16, #blocked11> -> tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>>
2026-02-21T10:44:55.0622079Z       %199 = ttg.convert_layout %194 : tensor<4x128xf32, #blocked11> -> tensor<4x128xf32, #blocked16>
2026-02-21T10:44:55.0622508Z       %200 = tt.dot %197, %198, %199, inputPrecision = tf32 : tensor<4x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>> * tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>> -> tensor<4x128xf32, #blocked16>
2026-02-21T10:44:55.0622923Z       %201 = ttg.convert_layout %200 : tensor<4x128xf32, #blocked16> -> tensor<4x128xf32, #blocked11>
2026-02-21T10:44:55.0623170Z       %202 = tt.reshape %201 : tensor<4x128xf32, #blocked11> -> tensor<1x4x128xf32, #blocked3>
2026-02-21T10:44:55.0623466Z       scf.yield %157, %173, %202 : tensor<1x4xf32, #blocked6>, tensor<1x4xf32, #blocked6>, tensor<1x4x128xf32, #blocked3>
2026-02-21T10:44:55.0623682Z     } {tt.num_stages = 4 : i32}
2026-02-21T10:44:55.0623905Z     %86 = ttg.convert_layout %85#1 : tensor<1x4xf32, #blocked6> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>>
2026-02-21T10:44:55.0624247Z     %87 = tt.expand_dims %86 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x4x1xf32, #blocked10>
2026-02-21T10:44:55.0624556Z     %88 = ttg.convert_layout %87 : tensor<1x4x1xf32, #blocked10> -> tensor<1x4x1xf32, #blocked4>
2026-02-21T10:44:55.0624799Z     %89 = tt.broadcast %88 : tensor<1x4x1xf32, #blocked4> -> tensor<1x4x128xf32, #blocked4>
2026-02-21T10:44:55.0625041Z     %90 = ttg.convert_layout %89 : tensor<1x4x128xf32, #blocked4> -> tensor<1x4x128xf32, #blocked3>
2026-02-21T10:44:55.0625260Z     %91 = arith.divf %85#2, %90 : tensor<1x4x128xf32, #blocked3>
2026-02-21T10:44:55.0625490Z     %92 = arith.truncf %91 : tensor<1x4x128xf32, #blocked3> to tensor<1x4x128xbf16, #blocked3>
2026-02-21T10:44:55.0625676Z     %93 = arith.muli %7, %c16384_i32 : i32
2026-02-21T10:44:55.0625894Z     %94 = ttg.convert_layout %12 : tensor<4xi32, #blocked8> -> tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T10:44:55.0626214Z     %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<4xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x4xi32, #blocked9>
2026-02-21T10:44:55.0626498Z     %96 = ttg.convert_layout %95 : tensor<1x4xi32, #blocked9> -> tensor<1x4xi32, #blocked6>
2026-02-21T10:44:55.0626779Z     %97 = ttg.convert_layout %96 : tensor<1x4xi32, #blocked6> -> tensor<1x4xi32, #ttg.slice<{dim = 2, parent = #blocked10}>>
2026-02-21T10:44:55.0627108Z     %98 = tt.expand_dims %97 {axis = 2 : i32} : tensor<1x4xi32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x4x1xi32, #blocked10>
2026-02-21T10:44:55.0627408Z     %99 = ttg.convert_layout %98 : tensor<1x4x1xi32, #blocked10> -> tensor<1x4x1xi32, #blocked4>
2026-02-21T10:44:55.0627617Z     %100 = arith.muli %99, %cst_9 : tensor<1x4x1xi32, #blocked4>
2026-02-21T10:44:55.0627785Z     %101 = tt.splat %93 : i32 -> tensor<1x4x1xi32, #blocked4>
2026-02-21T10:44:55.0627949Z     %102 = arith.addi %101, %100 : tensor<1x4x1xi32, #blocked4>
2026-02-21T10:44:55.0628209Z     %103 = ttg.convert_layout %13 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T10:44:55.0628547Z     %104 = tt.expand_dims %103 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9>
2026-02-21T10:44:55.0628841Z     %105 = ttg.convert_layout %104 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked11>
2026-02-21T10:44:55.0629143Z     %106 = ttg.convert_layout %105 : tensor<1x128xi32, #blocked11> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked12}>>
2026-02-21T10:44:55.0629503Z     %107 = tt.expand_dims %106 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked12}>> -> tensor<1x1x128xi32, #blocked12>
2026-02-21T10:44:55.0629810Z     %108 = ttg.convert_layout %107 : tensor<1x1x128xi32, #blocked12> -> tensor<1x1x128xi32, #blocked3>
2026-02-21T10:44:55.0630063Z     %109 = tt.broadcast %102 : tensor<1x4x1xi32, #blocked4> -> tensor<1x4x128xi32, #blocked4>
2026-02-21T10:44:55.0630310Z     %110 = ttg.convert_layout %109 : tensor<1x4x128xi32, #blocked4> -> tensor<1x4x128xi32, #blocked3>
2026-02-21T10:44:55.0630564Z     %111 = tt.broadcast %108 : tensor<1x1x128xi32, #blocked3> -> tensor<1x4x128xi32, #blocked3>
2026-02-21T10:44:55.0630771Z     %112 = arith.addi %110, %111 : tensor<1x4x128xi32, #blocked3>
2026-02-21T10:44:55.0630963Z     %113 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x4x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:44:55.0631207Z     %114 = tt.addptr %113, %112 : tensor<1x4x128x!tt.ptr<bf16>, #blocked3>, tensor<1x4x128xi32, #blocked3>
2026-02-21T10:44:55.0631441Z     tt.store %114, %92 : tensor<1x4x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:44:55.0631587Z     tt.return
2026-02-21T10:44:55.0631671Z   }
2026-02-21T10:44:55.0631758Z }
2026-02-21T10:44:55.0631802Z 
2026-02-21T10:44:55.0631840Z {-#
2026-02-21T10:44:55.0631923Z   external_resources: {
2026-02-21T10:44:55.0632030Z     mlir_reproducer: {
2026-02-21T10:44:55.0634289Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T10:44:55.0636612Z       disable_threading: false,
2026-02-21T10:44:55.0636720Z       verify_each: true
2026-02-21T10:44:55.0636818Z     }
2026-02-21T10:44:55.0636898Z   }
2026-02-21T10:44:55.0636970Z #-}
2026-02-21T10:44:55.0637253Z /tmp/torchinductor_root/xd/cxd2hduf2l55d74tltebjgb4u7ej4fcfh4zszguf5nurbyjr3hpx.py:17:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T10:44:55.0637968Z /tmp/torchinductor_root/xd/cxd2hduf2l55d74tltebjgb4u7ej4fcfh4zszguf5nurbyjr3hpx.py:17:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T10:44:55.0638539Z [20s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T10:44:55.0639271Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 4, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T10:44:55.0639938Z Error: RuntimeError: PassManager::run failed
2026-02-21T10:44:55.0640114Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T10:44:55.1159372Z /tmp/torchinductor_root/qp/cqpelahacicx6ki3kthwus2gomtuw7rhjdarif5mfbgtnqccw44x.py:62:24: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T10:44:55.1160813Z             k = tl.load(tl.make_block_ptr(k_view, [192, 128, 128], [16384, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_3, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T10:44:55.1161685Z                        ^
2026-02-21T10:44:55.1163459Z /tmp/torchinductor_root/qp/cqpelahacicx6ki3kthwus2gomtuw7rhjdarif5mfbgtnqccw44x.py:64:145: note: - use: %105 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x16xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 4], order = [1, 0, 2]}>>) -> tensor<128x16xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 4], order = [0, 1]}>>
2026-02-21T10:44:55.1165144Z 
2026-02-21T10:44:55.1166077Z             qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T10:44:55.1167336Z                                                                                                                                                 ^
2026-02-21T10:44:55.1167926Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T10:44:55.1168656Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 4, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.1169536Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.1170456Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.1171328Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.1172180Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T10:44:55.1172791Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T10:44:55.1173154Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [2, 1, 0]}>
2026-02-21T10:44:55.1173471Z #blocked7 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [4], order = [0]}>
2026-02-21T10:44:55.1173765Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [0, 1]}>
2026-02-21T10:44:55.1174064Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T10:44:55.1174383Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [0, 1, 2]}>
2026-02-21T10:44:55.1174711Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [0, 1, 2]}>
2026-02-21T10:44:55.1175057Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T10:44:55.1175375Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:55.1175693Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:55.1176010Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:55.1176332Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:55.1176668Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T10:44:55.1177196Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T10:44:55.1177604Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T10:44:55.1177730Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T10:44:55.1177863Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T10:44:55.1177990Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T10:44:55.1178162Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x128x16xbf16, #blocked>
2026-02-21T10:44:55.1178387Z     %cst_0 = arith.constant dense<0> : tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.1178574Z     %cst_1 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:55.1178761Z     %cst_2 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:55.1178941Z     %cst_3 = arith.constant dense<128> : tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.1179103Z     %c3072_i32 = arith.constant 3072 : i32
2026-02-21T10:44:55.1179224Z     %c4864_i32 = arith.constant 4864 : i32
2026-02-21T10:44:55.1179370Z     %c24576_i32 = arith.constant 24576 : i32
2026-02-21T10:44:55.1179524Z     %cst_4 = arith.constant dense<128> : tensor<1x16x1xi32, #blocked3>
2026-02-21T10:44:55.1179713Z     %cst_5 = arith.constant dense<0.127517432> : tensor<1x1x16xf32, #blocked1>
2026-02-21T10:44:55.1179915Z     %cst_6 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1180110Z     %cst_7 = arith.constant dense<0.000000e+00> : tensor<1x16xf32, #blocked5>
2026-02-21T10:44:55.1180281Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T10:44:55.1180459Z     %cst_8 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked6>
2026-02-21T10:44:55.1180666Z     %cst_9 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1180867Z     %cst_10 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1181028Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T10:44:55.1181150Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T10:44:55.1181274Z     %0 = tt.get_program_id x : i32
2026-02-21T10:44:55.1181439Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked7>
2026-02-21T10:44:55.1181709Z     %2 = ttg.convert_layout %1 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:44:55.1182040Z     %3 = tt.expand_dims %2 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8>
2026-02-21T10:44:55.1182334Z     %4 = ttg.convert_layout %3 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked9>
2026-02-21T10:44:55.1182618Z     %5 = ttg.convert_layout %4 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T10:44:55.1182980Z     %6 = tt.expand_dims %5 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<1x1x128xi32, #blocked10>
2026-02-21T10:44:55.1183285Z     %7 = ttg.convert_layout %6 : tensor<1x1x128xi32, #blocked10> -> tensor<1x1x128xi32, #blocked6>
2026-02-21T10:44:55.1183521Z     %8 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T10:44:55.1183734Z     %9 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked7>
2026-02-21T10:44:55.1183940Z     %10 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x16x!tt.ptr<bf16>, #blocked>
2026-02-21T10:44:55.1184157Z     %11 = arith.extsi %1 : tensor<128xi32, #blocked7> to tensor<128xi64, #blocked7>
2026-02-21T10:44:55.1184433Z     %12 = ttg.convert_layout %11 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:44:55.1184764Z     %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8>
2026-02-21T10:44:55.1185062Z     %14 = ttg.convert_layout %13 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked9>
2026-02-21T10:44:55.1185350Z     %15 = ttg.convert_layout %14 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked11}>>
2026-02-21T10:44:55.1185695Z     %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x128x1xi64, #blocked11>
2026-02-21T10:44:55.1186001Z     %17 = ttg.convert_layout %16 : tensor<1x128x1xi64, #blocked11> -> tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:55.1186254Z     %18 = tt.broadcast %17 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x16xi64, #blocked2>
2026-02-21T10:44:55.1186520Z     %19 = ttg.convert_layout %18 : tensor<1x128x16xi64, #blocked2> -> tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.1186750Z     %20 = arith.extsi %9 : tensor<16xi32, #blocked7> to tensor<16xi64, #blocked7>
2026-02-21T10:44:55.1186945Z     %21 = arith.cmpi sge, %17, %cst_2 : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:55.1187124Z     %22 = arith.cmpi slt, %17, %cst_1 : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:55.1187291Z     %23 = arith.andi %21, %22 : tensor<1x128x1xi1, #blocked2>
2026-02-21T10:44:55.1187510Z     %24 = tt.broadcast %7 : tensor<1x1x128xi32, #blocked6> -> tensor<1x16x128xi32, #blocked6>
2026-02-21T10:44:55.1187760Z     %25 = ttg.convert_layout %24 : tensor<1x16x128xi32, #blocked6> -> tensor<1x16x128xi32, #blocked12>
2026-02-21T10:44:55.1187999Z     %26 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x16x128x!tt.ptr<bf16>, #blocked12>
2026-02-21T10:44:55.1188202Z     %27 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T10:44:55.1188411Z     scf.for %arg4 = %0 to %c24576_i32 step %c4864_i32  : i32 {
2026-02-21T10:44:55.1188564Z       %28 = arith.divsi %arg4, %c3072_i32 : i32
2026-02-21T10:44:55.1188690Z       %29 = arith.muli %28, %c16_i32 : i32
2026-02-21T10:44:55.1188811Z       %30 = arith.subi %c128_i32, %29 : i32
2026-02-21T10:44:55.1188928Z       %31 = arith.minsi %30, %c16_i32 : i32
2026-02-21T10:44:55.1189051Z       %32 = arith.remsi %arg4, %c3072_i32 : i32
2026-02-21T10:44:55.1189170Z       %33 = arith.remsi %32, %31 : i32
2026-02-21T10:44:55.1189284Z       %34 = arith.addi %29, %33 : i32
2026-02-21T10:44:55.1189395Z       %35 = arith.divsi %32, %31 : i32
2026-02-21T10:44:55.1189507Z       %36 = arith.muli %35, %c16384_i32 : i32
2026-02-21T10:44:55.1189625Z       %37 = arith.muli %34, %c128_i32 : i32
2026-02-21T10:44:55.1189736Z       %38 = arith.addi %36, %37 : i32
2026-02-21T10:44:55.1189871Z       %39 = tt.splat %38 : i32 -> tensor<1x1x128xi32, #blocked6>
2026-02-21T10:44:55.1190029Z       %40 = arith.addi %39, %7 : tensor<1x1x128xi32, #blocked6>
2026-02-21T10:44:55.1190241Z       %41 = tt.addptr %8, %40 : tensor<1x1x128x!tt.ptr<bf16>, #blocked6>, tensor<1x1x128xi32, #blocked6>
2026-02-21T10:44:55.1190448Z       %42 = tt.load %41 : tensor<1x1x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T10:44:55.1190588Z       %43 = arith.extsi %35 : i32 to i64
2026-02-21T10:44:55.1190728Z       %44 = arith.muli %43, %c16384_i64 : i64
2026-02-21T10:44:55.1190865Z       %45 = tt.splat %44 : i64 -> tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.1191009Z       %46 = arith.cmpi sge, %43, %c0_i64 : i64
2026-02-21T10:44:55.1191135Z       %47 = arith.cmpi slt, %43, %c192_i64 : i64
2026-02-21T10:44:55.1191257Z       %48 = arith.andi %46, %47 : i1
2026-02-21T10:44:55.1191387Z       %49 = tt.splat %48 : i1 -> tensor<1x128x1xi1, #blocked2>
2026-02-21T10:44:55.1191544Z       %50 = arith.andi %49, %23 : tensor<1x128x1xi1, #blocked2>
2026-02-21T10:44:55.1191739Z       %51 = tt.broadcast %50 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x16xi1, #blocked2>
2026-02-21T10:44:55.1191984Z       %52 = ttg.convert_layout %51 : tensor<1x128x16xi1, #blocked2> -> tensor<1x128x16xi1, #blocked>
2026-02-21T10:44:55.1192226Z       %53 = tt.reshape %42 : tensor<1x1x128xbf16, #blocked6> -> tensor<1x128xbf16, #blocked9>
2026-02-21T10:44:55.1192417Z       %54 = tt.splat %36 : i32 -> tensor<1x16x1xi32, #blocked3>
2026-02-21T10:44:55.1192774Z       %55:3 = scf.for %arg5 = %c0_i32 to %c128_i32 step %c16_i32 iter_args(%arg6 = %cst_10, %arg7 = %cst_9, %arg8 = %cst_8) -> (tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked6>)  : i32 {
2026-02-21T10:44:55.1193130Z         %64 = tt.splat %arg5 : i32 -> tensor<16xi32, #blocked7>
2026-02-21T10:44:55.1193284Z         %65 = arith.addi %64, %9 : tensor<16xi32, #blocked7>
2026-02-21T10:44:55.1193421Z         %66 = arith.extsi %arg5 : i32 to i64
2026-02-21T10:44:55.1193559Z         %67 = tt.splat %66 : i64 -> tensor<16xi64, #blocked7>
2026-02-21T10:44:55.1193714Z         %68 = arith.addi %67, %20 : tensor<16xi64, #blocked7>
2026-02-21T10:44:55.1193967Z         %69 = ttg.convert_layout %68 : tensor<16xi64, #blocked7> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:44:55.1194292Z         %70 = tt.expand_dims %69 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x16xi64, #blocked8>
2026-02-21T10:44:55.1194582Z         %71 = ttg.convert_layout %70 : tensor<1x16xi64, #blocked8> -> tensor<1x16xi64, #blocked5>
2026-02-21T10:44:55.1194883Z         %72 = ttg.convert_layout %71 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked13}>>
2026-02-21T10:44:55.1195225Z         %73 = tt.expand_dims %72 {axis = 1 : i32} : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked13}>> -> tensor<1x1x16xi64, #blocked13>
2026-02-21T10:44:55.1195534Z         %74 = ttg.convert_layout %73 : tensor<1x1x16xi64, #blocked13> -> tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.1195746Z         %75 = arith.muli %74, %cst_3 : tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.1195968Z         %76 = tt.broadcast %75 : tensor<1x1x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked1>
2026-02-21T10:44:55.1196214Z         %77 = ttg.convert_layout %76 : tensor<1x128x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.1196425Z         %78 = arith.addi %19, %77 : tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.1196588Z         %79 = arith.addi %45, %78 : tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.1196797Z         %80 = tt.addptr %10, %79 : tensor<1x128x16x!tt.ptr<bf16>, #blocked>, tensor<1x128x16xi64, #blocked>
2026-02-21T10:44:55.1197022Z         %81 = arith.cmpi sge, %74, %cst_0 : tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.1197199Z         %82 = arith.cmpi slt, %74, %cst_3 : tensor<1x1x16xi64, #blocked1>
2026-02-21T10:44:55.1197367Z         %83 = arith.andi %81, %82 : tensor<1x1x16xi1, #blocked1>
2026-02-21T10:44:55.1197563Z         %84 = tt.broadcast %83 : tensor<1x1x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked1>
2026-02-21T10:44:55.1197815Z         %85 = ttg.convert_layout %84 : tensor<1x128x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked>
2026-02-21T10:44:55.1198024Z         %86 = arith.andi %52, %85 : tensor<1x128x16xi1, #blocked>
2026-02-21T10:44:55.1198189Z         %87 = tt.load %80, %86, %cst : tensor<1x128x16x!tt.ptr<bf16>, #blocked>
2026-02-21T10:44:55.1198417Z         %88 = tt.reshape %87 : tensor<1x128x16xbf16, #blocked> -> tensor<128x16xbf16, #blocked5>
2026-02-21T10:44:55.1198708Z         %89 = ttg.convert_layout %53 : tensor<1x128xbf16, #blocked9> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>>
2026-02-21T10:44:55.1199062Z         %90 = ttg.convert_layout %88 : tensor<128x16xbf16, #blocked5> -> tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>>
2026-02-21T10:44:55.1199364Z         %91 = ttg.convert_layout %cst_7 : tensor<1x16xf32, #blocked5> -> tensor<1x16xf32, #blocked5>
2026-02-21T10:44:55.1199767Z         %92 = tt.dot %89, %90, %91, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> * tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> -> tensor<1x16xf32, #blocked5>
2026-02-21T10:44:55.1200153Z         %93 = tt.reshape %92 : tensor<1x16xf32, #blocked5> -> tensor<1x1x16xf32, #blocked1>
2026-02-21T10:44:55.1200390Z         %94 = arith.truncf %93 : tensor<1x1x16xf32, #blocked1> to tensor<1x1x16xbf16, #blocked1>
2026-02-21T10:44:55.1200620Z         %95 = arith.extf %94 : tensor<1x1x16xbf16, #blocked1> to tensor<1x1x16xf32, #blocked1>
2026-02-21T10:44:55.1200809Z         %96 = "tt.reduce"(%95) <{axis = 2 : i32}> ({
2026-02-21T10:44:55.1200936Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T10:44:55.1201062Z           %151 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T10:44:55.1201190Z           tt.reduce.return %151 : f32
2026-02-21T10:44:55.1201381Z         }) : (tensor<1x1x16xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T10:44:55.1201690Z         %97 = ttg.convert_layout %96 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1201953Z         %98 = arith.truncf %97 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4>
2026-02-21T10:44:55.1202173Z         %99 = arith.extf %98 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1202364Z         %100 = arith.mulf %99, %cst_6 : tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1202558Z         %101 = arith.truncf %100 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4>
2026-02-21T10:44:55.1202856Z         %102 = arith.extf %101 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1203053Z         %103 = arith.cmpf ogt, %arg6, %102 : tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1203230Z         %104 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1203395Z         %105 = arith.ori %103, %104 : tensor<1x1xi1, #blocked4>
2026-02-21T10:44:55.1203612Z         %106 = arith.select %105, %arg6, %102 : tensor<1x1xi1, #blocked4>, tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1203846Z         %107 = arith.mulf %95, %cst_5 : tensor<1x1x16xf32, #blocked1>
2026-02-21T10:44:55.1204053Z         %108 = arith.truncf %107 : tensor<1x1x16xf32, #blocked1> to tensor<1x1x16xbf16, #blocked1>
2026-02-21T10:44:55.1204342Z         %109 = ttg.convert_layout %106 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T10:44:55.1204685Z         %110 = tt.expand_dims %109 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T10:44:55.1204990Z         %111 = ttg.convert_layout %110 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15>
2026-02-21T10:44:55.1205237Z         %112 = arith.extf %108 : tensor<1x1x16xbf16, #blocked1> to tensor<1x1x16xf32, #blocked1>
2026-02-21T10:44:55.1205472Z         %113 = tt.broadcast %111 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x16xf32, #blocked15>
2026-02-21T10:44:55.1205729Z         %114 = ttg.convert_layout %113 : tensor<1x1x16xf32, #blocked15> -> tensor<1x1x16xf32, #blocked1>
2026-02-21T10:44:55.1205943Z         %115 = arith.subf %112, %114 : tensor<1x1x16xf32, #blocked1>
2026-02-21T10:44:55.1206263Z         %116 = tt.extern_elementwise %115 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x16xf32, #blocked1>) -> tensor<1x1x16xf32, #blocked1>
2026-02-21T10:44:55.1206554Z         %117 = "tt.reduce"(%116) <{axis = 2 : i32}> ({
2026-02-21T10:44:55.1206682Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T10:44:55.1206804Z           %151 = arith.addf %arg9, %arg10 : f32
2026-02-21T10:44:55.1206926Z           tt.reduce.return %151 : f32
2026-02-21T10:44:55.1207117Z         }) : (tensor<1x1x16xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T10:44:55.1207413Z         %118 = ttg.convert_layout %117 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1207660Z         %119 = arith.subf %arg6, %106 : tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1207948Z         %120 = tt.extern_elementwise %119 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked4>) -> tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1208238Z         %121 = arith.mulf %arg7, %120 : tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1208397Z         %122 = arith.addf %121, %118 : tensor<1x1xf32, #blocked4>
2026-02-21T10:44:55.1208645Z         %123 = ttg.convert_layout %120 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T10:44:55.1208985Z         %124 = tt.expand_dims %123 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T10:44:55.1209288Z         %125 = ttg.convert_layout %124 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15>
2026-02-21T10:44:55.1209560Z         %126 = tt.broadcast %125 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15>
2026-02-21T10:44:55.1209815Z         %127 = ttg.convert_layout %126 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked6>
2026-02-21T10:44:55.1210052Z         %128 = arith.mulf %arg8, %127 : tensor<1x1x128xf32, #blocked6>
2026-02-21T10:44:55.1210294Z         %129 = ttg.convert_layout %65 : tensor<16xi32, #blocked7> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:44:55.1210623Z         %130 = tt.expand_dims %129 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x16xi32, #blocked8>
2026-02-21T10:44:55.1210934Z         %131 = ttg.convert_layout %130 : tensor<1x16xi32, #blocked8> -> tensor<1x16xi32, #blocked5>
2026-02-21T10:44:55.1211224Z         %132 = ttg.convert_layout %131 : tensor<1x16xi32, #blocked5> -> tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked16}>>
2026-02-21T10:44:55.1211592Z         %133 = tt.expand_dims %132 {axis = 2 : i32} : tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked16}>> -> tensor<1x16x1xi32, #blocked16>
2026-02-21T10:44:55.1211899Z         %134 = ttg.convert_layout %133 : tensor<1x16x1xi32, #blocked16> -> tensor<1x16x1xi32, #blocked3>
2026-02-21T10:44:55.1212114Z         %135 = arith.muli %134, %cst_4 : tensor<1x16x1xi32, #blocked3>
2026-02-21T10:44:55.1212282Z         %136 = arith.addi %54, %135 : tensor<1x16x1xi32, #blocked3>
2026-02-21T10:44:55.1212482Z         %137 = tt.broadcast %136 : tensor<1x16x1xi32, #blocked3> -> tensor<1x16x128xi32, #blocked3>
2026-02-21T10:44:55.1212745Z         %138 = ttg.convert_layout %137 : tensor<1x16x128xi32, #blocked3> -> tensor<1x16x128xi32, #blocked12>
2026-02-21T10:44:55.1212965Z         %139 = arith.addi %138, %25 : tensor<1x16x128xi32, #blocked12>
2026-02-21T10:44:55.1213190Z         %140 = tt.addptr %26, %139 : tensor<1x16x128x!tt.ptr<bf16>, #blocked12>, tensor<1x16x128xi32, #blocked12>
2026-02-21T10:44:55.1213421Z         %141 = tt.load %140 : tensor<1x16x128x!tt.ptr<bf16>, #blocked12>
2026-02-21T10:44:55.1213629Z         %142 = arith.truncf %116 : tensor<1x1x16xf32, #blocked1> to tensor<1x1x16xbf16, #blocked1>
2026-02-21T10:44:55.1213872Z         %143 = tt.reshape %128 : tensor<1x1x128xf32, #blocked6> -> tensor<1x128xf32, #blocked9>
2026-02-21T10:44:55.1214102Z         %144 = tt.reshape %142 : tensor<1x1x16xbf16, #blocked1> -> tensor<1x16xbf16, #blocked5>
2026-02-21T10:44:55.1214361Z         %145 = tt.reshape %141 : tensor<1x16x128xbf16, #blocked12> -> tensor<16x128xbf16, #blocked9>
2026-02-21T10:44:55.1214663Z         %146 = ttg.convert_layout %144 : tensor<1x16xbf16, #blocked5> -> tensor<1x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>>
2026-02-21T10:44:55.1215015Z         %147 = ttg.convert_layout %145 : tensor<16x128xbf16, #blocked9> -> tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>>
2026-02-21T10:44:55.1215317Z         %148 = ttg.convert_layout %143 : tensor<1x128xf32, #blocked9> -> tensor<1x128xf32, #blocked9>
2026-02-21T10:44:55.1215731Z         %149 = tt.dot %146, %147, %148, inputPrecision = tf32 : tensor<1x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>> * tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>> -> tensor<1x128xf32, #blocked9>
2026-02-21T10:44:55.1216128Z         %150 = tt.reshape %149 : tensor<1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked6>
2026-02-21T10:44:55.1216408Z         scf.yield %106, %122, %150 : tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked6>
2026-02-21T10:44:55.1216650Z       } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T10:44:55.1216903Z       %56 = ttg.convert_layout %55#1 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T10:44:55.1217237Z       %57 = tt.expand_dims %56 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T10:44:55.1217536Z       %58 = ttg.convert_layout %57 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15>
2026-02-21T10:44:55.1217803Z       %59 = tt.broadcast %58 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15>
2026-02-21T10:44:55.1218049Z       %60 = ttg.convert_layout %59 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked6>
2026-02-21T10:44:55.1218262Z       %61 = arith.divf %55#2, %60 : tensor<1x1x128xf32, #blocked6>
2026-02-21T10:44:55.1218464Z       %62 = arith.truncf %61 : tensor<1x1x128xf32, #blocked6> to tensor<1x1x128xbf16, #blocked6>
2026-02-21T10:44:55.1218731Z       %63 = tt.addptr %27, %40 : tensor<1x1x128x!tt.ptr<bf16>, #blocked6>, tensor<1x1x128xi32, #blocked6>
2026-02-21T10:44:55.1220794Z       tt.store %63, %62 : tensor<1x1x128x!tt.ptr<bf16>, #blocked6>
2026-02-21T10:44:55.1220965Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T10:44:55.1221107Z     tt.return
2026-02-21T10:44:55.1221186Z   }
2026-02-21T10:44:55.1221259Z }
2026-02-21T10:44:55.1221301Z 
2026-02-21T10:44:55.1221334Z {-#
2026-02-21T10:44:55.1221417Z   external_resources: {
2026-02-21T10:44:55.1221515Z     mlir_reproducer: {
2026-02-21T10:44:55.1223782Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=2 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=2}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T10:44:55.1226058Z       disable_threading: false,
2026-02-21T10:44:55.1226164Z       verify_each: true
2026-02-21T10:44:55.1226257Z     }
2026-02-21T10:44:55.1226327Z   }
2026-02-21T10:44:55.1226405Z #-}
2026-02-21T10:44:55.1226696Z /tmp/torchinductor_root/qp/cqpelahacicx6ki3kthwus2gomtuw7rhjdarif5mfbgtnqccw44x.py:17:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T10:44:55.1227379Z /tmp/torchinductor_root/qp/cqpelahacicx6ki3kthwus2gomtuw7rhjdarif5mfbgtnqccw44x.py:17:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T10:44:55.1227946Z [20s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T10:44:55.1228754Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[4, 4], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T10:44:55.1229493Z Error: RuntimeError: PassManager::run failed
2026-02-21T10:44:55.1229663Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T10:44:56.7216073Z /tmp/torchinductor_root/lz/clzaf7w6wcyqtwtghizt7yfmicm4hjzvfbzhjpc2zvux6s7fsoth.py:57:20: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T10:44:56.7217884Z         k = tl.load(tl.make_block_ptr(k_view, [192, 128, 128], [16384, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T10:44:56.7218756Z                    ^
2026-02-21T10:44:56.7220358Z /tmp/torchinductor_root/lz/clzaf7w6wcyqtwtghizt7yfmicm4hjzvfbzhjpc2zvux6s7fsoth.py:59:141: note: - use: %124 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x32xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 4], order = [1, 0, 2]}>>) -> tensor<128x32xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 4], order = [0, 1]}>>
2026-02-21T10:44:56.7221626Z 
2026-02-21T10:44:56.7222295Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T10:44:56.7223192Z                                                                                                                                             ^
2026-02-21T10:44:56.7223641Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T10:44:56.7224105Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [1, 4, 1], order = [2, 1, 0]}>
2026-02-21T10:44:56.7224733Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 1, 32], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:56.7225351Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}>
2026-02-21T10:44:56.7225966Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [2, 1, 0]}>
2026-02-21T10:44:56.7226587Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 32, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:56.7227185Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T10:44:56.7227880Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T10:44:56.7228434Z #blocked7 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [4], order = [0]}>
2026-02-21T10:44:56.7228987Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [0, 1]}>
2026-02-21T10:44:56.7229565Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T10:44:56.7230076Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [0, 1, 2]}>
2026-02-21T10:44:56.7230544Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [0, 1, 2]}>
2026-02-21T10:44:56.7230991Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T10:44:56.7231439Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 1, 32], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:56.7231887Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:56.7232331Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:44:56.7232773Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 32, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:44:56.7233252Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T10:44:56.7234004Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T10:44:56.7234567Z     %c16384_i32 = arith.constant 16384 : i32
2026-02-21T10:44:56.7234747Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T10:44:56.7234941Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T10:44:56.7235108Z     %c128_i64 = arith.constant 128 : i64
2026-02-21T10:44:56.7235311Z     %c16384_i64 = arith.constant 16384 : i64
2026-02-21T10:44:56.7235547Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x128x32xbf16, #blocked>
2026-02-21T10:44:56.7235828Z     %cst_0 = arith.constant dense<0> : tensor<1x1x32xi64, #blocked1>
2026-02-21T10:44:56.7236088Z     %cst_1 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:56.7236344Z     %cst_2 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:56.7236600Z     %cst_3 = arith.constant dense<128> : tensor<1x1x32xi64, #blocked1>
2026-02-21T10:44:56.7236875Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<1x1x128xbf16, #blocked3>
2026-02-21T10:44:56.7237149Z     %cst_5 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7237405Z     %cst_6 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7237617Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T10:44:56.7237786Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T10:44:56.7237952Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T10:44:56.7238157Z     %cst_7 = arith.constant dense<128> : tensor<1x32x1xi32, #blocked4>
2026-02-21T10:44:56.7238426Z     %cst_8 = arith.constant dense<0.127517432> : tensor<1x1x32xf32, #blocked1>
2026-02-21T10:44:56.7238703Z     %cst_9 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7238981Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<1x32xf32, #blocked6>
2026-02-21T10:44:56.7239236Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T10:44:56.7239461Z     %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked3>
2026-02-21T10:44:56.7239771Z     %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7240047Z     %cst_13 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7240258Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T10:44:56.7240392Z     %c192_i32 = arith.constant 192 : i32
2026-02-21T10:44:56.7240527Z     %0 = tt.get_program_id x : i32
2026-02-21T10:44:56.7240662Z     %1 = tt.get_program_id y : i32
2026-02-21T10:44:56.7240793Z     %2 = arith.muli %1, %c192_i32 : i32
2026-02-21T10:44:56.7251176Z     %3 = arith.addi %0, %2 : i32
2026-02-21T10:44:56.7251322Z     %4 = arith.divsi %3, %c512_i32 : i32
2026-02-21T10:44:56.7251453Z     %5 = arith.muli %4, %c4_i32 : i32
2026-02-21T10:44:56.7251569Z     %6 = arith.subi %c192_i32, %5 : i32
2026-02-21T10:44:56.7251714Z     %7 = arith.minsi %6, %c4_i32 : i32
2026-02-21T10:44:56.7251832Z     %8 = arith.remsi %3, %c512_i32 : i32
2026-02-21T10:44:56.7251953Z     %9 = arith.remsi %8, %7 : i32
2026-02-21T10:44:56.7252064Z     %10 = arith.addi %5, %9 : i32
2026-02-21T10:44:56.7252184Z     %11 = arith.divsi %8, %7 : i32
2026-02-21T10:44:56.7252346Z     %12 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked7>
2026-02-21T10:44:56.7252528Z     %13 = arith.extsi %10 : i32 to i64
2026-02-21T10:44:56.7252646Z     %14 = arith.extsi %11 : i32 to i64
2026-02-21T10:44:56.7252810Z     %15 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:44:56.7252989Z     %16 = arith.muli %13, %c16384_i64 : i64
2026-02-21T10:44:56.7253135Z     %17 = tt.splat %16 : i64 -> tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7253281Z     %18 = arith.muli %14, %c128_i64 : i64
2026-02-21T10:44:56.7253468Z     %19 = tt.splat %18 : i64 -> tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7253657Z     %20 = arith.extsi %12 : tensor<128xi32, #blocked7> to tensor<128xi64, #blocked7>
2026-02-21T10:44:56.7253934Z     %21 = ttg.convert_layout %20 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:44:56.7254273Z     %22 = tt.expand_dims %21 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8>
2026-02-21T10:44:56.7254596Z     %23 = ttg.convert_layout %22 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked9>
2026-02-21T10:44:56.7254907Z     %24 = ttg.convert_layout %23 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T10:44:56.7255258Z     %25 = tt.expand_dims %24 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<1x1x128xi64, #blocked10>
2026-02-21T10:44:56.7255569Z     %26 = ttg.convert_layout %25 : tensor<1x1x128xi64, #blocked10> -> tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7255780Z     %27 = arith.addi %19, %26 : tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7255944Z     %28 = arith.addi %17, %27 : tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7256151Z     %29 = tt.addptr %15, %28 : tensor<1x1x128x!tt.ptr<bf16>, #blocked3>, tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7256351Z     %30 = arith.cmpi sge, %13, %c0_i64 : i64
2026-02-21T10:44:56.7256487Z     %31 = arith.cmpi slt, %13, %c192_i64 : i64
2026-02-21T10:44:56.7256611Z     %32 = arith.andi %30, %31 : i1
2026-02-21T10:44:56.7256732Z     %33 = arith.cmpi sge, %14, %c0_i64 : i64
2026-02-21T10:44:56.7256858Z     %34 = arith.cmpi slt, %14, %c128_i64 : i64
2026-02-21T10:44:56.7256982Z     %35 = arith.andi %33, %34 : i1
2026-02-21T10:44:56.7257094Z     %36 = arith.andi %32, %35 : i1
2026-02-21T10:44:56.7257231Z     %37 = tt.splat %36 : i1 -> tensor<1x1x128xi1, #blocked3>
2026-02-21T10:44:56.7257399Z     %38 = arith.cmpi sge, %26, %cst_6 : tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7257579Z     %39 = arith.cmpi slt, %26, %cst_5 : tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7257753Z     %40 = arith.andi %38, %39 : tensor<1x1x128xi1, #blocked3>
2026-02-21T10:44:56.7257912Z     %41 = arith.andi %37, %40 : tensor<1x1x128xi1, #blocked3>
2026-02-21T10:44:56.7258115Z     %42 = tt.load %29, %41, %cst_4 : tensor<1x1x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:44:56.7258315Z     %43 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked7>
2026-02-21T10:44:56.7258535Z     %44 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x32x!tt.ptr<bf16>, #blocked>
2026-02-21T10:44:56.7258729Z     %45 = tt.splat %16 : i64 -> tensor<1x128x32xi64, #blocked>
2026-02-21T10:44:56.7258978Z     %46 = ttg.convert_layout %23 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked11}>>
2026-02-21T10:44:56.7259331Z     %47 = tt.expand_dims %46 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x128x1xi64, #blocked11>
2026-02-21T10:44:56.7259641Z     %48 = ttg.convert_layout %47 : tensor<1x128x1xi64, #blocked11> -> tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:56.7259898Z     %49 = tt.broadcast %48 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x32xi64, #blocked2>
2026-02-21T10:44:56.7260157Z     %50 = ttg.convert_layout %49 : tensor<1x128x32xi64, #blocked2> -> tensor<1x128x32xi64, #blocked>
2026-02-21T10:44:56.7260391Z     %51 = arith.extsi %43 : tensor<32xi32, #blocked7> to tensor<32xi64, #blocked7>
2026-02-21T10:44:56.7260592Z     %52 = arith.cmpi sge, %48, %cst_2 : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:56.7260767Z     %53 = arith.cmpi slt, %48, %cst_1 : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:44:56.7260942Z     %54 = arith.andi %52, %53 : tensor<1x128x1xi1, #blocked2>
2026-02-21T10:44:56.7261103Z     %55 = tt.splat %32 : i1 -> tensor<1x128x1xi1, #blocked2>
2026-02-21T10:44:56.7261283Z     %56 = arith.andi %55, %54 : tensor<1x128x1xi1, #blocked2>
2026-02-21T10:44:56.7261485Z     %57 = tt.broadcast %56 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x32xi1, #blocked2>
2026-02-21T10:44:56.7261732Z     %58 = ttg.convert_layout %57 : tensor<1x128x32xi1, #blocked2> -> tensor<1x128x32xi1, #blocked>
2026-02-21T10:44:56.7261990Z     %59 = tt.reshape %42 : tensor<1x1x128xbf16, #blocked3> -> tensor<1x128xbf16, #blocked9>
2026-02-21T10:44:56.7262175Z     %60 = arith.muli %10, %c16384_i32 : i32
2026-02-21T10:44:56.7262347Z     %61 = tt.splat %60 : i32 -> tensor<1x32x1xi32, #blocked4>
2026-02-21T10:44:56.7262591Z     %62 = ttg.convert_layout %12 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:44:56.7262936Z     %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8>
2026-02-21T10:44:56.7263237Z     %64 = ttg.convert_layout %63 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked9>
2026-02-21T10:44:56.7263529Z     %65 = ttg.convert_layout %64 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked10}>>
2026-02-21T10:44:56.7263883Z     %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked10}>> -> tensor<1x1x128xi32, #blocked10>
2026-02-21T10:44:56.7264196Z     %67 = ttg.convert_layout %66 : tensor<1x1x128xi32, #blocked10> -> tensor<1x1x128xi32, #blocked3>
2026-02-21T10:44:56.7264450Z     %68 = tt.broadcast %67 : tensor<1x1x128xi32, #blocked3> -> tensor<1x32x128xi32, #blocked3>
2026-02-21T10:44:56.7264711Z     %69 = ttg.convert_layout %68 : tensor<1x32x128xi32, #blocked3> -> tensor<1x32x128xi32, #blocked12>
2026-02-21T10:44:56.7264954Z     %70 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x32x128x!tt.ptr<bf16>, #blocked12>
2026-02-21T10:44:56.7265344Z     %71:3 = scf.for %arg4 = %c0_i32 to %c128_i32 step %c32_i32 iter_args(%arg5 = %cst_13, %arg6 = %cst_12, %arg7 = %cst_11) -> (tensor<1x1xf32, #blocked5>, tensor<1x1xf32, #blocked5>, tensor<1x1x128xf32, #blocked3>)  : i32 {
2026-02-21T10:44:56.7265702Z       %81 = tt.splat %arg4 : i32 -> tensor<32xi32, #blocked7>
2026-02-21T10:44:56.7265863Z       %82 = arith.addi %81, %43 : tensor<32xi32, #blocked7>
2026-02-21T10:44:56.7266013Z       %83 = arith.extsi %arg4 : i32 to i64
2026-02-21T10:44:56.7266170Z       %84 = tt.splat %83 : i64 -> tensor<32xi64, #blocked7>
2026-02-21T10:44:56.7266326Z       %85 = arith.addi %84, %51 : tensor<32xi64, #blocked7>
2026-02-21T10:44:56.7266571Z       %86 = ttg.convert_layout %85 : tensor<32xi64, #blocked7> -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:44:56.7266904Z       %87 = tt.expand_dims %86 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x32xi64, #blocked8>
2026-02-21T10:44:56.7267200Z       %88 = ttg.convert_layout %87 : tensor<1x32xi64, #blocked8> -> tensor<1x32xi64, #blocked6>
2026-02-21T10:44:56.7267489Z       %89 = ttg.convert_layout %88 : tensor<1x32xi64, #blocked6> -> tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked13}>>
2026-02-21T10:44:56.7267836Z       %90 = tt.expand_dims %89 {axis = 1 : i32} : tensor<1x32xi64, #ttg.slice<{dim = 1, parent = #blocked13}>> -> tensor<1x1x32xi64, #blocked13>
2026-02-21T10:44:56.7268149Z       %91 = ttg.convert_layout %90 : tensor<1x1x32xi64, #blocked13> -> tensor<1x1x32xi64, #blocked1>
2026-02-21T10:44:56.7268367Z       %92 = arith.muli %91, %cst_3 : tensor<1x1x32xi64, #blocked1>
2026-02-21T10:44:56.7268579Z       %93 = tt.broadcast %92 : tensor<1x1x32xi64, #blocked1> -> tensor<1x128x32xi64, #blocked1>
2026-02-21T10:44:56.7268829Z       %94 = ttg.convert_layout %93 : tensor<1x128x32xi64, #blocked1> -> tensor<1x128x32xi64, #blocked>
2026-02-21T10:44:56.7269049Z       %95 = arith.addi %50, %94 : tensor<1x128x32xi64, #blocked>
2026-02-21T10:44:56.7269214Z       %96 = arith.addi %45, %95 : tensor<1x128x32xi64, #blocked>
2026-02-21T10:44:56.7269425Z       %97 = tt.addptr %44, %96 : tensor<1x128x32x!tt.ptr<bf16>, #blocked>, tensor<1x128x32xi64, #blocked>
2026-02-21T10:44:56.7269684Z       %98 = arith.cmpi sge, %91, %cst_0 : tensor<1x1x32xi64, #blocked1>
2026-02-21T10:44:56.7269867Z       %99 = arith.cmpi slt, %91, %cst_3 : tensor<1x1x32xi64, #blocked1>
2026-02-21T10:44:56.7270041Z       %100 = arith.andi %98, %99 : tensor<1x1x32xi1, #blocked1>
2026-02-21T10:44:56.7270253Z       %101 = tt.broadcast %100 : tensor<1x1x32xi1, #blocked1> -> tensor<1x128x32xi1, #blocked1>
2026-02-21T10:44:56.7270506Z       %102 = ttg.convert_layout %101 : tensor<1x128x32xi1, #blocked1> -> tensor<1x128x32xi1, #blocked>
2026-02-21T10:44:56.7270743Z       %103 = arith.andi %58, %102 : tensor<1x128x32xi1, #blocked>
2026-02-21T10:44:56.7270947Z       %104 = tt.load %97, %103, %cst : tensor<1x128x32x!tt.ptr<bf16>, #blocked>
2026-02-21T10:44:56.7271169Z       %105 = tt.reshape %104 : tensor<1x128x32xbf16, #blocked> -> tensor<128x32xbf16, #blocked6>
2026-02-21T10:44:56.7271477Z       %106 = ttg.convert_layout %59 : tensor<1x128xbf16, #blocked9> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>>
2026-02-21T10:44:56.7271836Z       %107 = ttg.convert_layout %105 : tensor<128x32xbf16, #blocked6> -> tensor<128x32xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>>
2026-02-21T10:44:56.7272149Z       %108 = ttg.convert_layout %cst_10 : tensor<1x32xf32, #blocked6> -> tensor<1x32xf32, #blocked6>
2026-02-21T10:44:56.7272569Z       %109 = tt.dot %106, %107, %108, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<128x32xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<1x32xf32, #blocked6>
2026-02-21T10:44:56.7272970Z       %110 = tt.reshape %109 : tensor<1x32xf32, #blocked6> -> tensor<1x1x32xf32, #blocked1>
2026-02-21T10:44:56.7273216Z       %111 = arith.truncf %110 : tensor<1x1x32xf32, #blocked1> to tensor<1x1x32xbf16, #blocked1>
2026-02-21T10:44:56.7273461Z       %112 = arith.extf %111 : tensor<1x1x32xbf16, #blocked1> to tensor<1x1x32xf32, #blocked1>
2026-02-21T10:44:56.7273666Z       %113 = "tt.reduce"(%112) <{axis = 2 : i32}> ({
2026-02-21T10:44:56.7273807Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T10:44:56.7273937Z         %168 = arith.maxnumf %arg8, %arg9 : f32
2026-02-21T10:44:56.7274073Z         tt.reduce.return %168 : f32
2026-02-21T10:44:56.7274266Z       }) : (tensor<1x1x32xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T10:44:56.7274589Z       %114 = ttg.convert_layout %113 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7274873Z       %115 = arith.truncf %114 : tensor<1x1xf32, #blocked5> to tensor<1x1xbf16, #blocked5>
2026-02-21T10:44:56.7275101Z       %116 = arith.extf %115 : tensor<1x1xbf16, #blocked5> to tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7275309Z       %117 = arith.mulf %116, %cst_9 : tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7275506Z       %118 = arith.truncf %117 : tensor<1x1xf32, #blocked5> to tensor<1x1xbf16, #blocked5>
2026-02-21T10:44:56.7275733Z       %119 = arith.extf %118 : tensor<1x1xbf16, #blocked5> to tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7275933Z       %120 = arith.cmpf ogt, %arg5, %119 : tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7276119Z       %121 = arith.cmpf une, %arg5, %arg5 : tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7276294Z       %122 = arith.ori %120, %121 : tensor<1x1xi1, #blocked5>
2026-02-21T10:44:56.7276493Z       %123 = arith.select %122, %arg5, %119 : tensor<1x1xi1, #blocked5>, tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7276710Z       %124 = arith.mulf %112, %cst_8 : tensor<1x1x32xf32, #blocked1>
2026-02-21T10:44:56.7276920Z       %125 = arith.truncf %124 : tensor<1x1x32xf32, #blocked1> to tensor<1x1x32xbf16, #blocked1>
2026-02-21T10:44:56.7277216Z       %126 = ttg.convert_layout %123 : tensor<1x1xf32, #blocked5> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T10:44:56.7277567Z       %127 = tt.expand_dims %126 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T10:44:56.7277889Z       %128 = ttg.convert_layout %127 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15>
2026-02-21T10:44:56.7278148Z       %129 = arith.extf %125 : tensor<1x1x32xbf16, #blocked1> to tensor<1x1x32xf32, #blocked1>
2026-02-21T10:44:56.7278393Z       %130 = tt.broadcast %128 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x32xf32, #blocked15>
2026-02-21T10:44:56.7278650Z       %131 = ttg.convert_layout %130 : tensor<1x1x32xf32, #blocked15> -> tensor<1x1x32xf32, #blocked1>
2026-02-21T10:44:56.7278890Z       %132 = arith.subf %129, %131 : tensor<1x1x32xf32, #blocked1>
2026-02-21T10:44:56.7279225Z       %133 = tt.extern_elementwise %132 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x32xf32, #blocked1>) -> tensor<1x1x32xf32, #blocked1>
2026-02-21T10:44:56.7279524Z       %134 = "tt.reduce"(%133) <{axis = 2 : i32}> ({
2026-02-21T10:44:56.7279655Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T10:44:56.7279784Z         %168 = arith.addf %arg8, %arg9 : f32
2026-02-21T10:44:56.7279909Z         tt.reduce.return %168 : f32
2026-02-21T10:44:56.7280103Z       }) : (tensor<1x1x32xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T10:44:56.7280403Z       %135 = ttg.convert_layout %134 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7280650Z       %136 = arith.subf %arg5, %123 : tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7280946Z       %137 = tt.extern_elementwise %136 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked5>) -> tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7281242Z       %138 = arith.mulf %arg6, %137 : tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7281405Z       %139 = arith.addf %138, %135 : tensor<1x1xf32, #blocked5>
2026-02-21T10:44:56.7281655Z       %140 = ttg.convert_layout %137 : tensor<1x1xf32, #blocked5> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T10:44:56.7281998Z       %141 = tt.expand_dims %140 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T10:44:56.7282308Z       %142 = ttg.convert_layout %141 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15>
2026-02-21T10:44:56.7282641Z       %143 = tt.broadcast %142 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15>
2026-02-21T10:44:56.7282900Z       %144 = ttg.convert_layout %143 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked3>
2026-02-21T10:44:56.7283127Z       %145 = arith.mulf %arg7, %144 : tensor<1x1x128xf32, #blocked3>
2026-02-21T10:44:56.7283374Z       %146 = ttg.convert_layout %82 : tensor<32xi32, #blocked7> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:44:56.7283708Z       %147 = tt.expand_dims %146 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x32xi32, #blocked8>
2026-02-21T10:44:56.7284005Z       %148 = ttg.convert_layout %147 : tensor<1x32xi32, #blocked8> -> tensor<1x32xi32, #blocked6>
2026-02-21T10:44:56.7284296Z       %149 = ttg.convert_layout %148 : tensor<1x32xi32, #blocked6> -> tensor<1x32xi32, #ttg.slice<{dim = 2, parent = #blocked16}>>
2026-02-21T10:44:56.7284648Z       %150 = tt.expand_dims %149 {axis = 2 : i32} : tensor<1x32xi32, #ttg.slice<{dim = 2, parent = #blocked16}>> -> tensor<1x32x1xi32, #blocked16>
2026-02-21T10:44:56.7284957Z       %151 = ttg.convert_layout %150 : tensor<1x32x1xi32, #blocked16> -> tensor<1x32x1xi32, #blocked4>
2026-02-21T10:44:56.7285179Z       %152 = arith.muli %151, %cst_7 : tensor<1x32x1xi32, #blocked4>
2026-02-21T10:44:56.7285352Z       %153 = arith.addi %61, %152 : tensor<1x32x1xi32, #blocked4>
2026-02-21T10:44:56.7285555Z       %154 = tt.broadcast %153 : tensor<1x32x1xi32, #blocked4> -> tensor<1x32x128xi32, #blocked4>
2026-02-21T10:44:56.7285825Z       %155 = ttg.convert_layout %154 : tensor<1x32x128xi32, #blocked4> -> tensor<1x32x128xi32, #blocked12>
2026-02-21T10:44:56.7286069Z       %156 = arith.addi %155, %69 : tensor<1x32x128xi32, #blocked12>
2026-02-21T10:44:56.7286299Z       %157 = tt.addptr %70, %156 : tensor<1x32x128x!tt.ptr<bf16>, #blocked12>, tensor<1x32x128xi32, #blocked12>
2026-02-21T10:44:56.7286534Z       %158 = tt.load %157 : tensor<1x32x128x!tt.ptr<bf16>, #blocked12>
2026-02-21T10:44:56.7286747Z       %159 = arith.truncf %133 : tensor<1x1x32xf32, #blocked1> to tensor<1x1x32xbf16, #blocked1>
2026-02-21T10:44:56.7286995Z       %160 = tt.reshape %145 : tensor<1x1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked9>
2026-02-21T10:44:56.7287249Z       %161 = tt.reshape %159 : tensor<1x1x32xbf16, #blocked1> -> tensor<1x32xbf16, #blocked6>
2026-02-21T10:44:56.7287516Z       %162 = tt.reshape %158 : tensor<1x32x128xbf16, #blocked12> -> tensor<32x128xbf16, #blocked9>
2026-02-21T10:44:56.7287819Z       %163 = ttg.convert_layout %161 : tensor<1x32xbf16, #blocked6> -> tensor<1x32xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>>
2026-02-21T10:44:56.7288182Z       %164 = ttg.convert_layout %162 : tensor<32x128xbf16, #blocked9> -> tensor<32x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>>
2026-02-21T10:44:56.7288493Z       %165 = ttg.convert_layout %160 : tensor<1x128xf32, #blocked9> -> tensor<1x128xf32, #blocked9>
2026-02-21T10:44:56.7288901Z       %166 = tt.dot %163, %164, %165, inputPrecision = tf32 : tensor<1x32xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>> * tensor<32x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>> -> tensor<1x128xf32, #blocked9>
2026-02-21T10:44:56.7289301Z       %167 = tt.reshape %166 : tensor<1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked3>
2026-02-21T10:44:56.7289581Z       scf.yield %123, %139, %167 : tensor<1x1xf32, #blocked5>, tensor<1x1xf32, #blocked5>, tensor<1x1x128xf32, #blocked3>
2026-02-21T10:44:56.7289838Z     } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T10:44:56.7290111Z     %72 = ttg.convert_layout %71#1 : tensor<1x1xf32, #blocked5> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T10:44:56.7290450Z     %73 = tt.expand_dims %72 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T10:44:56.7290754Z     %74 = ttg.convert_layout %73 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15>
2026-02-21T10:44:56.7291025Z     %75 = tt.broadcast %74 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15>
2026-02-21T10:44:56.7291272Z     %76 = ttg.convert_layout %75 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked3>
2026-02-21T10:44:56.7291495Z     %77 = arith.divf %71#2, %76 : tensor<1x1x128xf32, #blocked3>
2026-02-21T10:44:56.7291706Z     %78 = arith.truncf %77 : tensor<1x1x128xf32, #blocked3> to tensor<1x1x128xbf16, #blocked3>
2026-02-21T10:44:56.7291936Z     %79 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:44:56.7292177Z     %80 = tt.addptr %79, %28 : tensor<1x1x128x!tt.ptr<bf16>, #blocked3>, tensor<1x1x128xi64, #blocked3>
2026-02-21T10:44:56.7292397Z     tt.store %80, %78, %41 : tensor<1x1x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:44:56.7292548Z     tt.return
2026-02-21T10:44:56.7292633Z   }
2026-02-21T10:44:56.7292724Z }
2026-02-21T10:44:56.7292771Z 
2026-02-21T10:44:56.7292809Z {-#
2026-02-21T10:44:56.7292895Z   external_resources: {
2026-02-21T10:44:56.7293009Z     mlir_reproducer: {
2026-02-21T10:44:56.7295263Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T10:44:56.7297577Z       disable_threading: false,
2026-02-21T10:44:56.7297695Z       verify_each: true
2026-02-21T10:44:56.7297797Z     }
2026-02-21T10:44:56.7297873Z   }
2026-02-21T10:44:56.7297955Z #-}
2026-02-21T10:44:56.7298238Z /tmp/torchinductor_root/lz/clzaf7w6wcyqtwtghizt7yfmicm4hjzvfbzhjpc2zvux6s7fsoth.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T10:44:56.7298957Z /tmp/torchinductor_root/lz/clzaf7w6wcyqtwtghizt7yfmicm4hjzvfbzhjpc2zvux6s7fsoth.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T10:44:56.7299518Z [22s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T10:44:56.7300254Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T10:44:56.7300922Z Error: RuntimeError: PassManager::run failed
2026-02-21T10:44:56.7301098Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T10:44:59.0286806Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 16.5 configs/s
2026-02-21T10:44:59.0293402Z [24s] Adaptive compile timeout: 30s (90% percentile=3.4s, bounds=[30.0s, 60s])
2026-02-21T10:44:59.9638588Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 991.8 configs/s
2026-02-21T10:45:00.2617261Z [25s] Initial random population of 100, 5 starting points: 
2026-02-21T10:45:00.2617438Z error=9
2026-02-21T10:45:00.2617526Z ok=91
2026-02-21T10:45:00.2657857Z min=0.0253
2026-02-21T10:45:00.2657959Z mid=0.1022
2026-02-21T10:45:00.2658048Z max=6.1846
2026-02-21T10:45:00.2658144Z best={'block_sizes': [1, 16, 16],
2026-02-21T10:45:00.2658313Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T10:45:00.2658478Z  'l2_groupings': [1],
2026-02-21T10:45:00.2658594Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:45:00.2658722Z  'loop_orders': [[0, 1]],
2026-02-21T10:45:00.2658835Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:45:00.2658938Z  'num_stages': 1,
2026-02-21T10:45:00.2659031Z  'num_warps': 1,
2026-02-21T10:45:00.2659122Z  'pid_type': 'xyz',
2026-02-21T10:45:00.2659235Z  'range_flattens': [None, False],
2026-02-21T10:45:00.2659355Z  'range_multi_buffers': [None, False],
2026-02-21T10:45:00.2659472Z  'range_num_stages': [0, 2],
2026-02-21T10:45:00.2659584Z  'range_unroll_factors': [0, 4],
2026-02-21T10:45:00.2659696Z  'range_warp_specializes': [],
2026-02-21T10:45:00.2659802Z  'waves_per_eu': 2}
2026-02-21T10:45:00.2659913Z [25s] Fitting surrogate: 100 points, 100 targets
2026-02-21T10:45:00.9279902Z [26s] Generation 1 starting: 68 neighbors, 5 active search path(s)
2026-02-21T10:45:11.0542516Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 2.7 configs/s
2026-02-21T10:45:15.4097731Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 16.6 configs/s
2026-02-21T10:45:18.0620516Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 370.4         
2026-02-21T10:45:18.0621096Z                                                                   configs/s     
2026-02-21T10:45:18.4832388Z [44s] Generation 1 complete: 
2026-02-21T10:45:18.4832767Z ok=73
2026-02-21T10:45:18.4833008Z min=0.0202
2026-02-21T10:45:18.4833229Z mid=0.0289
2026-02-21T10:45:18.4833429Z max=0.2869
2026-02-21T10:45:18.4834103Z best={'block_sizes': [1, 128, 128],
2026-02-21T10:45:18.4834525Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T10:45:18.4834924Z  'l2_groupings': [1],
2026-02-21T10:45:18.4835342Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:45:18.4835664Z  'loop_orders': [[0, 1]],
2026-02-21T10:45:18.4835948Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:45:18.4836219Z  'num_stages': 1,
2026-02-21T10:45:18.4836457Z  'num_warps': 8,
2026-02-21T10:45:18.4836709Z  'pid_type': 'flat',
2026-02-21T10:45:18.4836965Z  'range_flattens': [None, True],
2026-02-21T10:45:18.4837260Z  'range_multi_buffers': [None, False],
2026-02-21T10:45:18.4837602Z  'range_num_stages': [0, 1],
2026-02-21T10:45:18.4838157Z  'range_unroll_factors': [0, 0],
2026-02-21T10:45:18.4838502Z  'range_warp_specializes': [],
2026-02-21T10:45:18.4838835Z  'waves_per_eu': 4}
2026-02-21T10:45:18.5123885Z [44s] Fitting surrogate: 173 points, 173 targets
2026-02-21T10:45:19.4409112Z [45s] Generation 2 starting: 67 neighbors, 5 active search path(s)
2026-02-21T10:45:28.6248480Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 5.6 configs/s
2026-02-21T10:45:32.9520259Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 16.0 configs/s
2026-02-21T10:45:36.5054313Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 316.7         
2026-02-21T10:45:36.5054925Z                                                                   configs/s     
2026-02-21T10:45:37.0301226Z [62s] Generation 2 complete: 
2026-02-21T10:45:37.0301443Z ok=72
2026-02-21T10:45:37.0301533Z min=0.0218
2026-02-21T10:45:37.0301618Z mid=0.0294
2026-02-21T10:45:37.0301703Z max=0.1891
2026-02-21T10:45:37.0301799Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:45:37.0301961Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:45:37.0302119Z  'l2_groupings': [1],
2026-02-21T10:45:37.0302800Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:45:37.0302925Z  'loop_orders': [[0, 1]],
2026-02-21T10:45:37.0303053Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:45:37.0303157Z  'num_stages': 4,
2026-02-21T10:45:37.0303252Z  'num_warps': 4,
2026-02-21T10:45:37.0303348Z  'pid_type': 'flat',
2026-02-21T10:45:37.0303463Z  'range_flattens': [None, True],
2026-02-21T10:45:37.0303586Z  'range_multi_buffers': [None, None],
2026-02-21T10:45:37.0303704Z  'range_num_stages': [0, 4],
2026-02-21T10:45:37.0303816Z  'range_unroll_factors': [0, 2],
2026-02-21T10:45:37.0303930Z  'range_warp_specializes': [],
2026-02-21T10:45:37.0304049Z  'waves_per_eu': 2}
2026-02-21T10:45:37.1345960Z [62s] Fitting surrogate: 245 points, 245 targets
2026-02-21T10:45:37.8664744Z [63s] Generation 3 starting: 55 neighbors, 5 active search path(s)
2026-02-21T10:45:52.0099009Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 1.0 configs/s
2026-02-21T10:45:56.0940053Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 14.4 configs/s
2026-02-21T10:45:58.2165622Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 465.0         
2026-02-21T10:45:58.2166803Z                                                                   configs/s     
2026-02-21T10:45:58.6175312Z [84s] Generation 3 complete: 
2026-02-21T10:45:58.6175736Z ok=60
2026-02-21T10:45:58.6175922Z min=0.0198
2026-02-21T10:45:58.6176118Z mid=0.0242
2026-02-21T10:45:58.6176293Z max=2.2780
2026-02-21T10:45:58.6176497Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:45:58.6176859Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:45:58.6177608Z  'l2_groupings': [1],
2026-02-21T10:45:58.6177849Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:45:58.6178124Z  'loop_orders': [[0, 1]],
2026-02-21T10:45:58.6178361Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:45:58.6178592Z  'num_stages': 3,
2026-02-21T10:45:58.6178793Z  'num_warps': 4,
2026-02-21T10:45:58.6178991Z  'pid_type': 'flat',
2026-02-21T10:45:58.6179223Z  'range_flattens': [None, True],
2026-02-21T10:45:58.6179465Z  'range_multi_buffers': [None, None],
2026-02-21T10:45:58.6179738Z  'range_num_stages': [0, 4],
2026-02-21T10:45:58.6179960Z  'range_unroll_factors': [0, 2],
2026-02-21T10:45:58.6180193Z  'range_warp_specializes': [],
2026-02-21T10:45:58.6180414Z  'waves_per_eu': 2}
2026-02-21T10:45:58.6809750Z [84s] Fitting surrogate: 305 points, 305 targets
2026-02-21T10:45:59.4721136Z [85s] Generation 4 starting: 69 neighbors, 5 active search path(s)
2026-02-21T10:46:10.3629234Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 13.0 configs/s
2026-02-21T10:46:14.9170146Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 70/70 15.9 configs/s
2026-02-21T10:46:18.1384323Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 308.8         
2026-02-21T10:46:18.1384929Z                                                                   configs/s     
2026-02-21T10:46:18.6698366Z [104s] Generation 4 complete: 
2026-02-21T10:46:18.6703021Z ok=74
2026-02-21T10:46:18.6703299Z min=0.0200
2026-02-21T10:46:18.6703455Z mid=0.0242
2026-02-21T10:46:18.6703606Z max=0.0730
2026-02-21T10:46:18.6703768Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:46:18.6704094Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T10:46:18.6704399Z  'l2_groupings': [64],
2026-02-21T10:46:18.6704580Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:46:18.6704786Z  'loop_orders': [[0, 1]],
2026-02-21T10:46:18.6704965Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:46:18.6705139Z  'num_stages': 3,
2026-02-21T10:46:18.6705297Z  'num_warps': 4,
2026-02-21T10:46:18.6705450Z  'pid_type': 'flat',
2026-02-21T10:46:18.6705625Z  'range_flattens': [None, True],
2026-02-21T10:46:18.6705834Z  'range_multi_buffers': [None, False],
2026-02-21T10:46:18.6706043Z  'range_num_stages': [0, 3],
2026-02-21T10:46:18.6706225Z  'range_unroll_factors': [0, 4],
2026-02-21T10:46:18.6706415Z  'range_warp_specializes': [],
2026-02-21T10:46:18.6706604Z  'waves_per_eu': 1}
2026-02-21T10:46:18.7751267Z [104s] Fitting surrogate: 379 points, 379 targets
2026-02-21T10:46:19.8463582Z [105s] Generation 5 starting: 52 neighbors, 4 active search path(s)
2026-02-21T10:46:29.2456626Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 2.6 configs/s
2026-02-21T10:46:32.6030047Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 16.2 configs/s
2026-02-21T10:46:34.7153173Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 467.0         
2026-02-21T10:46:34.7153756Z                                                                   configs/s     
2026-02-21T10:46:35.1105013Z [120s] Generation 5 complete: 
2026-02-21T10:46:35.1105400Z ok=56
2026-02-21T10:46:35.1105627Z min=0.0198
2026-02-21T10:46:35.1105843Z mid=0.0232
2026-02-21T10:46:35.1106052Z max=0.0706
2026-02-21T10:46:35.1106289Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:46:35.1106712Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:46:35.1107147Z  'l2_groupings': [1],
2026-02-21T10:46:35.1107430Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:46:35.1108128Z  'loop_orders': [[0, 1]],
2026-02-21T10:46:35.1108411Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:46:35.1108683Z  'num_stages': 3,
2026-02-21T10:46:35.1108918Z  'num_warps': 4,
2026-02-21T10:46:35.1109176Z  'pid_type': 'flat',
2026-02-21T10:46:35.1109441Z  'range_flattens': [None, True],
2026-02-21T10:46:35.1109758Z  'range_multi_buffers': [None, None],
2026-02-21T10:46:35.1110068Z  'range_num_stages': [0, 3],
2026-02-21T10:46:35.1110342Z  'range_unroll_factors': [0, 2],
2026-02-21T10:46:35.1110640Z  'range_warp_specializes': [],
2026-02-21T10:46:35.1110916Z  'waves_per_eu': 2}
2026-02-21T10:46:35.1702295Z [120s] Fitting surrogate: 435 points, 435 targets
2026-02-21T10:46:36.2904778Z [121s] Generation 6 starting: 55 neighbors, 4 active search path(s)
2026-02-21T10:46:44.4055897Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 9.2 configs/s
2026-02-21T10:46:48.0412741Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 16.0 configs/s
2026-02-21T10:46:50.3449868Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 429.6         
2026-02-21T10:46:50.3450128Z                                                                   configs/s     
2026-02-21T10:46:50.7618453Z [136s] Generation 6 complete: 
2026-02-21T10:46:50.7618778Z ok=59
2026-02-21T10:46:50.7618964Z min=0.0188
2026-02-21T10:46:50.7619138Z mid=0.0214
2026-02-21T10:46:50.7619311Z max=0.3548
2026-02-21T10:46:50.7619498Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:46:50.7619851Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:46:50.7620178Z  'l2_groupings': [1],
2026-02-21T10:46:50.7620413Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:46:50.7620674Z  'loop_orders': [[0, 1]],
2026-02-21T10:46:50.7620905Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:46:50.7621132Z  'num_stages': 3,
2026-02-21T10:46:50.7621325Z  'num_warps': 4,
2026-02-21T10:46:50.7621680Z  'pid_type': 'flat',
2026-02-21T10:46:50.7621900Z  'range_flattens': [None, True],
2026-02-21T10:46:50.7622180Z  'range_multi_buffers': [None, None],
2026-02-21T10:46:50.7622442Z  'range_num_stages': [0, 3],
2026-02-21T10:46:50.7622673Z  'range_unroll_factors': [0, 2],
2026-02-21T10:46:50.7622919Z  'range_warp_specializes': [],
2026-02-21T10:46:50.7623154Z  'waves_per_eu': 3}
2026-02-21T10:46:50.8339878Z [136s] Fitting surrogate: 494 points, 494 targets
2026-02-21T10:46:51.9942435Z [137s] Generation 7 starting: 59 neighbors, 4 active search path(s)
2026-02-21T10:47:01.5057177Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 6.8 configs/s
2026-02-21T10:47:05.3374455Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 16.0 configs/s
2026-02-21T10:47:07.4740702Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 463.7         
2026-02-21T10:47:07.4741304Z                                                                   configs/s     
2026-02-21T10:47:07.9056876Z [153s] Generation 7 complete: 
2026-02-21T10:47:07.9057316Z ok=63
2026-02-21T10:47:07.9057530Z min=0.0186
2026-02-21T10:47:07.9057781Z mid=0.0247
2026-02-21T10:47:07.9057982Z max=0.1292
2026-02-21T10:47:07.9058213Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:47:07.9058623Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:47:07.9059043Z  'l2_groupings': [1],
2026-02-21T10:47:07.9059326Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:47:07.9059645Z  'loop_orders': [[0, 1]],
2026-02-21T10:47:07.9059926Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:47:07.9060199Z  'num_stages': 3,
2026-02-21T10:47:07.9060910Z  'num_warps': 4,
2026-02-21T10:47:07.9061118Z  'pid_type': 'flat',
2026-02-21T10:47:07.9061345Z  'range_flattens': [None, True],
2026-02-21T10:47:07.9061610Z  'range_multi_buffers': [None, None],
2026-02-21T10:47:07.9061884Z  'range_num_stages': [0, 2],
2026-02-21T10:47:07.9062120Z  'range_unroll_factors': [0, 2],
2026-02-21T10:47:07.9062391Z  'range_warp_specializes': [],
2026-02-21T10:47:07.9062629Z  'waves_per_eu': 3}
2026-02-21T10:47:07.9703495Z [153s] Fitting surrogate: 557 points, 557 targets
2026-02-21T10:47:08.9901353Z [154s] Generation 8 starting: 45 neighbors, 3 active search path(s)
2026-02-21T10:47:16.6652594Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 4.5 configs/s
2026-02-21T10:47:19.6064007Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 45/45 16.1 configs/s
2026-02-21T10:47:21.3626125Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 559.3         
2026-02-21T10:47:21.3626777Z                                                                   configs/s     
2026-02-21T10:47:21.7391761Z [167s] Generation 8 complete: 
2026-02-21T10:47:21.7392214Z ok=48
2026-02-21T10:47:21.7392435Z min=0.0188
2026-02-21T10:47:21.7392654Z mid=0.0218
2026-02-21T10:47:21.7392858Z max=0.1299
2026-02-21T10:47:21.7393094Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:47:21.7393542Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:47:21.7393941Z  'l2_groupings': [1],
2026-02-21T10:47:21.7394221Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:47:21.7394571Z  'loop_orders': [[0, 1]],
2026-02-21T10:47:21.7394848Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:47:21.7395125Z  'num_stages': 3,
2026-02-21T10:47:21.7395356Z  'num_warps': 4,
2026-02-21T10:47:21.7396026Z  'pid_type': 'flat',
2026-02-21T10:47:21.7396294Z  'range_flattens': [None, True],
2026-02-21T10:47:21.7396593Z  'range_multi_buffers': [None, None],
2026-02-21T10:47:21.7396903Z  'range_num_stages': [0, 2],
2026-02-21T10:47:21.7397180Z  'range_unroll_factors': [0, 2],
2026-02-21T10:47:21.7397491Z  'range_warp_specializes': [],
2026-02-21T10:47:21.7397771Z  'waves_per_eu': 3}
2026-02-21T10:47:21.7968366Z [167s] Fitting surrogate: 605 points, 605 targets
2026-02-21T10:47:22.1455859Z [167s] Generation 9 starting: 22 neighbors, 2 active search path(s)
2026-02-21T10:47:26.9678138Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22/22 4.7 configs/s
2026-02-21T10:47:28.4536933Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 22/22 16.4 configs/s
2026-02-21T10:47:29.1965431Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1296.7         
2026-02-21T10:47:29.1965918Z                                                                  configs/s      
2026-02-21T10:47:29.4830539Z [175s] Generation 9 complete: 
2026-02-21T10:47:29.4830839Z ok=24
2026-02-21T10:47:29.4830947Z min=0.0186
2026-02-21T10:47:29.4831062Z mid=0.0244
2026-02-21T10:47:29.4831167Z max=0.1235
2026-02-21T10:47:29.4831290Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:47:29.4831502Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:47:29.4831708Z  'l2_groupings': [1],
2026-02-21T10:47:29.4831849Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:47:29.4832003Z  'loop_orders': [[0, 1]],
2026-02-21T10:47:29.4832138Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:47:29.4832273Z  'num_stages': 3,
2026-02-21T10:47:29.4832877Z  'num_warps': 4,
2026-02-21T10:47:29.4832997Z  'pid_type': 'flat',
2026-02-21T10:47:29.4833319Z  'range_flattens': [None, True],
2026-02-21T10:47:29.4833862Z  'range_multi_buffers': [None, None],
2026-02-21T10:47:29.4833996Z  'range_num_stages': [0, 2],
2026-02-21T10:47:29.4834110Z  'range_unroll_factors': [0, 2],
2026-02-21T10:47:29.4834235Z  'range_warp_specializes': [],
2026-02-21T10:47:29.4834350Z  'waves_per_eu': 3}
2026-02-21T10:47:29.5000597Z [175s] Fitting surrogate: 629 points, 629 targets
2026-02-21T10:47:29.7568746Z [175s] Generation 10 starting: 16 neighbors, 1 active search path(s)
2026-02-21T10:47:33.0168509Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.4 configs/s
2026-02-21T10:47:34.1295632Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.5 configs/s
2026-02-21T10:47:35.1024484Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 992.7         
2026-02-21T10:47:35.1025107Z                                                                   configs/s     
2026-02-21T10:47:35.4210674Z [181s] Generation 10 complete: 
2026-02-21T10:47:35.4211002Z ok=18
2026-02-21T10:47:35.4211247Z min=0.0202
2026-02-21T10:47:35.4211496Z mid=0.0224
2026-02-21T10:47:35.4211700Z max=0.0303
2026-02-21T10:47:35.4214132Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:47:35.4215008Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:47:35.4215416Z  'l2_groupings': [1],
2026-02-21T10:47:35.4215701Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:47:35.4216024Z  'loop_orders': [[0, 1]],
2026-02-21T10:47:35.4216305Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:47:35.4216581Z  'num_stages': 3,
2026-02-21T10:47:35.4216816Z  'num_warps': 4,
2026-02-21T10:47:35.4217043Z  'pid_type': 'flat',
2026-02-21T10:47:35.4217307Z  'range_flattens': [None, True],
2026-02-21T10:47:35.4217605Z  'range_multi_buffers': [None, None],
2026-02-21T10:47:35.4217910Z  'range_num_stages': [0, 2],
2026-02-21T10:47:35.4218184Z  'range_unroll_factors': [0, 2],
2026-02-21T10:47:35.4218622Z  'range_warp_specializes': [],
2026-02-21T10:47:35.4218826Z  'waves_per_eu': 3}
2026-02-21T10:47:35.4462042Z [181s] Fitting surrogate: 647 points, 647 targets
2026-02-21T10:47:35.7119063Z [181s] Generation 11 starting: 15 neighbors, 1 active search path(s)
2026-02-21T10:47:38.4365293Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 6.4 configs/s
2026-02-21T10:47:39.5394358Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.7 configs/s
2026-02-21T10:47:40.0720682Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1810.2        
2026-02-21T10:47:40.0721247Z                                                                   configs/s     
2026-02-21T10:47:40.3696337Z [185s] Generation 11 complete: 
2026-02-21T10:47:40.3696661Z ok=17
2026-02-21T10:47:40.3696868Z min=0.0196
2026-02-21T10:47:40.3697080Z mid=0.0262
2026-02-21T10:47:40.3697279Z max=0.0739
2026-02-21T10:47:40.3697508Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:47:40.3697933Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:47:40.3698330Z  'l2_groupings': [1],
2026-02-21T10:47:40.3698625Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:47:40.3698947Z  'loop_orders': [[0, 1]],
2026-02-21T10:47:40.3699229Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:47:40.3699495Z  'num_stages': 3,
2026-02-21T10:47:40.3699736Z  'num_warps': 4,
2026-02-21T10:47:40.3699965Z  'pid_type': 'flat',
2026-02-21T10:47:40.3700228Z  'range_flattens': [None, True],
2026-02-21T10:47:40.3700526Z  'range_multi_buffers': [None, None],
2026-02-21T10:47:40.3700833Z  'range_num_stages': [0, 2],
2026-02-21T10:47:40.3701112Z  'range_unroll_factors': [0, 2],
2026-02-21T10:47:40.3701408Z  'range_warp_specializes': [],
2026-02-21T10:47:40.3701683Z  'waves_per_eu': 3}
2026-02-21T10:47:40.3871707Z [185s] Fitting surrogate: 664 points, 664 targets
2026-02-21T10:47:40.6698643Z [186s] Generation 12 starting: 16 neighbors, 1 active search path(s)
2026-02-21T10:47:45.4007810Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 1.8 configs/s
2026-02-21T10:47:46.5325066Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.2 configs/s
2026-02-21T10:47:46.7680589Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 4380.2        
2026-02-21T10:47:46.7681129Z                                                                   configs/s     
2026-02-21T10:47:46.9751484Z [192s] Generation 12 complete: 
2026-02-21T10:47:46.9751803Z ok=18
2026-02-21T10:47:46.9752008Z min=0.0184
2026-02-21T10:47:46.9752221Z mid=0.0302
2026-02-21T10:47:46.9752419Z max=0.3199
2026-02-21T10:47:46.9752924Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:47:46.9753325Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:47:46.9753721Z  'l2_groupings': [1],
2026-02-21T10:47:46.9753993Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:47:46.9754314Z  'loop_orders': [[0, 1]],
2026-02-21T10:47:46.9754597Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:47:46.9754865Z  'num_stages': 3,
2026-02-21T10:47:46.9755109Z  'num_warps': 4,
2026-02-21T10:47:46.9755336Z  'pid_type': 'flat',
2026-02-21T10:47:46.9755607Z  'range_flattens': [None, True],
2026-02-21T10:47:46.9755905Z  'range_multi_buffers': [None, None],
2026-02-21T10:47:46.9756216Z  'range_num_stages': [0, 2],
2026-02-21T10:47:46.9756569Z  'range_unroll_factors': [0, 2],
2026-02-21T10:47:46.9756822Z  'range_warp_specializes': [],
2026-02-21T10:47:46.9757060Z  'waves_per_eu': 3}
2026-02-21T10:47:46.9836098Z [192s] Fitting surrogate: 682 points, 682 targets
2026-02-21T10:47:47.2557375Z [192s] Generation 13 starting: 16 neighbors, 1 active search path(s)
2026-02-21T10:47:50.7445678Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 3.8 configs/s
2026-02-21T10:47:51.8598702Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.5 configs/s
2026-02-21T10:47:52.6644778Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1196.2        
2026-02-21T10:47:52.6645563Z                                                                   configs/s     
2026-02-21T10:47:52.9635338Z [198s] Generation 13 complete: 
2026-02-21T10:47:52.9635682Z ok=18
2026-02-21T10:47:52.9635892Z min=0.0187
2026-02-21T10:47:52.9636045Z mid=0.0208
2026-02-21T10:47:52.9636251Z max=0.2300
2026-02-21T10:47:52.9636484Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:47:52.9636905Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:47:52.9637295Z  'l2_groupings': [1],
2026-02-21T10:47:52.9637577Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:47:52.9637891Z  'loop_orders': [[0, 1]],
2026-02-21T10:47:52.9638181Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:47:52.9638450Z  'num_stages': 3,
2026-02-21T10:47:52.9638697Z  'num_warps': 4,
2026-02-21T10:47:52.9638923Z  'pid_type': 'flat',
2026-02-21T10:47:52.9639185Z  'range_flattens': [None, True],
2026-02-21T10:47:52.9639494Z  'range_multi_buffers': [None, None],
2026-02-21T10:47:52.9639798Z  'range_num_stages': [0, 2],
2026-02-21T10:47:52.9640081Z  'range_unroll_factors': [0, 2],
2026-02-21T10:47:52.9640372Z  'range_warp_specializes': [],
2026-02-21T10:47:52.9640654Z  'waves_per_eu': 3}
2026-02-21T10:47:52.9832845Z [198s] Fitting surrogate: 700 points, 700 targets
2026-02-21T10:47:53.2558942Z [198s] Generation 14 starting: 15 neighbors, 1 active search path(s)
2026-02-21T10:47:55.8819354Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 9.3 configs/s
2026-02-21T10:47:56.9155065Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 16.9 configs/s
2026-02-21T10:47:57.5754939Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1454.5        
2026-02-21T10:47:57.5756102Z                                                                   configs/s     
2026-02-21T10:47:57.8575869Z [203s] Generation 14 complete: 
2026-02-21T10:47:57.8576206Z ok=17
2026-02-21T10:47:57.8576416Z min=0.0187
2026-02-21T10:47:57.8576636Z mid=0.0197
2026-02-21T10:47:57.8576837Z max=0.0698
2026-02-21T10:47:57.8577088Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:47:57.8577494Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:47:57.8578104Z  'l2_groupings': [1],
2026-02-21T10:47:57.8578387Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:47:57.8578702Z  'loop_orders': [[0, 1]],
2026-02-21T10:47:57.8578981Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:47:57.8579261Z  'num_stages': 3,
2026-02-21T10:47:57.8579496Z  'num_warps': 4,
2026-02-21T10:47:57.8579724Z  'pid_type': 'flat',
2026-02-21T10:47:57.8579985Z  'range_flattens': [None, True],
2026-02-21T10:47:57.8580282Z  'range_multi_buffers': [None, None],
2026-02-21T10:47:57.8580598Z  'range_num_stages': [0, 2],
2026-02-21T10:47:57.8580868Z  'range_unroll_factors': [0, 2],
2026-02-21T10:47:57.8581168Z  'range_warp_specializes': [],
2026-02-21T10:47:57.8581447Z  'waves_per_eu': 3}
2026-02-21T10:47:57.8732974Z [203s] Fitting surrogate: 717 points, 717 targets
2026-02-21T10:47:58.1642398Z [203s] Generation 15 starting: 16 neighbors, 1 active search path(s)
2026-02-21T10:48:01.0895327Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 5.7 configs/s
2026-02-21T10:48:02.1855313Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.8 configs/s
2026-02-21T10:48:03.5215395Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1128.3        
2026-02-21T10:48:03.5216305Z                                                                   configs/s     
2026-02-21T10:48:03.8196838Z [209s] Generation 15 complete: 
2026-02-21T10:48:03.8197171Z ok=18
2026-02-21T10:48:03.8197380Z min=0.0187
2026-02-21T10:48:03.8197596Z mid=0.0201
2026-02-21T10:48:03.8197815Z max=0.0727
2026-02-21T10:48:03.8198058Z best={'block_sizes': [1, 64, 64],
2026-02-21T10:48:03.8198470Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:48:03.8198859Z  'l2_groupings': [1],
2026-02-21T10:48:03.8199149Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:48:03.8199461Z  'loop_orders': [[0, 1]],
2026-02-21T10:48:03.8199984Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:48:03.8200256Z  'num_stages': 3,
2026-02-21T10:48:03.8200480Z  'num_warps': 4,
2026-02-21T10:48:03.8200730Z  'pid_type': 'flat',
2026-02-21T10:48:03.8200993Z  'range_flattens': [None, True],
2026-02-21T10:48:03.8201296Z  'range_multi_buffers': [None, None],
2026-02-21T10:48:03.8201603Z  'range_num_stages': [0, 2],
2026-02-21T10:48:03.8201888Z  'range_unroll_factors': [0, 2],
2026-02-21T10:48:03.8202175Z  'range_warp_specializes': [],
2026-02-21T10:48:03.8202462Z  'waves_per_eu': 3}
2026-02-21T10:48:03.8443995Z [209s] Fitting surrogate: 735 points, 735 targets
2026-02-21T10:48:03.9792831Z [209s] Autotuning complete in 209.6s after searching 694 configs.
2026-02-21T10:48:03.9793177Z One can hardcode the best config and skip autotuning with:
2026-02-21T10:48:03.9794344Z     @helion.kernel(config=helion.Config(block_sizes=[1, 64, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T10:48:03.9795386Z 
2026-02-21T10:48:03.9795665Z [209s] Code of selected kernel: /tmp/torchinductor_root/yz/cyz7bheexsaqjs64c2glkk7hmjd25c3nikx5d3o6ma22jc6ttott.py
2026-02-21T10:48:04.4578206Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T10:48:04.4579575Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 64, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T10:48:04.4581040Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T10:48:04.4581467Z WARNING:tritonbench.utils.triton_op:Completed input ID 0:
2026-02-21T10:48:04.4581729Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T10:48:04.4581936Z ------------------------------------------
2026-02-21T10:48:04.4582129Z (4, 48, 128, 128, 128)
2026-02-21T10:48:04.4582227Z 
2026-02-21T10:48:04.4587317Z  17%|█▋        | 1/6 [03:38<18:10, 218.07s/it]WARNING:tritonbench.utils.triton_op:Running input ID 1:
2026-02-21T10:48:04.4587606Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T10:48:04.4587791Z ------------------------------------------
2026-02-21T10:48:04.4587962Z (4, 48, 256, 256, 128)
2026-02-21T10:48:04.4589880Z INFO:tritonbench.utils.triton_op:Took 0.07ms to get benchmark function for aten
2026-02-21T10:48:05.7295342Z INFO:tritonbench.utils.triton_op:Took 2.42ms to get benchmark function for flex_attention
2026-02-21T10:48:07.1547255Z WARNING:__main__:Input tensor metadata:
2026-02-21T10:48:07.1547535Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T10:48:07.1547739Z               'dtype': 'torch.bfloat16',
2026-02-21T10:48:07.1547892Z               'shape': (4, 48, 256, 128),
2026-02-21T10:48:07.1548047Z               'stride': (1572864, 32768, 128, 1)},
2026-02-21T10:48:07.1548647Z             { 'device': 'cuda:0',
2026-02-21T10:48:07.1548788Z               'dtype': 'torch.bfloat16',
2026-02-21T10:48:07.1548932Z               'shape': (4, 48, 256, 128),
2026-02-21T10:48:07.1549078Z               'stride': (1572864, 32768, 128, 1)},
2026-02-21T10:48:07.1549237Z             { 'device': 'cuda:0',
2026-02-21T10:48:07.1549369Z               'dtype': 'torch.bfloat16',
2026-02-21T10:48:07.1549507Z               'shape': (4, 48, 256, 128),
2026-02-21T10:48:07.1549653Z               'stride': (1572864, 32768, 128, 1)}),
2026-02-21T10:48:07.1549795Z   'kwargs': {}}
2026-02-21T10:48:07.1569653Z INFO:tritonbench.utils.triton_op:Took 2.78ms to get benchmark function for helion_attention
2026-02-21T10:48:07.4046936Z [0s] Autotune random seed: 2144140282
2026-02-21T10:48:07.5643868Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T10:48:25.4723692Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 3.7 configs/s
2026-02-21T10:48:27.7923620Z /tmp/torchinductor_root/re/creryn3ijcfe5ngnsdypjphn22cnx4fiylxrfemcef4ips7yanzk.py:55:129: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T10:48:27.7936866Z         k = tl.load(k_view + (indices_0[:, None, None] * 32768 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None)
2026-02-21T10:48:27.7937291Z                                                                                                                                 ^
2026-02-21T10:48:27.7938077Z /tmp/torchinductor_root/re/creryn3ijcfe5ngnsdypjphn22cnx4fiylxrfemcef4ips7yanzk.py:57:141: note: - use: %132 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x64xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x64xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>>
2026-02-21T10:48:27.7938757Z 
2026-02-21T10:48:27.7939128Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T10:48:27.7939948Z                                                                                                                                             ^
2026-02-21T10:48:27.7940158Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T10:48:27.7940422Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:48:27.7940781Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:48:27.7941231Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:48:27.7941576Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T10:48:27.7941919Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T10:48:27.7942236Z #blocked5 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}>
2026-02-21T10:48:27.7942539Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}>
2026-02-21T10:48:27.7942843Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:48:27.7943156Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:48:27.7943472Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:48:27.7943850Z #blocked10 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T10:48:27.7944180Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T10:48:27.7944701Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T10:48:27.7945097Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T10:48:27.7945335Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T10:48:27.7945454Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T10:48:27.7945574Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T10:48:27.7945740Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked>
2026-02-21T10:48:27.7945936Z     %cst_0 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked>
2026-02-21T10:48:27.7946120Z     %cst_1 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked>
2026-02-21T10:48:27.7946300Z     %cst_2 = arith.constant dense<256> : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:48:27.7946485Z     %cst_3 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:48:27.7946657Z     %cst_4 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:48:27.7946809Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T10:48:27.7946927Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T10:48:27.7947045Z     %c3072_i32 = arith.constant 3072 : i32
2026-02-21T10:48:27.7947192Z     %cst_5 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked1>
2026-02-21T10:48:27.7947389Z     %cst_6 = arith.constant dense<128> : tensor<1x64x1xi32, #blocked2>
2026-02-21T10:48:27.7947579Z     %cst_7 = arith.constant dense<0.127517432> : tensor<1x2x64xf32, #blocked>
2026-02-21T10:48:27.7947778Z     %cst_8 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7947973Z     %cst_9 = arith.constant dense<0.000000e+00> : tensor<2x64xf32, #blocked4>
2026-02-21T10:48:27.7948160Z     %cst_10 = arith.constant dense<128> : tensor<1x1x64xi32, #blocked>
2026-02-21T10:48:27.7948331Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T10:48:27.7948486Z     %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked>
2026-02-21T10:48:27.7948687Z     %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7948880Z     %cst_13 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7949038Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T10:48:27.7949153Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T10:48:27.7949266Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T10:48:27.7949406Z     %0 = tt.get_program_id x : i32
2026-02-21T10:48:27.7949518Z     %1 = arith.divsi %0, %c3072_i32 : i32
2026-02-21T10:48:27.7949635Z     %2 = arith.muli %1, %c16_i32 : i32
2026-02-21T10:48:27.7949746Z     %3 = arith.subi %c128_i32, %2 : i32
2026-02-21T10:48:27.7949861Z     %4 = arith.minsi %3, %c16_i32 : i32
2026-02-21T10:48:27.7949974Z     %5 = arith.remsi %0, %c3072_i32 : i32
2026-02-21T10:48:27.7950089Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T10:48:27.7950201Z     %7 = arith.addi %2, %6 : i32
2026-02-21T10:48:27.7950306Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T10:48:27.7950417Z     %9 = arith.muli %7, %c2_i32 : i32
2026-02-21T10:48:27.7950571Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked5>
2026-02-21T10:48:27.7950753Z     %11 = tt.splat %9 : i32 -> tensor<2xi32, #blocked5>
2026-02-21T10:48:27.7950903Z     %12 = arith.addi %11, %10 : tensor<2xi32, #blocked5>
2026-02-21T10:48:27.7951078Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked5>
2026-02-21T10:48:27.7951248Z     %14 = arith.extsi %8 : i32 to i64
2026-02-21T10:48:27.7951357Z     %15 = arith.extsi %9 : i32 to i64
2026-02-21T10:48:27.7951532Z     %16 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:48:27.7951698Z     %17 = arith.muli %14, %c32768_i64 : i64
2026-02-21T10:48:27.7951838Z     %18 = tt.splat %17 : i64 -> tensor<1x2x128xi64, #blocked>
2026-02-21T10:48:27.7951990Z     %19 = tt.splat %15 : i64 -> tensor<2xi64, #blocked5>
2026-02-21T10:48:27.7952163Z     %20 = arith.extsi %10 : tensor<2xi32, #blocked5> to tensor<2xi64, #blocked5>
2026-02-21T10:48:27.7952341Z     %21 = arith.addi %19, %20 : tensor<2xi64, #blocked5>
2026-02-21T10:48:27.7952585Z     %22 = ttg.convert_layout %21 : tensor<2xi64, #blocked5> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:48:27.7952908Z     %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi64, #blocked6>
2026-02-21T10:48:27.7953187Z     %24 = ttg.convert_layout %23 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #blocked3>
2026-02-21T10:48:27.7953468Z     %25 = ttg.convert_layout %24 : tensor<1x2xi64, #blocked3> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T10:48:27.7953802Z     %26 = tt.expand_dims %25 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi64, #blocked7>
2026-02-21T10:48:27.7954097Z     %27 = ttg.convert_layout %26 : tensor<1x2x1xi64, #blocked7> -> tensor<1x2x1xi64, #blocked1>
2026-02-21T10:48:27.7954306Z     %28 = arith.muli %27, %cst_4 : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:48:27.7954501Z     %29 = tt.broadcast %28 : tensor<1x2x1xi64, #blocked1> -> tensor<1x2x128xi64, #blocked1>
2026-02-21T10:48:27.7954745Z     %30 = ttg.convert_layout %29 : tensor<1x2x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked>
2026-02-21T10:48:27.7954976Z     %31 = arith.extsi %13 : tensor<128xi32, #blocked5> to tensor<128xi64, #blocked5>
2026-02-21T10:48:27.7955243Z     %32 = ttg.convert_layout %31 : tensor<128xi64, #blocked5> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:48:27.7955568Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi64, #blocked6>
2026-02-21T10:48:27.7955852Z     %34 = ttg.convert_layout %33 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #blocked4>
2026-02-21T10:48:27.7956150Z     %35 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked4> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T10:48:27.7956490Z     %36 = tt.expand_dims %35 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi64, #blocked8>
2026-02-21T10:48:27.7956787Z     %37 = ttg.convert_layout %36 : tensor<1x1x128xi64, #blocked8> -> tensor<1x1x128xi64, #blocked>
2026-02-21T10:48:27.7957028Z     %38 = tt.broadcast %37 : tensor<1x1x128xi64, #blocked> -> tensor<1x2x128xi64, #blocked>
2026-02-21T10:48:27.7957238Z     %39 = arith.addi %30, %38 : tensor<1x2x128xi64, #blocked>
2026-02-21T10:48:27.7957395Z     %40 = arith.addi %18, %39 : tensor<1x2x128xi64, #blocked>
2026-02-21T10:48:27.7957598Z     %41 = tt.addptr %16, %40 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>, tensor<1x2x128xi64, #blocked>
2026-02-21T10:48:27.7957786Z     %42 = arith.cmpi sge, %14, %c0_i64 : i64
2026-02-21T10:48:27.7957915Z     %43 = arith.cmpi slt, %14, %c192_i64 : i64
2026-02-21T10:48:27.7958036Z     %44 = arith.andi %42, %43 : i1
2026-02-21T10:48:27.7958177Z     %45 = arith.cmpi sge, %27, %cst_3 : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:48:27.7958347Z     %46 = arith.cmpi slt, %27, %cst_2 : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:48:27.7958512Z     %47 = arith.andi %45, %46 : tensor<1x2x1xi1, #blocked1>
2026-02-21T10:48:27.7958666Z     %48 = tt.splat %44 : i1 -> tensor<1x2x1xi1, #blocked1>
2026-02-21T10:48:27.7958815Z     %49 = arith.andi %48, %47 : tensor<1x2x1xi1, #blocked1>
2026-02-21T10:48:27.7959005Z     %50 = tt.broadcast %49 : tensor<1x2x1xi1, #blocked1> -> tensor<1x2x128xi1, #blocked1>
2026-02-21T10:48:27.7959257Z     %51 = ttg.convert_layout %50 : tensor<1x2x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked>
2026-02-21T10:48:27.7959468Z     %52 = arith.cmpi sge, %37, %cst_1 : tensor<1x1x128xi64, #blocked>
2026-02-21T10:48:27.7959638Z     %53 = arith.cmpi slt, %37, %cst_0 : tensor<1x1x128xi64, #blocked>
2026-02-21T10:48:27.7959801Z     %54 = arith.andi %52, %53 : tensor<1x1x128xi1, #blocked>
2026-02-21T10:48:27.7959987Z     %55 = tt.broadcast %54 : tensor<1x1x128xi1, #blocked> -> tensor<1x2x128xi1, #blocked>
2026-02-21T10:48:27.7960180Z     %56 = arith.andi %51, %55 : tensor<1x2x128xi1, #blocked>
2026-02-21T10:48:27.7960343Z     %57 = tt.load %41, %56, %cst : tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:48:27.7960554Z     %58 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked5>
2026-02-21T10:48:27.7960723Z     %59 = arith.muli %8, %c32768_i32 : i32
2026-02-21T10:48:27.7960943Z     %60 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:48:27.7961270Z     %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6>
2026-02-21T10:48:27.7961558Z     %62 = ttg.convert_layout %61 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4>
2026-02-21T10:48:27.7961840Z     %63 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T10:48:27.7962177Z     %64 = tt.expand_dims %63 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x128x1xi32, #blocked9>
2026-02-21T10:48:27.7962480Z     %65 = ttg.convert_layout %64 : tensor<1x128x1xi32, #blocked9> -> tensor<1x128x1xi32, #blocked2>
2026-02-21T10:48:27.7962752Z     %66 = tt.splat %59 : i32 -> tensor<1x128x1xi32, #blocked2>
2026-02-21T10:48:27.7962915Z     %67 = arith.addi %66, %65 : tensor<1x128x1xi32, #blocked2>
2026-02-21T10:48:27.7963115Z     %68 = tt.broadcast %67 : tensor<1x128x1xi32, #blocked2> -> tensor<1x128x64xi32, #blocked2>
2026-02-21T10:48:27.7963365Z     %69 = ttg.convert_layout %68 : tensor<1x128x64xi32, #blocked2> -> tensor<1x128x64xi32, #blocked>
2026-02-21T10:48:27.7963598Z     %70 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x64x!tt.ptr<bf16>, #blocked>
2026-02-21T10:48:27.7963832Z     %71 = tt.reshape %57 : tensor<1x2x128xbf16, #blocked> -> tensor<2x128xbf16, #blocked4>
2026-02-21T10:48:27.7964026Z     %72 = tt.splat %59 : i32 -> tensor<1x64x1xi32, #blocked2>
2026-02-21T10:48:27.7964265Z     %73 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T10:48:27.7964602Z     %74 = tt.expand_dims %73 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8>
2026-02-21T10:48:27.7964920Z     %75 = ttg.convert_layout %74 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked>
2026-02-21T10:48:27.7965158Z     %76 = tt.broadcast %75 : tensor<1x1x128xi32, #blocked> -> tensor<1x64x128xi32, #blocked>
2026-02-21T10:48:27.7965376Z     %77 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x64x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:48:27.7965757Z     %78:3 = scf.for %arg4 = %c0_i32 to %c256_i32 step %c64_i32 iter_args(%arg5 = %cst_13, %arg6 = %cst_12, %arg7 = %cst_11) -> (tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>)  : i32 {
2026-02-21T10:48:27.7966115Z       %108 = tt.splat %arg4 : i32 -> tensor<64xi32, #blocked5>
2026-02-21T10:48:27.7966273Z       %109 = arith.addi %108, %58 : tensor<64xi32, #blocked5>
2026-02-21T10:48:27.7966510Z       %110 = ttg.convert_layout %109 : tensor<64xi32, #blocked5> -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:48:27.7966840Z       %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x64xi32, #blocked6>
2026-02-21T10:48:27.7967132Z       %112 = ttg.convert_layout %111 : tensor<1x64xi32, #blocked6> -> tensor<1x64xi32, #blocked4>
2026-02-21T10:48:27.7967447Z       %113 = ttg.convert_layout %112 : tensor<1x64xi32, #blocked4> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T10:48:27.7967791Z       %114 = tt.expand_dims %113 {axis = 1 : i32} : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x64xi32, #blocked8>
2026-02-21T10:48:27.7968090Z       %115 = ttg.convert_layout %114 : tensor<1x1x64xi32, #blocked8> -> tensor<1x1x64xi32, #blocked>
2026-02-21T10:48:27.7968305Z       %116 = arith.muli %115, %cst_10 : tensor<1x1x64xi32, #blocked>
2026-02-21T10:48:27.7968505Z       %117 = tt.broadcast %116 : tensor<1x1x64xi32, #blocked> -> tensor<1x128x64xi32, #blocked>
2026-02-21T10:48:27.7968729Z       %118 = arith.addi %69, %117 : tensor<1x128x64xi32, #blocked>
2026-02-21T10:48:27.7968944Z       %119 = tt.addptr %70, %118 : tensor<1x128x64x!tt.ptr<bf16>, #blocked>, tensor<1x128x64xi32, #blocked>
2026-02-21T10:48:27.7969161Z       %120 = tt.load %119 : tensor<1x128x64x!tt.ptr<bf16>, #blocked>
2026-02-21T10:48:27.7969366Z       %121 = tt.reshape %120 : tensor<1x128x64xbf16, #blocked> -> tensor<128x64xbf16, #blocked4>
2026-02-21T10:48:27.7969663Z       %122 = ttg.convert_layout %71 : tensor<2x128xbf16, #blocked4> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked4}>>
2026-02-21T10:48:27.7970020Z       %123 = ttg.convert_layout %121 : tensor<128x64xbf16, #blocked4> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked4}>>
2026-02-21T10:48:27.7970324Z       %124 = ttg.convert_layout %cst_9 : tensor<2x64xf32, #blocked4> -> tensor<2x64xf32, #blocked4>
2026-02-21T10:48:27.7970735Z       %125 = tt.dot %122, %123, %124, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked4}>> * tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked4}>> -> tensor<2x64xf32, #blocked4>
2026-02-21T10:48:27.7971126Z       %126 = tt.reshape %125 : tensor<2x64xf32, #blocked4> -> tensor<1x2x64xf32, #blocked>
2026-02-21T10:48:27.7971358Z       %127 = arith.truncf %126 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked>
2026-02-21T10:48:27.7971591Z       %128 = arith.extf %127 : tensor<1x2x64xbf16, #blocked> to tensor<1x2x64xf32, #blocked>
2026-02-21T10:48:27.7971781Z       %129 = "tt.reduce"(%128) <{axis = 2 : i32}> ({
2026-02-21T10:48:27.7971922Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T10:48:27.7972046Z         %182 = arith.maxnumf %arg8, %arg9 : f32
2026-02-21T10:48:27.7972169Z         tt.reduce.return %182 : f32
2026-02-21T10:48:27.7972358Z       }) : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T10:48:27.7972649Z       %130 = ttg.convert_layout %129 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7972919Z       %131 = arith.truncf %130 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3>
2026-02-21T10:48:27.7973161Z       %132 = arith.extf %131 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7973354Z       %133 = arith.mulf %132, %cst_8 : tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7973545Z       %134 = arith.truncf %133 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3>
2026-02-21T10:48:27.7973759Z       %135 = arith.extf %134 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7973957Z       %136 = arith.cmpf ogt, %arg5, %135 : tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7974134Z       %137 = arith.cmpf une, %arg5, %arg5 : tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7974296Z       %138 = arith.ori %136, %137 : tensor<1x2xi1, #blocked3>
2026-02-21T10:48:27.7974494Z       %139 = arith.select %138, %arg5, %135 : tensor<1x2xi1, #blocked3>, tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7974699Z       %140 = arith.mulf %128, %cst_7 : tensor<1x2x64xf32, #blocked>
2026-02-21T10:48:27.7974901Z       %141 = arith.truncf %140 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked>
2026-02-21T10:48:27.7975197Z       %142 = ttg.convert_layout %139 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T10:48:27.7975530Z       %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T10:48:27.7975827Z       %144 = ttg.convert_layout %143 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T10:48:27.7976066Z       %145 = arith.extf %141 : tensor<1x2x64xbf16, #blocked> to tensor<1x2x64xf32, #blocked>
2026-02-21T10:48:27.7976300Z       %146 = tt.broadcast %144 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x64xf32, #blocked1>
2026-02-21T10:48:27.7976559Z       %147 = ttg.convert_layout %146 : tensor<1x2x64xf32, #blocked1> -> tensor<1x2x64xf32, #blocked>
2026-02-21T10:48:27.7976767Z       %148 = arith.subf %145, %147 : tensor<1x2x64xf32, #blocked>
2026-02-21T10:48:27.7977070Z       %149 = tt.extern_elementwise %148 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2x64xf32, #blocked>
2026-02-21T10:48:27.7977354Z       %150 = "tt.reduce"(%149) <{axis = 2 : i32}> ({
2026-02-21T10:48:27.7977482Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T10:48:27.7977599Z         %182 = arith.addf %arg8, %arg9 : f32
2026-02-21T10:48:27.7977718Z         tt.reduce.return %182 : f32
2026-02-21T10:48:27.7977902Z       }) : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T10:48:27.7978187Z       %151 = ttg.convert_layout %150 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7978430Z       %152 = arith.subf %arg5, %139 : tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7978716Z       %153 = tt.extern_elementwise %152 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked3>) -> tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7979006Z       %154 = arith.mulf %arg6, %153 : tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7979168Z       %155 = arith.addf %154, %151 : tensor<1x2xf32, #blocked3>
2026-02-21T10:48:27.7979407Z       %156 = ttg.convert_layout %153 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T10:48:27.7979744Z       %157 = tt.expand_dims %156 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T10:48:27.7980071Z       %158 = ttg.convert_layout %157 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T10:48:27.7980314Z       %159 = tt.broadcast %158 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1>
2026-02-21T10:48:27.7980562Z       %160 = ttg.convert_layout %159 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked>
2026-02-21T10:48:27.7980776Z       %161 = arith.mulf %arg7, %160 : tensor<1x2x128xf32, #blocked>
2026-02-21T10:48:27.7981039Z       %162 = ttg.convert_layout %112 : tensor<1x64xi32, #blocked4> -> tensor<1x64xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T10:48:27.7981379Z       %163 = tt.expand_dims %162 {axis = 2 : i32} : tensor<1x64xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x64x1xi32, #blocked9>
2026-02-21T10:48:27.7981679Z       %164 = ttg.convert_layout %163 : tensor<1x64x1xi32, #blocked9> -> tensor<1x64x1xi32, #blocked2>
2026-02-21T10:48:27.7981893Z       %165 = arith.muli %164, %cst_6 : tensor<1x64x1xi32, #blocked2>
2026-02-21T10:48:27.7982061Z       %166 = arith.addi %72, %165 : tensor<1x64x1xi32, #blocked2>
2026-02-21T10:48:27.7982263Z       %167 = tt.broadcast %166 : tensor<1x64x1xi32, #blocked2> -> tensor<1x64x128xi32, #blocked2>
2026-02-21T10:48:27.7982520Z       %168 = ttg.convert_layout %167 : tensor<1x64x128xi32, #blocked2> -> tensor<1x64x128xi32, #blocked>
2026-02-21T10:48:27.7982732Z       %169 = arith.addi %168, %76 : tensor<1x64x128xi32, #blocked>
2026-02-21T10:48:27.7982948Z       %170 = tt.addptr %77, %169 : tensor<1x64x128x!tt.ptr<bf16>, #blocked>, tensor<1x64x128xi32, #blocked>
2026-02-21T10:48:27.7983178Z       %171 = tt.load %170 : tensor<1x64x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:48:27.7983377Z       %172 = arith.truncf %149 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked>
2026-02-21T10:48:27.7983610Z       %173 = tt.reshape %161 : tensor<1x2x128xf32, #blocked> -> tensor<2x128xf32, #blocked4>
2026-02-21T10:48:27.7983835Z       %174 = tt.reshape %172 : tensor<1x2x64xbf16, #blocked> -> tensor<2x64xbf16, #blocked4>
2026-02-21T10:48:27.7984066Z       %175 = tt.reshape %171 : tensor<1x64x128xbf16, #blocked> -> tensor<64x128xbf16, #blocked4>
2026-02-21T10:48:27.7984379Z       %176 = ttg.convert_layout %174 : tensor<2x64xbf16, #blocked4> -> tensor<2x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>>
2026-02-21T10:48:27.7984734Z       %177 = ttg.convert_layout %175 : tensor<64x128xbf16, #blocked4> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>>
2026-02-21T10:48:27.7985044Z       %178 = ttg.convert_layout %173 : tensor<2x128xf32, #blocked4> -> tensor<2x128xf32, #blocked10>
2026-02-21T10:48:27.7985454Z       %179 = tt.dot %176, %177, %178, inputPrecision = tf32 : tensor<2x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x128xf32, #blocked10>
2026-02-21T10:48:27.7985861Z       %180 = ttg.convert_layout %179 : tensor<2x128xf32, #blocked10> -> tensor<2x128xf32, #blocked4>
2026-02-21T10:48:27.7986100Z       %181 = tt.reshape %180 : tensor<2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked>
2026-02-21T10:48:27.7986361Z       scf.yield %139, %155, %181 : tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>
2026-02-21T10:48:27.7986615Z     } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T10:48:27.7986874Z     %79 = ttg.convert_layout %78#1 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T10:48:27.7987203Z     %80 = tt.expand_dims %79 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T10:48:27.7987492Z     %81 = ttg.convert_layout %80 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T10:48:27.7987725Z     %82 = tt.broadcast %81 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1>
2026-02-21T10:48:27.7987979Z     %83 = ttg.convert_layout %82 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked>
2026-02-21T10:48:27.7988183Z     %84 = arith.divf %78#2, %83 : tensor<1x2x128xf32, #blocked>
2026-02-21T10:48:27.7988377Z     %85 = arith.truncf %84 : tensor<1x2x128xf32, #blocked> to tensor<1x2x128xbf16, #blocked>
2026-02-21T10:48:27.7988560Z     %86 = arith.muli %8, %c32768_i32 : i32
2026-02-21T10:48:27.7988771Z     %87 = ttg.convert_layout %12 : tensor<2xi32, #blocked5> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:48:27.7989105Z     %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi32, #blocked6>
2026-02-21T10:48:27.7989379Z     %89 = ttg.convert_layout %88 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #blocked3>
2026-02-21T10:48:27.7989653Z     %90 = ttg.convert_layout %89 : tensor<1x2xi32, #blocked3> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T10:48:27.7989977Z     %91 = tt.expand_dims %90 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi32, #blocked7>
2026-02-21T10:48:27.7990261Z     %92 = ttg.convert_layout %91 : tensor<1x2x1xi32, #blocked7> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T10:48:27.7990478Z     %93 = arith.muli %92, %cst_5 : tensor<1x2x1xi32, #blocked1>
2026-02-21T10:48:27.7990637Z     %94 = tt.splat %86 : i32 -> tensor<1x2x1xi32, #blocked1>
2026-02-21T10:48:27.7990788Z     %95 = arith.addi %94, %93 : tensor<1x2x1xi32, #blocked1>
2026-02-21T10:48:27.7991026Z     %96 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:48:27.8001508Z     %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6>
2026-02-21T10:48:27.8001810Z     %98 = ttg.convert_layout %97 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4>
2026-02-21T10:48:27.8002100Z     %99 = ttg.convert_layout %98 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T10:48:27.8002436Z     %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8>
2026-02-21T10:48:27.8002801Z     %101 = ttg.convert_layout %100 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked>
2026-02-21T10:48:27.8003042Z     %102 = tt.broadcast %95 : tensor<1x2x1xi32, #blocked1> -> tensor<1x2x128xi32, #blocked1>
2026-02-21T10:48:27.8003285Z     %103 = ttg.convert_layout %102 : tensor<1x2x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked>
2026-02-21T10:48:27.8003530Z     %104 = tt.broadcast %101 : tensor<1x1x128xi32, #blocked> -> tensor<1x2x128xi32, #blocked>
2026-02-21T10:48:27.8003729Z     %105 = arith.addi %103, %104 : tensor<1x2x128xi32, #blocked>
2026-02-21T10:48:27.8003915Z     %106 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:48:27.8004148Z     %107 = tt.addptr %106, %105 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>, tensor<1x2x128xi32, #blocked>
2026-02-21T10:48:27.8004363Z     tt.store %107, %85 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:48:27.8004499Z     tt.return
2026-02-21T10:48:27.8004580Z   }
2026-02-21T10:48:27.8004659Z }
2026-02-21T10:48:27.8004701Z 
2026-02-21T10:48:27.8004735Z {-#
2026-02-21T10:48:27.8004817Z   external_resources: {
2026-02-21T10:48:27.8004916Z     mlir_reproducer: {
2026-02-21T10:48:27.8007164Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T10:48:27.8009498Z       disable_threading: false,
2026-02-21T10:48:27.8009607Z       verify_each: true
2026-02-21T10:48:27.8009697Z     }
2026-02-21T10:48:27.8009772Z   }
2026-02-21T10:48:27.8009840Z #-}
2026-02-21T10:48:27.8010121Z /tmp/torchinductor_root/re/creryn3ijcfe5ngnsdypjphn22cnx4fiylxrfemcef4ips7yanzk.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T10:48:27.8010844Z /tmp/torchinductor_root/re/creryn3ijcfe5ngnsdypjphn22cnx4fiylxrfemcef4ips7yanzk.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T10:48:27.8011393Z [20s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T10:48:27.8012158Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T10:48:27.8012823Z Error: RuntimeError: PassManager::run failed
2026-02-21T10:48:27.8012991Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T10:48:32.8813311Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.7 configs/s
2026-02-21T10:48:32.8820561Z [25s] Adaptive compile timeout: 30s (90% percentile=11.8s, bounds=[30.0s, 30s])
2026-02-21T10:48:33.4971720Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1335.4 configs/s
2026-02-21T10:48:33.9296650Z [26s] Initial random population of 100, 5 starting points: 
2026-02-21T10:48:33.9298007Z error=12
2026-02-21T10:48:33.9298363Z ok=88
2026-02-21T10:48:33.9298570Z min=0.0564
2026-02-21T10:48:33.9298779Z mid=0.3850
2026-02-21T10:48:33.9298986Z max=42.0966
2026-02-21T10:48:33.9299232Z best={'block_sizes': [1, 32, 16],
2026-02-21T10:48:33.9299729Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T10:48:33.9300141Z  'l2_groupings': [8],
2026-02-21T10:48:33.9300425Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:48:33.9300692Z  'loop_orders': [[0, 1]],
2026-02-21T10:48:33.9300864Z  'matrix_instr_nonkdim': 0,
2026-02-21T10:48:33.9301029Z  'num_stages': 2,
2026-02-21T10:48:33.9301173Z  'num_warps': 1,
2026-02-21T10:48:33.9301322Z  'pid_type': 'flat',
2026-02-21T10:48:33.9301487Z  'range_flattens': [None, None],
2026-02-21T10:48:33.9301696Z  'range_multi_buffers': [None, True],
2026-02-21T10:48:33.9301878Z  'range_num_stages': [0, 3],
2026-02-21T10:48:33.9302050Z  'range_unroll_factors': [0, 2],
2026-02-21T10:48:33.9302225Z  'range_warp_specializes': [],
2026-02-21T10:48:33.9302395Z  'waves_per_eu': 2}
2026-02-21T10:48:33.9387998Z [26s] Fitting surrogate: 100 points, 100 targets
2026-02-21T10:48:34.7814813Z [27s] Generation 1 starting: 76 neighbors, 5 active search path(s)
2026-02-21T10:48:46.5151894Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 10.2 configs/s
2026-02-21T10:48:51.9384620Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 14.9 configs/s
2026-02-21T10:48:55.6808394Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 258.3         
2026-02-21T10:48:55.6808848Z                                                                   configs/s     
2026-02-21T10:48:56.3403130Z [48s] Generation 1 complete: 
2026-02-21T10:48:56.3403345Z ok=81
2026-02-21T10:48:56.3403429Z min=0.0473
2026-02-21T10:48:56.3403913Z mid=0.0755
2026-02-21T10:48:56.3403987Z max=0.4422
2026-02-21T10:48:56.3404077Z best={'block_sizes': [1, 64, 16],
2026-02-21T10:48:56.3404345Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T10:48:56.3404500Z  'l2_groupings': [8],
2026-02-21T10:48:56.3404606Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:48:56.3404721Z  'loop_orders': [[0, 1]],
2026-02-21T10:48:56.3404824Z  'matrix_instr_nonkdim': 0,
2026-02-21T10:48:56.3404931Z  'num_stages': 2,
2026-02-21T10:48:56.3405020Z  'num_warps': 4,
2026-02-21T10:48:56.3405104Z  'pid_type': 'flat',
2026-02-21T10:48:56.3405201Z  'range_flattens': [None, None],
2026-02-21T10:48:56.3405311Z  'range_multi_buffers': [None, True],
2026-02-21T10:48:56.3405424Z  'range_num_stages': [0, 3],
2026-02-21T10:48:56.3405525Z  'range_unroll_factors': [0, 2],
2026-02-21T10:48:56.3405636Z  'range_warp_specializes': [],
2026-02-21T10:48:56.3405741Z  'waves_per_eu': 2}
2026-02-21T10:48:56.3435537Z [48s] Fitting surrogate: 181 points, 181 targets
2026-02-21T10:48:57.1274346Z [49s] Generation 2 starting: 70 neighbors, 5 active search path(s)
2026-02-21T10:49:37.1954001Z [89s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=1, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T10:49:37.6513434Z [90s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=1, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T10:49:37.6531652Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.5 configs/s
2026-02-21T10:49:41.9222804Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 17.0 configs/s
2026-02-21T10:49:44.3870490Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 384.7         
2026-02-21T10:49:44.3870938Z                                                                   configs/s     
2026-02-21T10:49:44.9136845Z [97s] Generation 2 complete: 
2026-02-21T10:49:44.9137064Z timeout=2
2026-02-21T10:49:44.9137186Z ok=73
2026-02-21T10:49:44.9137310Z min=0.0454
2026-02-21T10:49:44.9137435Z mid=0.0787
2026-02-21T10:49:44.9137558Z max=4.1786
2026-02-21T10:49:44.9137701Z best={'block_sizes': [1, 64, 16],
2026-02-21T10:49:44.9137961Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:49:44.9138230Z  'l2_groupings': [4],
2026-02-21T10:49:44.9138404Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:49:44.9138598Z  'loop_orders': [[0, 1]],
2026-02-21T10:49:44.9139102Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:49:44.9139271Z  'num_stages': 2,
2026-02-21T10:49:44.9139413Z  'num_warps': 4,
2026-02-21T10:49:44.9139548Z  'pid_type': 'flat',
2026-02-21T10:49:44.9139738Z  'range_flattens': [None, False],
2026-02-21T10:49:44.9139922Z  'range_multi_buffers': [None, None],
2026-02-21T10:49:44.9140119Z  'range_num_stages': [0, 1],
2026-02-21T10:49:44.9140296Z  'range_unroll_factors': [0, 4],
2026-02-21T10:49:44.9140592Z  'range_warp_specializes': [],
2026-02-21T10:49:44.9140769Z  'waves_per_eu': 3}
2026-02-21T10:49:44.9572646Z [97s] Fitting surrogate: 256 points, 256 targets
2026-02-21T10:49:45.6595051Z [98s] Generation 3 starting: 70 neighbors, 5 active search path(s)
2026-02-21T10:50:09.9018475Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.9 configs/s
2026-02-21T10:50:14.2968292Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 71/71 16.7 configs/s
2026-02-21T10:50:18.3720923Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 238.8         
2026-02-21T10:50:18.3721543Z                                                                   configs/s     
2026-02-21T10:50:19.0258222Z [131s] Generation 3 complete: 
2026-02-21T10:50:19.0258852Z ok=75
2026-02-21T10:50:19.0259085Z min=0.0464
2026-02-21T10:50:19.0259299Z mid=0.0616
2026-02-21T10:50:19.0259498Z max=0.5650
2026-02-21T10:50:19.0259728Z best={'block_sizes': [1, 64, 16],
2026-02-21T10:50:19.0260137Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:50:19.0260544Z  'l2_groupings': [64],
2026-02-21T10:50:19.0260820Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:50:19.0261137Z  'loop_orders': [[0, 1]],
2026-02-21T10:50:19.0261413Z  'matrix_instr_nonkdim': 0,
2026-02-21T10:50:19.0261695Z  'num_stages': 2,
2026-02-21T10:50:19.0261933Z  'num_warps': 4,
2026-02-21T10:50:19.0262312Z  'pid_type': 'flat',
2026-02-21T10:50:19.0262580Z  'range_flattens': [None, True],
2026-02-21T10:50:19.0262884Z  'range_multi_buffers': [None, None],
2026-02-21T10:50:19.0263209Z  'range_num_stages': [0, 2],
2026-02-21T10:50:19.0263487Z  'range_unroll_factors': [0, 2],
2026-02-21T10:50:19.0263789Z  'range_warp_specializes': [],
2026-02-21T10:50:19.0264077Z  'waves_per_eu': 3}
2026-02-21T10:50:19.0308180Z [131s] Fitting surrogate: 331 points, 331 targets
2026-02-21T10:50:19.7209877Z [132s] Generation 4 starting: 67 neighbors, 5 active search path(s)
2026-02-21T10:50:30.0333337Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 11.5 configs/s
2026-02-21T10:50:34.3240095Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 16.4 configs/s
2026-02-21T10:50:36.9623667Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 364.3         
2026-02-21T10:50:36.9624292Z                                                                   configs/s     
2026-02-21T10:50:37.5116138Z [149s] Generation 4 complete: 
2026-02-21T10:50:37.5116622Z ok=72
2026-02-21T10:50:37.5116871Z min=0.0457
2026-02-21T10:50:37.5117124Z mid=0.0794
2026-02-21T10:50:37.5117361Z max=1.5306
2026-02-21T10:50:37.5117605Z best={'block_sizes': [1, 64, 16],
2026-02-21T10:50:37.5118040Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:50:37.5118462Z  'l2_groupings': [64],
2026-02-21T10:50:37.5118827Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:50:37.5119141Z  'loop_orders': [[0, 1]],
2026-02-21T10:50:37.5119423Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:50:37.5119691Z  'num_stages': 2,
2026-02-21T10:50:37.5119939Z  'num_warps': 4,
2026-02-21T10:50:37.5120168Z  'pid_type': 'flat',
2026-02-21T10:50:37.5120441Z  'range_flattens': [None, True],
2026-02-21T10:50:37.5120762Z  'range_multi_buffers': [None, None],
2026-02-21T10:50:37.5121082Z  'range_num_stages': [0, 3],
2026-02-21T10:50:37.5121361Z  'range_unroll_factors': [0, 2],
2026-02-21T10:50:37.5121653Z  'range_warp_specializes': [],
2026-02-21T10:50:37.5121964Z  'waves_per_eu': 3}
2026-02-21T10:50:37.5668208Z [150s] Fitting surrogate: 403 points, 403 targets
2026-02-21T10:50:38.1967704Z [150s] Generation 5 starting: 47 neighbors, 4 active search path(s)
2026-02-21T10:50:46.5331541Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 6.1 configs/s
2026-02-21T10:50:49.5730429Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 16.6 configs/s
2026-02-21T10:50:51.9346843Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 403.8         
2026-02-21T10:50:51.9347456Z                                                                   configs/s     
2026-02-21T10:50:52.3957783Z [164s] Generation 5 complete: 
2026-02-21T10:50:52.3958148Z ok=51
2026-02-21T10:50:52.3958359Z min=0.0432
2026-02-21T10:50:52.3958583Z mid=0.0655
2026-02-21T10:50:52.3958782Z max=0.5865
2026-02-21T10:50:52.3959020Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:50:52.3959427Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:50:52.3959853Z  'l2_groupings': [64],
2026-02-21T10:50:52.3960133Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:50:52.3960477Z  'loop_orders': [[0, 1]],
2026-02-21T10:50:52.3960758Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:50:52.3961030Z  'num_stages': 2,
2026-02-21T10:50:52.3961270Z  'num_warps': 4,
2026-02-21T10:50:52.3961741Z  'pid_type': 'flat',
2026-02-21T10:50:52.3962020Z  'range_flattens': [None, None],
2026-02-21T10:50:52.3962319Z  'range_multi_buffers': [None, None],
2026-02-21T10:50:52.3962708Z  'range_num_stages': [0, 3],
2026-02-21T10:50:52.3962981Z  'range_unroll_factors': [0, 2],
2026-02-21T10:50:52.3963289Z  'range_warp_specializes': [],
2026-02-21T10:50:52.3963566Z  'waves_per_eu': 3}
2026-02-21T10:50:52.4393336Z [164s] Fitting surrogate: 454 points, 454 targets
2026-02-21T10:50:52.9443205Z [165s] Generation 6 starting: 46 neighbors, 3 active search path(s)
2026-02-21T10:50:59.6263284Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 7.7 configs/s
2026-02-21T10:51:02.4794636Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 16.4 configs/s
2026-02-21T10:51:05.9416561Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 315.4         
2026-02-21T10:51:05.9417190Z                                                                   configs/s     
2026-02-21T10:51:06.4360354Z [178s] Generation 6 complete: 
2026-02-21T10:51:06.4360774Z error=1
2026-02-21T10:51:06.4360981Z ok=48
2026-02-21T10:51:06.4361188Z min=0.0440
2026-02-21T10:51:06.4361429Z mid=0.0566
2026-02-21T10:51:06.4361633Z max=0.3078
2026-02-21T10:51:06.4361865Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:51:06.4362300Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:51:06.4362781Z  'l2_groupings': [64],
2026-02-21T10:51:06.4363061Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:51:06.4363385Z  'loop_orders': [[0, 1]],
2026-02-21T10:51:06.4363666Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:51:06.4363988Z  'num_stages': 2,
2026-02-21T10:51:06.4364212Z  'num_warps': 4,
2026-02-21T10:51:06.4364437Z  'pid_type': 'flat',
2026-02-21T10:51:06.4364682Z  'range_flattens': [None, False],
2026-02-21T10:51:06.4364990Z  'range_multi_buffers': [None, None],
2026-02-21T10:51:06.4365274Z  'range_num_stages': [0, 3],
2026-02-21T10:51:06.4365544Z  'range_unroll_factors': [0, 2],
2026-02-21T10:51:06.4365830Z  'range_warp_specializes': [],
2026-02-21T10:51:06.4366089Z  'waves_per_eu': 3}
2026-02-21T10:51:06.4934727Z [178s] Fitting surrogate: 503 points, 503 targets
2026-02-21T10:51:06.9958383Z [179s] Generation 7 starting: 46 neighbors, 3 active search path(s)
2026-02-21T10:51:12.1266454Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 5.0 configs/s
2026-02-21T10:51:15.0842403Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 16.3 configs/s
2026-02-21T10:51:17.9734047Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 332.4         
2026-02-21T10:51:17.9734715Z                                                                   configs/s     
2026-02-21T10:51:18.4661306Z [190s] Generation 7 complete: 
2026-02-21T10:51:18.4662059Z ok=49
2026-02-21T10:51:18.4662178Z min=0.0438
2026-02-21T10:51:18.4662296Z mid=0.0504
2026-02-21T10:51:18.4662416Z max=0.5898
2026-02-21T10:51:18.4662553Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:51:18.4662805Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:51:18.4663033Z  'l2_groupings': [64],
2026-02-21T10:51:18.4663178Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:51:18.4663356Z  'loop_orders': [[0, 1]],
2026-02-21T10:51:18.4663527Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:51:18.4663671Z  'num_stages': 2,
2026-02-21T10:51:18.4663791Z  'num_warps': 4,
2026-02-21T10:51:18.4663922Z  'pid_type': 'flat',
2026-02-21T10:51:18.4664060Z  'range_flattens': [None, False],
2026-02-21T10:51:18.4664234Z  'range_multi_buffers': [None, None],
2026-02-21T10:51:18.4664402Z  'range_num_stages': [0, 2],
2026-02-21T10:51:18.4664556Z  'range_unroll_factors': [0, 2],
2026-02-21T10:51:18.4664727Z  'range_warp_specializes': [],
2026-02-21T10:51:18.4664880Z  'waves_per_eu': 3}
2026-02-21T10:51:18.5183047Z [190s] Fitting surrogate: 552 points, 552 targets
2026-02-21T10:51:18.8846008Z [191s] Generation 8 starting: 30 neighbors, 2 active search path(s)
2026-02-21T10:51:24.0184231Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 3.6 configs/s
2026-02-21T10:51:25.9578666Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 16.7 configs/s
2026-02-21T10:51:27.9151553Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 482.1         
2026-02-21T10:51:27.9152176Z                                                                   configs/s     
2026-02-21T10:51:28.4024813Z [200s] Generation 8 complete: 
2026-02-21T10:51:28.4025264Z ok=32
2026-02-21T10:51:28.4025483Z min=0.0445
2026-02-21T10:51:28.4025693Z mid=0.0485
2026-02-21T10:51:28.4025888Z max=0.2253
2026-02-21T10:51:28.4026118Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:51:28.4027027Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:51:28.4027433Z  'l2_groupings': [64],
2026-02-21T10:51:28.4027725Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:51:28.4028037Z  'loop_orders': [[0, 1]],
2026-02-21T10:51:28.4028314Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:51:28.4028586Z  'num_stages': 2,
2026-02-21T10:51:28.4028814Z  'num_warps': 4,
2026-02-21T10:51:28.4029040Z  'pid_type': 'flat',
2026-02-21T10:51:28.4029297Z  'range_flattens': [None, False],
2026-02-21T10:51:28.4029599Z  'range_multi_buffers': [None, None],
2026-02-21T10:51:28.4029912Z  'range_num_stages': [0, 2],
2026-02-21T10:51:28.4030184Z  'range_unroll_factors': [0, 2],
2026-02-21T10:51:28.4030477Z  'range_warp_specializes': [],
2026-02-21T10:51:28.4030752Z  'waves_per_eu': 3}
2026-02-21T10:51:28.4397976Z [200s] Fitting surrogate: 584 points, 584 targets
2026-02-21T10:51:28.8049017Z [201s] Generation 9 starting: 26 neighbors, 2 active search path(s)
2026-02-21T10:51:32.7662638Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 8.6 configs/s
2026-02-21T10:51:34.4326976Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 17.1 configs/s
2026-02-21T10:51:35.7920876Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 678.1         
2026-02-21T10:51:35.7921303Z                                                                   configs/s     
2026-02-21T10:51:36.1775769Z [208s] Generation 9 complete: 
2026-02-21T10:51:36.1775965Z ok=28
2026-02-21T10:51:36.1776089Z min=0.0438
2026-02-21T10:51:36.1776226Z mid=0.0494
2026-02-21T10:51:36.1776739Z max=0.1405
2026-02-21T10:51:36.1776931Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:51:36.1777203Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:51:36.1777430Z  'l2_groupings': [64],
2026-02-21T10:51:36.1777620Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:51:36.1777801Z  'loop_orders': [[0, 1]],
2026-02-21T10:51:36.1777957Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:51:36.1778120Z  'num_stages': 2,
2026-02-21T10:51:36.1778252Z  'num_warps': 4,
2026-02-21T10:51:36.1778388Z  'pid_type': 'flat',
2026-02-21T10:51:36.1778669Z  'range_flattens': [None, False],
2026-02-21T10:51:36.1778846Z  'range_multi_buffers': [None, None],
2026-02-21T10:51:36.1779023Z  'range_num_stages': [0, 2],
2026-02-21T10:51:36.1779274Z  'range_unroll_factors': [0, 2],
2026-02-21T10:51:36.1779446Z  'range_warp_specializes': [],
2026-02-21T10:51:36.1779607Z  'waves_per_eu': 3}
2026-02-21T10:51:36.2034404Z [208s] Fitting surrogate: 612 points, 612 targets
2026-02-21T10:51:36.4555068Z [208s] Generation 10 starting: 15 neighbors, 1 active search path(s)
2026-02-21T10:51:38.8496231Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 11.9 configs/s
2026-02-21T10:51:39.8522716Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.5 configs/s
2026-02-21T10:51:40.2518201Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2073.1        
2026-02-21T10:51:40.2518784Z                                                                   configs/s     
2026-02-21T10:51:40.5965754Z [213s] Generation 10 complete: 
2026-02-21T10:51:40.5966143Z ok=17
2026-02-21T10:51:40.5966357Z min=0.0463
2026-02-21T10:51:40.5966567Z mid=0.0741
2026-02-21T10:51:40.5966767Z max=0.3287
2026-02-21T10:51:40.5967006Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:51:40.5967409Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:51:40.5967807Z  'l2_groupings': [64],
2026-02-21T10:51:40.5968085Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:51:40.5968396Z  'loop_orders': [[0, 1]],
2026-02-21T10:51:40.5968682Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:51:40.5968950Z  'num_stages': 2,
2026-02-21T10:51:40.5969179Z  'num_warps': 4,
2026-02-21T10:51:40.5969393Z  'pid_type': 'flat',
2026-02-21T10:51:40.5969588Z  'range_flattens': [None, False],
2026-02-21T10:51:40.5969837Z  'range_multi_buffers': [None, None],
2026-02-21T10:51:40.5970084Z  'range_num_stages': [0, 2],
2026-02-21T10:51:40.5970629Z  'range_unroll_factors': [0, 2],
2026-02-21T10:51:40.5970876Z  'range_warp_specializes': [],
2026-02-21T10:51:40.5971105Z  'waves_per_eu': 3}
2026-02-21T10:51:40.6064897Z [213s] Fitting surrogate: 629 points, 629 targets
2026-02-21T10:51:40.8322830Z [213s] Generation 11 starting: 12 neighbors, 1 active search path(s)
2026-02-21T10:51:49.7859169Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13/13 0.4 configs/s
2026-02-21T10:51:50.6653611Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 13/13 17.7 configs/s
2026-02-21T10:51:50.9235575Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3621.8        
2026-02-21T10:51:50.9236246Z                                                                   configs/s     
2026-02-21T10:51:51.2695882Z [223s] Generation 11 complete: 
2026-02-21T10:51:51.2696085Z ok=14
2026-02-21T10:51:51.2696176Z min=0.0438
2026-02-21T10:51:51.2696276Z mid=0.0750
2026-02-21T10:51:51.2696359Z max=1.5105
2026-02-21T10:51:51.2696471Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:51:51.2696636Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:51:51.2697040Z  'l2_groupings': [64],
2026-02-21T10:51:51.2697154Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:51:51.2697289Z  'loop_orders': [[0, 1]],
2026-02-21T10:51:51.2697412Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:51:51.2697520Z  'num_stages': 2,
2026-02-21T10:51:51.2697615Z  'num_warps': 4,
2026-02-21T10:51:51.2697722Z  'pid_type': 'flat',
2026-02-21T10:51:51.2697831Z  'range_flattens': [None, False],
2026-02-21T10:51:51.2697957Z  'range_multi_buffers': [None, None],
2026-02-21T10:51:51.2698161Z  'range_num_stages': [0, 2],
2026-02-21T10:51:51.2698273Z  'range_unroll_factors': [0, 2],
2026-02-21T10:51:51.2698395Z  'range_warp_specializes': [],
2026-02-21T10:51:51.2698514Z  'waves_per_eu': 3}
2026-02-21T10:51:51.2779001Z [223s] Fitting surrogate: 643 points, 643 targets
2026-02-21T10:51:51.5379279Z [223s] Generation 12 starting: 14 neighbors, 1 active search path(s)
2026-02-21T10:51:54.1176352Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 4.6 configs/s
2026-02-21T10:51:55.0721862Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.3 configs/s
2026-02-21T10:51:55.6385944Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1522.8        
2026-02-21T10:51:55.6386826Z                                                                   configs/s     
2026-02-21T10:51:56.0219947Z [228s] Generation 12 complete: 
2026-02-21T10:51:56.0220367Z ok=16
2026-02-21T10:51:56.0220576Z min=0.0460
2026-02-21T10:51:56.0220819Z mid=0.0604
2026-02-21T10:51:56.0221023Z max=0.3355
2026-02-21T10:51:56.0221247Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:51:56.0221649Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:51:56.0222038Z  'l2_groupings': [64],
2026-02-21T10:51:56.0222316Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:51:56.0222631Z  'loop_orders': [[0, 1]],
2026-02-21T10:51:56.0223375Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:51:56.0223644Z  'num_stages': 2,
2026-02-21T10:51:56.0223871Z  'num_warps': 4,
2026-02-21T10:51:56.0224119Z  'pid_type': 'flat',
2026-02-21T10:51:56.0224379Z  'range_flattens': [None, False],
2026-02-21T10:51:56.0224683Z  'range_multi_buffers': [None, None],
2026-02-21T10:51:56.0225003Z  'range_num_stages': [0, 2],
2026-02-21T10:51:56.0225277Z  'range_unroll_factors': [0, 2],
2026-02-21T10:51:56.0225574Z  'range_warp_specializes': [],
2026-02-21T10:51:56.0225849Z  'waves_per_eu': 3}
2026-02-21T10:51:56.0326785Z [228s] Fitting surrogate: 659 points, 659 targets
2026-02-21T10:51:56.2767777Z [228s] Generation 13 starting: 15 neighbors, 1 active search path(s)
2026-02-21T10:52:11.4792348Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 0.5 configs/s
2026-02-21T10:52:12.5563541Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.2 configs/s
2026-02-21T10:52:12.7349677Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 5882.9        
2026-02-21T10:52:12.7350221Z                                                                   configs/s     
2026-02-21T10:52:13.0676954Z [245s] Generation 13 complete: 
2026-02-21T10:52:13.0677207Z ok=17
2026-02-21T10:52:13.0677389Z min=0.0430
2026-02-21T10:52:13.0677549Z mid=0.0954
2026-02-21T10:52:13.0677707Z max=1.6010
2026-02-21T10:52:13.0677903Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:52:13.0678207Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:52:13.0678508Z  'l2_groupings': [64],
2026-02-21T10:52:13.0678718Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:52:13.0679130Z  'loop_orders': [[0, 1]],
2026-02-21T10:52:13.0679340Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:52:13.0679545Z  'num_stages': 2,
2026-02-21T10:52:13.0679715Z  'num_warps': 4,
2026-02-21T10:52:13.0679892Z  'pid_type': 'flat',
2026-02-21T10:52:13.0680087Z  'range_flattens': [None, False],
2026-02-21T10:52:13.0680318Z  'range_multi_buffers': [None, None],
2026-02-21T10:52:13.0680554Z  'range_num_stages': [0, 2],
2026-02-21T10:52:13.0680762Z  'range_unroll_factors': [0, 2],
2026-02-21T10:52:13.0681101Z  'range_warp_specializes': [],
2026-02-21T10:52:13.0681310Z  'waves_per_eu': 3}
2026-02-21T10:52:13.0732897Z [245s] Fitting surrogate: 676 points, 676 targets
2026-02-21T10:52:13.3204512Z [245s] Generation 14 starting: 15 neighbors, 1 active search path(s)
2026-02-21T10:52:15.6815931Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 12.5 configs/s
2026-02-21T10:52:16.6953462Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.3 configs/s
2026-02-21T10:52:17.5781567Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1002.7        
2026-02-21T10:52:17.5783202Z                                                                   configs/s     
2026-02-21T10:52:17.9821695Z [250s] Generation 14 complete: 
2026-02-21T10:52:17.9822044Z ok=17
2026-02-21T10:52:17.9822237Z min=0.0443
2026-02-21T10:52:17.9822400Z mid=0.0488
2026-02-21T10:52:17.9822583Z max=0.3780
2026-02-21T10:52:17.9822757Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:52:17.9823060Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:52:17.9823379Z  'l2_groupings': [64],
2026-02-21T10:52:17.9823589Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:52:17.9823825Z  'loop_orders': [[0, 1]],
2026-02-21T10:52:17.9824468Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:52:17.9824665Z  'num_stages': 2,
2026-02-21T10:52:17.9835269Z  'num_warps': 4,
2026-02-21T10:52:17.9835417Z  'pid_type': 'flat',
2026-02-21T10:52:17.9835557Z  'range_flattens': [None, False],
2026-02-21T10:52:17.9835733Z  'range_multi_buffers': [None, None],
2026-02-21T10:52:17.9835935Z  'range_num_stages': [0, 2],
2026-02-21T10:52:17.9836086Z  'range_unroll_factors': [0, 2],
2026-02-21T10:52:17.9836245Z  'range_warp_specializes': [],
2026-02-21T10:52:17.9836395Z  'waves_per_eu': 3}
2026-02-21T10:52:17.9971933Z [250s] Fitting surrogate: 693 points, 693 targets
2026-02-21T10:52:18.2232127Z [250s] Generation 15 starting: 11 neighbors, 1 active search path(s)
2026-02-21T10:52:19.5324311Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 17.6 configs/s
2026-02-21T10:52:20.2249028Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 16.9 configs/s
2026-02-21T10:52:20.6159682Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2146.6        
2026-02-21T10:52:20.6160242Z                                                                   configs/s     
2026-02-21T10:52:20.9605837Z [253s] Generation 15 complete: 
2026-02-21T10:52:20.9606034Z ok=13
2026-02-21T10:52:20.9606168Z min=0.0448
2026-02-21T10:52:20.9606284Z mid=0.0626
2026-02-21T10:52:20.9606387Z max=0.2893
2026-02-21T10:52:20.9606494Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:52:20.9606694Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:52:20.9606923Z  'l2_groupings': [64],
2026-02-21T10:52:20.9607065Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:52:20.9607234Z  'loop_orders': [[0, 1]],
2026-02-21T10:52:20.9607374Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:52:20.9607522Z  'num_stages': 2,
2026-02-21T10:52:20.9607650Z  'num_warps': 4,
2026-02-21T10:52:20.9607774Z  'pid_type': 'flat',
2026-02-21T10:52:20.9607911Z  'range_flattens': [None, False],
2026-02-21T10:52:20.9608074Z  'range_multi_buffers': [None, None],
2026-02-21T10:52:20.9608206Z  'range_num_stages': [0, 2],
2026-02-21T10:52:20.9608331Z  'range_unroll_factors': [0, 2],
2026-02-21T10:52:20.9608496Z  'range_warp_specializes': [],
2026-02-21T10:52:20.9608653Z  'waves_per_eu': 3}
2026-02-21T10:52:20.9702825Z [253s] Fitting surrogate: 706 points, 706 targets
2026-02-21T10:52:21.2268650Z [253s] Generation 16 starting: 16 neighbors, 1 active search path(s)
2026-02-21T10:52:34.0645851Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.7 configs/s
2026-02-21T10:52:35.2125121Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.0 configs/s
2026-02-21T10:52:35.5734579Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2381.1        
2026-02-21T10:52:35.5735152Z                                                                   configs/s     
2026-02-21T10:52:35.9575131Z [268s] Generation 16 complete: 
2026-02-21T10:52:35.9575515Z ok=18
2026-02-21T10:52:35.9575728Z min=0.0458
2026-02-21T10:52:35.9575960Z mid=0.0769
2026-02-21T10:52:35.9576162Z max=0.8985
2026-02-21T10:52:35.9576387Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:52:35.9576797Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:52:35.9577191Z  'l2_groupings': [64],
2026-02-21T10:52:35.9577490Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:52:35.9577801Z  'loop_orders': [[0, 1]],
2026-02-21T10:52:35.9578077Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:52:35.9578343Z  'num_stages': 2,
2026-02-21T10:52:35.9578570Z  'num_warps': 4,
2026-02-21T10:52:35.9578805Z  'pid_type': 'flat',
2026-02-21T10:52:35.9579064Z  'range_flattens': [None, False],
2026-02-21T10:52:35.9579377Z  'range_multi_buffers': [None, None],
2026-02-21T10:52:35.9579678Z  'range_num_stages': [0, 2],
2026-02-21T10:52:35.9579958Z  'range_unroll_factors': [0, 2],
2026-02-21T10:52:35.9580261Z  'range_warp_specializes': [],
2026-02-21T10:52:35.9580545Z  'waves_per_eu': 3}
2026-02-21T10:52:35.9669292Z [268s] Fitting surrogate: 724 points, 724 targets
2026-02-21T10:52:36.2212562Z [268s] Generation 17 starting: 15 neighbors, 1 active search path(s)
2026-02-21T10:52:40.5651375Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.3 configs/s
2026-02-21T10:52:41.6353507Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.3 configs/s
2026-02-21T10:52:41.8205600Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 5614.8        
2026-02-21T10:52:41.8206171Z                                                                   configs/s     
2026-02-21T10:52:42.1576344Z [274s] Generation 17 complete: 
2026-02-21T10:52:42.1576505Z ok=17
2026-02-21T10:52:42.1576593Z min=0.0446
2026-02-21T10:52:42.1576911Z mid=0.0981
2026-02-21T10:52:42.1576990Z max=0.7139
2026-02-21T10:52:42.1577076Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:52:42.1577242Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:52:42.1577385Z  'l2_groupings': [64],
2026-02-21T10:52:42.1577490Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:52:42.1577615Z  'loop_orders': [[0, 1]],
2026-02-21T10:52:42.1577719Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:52:42.1577823Z  'num_stages': 2,
2026-02-21T10:52:42.1577907Z  'num_warps': 4,
2026-02-21T10:52:42.1577993Z  'pid_type': 'flat',
2026-02-21T10:52:42.1578096Z  'range_flattens': [None, False],
2026-02-21T10:52:42.1578211Z  'range_multi_buffers': [None, None],
2026-02-21T10:52:42.1578321Z  'range_num_stages': [0, 2],
2026-02-21T10:52:42.1578424Z  'range_unroll_factors': [0, 2],
2026-02-21T10:52:42.1578534Z  'range_warp_specializes': [],
2026-02-21T10:52:42.1578639Z  'waves_per_eu': 3}
2026-02-21T10:52:42.1654762Z [274s] Fitting surrogate: 741 points, 741 targets
2026-02-21T10:52:42.3914239Z [274s] Generation 18 starting: 15 neighbors, 1 active search path(s)
2026-02-21T10:52:45.6797432Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 3.3 configs/s
2026-02-21T10:52:46.7518548Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.3 configs/s
2026-02-21T10:52:47.1386732Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2177.7        
2026-02-21T10:52:47.1386969Z                                                                   configs/s     
2026-02-21T10:52:47.5039484Z [279s] Generation 18 complete: 
2026-02-21T10:52:47.5040568Z ok=17
2026-02-21T10:52:47.5040789Z min=0.0464
2026-02-21T10:52:47.5041004Z mid=0.0657
2026-02-21T10:52:47.5041200Z max=0.5624
2026-02-21T10:52:47.5041429Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:52:47.5041787Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:52:47.5041978Z  'l2_groupings': [64],
2026-02-21T10:52:47.5042112Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:52:47.5042262Z  'loop_orders': [[0, 1]],
2026-02-21T10:52:47.5042403Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:52:47.5042698Z  'num_stages': 2,
2026-02-21T10:52:47.5042809Z  'num_warps': 4,
2026-02-21T10:52:47.5042915Z  'pid_type': 'flat',
2026-02-21T10:52:47.5043036Z  'range_flattens': [None, False],
2026-02-21T10:52:47.5043266Z  'range_multi_buffers': [None, None],
2026-02-21T10:52:47.5043408Z  'range_num_stages': [0, 2],
2026-02-21T10:52:47.5043534Z  'range_unroll_factors': [0, 2],
2026-02-21T10:52:47.5043676Z  'range_warp_specializes': [],
2026-02-21T10:52:47.5043804Z  'waves_per_eu': 3}
2026-02-21T10:52:47.5132540Z [279s] Fitting surrogate: 758 points, 758 targets
2026-02-21T10:52:47.7486770Z [280s] Generation 19 starting: 13 neighbors, 1 active search path(s)
2026-02-21T10:52:51.1074113Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 3.6 configs/s
2026-02-21T10:52:52.0629321Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.3 configs/s
2026-02-21T10:52:52.3276045Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3447.3        
2026-02-21T10:52:52.3276651Z                                                                   configs/s     
2026-02-21T10:52:52.6914692Z [285s] Generation 19 complete: 
2026-02-21T10:52:52.6914946Z ok=15
2026-02-21T10:52:52.6915219Z min=0.0437
2026-02-21T10:52:52.6915401Z mid=0.0743
2026-02-21T10:52:52.6915551Z max=0.4740
2026-02-21T10:52:52.6915722Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:52:52.6916027Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:52:52.6916337Z  'l2_groupings': [64],
2026-02-21T10:52:52.6916546Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:52:52.6916780Z  'loop_orders': [[0, 1]],
2026-02-21T10:52:52.6916988Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:52:52.6917185Z  'num_stages': 2,
2026-02-21T10:52:52.6917359Z  'num_warps': 4,
2026-02-21T10:52:52.6917531Z  'pid_type': 'flat',
2026-02-21T10:52:52.6917992Z  'range_flattens': [None, False],
2026-02-21T10:52:52.6918225Z  'range_multi_buffers': [None, None],
2026-02-21T10:52:52.6918453Z  'range_num_stages': [0, 2],
2026-02-21T10:52:52.6918676Z  'range_unroll_factors': [0, 2],
2026-02-21T10:52:52.6918898Z  'range_warp_specializes': [],
2026-02-21T10:52:52.6919112Z  'waves_per_eu': 3}
2026-02-21T10:52:52.6998866Z [285s] Fitting surrogate: 773 points, 773 targets
2026-02-21T10:52:52.9402011Z [285s] Generation 20 starting: 15 neighbors, 1 active search path(s)
2026-02-21T10:52:55.6447196Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 8.0 configs/s
2026-02-21T10:52:56.6679945Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.1 configs/s
2026-02-21T10:52:57.4320939Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1148.3        
2026-02-21T10:52:57.4321529Z                                                                   configs/s     
2026-02-21T10:52:57.8050628Z [290s] Generation 20 complete: 
2026-02-21T10:52:57.8051032Z ok=17
2026-02-21T10:52:57.8051280Z min=0.0434
2026-02-21T10:52:57.8051493Z mid=0.0549
2026-02-21T10:52:57.8051689Z max=0.1516
2026-02-21T10:52:57.8052423Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:52:57.8052821Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:52:57.8053210Z  'l2_groupings': [64],
2026-02-21T10:52:57.8053499Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:52:57.8053816Z  'loop_orders': [[0, 1]],
2026-02-21T10:52:57.8054089Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:52:57.8054361Z  'num_stages': 2,
2026-02-21T10:52:57.8054621Z  'num_warps': 4,
2026-02-21T10:52:57.8054977Z  'pid_type': 'flat',
2026-02-21T10:52:57.8055233Z  'range_flattens': [None, False],
2026-02-21T10:52:57.8055536Z  'range_multi_buffers': [None, None],
2026-02-21T10:52:57.8055837Z  'range_num_stages': [0, 2],
2026-02-21T10:52:57.8056115Z  'range_unroll_factors': [0, 2],
2026-02-21T10:52:57.8056409Z  'range_warp_specializes': [],
2026-02-21T10:52:57.8056682Z  'waves_per_eu': 3}
2026-02-21T10:52:57.8201435Z [290s] Fitting surrogate: 790 points, 790 targets
2026-02-21T10:52:58.3780971Z [290s] Autotuning complete in 290.8s after searching 746 configs.
2026-02-21T10:52:58.3781489Z One can hardcode the best config and skip autotuning with:
2026-02-21T10:52:58.3783426Z     @helion.kernel(config=helion.Config(block_sizes=[1, 64, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T10:52:58.3784812Z 
2026-02-21T10:52:58.3785174Z [290s] Code of selected kernel: /tmp/torchinductor_root/4t/c4tnig7zzinrar4lpsnkttfd4dxmfj7alahjvfll6sl6hpe4kmtn.py
2026-02-21T10:52:58.9684747Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T10:52:58.9687585Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 64, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T10:52:58.9689721Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T10:52:58.9690100Z WARNING:tritonbench.utils.triton_op:Completed input ID 1:
2026-02-21T10:52:58.9690380Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T10:52:58.9690600Z ------------------------------------------
2026-02-21T10:52:58.9690810Z (4, 48, 256, 256, 128)
2026-02-21T10:52:58.9690922Z 
2026-02-21T10:52:58.9691522Z  33%|███▎      | 2/6 [08:32<17:32, 263.04s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2:
2026-02-21T10:52:58.9691875Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T10:52:58.9692095Z ------------------------------------------
2026-02-21T10:52:58.9692294Z (4, 48, 512, 512, 128)
2026-02-21T10:52:58.9692571Z INFO:tritonbench.utils.triton_op:Took 0.06ms to get benchmark function for aten
2026-02-21T10:53:00.0227282Z INFO:tritonbench.utils.triton_op:Took 1.62ms to get benchmark function for flex_attention
2026-02-21T10:53:01.6032561Z WARNING:__main__:Input tensor metadata:
2026-02-21T10:53:01.6033022Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T10:53:01.6033400Z               'dtype': 'torch.bfloat16',
2026-02-21T10:53:01.6033720Z               'shape': (4, 48, 512, 128),
2026-02-21T10:53:01.6034048Z               'stride': (3145728, 65536, 128, 1)},
2026-02-21T10:53:01.6034375Z             { 'device': 'cuda:0',
2026-02-21T10:53:01.6034673Z               'dtype': 'torch.bfloat16',
2026-02-21T10:53:01.6034981Z               'shape': (4, 48, 512, 128),
2026-02-21T10:53:01.6035299Z               'stride': (3145728, 65536, 128, 1)},
2026-02-21T10:53:01.6035610Z             { 'device': 'cuda:0',
2026-02-21T10:53:01.6036372Z               'dtype': 'torch.bfloat16',
2026-02-21T10:53:01.6036676Z               'shape': (4, 48, 512, 128),
2026-02-21T10:53:01.6036996Z               'stride': (3145728, 65536, 128, 1)}),
2026-02-21T10:53:01.6037312Z   'kwargs': {}}
2026-02-21T10:53:01.6055037Z INFO:tritonbench.utils.triton_op:Took 2.79ms to get benchmark function for helion_attention
2026-02-21T10:53:01.8484858Z [0s] Autotune random seed: 2144140282
2026-02-21T10:53:02.0022476Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T10:53:42.4272510Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 8, 512], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[1, 2], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T10:53:42.4289103Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.6 configs/s
2026-02-21T10:53:44.1017487Z /tmp/torchinductor_root/rd/crdzuare6tius3irgn2fl3moepu6my2a3zv53zxi3c2hi2j3pq4r.py:57:24: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T10:53:44.1018034Z             k = tl.load(tl.make_block_ptr(k_view, [192, 128, 512], [65536, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T10:53:44.1018348Z                        ^
2026-02-21T10:53:44.1019081Z /tmp/torchinductor_root/rd/crdzuare6tius3irgn2fl3moepu6my2a3zv53zxi3c2hi2j3pq4r.py:59:145: note: - use: %143 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x128xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [1, 0, 2]}>>) -> tensor<128x128xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>>
2026-02-21T10:53:44.1019695Z 
2026-02-21T10:53:44.1020032Z             qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T10:53:44.1020488Z                                                                                                                                                 ^
2026-02-21T10:53:44.1020678Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T10:53:44.1020957Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}>
2026-02-21T10:53:44.1021275Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}>
2026-02-21T10:53:44.1021636Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}>
2026-02-21T10:53:44.1021968Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}>
2026-02-21T10:53:44.1022283Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:53:44.1022589Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T10:53:44.1022883Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}>
2026-02-21T10:53:44.1023171Z #blocked7 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}>
2026-02-21T10:53:44.1023456Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}>
2026-02-21T10:53:44.1023848Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}>
2026-02-21T10:53:44.1024162Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}>
2026-02-21T10:53:44.1024474Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:53:44.1024872Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T10:53:44.1025395Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T10:53:44.1025804Z     %c65536_i32 = arith.constant 65536 : i32
2026-02-21T10:53:44.1025934Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T10:53:44.1026053Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T10:53:44.1026170Z     %c65536_i64 = arith.constant 65536 : i64
2026-02-21T10:53:44.1026347Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x128x128xbf16, #blocked>
2026-02-21T10:53:44.1026555Z     %cst_0 = arith.constant dense<512> : tensor<1x1x128xi64, #blocked1>
2026-02-21T10:53:44.1026738Z     %cst_1 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:53:44.1026916Z     %cst_2 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:53:44.1027109Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked3>
2026-02-21T10:53:44.1027300Z     %cst_4 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked1>
2026-02-21T10:53:44.1027496Z     %cst_5 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked1>
2026-02-21T10:53:44.1027667Z     %cst_6 = arith.constant dense<512> : tensor<1x2x1xi64, #blocked4>
2026-02-21T10:53:44.1027845Z     %cst_7 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked4>
2026-02-21T10:53:44.1028017Z     %cst_8 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked4>
2026-02-21T10:53:44.1028162Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T10:53:44.1028278Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T10:53:44.1028390Z     %c304_i32 = arith.constant 304 : i32
2026-02-21T10:53:44.1028529Z     %cst_9 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked4>
2026-02-21T10:53:44.1028704Z     %cst_10 = arith.constant dense<128> : tensor<1x128x1xi32, #blocked2>
2026-02-21T10:53:44.1028896Z     %cst_11 = arith.constant dense<0.127517432> : tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1029093Z     %cst_12 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1029288Z     %cst_13 = arith.constant dense<0.000000e+00> : tensor<2x128xf32, #blocked6>
2026-02-21T10:53:44.1029447Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T10:53:44.1029601Z     %cst_14 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1029798Z     %cst_15 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1029989Z     %cst_16 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1030144Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T10:53:44.1030258Z     %c192_i32 = arith.constant 192 : i32
2026-02-21T10:53:44.1030372Z     %c49152_i32 = arith.constant 49152 : i32
2026-02-21T10:53:44.1030493Z     %0 = tt.get_program_id x : i32
2026-02-21T10:53:44.1030642Z     %1 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked7>
2026-02-21T10:53:44.1030843Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked7>
2026-02-21T10:53:44.1031049Z     %3 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:53:44.1031253Z     %4 = arith.extsi %1 : tensor<2xi32, #blocked7> to tensor<2xi64, #blocked7>
2026-02-21T10:53:44.1031475Z     %5 = arith.extsi %2 : tensor<128xi32, #blocked7> to tensor<128xi64, #blocked7>
2026-02-21T10:53:44.1031734Z     %6 = ttg.convert_layout %5 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:53:44.1032059Z     %7 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8>
2026-02-21T10:53:44.1032341Z     %8 = ttg.convert_layout %7 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked6>
2026-02-21T10:53:44.1032634Z     %9 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T10:53:44.1032966Z     %10 = tt.expand_dims %9 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi64, #blocked9>
2026-02-21T10:53:44.1033261Z     %11 = ttg.convert_layout %10 : tensor<1x1x128xi64, #blocked9> -> tensor<1x1x128xi64, #blocked1>
2026-02-21T10:53:44.1033506Z     %12 = tt.broadcast %11 : tensor<1x1x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked1>
2026-02-21T10:53:44.1033763Z     %13 = ttg.convert_layout %12 : tensor<1x2x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked3>
2026-02-21T10:53:44.1033977Z     %14 = arith.cmpi sge, %11, %cst_5 : tensor<1x1x128xi64, #blocked1>
2026-02-21T10:53:44.1034149Z     %15 = arith.cmpi slt, %11, %cst_4 : tensor<1x1x128xi64, #blocked1>
2026-02-21T10:53:44.1034311Z     %16 = arith.andi %14, %15 : tensor<1x1x128xi1, #blocked1>
2026-02-21T10:53:44.1034502Z     %17 = tt.broadcast %16 : tensor<1x1x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked1>
2026-02-21T10:53:44.1034737Z     %18 = ttg.convert_layout %17 : tensor<1x2x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked3>
2026-02-21T10:53:44.1034965Z     %19 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.1035254Z     %20 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>>
2026-02-21T10:53:44.1035598Z     %21 = tt.expand_dims %20 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x128x1xi64, #blocked10>
2026-02-21T10:53:44.1035901Z     %22 = ttg.convert_layout %21 : tensor<1x128x1xi64, #blocked10> -> tensor<1x128x1xi64, #blocked2>
2026-02-21T10:53:44.1036166Z     %23 = tt.broadcast %22 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x128xi64, #blocked2>
2026-02-21T10:53:44.1036417Z     %24 = ttg.convert_layout %23 : tensor<1x128x128xi64, #blocked2> -> tensor<1x128x128xi64, #blocked>
2026-02-21T10:53:44.1036636Z     %25 = arith.cmpi sge, %22, %cst_2 : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:53:44.1036805Z     %26 = arith.cmpi slt, %22, %cst_1 : tensor<1x128x1xi64, #blocked2>
2026-02-21T10:53:44.1036964Z     %27 = arith.andi %25, %26 : tensor<1x128x1xi1, #blocked2>
2026-02-21T10:53:44.1037200Z     %28 = ttg.convert_layout %2 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:53:44.1037524Z     %29 = tt.expand_dims %28 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8>
2026-02-21T10:53:44.1037812Z     %30 = ttg.convert_layout %29 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked6>
2026-02-21T10:53:44.1038092Z     %31 = ttg.convert_layout %30 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T10:53:44.1038426Z     %32 = tt.expand_dims %31 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi32, #blocked9>
2026-02-21T10:53:44.1038722Z     %33 = ttg.convert_layout %32 : tensor<1x1x128xi32, #blocked9> -> tensor<1x1x128xi32, #blocked1>
2026-02-21T10:53:44.1038968Z     %34 = tt.broadcast %33 : tensor<1x1x128xi32, #blocked1> -> tensor<1x128x128xi32, #blocked1>
2026-02-21T10:53:44.1039219Z     %35 = ttg.convert_layout %34 : tensor<1x128x128xi32, #blocked1> -> tensor<1x128x128xi32, #blocked>
2026-02-21T10:53:44.1039473Z     %36 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x128x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.1039734Z     %37 = ttg.convert_layout %2 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:53:44.1040054Z     %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8>
2026-02-21T10:53:44.1040334Z     %39 = ttg.convert_layout %38 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked6>
2026-02-21T10:53:44.1040635Z     %40 = ttg.convert_layout %39 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T10:53:44.1040967Z     %41 = tt.expand_dims %40 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi32, #blocked9>
2026-02-21T10:53:44.1041260Z     %42 = ttg.convert_layout %41 : tensor<1x1x128xi32, #blocked9> -> tensor<1x1x128xi32, #blocked1>
2026-02-21T10:53:44.1041502Z     %43 = tt.broadcast %42 : tensor<1x1x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked1>
2026-02-21T10:53:44.1041737Z     %44 = ttg.convert_layout %43 : tensor<1x2x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked3>
2026-02-21T10:53:44.1041981Z     %45 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:53:44.1042173Z     scf.for %arg4 = %0 to %c49152_i32 step %c304_i32  : i32 {
2026-02-21T10:53:44.1042315Z       %46 = arith.remsi %arg4, %c192_i32 : i32
2026-02-21T10:53:44.1042442Z       %47 = arith.divsi %arg4, %c192_i32 : i32
2026-02-21T10:53:44.1042663Z       %48 = arith.muli %47, %c2_i32 : i32
2026-02-21T10:53:44.1042802Z       %49 = tt.splat %48 : i32 -> tensor<2xi32, #blocked7>
2026-02-21T10:53:44.1042949Z       %50 = arith.addi %49, %1 : tensor<2xi32, #blocked7>
2026-02-21T10:53:44.1043083Z       %51 = arith.extsi %46 : i32 to i64
2026-02-21T10:53:44.1043224Z       %52 = arith.extsi %48 : i32 to i64
2026-02-21T10:53:44.1043340Z       %53 = arith.muli %51, %c65536_i64 : i64
2026-02-21T10:53:44.1043483Z       %54 = tt.splat %53 : i64 -> tensor<1x2x128xi64, #blocked3>
2026-02-21T10:53:44.1043634Z       %55 = tt.splat %52 : i64 -> tensor<2xi64, #blocked7>
2026-02-21T10:53:44.1043780Z       %56 = arith.addi %55, %4 : tensor<2xi64, #blocked7>
2026-02-21T10:53:44.1044008Z       %57 = ttg.convert_layout %56 : tensor<2xi64, #blocked7> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:53:44.1044330Z       %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x2xi64, #blocked8>
2026-02-21T10:53:44.1044610Z       %59 = ttg.convert_layout %58 : tensor<1x2xi64, #blocked8> -> tensor<1x2xi64, #blocked5>
2026-02-21T10:53:44.1044885Z       %60 = ttg.convert_layout %59 : tensor<1x2xi64, #blocked5> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked11}>>
2026-02-21T10:53:44.1045215Z       %61 = tt.expand_dims %60 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x2x1xi64, #blocked11>
2026-02-21T10:53:44.1045510Z       %62 = ttg.convert_layout %61 : tensor<1x2x1xi64, #blocked11> -> tensor<1x2x1xi64, #blocked4>
2026-02-21T10:53:44.1045714Z       %63 = arith.muli %62, %cst_8 : tensor<1x2x1xi64, #blocked4>
2026-02-21T10:53:44.1045909Z       %64 = tt.broadcast %63 : tensor<1x2x1xi64, #blocked4> -> tensor<1x2x128xi64, #blocked4>
2026-02-21T10:53:44.1046144Z       %65 = ttg.convert_layout %64 : tensor<1x2x128xi64, #blocked4> -> tensor<1x2x128xi64, #blocked3>
2026-02-21T10:53:44.1046354Z       %66 = arith.addi %65, %13 : tensor<1x2x128xi64, #blocked3>
2026-02-21T10:53:44.1046508Z       %67 = arith.addi %54, %66 : tensor<1x2x128xi64, #blocked3>
2026-02-21T10:53:44.1046711Z       %68 = tt.addptr %3, %67 : tensor<1x2x128x!tt.ptr<bf16>, #blocked3>, tensor<1x2x128xi64, #blocked3>
2026-02-21T10:53:44.1046905Z       %69 = arith.cmpi sge, %51, %c0_i64 : i64
2026-02-21T10:53:44.1047027Z       %70 = arith.cmpi slt, %51, %c192_i64 : i64
2026-02-21T10:53:44.1047150Z       %71 = arith.andi %69, %70 : i1
2026-02-21T10:53:44.1047323Z       %72 = arith.cmpi sge, %62, %cst_7 : tensor<1x2x1xi64, #blocked4>
2026-02-21T10:53:44.1047495Z       %73 = arith.cmpi slt, %62, %cst_6 : tensor<1x2x1xi64, #blocked4>
2026-02-21T10:53:44.1047656Z       %74 = arith.andi %72, %73 : tensor<1x2x1xi1, #blocked4>
2026-02-21T10:53:44.1047833Z       %75 = tt.splat %71 : i1 -> tensor<1x2x1xi1, #blocked4>
2026-02-21T10:53:44.1047986Z       %76 = arith.andi %75, %74 : tensor<1x2x1xi1, #blocked4>
2026-02-21T10:53:44.1048189Z       %77 = tt.broadcast %76 : tensor<1x2x1xi1, #blocked4> -> tensor<1x2x128xi1, #blocked4>
2026-02-21T10:53:44.1048428Z       %78 = ttg.convert_layout %77 : tensor<1x2x128xi1, #blocked4> -> tensor<1x2x128xi1, #blocked3>
2026-02-21T10:53:44.1048633Z       %79 = arith.andi %78, %18 : tensor<1x2x128xi1, #blocked3>
2026-02-21T10:53:44.1048801Z       %80 = tt.load %68, %79, %cst_3 : tensor<1x2x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:53:44.1048976Z       %81 = tt.splat %53 : i64 -> tensor<1x128x128xi64, #blocked>
2026-02-21T10:53:44.1049134Z       %82 = tt.splat %71 : i1 -> tensor<1x128x1xi1, #blocked2>
2026-02-21T10:53:44.1049287Z       %83 = arith.andi %82, %27 : tensor<1x128x1xi1, #blocked2>
2026-02-21T10:53:44.1049497Z       %84 = tt.broadcast %83 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x128xi1, #blocked2>
2026-02-21T10:53:44.1049742Z       %85 = ttg.convert_layout %84 : tensor<1x128x128xi1, #blocked2> -> tensor<1x128x128xi1, #blocked>
2026-02-21T10:53:44.1049984Z       %86 = tt.reshape %80 : tensor<1x2x128xbf16, #blocked3> -> tensor<2x128xbf16, #blocked6>
2026-02-21T10:53:44.1050162Z       %87 = arith.muli %46, %c65536_i32 : i32
2026-02-21T10:53:44.1050299Z       %88 = tt.splat %87 : i32 -> tensor<1x128x1xi32, #blocked2>
2026-02-21T10:53:44.1050703Z       %89:3 = scf.for %arg5 = %c0_i32 to %c512_i32 step %c128_i32 iter_args(%arg6 = %cst_16, %arg7 = %cst_15, %arg8 = %cst_14) -> (tensor<1x2xf32, #blocked5>, tensor<1x2xf32, #blocked5>, tensor<1x2x128xf32, #blocked3>)  : i32 {
2026-02-21T10:53:44.1051060Z         %111 = tt.splat %arg5 : i32 -> tensor<128xi32, #blocked7>
2026-02-21T10:53:44.1051221Z         %112 = arith.addi %111, %2 : tensor<128xi32, #blocked7>
2026-02-21T10:53:44.1051361Z         %113 = arith.extsi %arg5 : i32 to i64
2026-02-21T10:53:44.1051500Z         %114 = tt.splat %113 : i64 -> tensor<128xi64, #blocked7>
2026-02-21T10:53:44.1051653Z         %115 = arith.addi %114, %5 : tensor<128xi64, #blocked7>
2026-02-21T10:53:44.1051892Z         %116 = ttg.convert_layout %115 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:53:44.1052232Z         %117 = tt.expand_dims %116 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8>
2026-02-21T10:53:44.1052529Z         %118 = ttg.convert_layout %117 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked6>
2026-02-21T10:53:44.1052821Z         %119 = ttg.convert_layout %118 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T10:53:44.1053169Z         %120 = tt.expand_dims %119 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi64, #blocked9>
2026-02-21T10:53:44.1053477Z         %121 = ttg.convert_layout %120 : tensor<1x1x128xi64, #blocked9> -> tensor<1x1x128xi64, #blocked1>
2026-02-21T10:53:44.1053695Z         %122 = arith.muli %121, %cst_4 : tensor<1x1x128xi64, #blocked1>
2026-02-21T10:53:44.1053903Z         %123 = tt.broadcast %122 : tensor<1x1x128xi64, #blocked1> -> tensor<1x128x128xi64, #blocked1>
2026-02-21T10:53:44.1054160Z         %124 = ttg.convert_layout %123 : tensor<1x128x128xi64, #blocked1> -> tensor<1x128x128xi64, #blocked>
2026-02-21T10:53:44.1054381Z         %125 = arith.addi %24, %124 : tensor<1x128x128xi64, #blocked>
2026-02-21T10:53:44.1054543Z         %126 = arith.addi %81, %125 : tensor<1x128x128xi64, #blocked>
2026-02-21T10:53:44.1054758Z         %127 = tt.addptr %19, %126 : tensor<1x128x128x!tt.ptr<bf16>, #blocked>, tensor<1x128x128xi64, #blocked>
2026-02-21T10:53:44.1055002Z         %128 = arith.cmpi sge, %121, %cst_5 : tensor<1x1x128xi64, #blocked1>
2026-02-21T10:53:44.1055182Z         %129 = arith.cmpi slt, %121, %cst_0 : tensor<1x1x128xi64, #blocked1>
2026-02-21T10:53:44.1055357Z         %130 = arith.andi %128, %129 : tensor<1x1x128xi1, #blocked1>
2026-02-21T10:53:44.1055557Z         %131 = tt.broadcast %130 : tensor<1x1x128xi1, #blocked1> -> tensor<1x128x128xi1, #blocked1>
2026-02-21T10:53:44.1055826Z         %132 = ttg.convert_layout %131 : tensor<1x128x128xi1, #blocked1> -> tensor<1x128x128xi1, #blocked>
2026-02-21T10:53:44.1056044Z         %133 = arith.andi %85, %132 : tensor<1x128x128xi1, #blocked>
2026-02-21T10:53:44.1056221Z         %134 = tt.load %127, %133, %cst : tensor<1x128x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.1056441Z         %135 = tt.reshape %134 : tensor<1x128x128xbf16, #blocked> -> tensor<128x128xbf16, #blocked6>
2026-02-21T10:53:44.1056738Z         %136 = ttg.convert_layout %86 : tensor<2x128xbf16, #blocked6> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>>
2026-02-21T10:53:44.1057113Z         %137 = ttg.convert_layout %135 : tensor<128x128xbf16, #blocked6> -> tensor<128x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>>
2026-02-21T10:53:44.1057422Z         %138 = ttg.convert_layout %cst_13 : tensor<2x128xf32, #blocked6> -> tensor<2x128xf32, #blocked6>
2026-02-21T10:53:44.1057846Z         %139 = tt.dot %136, %137, %138, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<128x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<2x128xf32, #blocked6>
2026-02-21T10:53:44.1058251Z         %140 = tt.reshape %139 : tensor<2x128xf32, #blocked6> -> tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1058509Z         %141 = arith.truncf %140 : tensor<1x2x128xf32, #blocked3> to tensor<1x2x128xbf16, #blocked3>
2026-02-21T10:53:44.1058760Z         %142 = arith.extf %141 : tensor<1x2x128xbf16, #blocked3> to tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1058963Z         %143 = "tt.reduce"(%142) <{axis = 2 : i32}> ({
2026-02-21T10:53:44.1059095Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T10:53:44.1059225Z           %198 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T10:53:44.1059354Z           tt.reduce.return %198 : f32
2026-02-21T10:53:44.1059550Z         }) : (tensor<1x2x128xf32, #blocked3>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked3}>>
2026-02-21T10:53:44.1059851Z         %144 = ttg.convert_layout %143 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1060130Z         %145 = arith.truncf %144 : tensor<1x2xf32, #blocked5> to tensor<1x2xbf16, #blocked5>
2026-02-21T10:53:44.1060357Z         %146 = arith.extf %145 : tensor<1x2xbf16, #blocked5> to tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1060552Z         %147 = arith.mulf %146, %cst_12 : tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1060750Z         %148 = arith.truncf %147 : tensor<1x2xf32, #blocked5> to tensor<1x2xbf16, #blocked5>
2026-02-21T10:53:44.1060971Z         %149 = arith.extf %148 : tensor<1x2xbf16, #blocked5> to tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1061175Z         %150 = arith.cmpf ogt, %arg6, %149 : tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1061355Z         %151 = arith.cmpf une, %arg6, %arg6 : tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1061525Z         %152 = arith.ori %150, %151 : tensor<1x2xi1, #blocked5>
2026-02-21T10:53:44.1061725Z         %153 = arith.select %152, %arg6, %149 : tensor<1x2xi1, #blocked5>, tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1061932Z         %154 = arith.mulf %142, %cst_11 : tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1062145Z         %155 = arith.truncf %154 : tensor<1x2x128xf32, #blocked3> to tensor<1x2x128xbf16, #blocked3>
2026-02-21T10:53:44.1062441Z         %156 = ttg.convert_layout %153 : tensor<1x2xf32, #blocked5> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>>
2026-02-21T10:53:44.1062806Z         %157 = tt.expand_dims %156 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x2x1xf32, #blocked11>
2026-02-21T10:53:44.1063115Z         %158 = ttg.convert_layout %157 : tensor<1x2x1xf32, #blocked11> -> tensor<1x2x1xf32, #blocked4>
2026-02-21T10:53:44.1063363Z         %159 = arith.extf %155 : tensor<1x2x128xbf16, #blocked3> to tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1063610Z         %160 = tt.broadcast %158 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4>
2026-02-21T10:53:44.1063879Z         %161 = ttg.convert_layout %160 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1064095Z         %162 = arith.subf %159, %161 : tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1064412Z         %163 = tt.extern_elementwise %162 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x128xf32, #blocked3>) -> tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1064704Z         %164 = "tt.reduce"(%163) <{axis = 2 : i32}> ({
2026-02-21T10:53:44.1064841Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T10:53:44.1064963Z           %198 = arith.addf %arg9, %arg10 : f32
2026-02-21T10:53:44.1065114Z           tt.reduce.return %198 : f32
2026-02-21T10:53:44.1065310Z         }) : (tensor<1x2x128xf32, #blocked3>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked3}>>
2026-02-21T10:53:44.1065604Z         %165 = ttg.convert_layout %164 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1065853Z         %166 = arith.subf %arg6, %153 : tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1066146Z         %167 = tt.extern_elementwise %166 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked5>) -> tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1066457Z         %168 = arith.mulf %arg7, %167 : tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1066623Z         %169 = arith.addf %168, %165 : tensor<1x2xf32, #blocked5>
2026-02-21T10:53:44.1066871Z         %170 = ttg.convert_layout %167 : tensor<1x2xf32, #blocked5> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>>
2026-02-21T10:53:44.1067218Z         %171 = tt.expand_dims %170 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x2x1xf32, #blocked11>
2026-02-21T10:53:44.1067519Z         %172 = ttg.convert_layout %171 : tensor<1x2x1xf32, #blocked11> -> tensor<1x2x1xf32, #blocked4>
2026-02-21T10:53:44.1067771Z         %173 = tt.broadcast %172 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4>
2026-02-21T10:53:44.1068024Z         %174 = ttg.convert_layout %173 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1068242Z         %175 = arith.mulf %arg8, %174 : tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1068495Z         %176 = ttg.convert_layout %112 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:53:44.1068831Z         %177 = tt.expand_dims %176 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8>
2026-02-21T10:53:44.1069132Z         %178 = ttg.convert_layout %177 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked6>
2026-02-21T10:53:44.1069430Z         %179 = ttg.convert_layout %178 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked10}>>
2026-02-21T10:53:44.1069780Z         %180 = tt.expand_dims %179 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x128x1xi32, #blocked10>
2026-02-21T10:53:44.1070097Z         %181 = ttg.convert_layout %180 : tensor<1x128x1xi32, #blocked10> -> tensor<1x128x1xi32, #blocked2>
2026-02-21T10:53:44.1070321Z         %182 = arith.muli %181, %cst_10 : tensor<1x128x1xi32, #blocked2>
2026-02-21T10:53:44.1070493Z         %183 = arith.addi %88, %182 : tensor<1x128x1xi32, #blocked2>
2026-02-21T10:53:44.1070704Z         %184 = tt.broadcast %183 : tensor<1x128x1xi32, #blocked2> -> tensor<1x128x128xi32, #blocked2>
2026-02-21T10:53:44.1070983Z         %185 = ttg.convert_layout %184 : tensor<1x128x128xi32, #blocked2> -> tensor<1x128x128xi32, #blocked>
2026-02-21T10:53:44.1071208Z         %186 = arith.addi %185, %35 : tensor<1x128x128xi32, #blocked>
2026-02-21T10:53:44.1071430Z         %187 = tt.addptr %36, %186 : tensor<1x128x128x!tt.ptr<bf16>, #blocked>, tensor<1x128x128xi32, #blocked>
2026-02-21T10:53:44.1071653Z         %188 = tt.load %187 : tensor<1x128x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.1071880Z         %189 = arith.truncf %163 : tensor<1x2x128xf32, #blocked3> to tensor<1x2x128xbf16, #blocked3>
2026-02-21T10:53:44.1072124Z         %190 = tt.reshape %175 : tensor<1x2x128xf32, #blocked3> -> tensor<2x128xf32, #blocked6>
2026-02-21T10:53:44.1072366Z         %191 = tt.reshape %189 : tensor<1x2x128xbf16, #blocked3> -> tensor<2x128xbf16, #blocked6>
2026-02-21T10:53:44.1072614Z         %192 = tt.reshape %188 : tensor<1x128x128xbf16, #blocked> -> tensor<128x128xbf16, #blocked6>
2026-02-21T10:53:44.1072921Z         %193 = ttg.convert_layout %191 : tensor<2x128xbf16, #blocked6> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>>
2026-02-21T10:53:44.1073303Z         %194 = ttg.convert_layout %192 : tensor<128x128xbf16, #blocked6> -> tensor<128x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>>
2026-02-21T10:53:44.1073608Z         %195 = ttg.convert_layout %190 : tensor<2x128xf32, #blocked6> -> tensor<2x128xf32, #blocked6>
2026-02-21T10:53:44.1074023Z         %196 = tt.dot %193, %194, %195, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<128x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<2x128xf32, #blocked6>
2026-02-21T10:53:44.1074426Z         %197 = tt.reshape %196 : tensor<2x128xf32, #blocked6> -> tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1074719Z         scf.yield %153, %169, %197 : tensor<1x2xf32, #blocked5>, tensor<1x2xf32, #blocked5>, tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1074933Z       }
2026-02-21T10:53:44.1075126Z       %90 = ttg.convert_layout %89#1 : tensor<1x2xf32, #blocked5> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>>
2026-02-21T10:53:44.1075472Z       %91 = tt.expand_dims %90 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x2x1xf32, #blocked11>
2026-02-21T10:53:44.1075772Z       %92 = ttg.convert_layout %91 : tensor<1x2x1xf32, #blocked11> -> tensor<1x2x1xf32, #blocked4>
2026-02-21T10:53:44.1076027Z       %93 = tt.broadcast %92 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4>
2026-02-21T10:53:44.1076272Z       %94 = ttg.convert_layout %93 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1076490Z       %95 = arith.divf %89#2, %94 : tensor<1x2x128xf32, #blocked3>
2026-02-21T10:53:44.1076694Z       %96 = arith.truncf %95 : tensor<1x2x128xf32, #blocked3> to tensor<1x2x128xbf16, #blocked3>
2026-02-21T10:53:44.1076885Z       %97 = arith.muli %46, %c65536_i32 : i32
2026-02-21T10:53:44.1077102Z       %98 = ttg.convert_layout %50 : tensor<2xi32, #blocked7> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T10:53:44.1077422Z       %99 = tt.expand_dims %98 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x2xi32, #blocked8>
2026-02-21T10:53:44.1077710Z       %100 = ttg.convert_layout %99 : tensor<1x2xi32, #blocked8> -> tensor<1x2xi32, #blocked5>
2026-02-21T10:53:44.1077996Z       %101 = ttg.convert_layout %100 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked11}>>
2026-02-21T10:53:44.1078337Z       %102 = tt.expand_dims %101 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked11}>> -> tensor<1x2x1xi32, #blocked11>
2026-02-21T10:53:44.1078637Z       %103 = ttg.convert_layout %102 : tensor<1x2x1xi32, #blocked11> -> tensor<1x2x1xi32, #blocked4>
2026-02-21T10:53:44.1078857Z       %104 = arith.muli %103, %cst_9 : tensor<1x2x1xi32, #blocked4>
2026-02-21T10:53:44.1079045Z       %105 = tt.splat %97 : i32 -> tensor<1x2x1xi32, #blocked4>
2026-02-21T10:53:44.1079206Z       %106 = arith.addi %105, %104 : tensor<1x2x1xi32, #blocked4>
2026-02-21T10:53:44.1079409Z       %107 = tt.broadcast %106 : tensor<1x2x1xi32, #blocked4> -> tensor<1x2x128xi32, #blocked4>
2026-02-21T10:53:44.1079657Z       %108 = ttg.convert_layout %107 : tensor<1x2x128xi32, #blocked4> -> tensor<1x2x128xi32, #blocked3>
2026-02-21T10:53:44.1079892Z       %109 = arith.addi %108, %44 : tensor<1x2x128xi32, #blocked3>
2026-02-21T10:53:44.1080107Z       %110 = tt.addptr %45, %109 : tensor<1x2x128x!tt.ptr<bf16>, #blocked3>, tensor<1x2x128xi32, #blocked3>
2026-02-21T10:53:44.1080325Z       tt.store %110, %96 : tensor<1x2x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T10:53:44.1080477Z     } {tt.loop_unroll_factor = 1 : i32}
2026-02-21T10:53:44.1080591Z     tt.return
2026-02-21T10:53:44.1080678Z   }
2026-02-21T10:53:44.1080757Z }
2026-02-21T10:53:44.1080804Z 
2026-02-21T10:53:44.1080841Z {-#
2026-02-21T10:53:44.1080922Z   external_resources: {
2026-02-21T10:53:44.1081029Z     mlir_reproducer: {
2026-02-21T10:53:44.1083433Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T10:53:44.1085725Z       disable_threading: false,
2026-02-21T10:53:44.1085840Z       verify_each: true
2026-02-21T10:53:44.1085932Z     }
2026-02-21T10:53:44.1086015Z   }
2026-02-21T10:53:44.1086087Z #-}
2026-02-21T10:53:44.1086367Z /tmp/torchinductor_root/rd/crdzuare6tius3irgn2fl3moepu6my2a3zv53zxi3c2hi2j3pq4r.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T10:53:44.1087076Z /tmp/torchinductor_root/rd/crdzuare6tius3irgn2fl3moepu6my2a3zv53zxi3c2hi2j3pq4r.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T10:53:44.1087633Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T10:53:44.1088449Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[0, 0], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T10:53:44.1089185Z Error: RuntimeError: PassManager::run failed
2026-02-21T10:53:44.1089356Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T10:53:44.8628517Z /tmp/torchinductor_root/6v/c6vbchyecgbseqxf7v66tjcrvldnfmtcbunqfzmowyfinztxvrmq.py:55:129: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T10:53:44.8629765Z         k = tl.load(k_view + (indices_0[:, None, None] * 65536 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None)
2026-02-21T10:53:44.8630394Z                                                                                                                                 ^
2026-02-21T10:53:44.8632047Z /tmp/torchinductor_root/6v/c6vbchyecgbseqxf7v66tjcrvldnfmtcbunqfzmowyfinztxvrmq.py:57:141: note: - use: %132 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x64xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x64xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>>
2026-02-21T10:53:44.8633443Z 
2026-02-21T10:53:44.8634252Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T10:53:44.8635399Z                                                                                                                                             ^
2026-02-21T10:53:44.8635800Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T10:53:44.8636309Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:53:44.8636956Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:53:44.8637652Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T10:53:44.8638279Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T10:53:44.8638882Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T10:53:44.8639457Z #blocked5 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}>
2026-02-21T10:53:44.8640032Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}>
2026-02-21T10:53:44.8640652Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:53:44.8641293Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:53:44.8641929Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T10:53:44.8642549Z #blocked10 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T10:53:44.8643327Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T10:53:44.8644373Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T10:53:44.8645201Z     %c65536_i32 = arith.constant 65536 : i32
2026-02-21T10:53:44.8645460Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T10:53:44.8645678Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T10:53:44.8645855Z     %c65536_i64 = arith.constant 65536 : i64
2026-02-21T10:53:44.8646090Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked>
2026-02-21T10:53:44.8646380Z     %cst_0 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked>
2026-02-21T10:53:44.8646672Z     %cst_1 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked>
2026-02-21T10:53:44.8646925Z     %cst_2 = arith.constant dense<512> : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:53:44.8647181Z     %cst_3 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:53:44.8647430Z     %cst_4 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:53:44.8647640Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T10:53:44.8647834Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T10:53:44.8648003Z     %c3072_i32 = arith.constant 3072 : i32
2026-02-21T10:53:44.8648210Z     %cst_5 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked1>
2026-02-21T10:53:44.8648462Z     %cst_6 = arith.constant dense<128> : tensor<1x64x1xi32, #blocked2>
2026-02-21T10:53:44.8648734Z     %cst_7 = arith.constant dense<0.127517432> : tensor<1x2x64xf32, #blocked>
2026-02-21T10:53:44.8649012Z     %cst_8 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8649289Z     %cst_9 = arith.constant dense<0.000000e+00> : tensor<2x64xf32, #blocked4>
2026-02-21T10:53:44.8649558Z     %cst_10 = arith.constant dense<128> : tensor<1x1x64xi32, #blocked>
2026-02-21T10:53:44.8649797Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T10:53:44.8650017Z     %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked>
2026-02-21T10:53:44.8650298Z     %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8650576Z     %cst_13 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8650799Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T10:53:44.8650959Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T10:53:44.8651123Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T10:53:44.8651291Z     %0 = tt.get_program_id x : i32
2026-02-21T10:53:44.8651483Z     %1 = arith.divsi %0, %c3072_i32 : i32
2026-02-21T10:53:44.8651646Z     %2 = arith.muli %1, %c16_i32 : i32
2026-02-21T10:53:44.8651810Z     %3 = arith.subi %c256_i32, %2 : i32
2026-02-21T10:53:44.8651969Z     %4 = arith.minsi %3, %c16_i32 : i32
2026-02-21T10:53:44.8652131Z     %5 = arith.remsi %0, %c3072_i32 : i32
2026-02-21T10:53:44.8652302Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T10:53:44.8652455Z     %7 = arith.addi %2, %6 : i32
2026-02-21T10:53:44.8652607Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T10:53:44.8652759Z     %9 = arith.muli %7, %c2_i32 : i32
2026-02-21T10:53:44.8652979Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked5>
2026-02-21T10:53:44.8653236Z     %11 = tt.splat %9 : i32 -> tensor<2xi32, #blocked5>
2026-02-21T10:53:44.8653449Z     %12 = arith.addi %11, %10 : tensor<2xi32, #blocked5>
2026-02-21T10:53:44.8653698Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked5>
2026-02-21T10:53:44.8653936Z     %14 = arith.extsi %8 : i32 to i64
2026-02-21T10:53:44.8654099Z     %15 = arith.extsi %9 : i32 to i64
2026-02-21T10:53:44.8654320Z     %16 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.8654559Z     %17 = arith.muli %14, %c65536_i64 : i64
2026-02-21T10:53:44.8654754Z     %18 = tt.splat %17 : i64 -> tensor<1x2x128xi64, #blocked>
2026-02-21T10:53:44.8654973Z     %19 = tt.splat %15 : i64 -> tensor<2xi64, #blocked5>
2026-02-21T10:53:44.8655216Z     %20 = arith.extsi %10 : tensor<2xi32, #blocked5> to tensor<2xi64, #blocked5>
2026-02-21T10:53:44.8655464Z     %21 = arith.addi %19, %20 : tensor<2xi64, #blocked5>
2026-02-21T10:53:44.8655774Z     %22 = ttg.convert_layout %21 : tensor<2xi64, #blocked5> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:53:44.8656136Z     %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi64, #blocked6>
2026-02-21T10:53:44.8656444Z     %24 = ttg.convert_layout %23 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #blocked3>
2026-02-21T10:53:44.8656751Z     %25 = ttg.convert_layout %24 : tensor<1x2xi64, #blocked3> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T10:53:44.8657137Z     %26 = tt.expand_dims %25 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi64, #blocked7>
2026-02-21T10:53:44.8657457Z     %27 = ttg.convert_layout %26 : tensor<1x2x1xi64, #blocked7> -> tensor<1x2x1xi64, #blocked1>
2026-02-21T10:53:44.8657685Z     %28 = arith.muli %27, %cst_4 : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:53:44.8657899Z     %29 = tt.broadcast %28 : tensor<1x2x1xi64, #blocked1> -> tensor<1x2x128xi64, #blocked1>
2026-02-21T10:53:44.8658182Z     %30 = ttg.convert_layout %29 : tensor<1x2x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked>
2026-02-21T10:53:44.8658441Z     %31 = arith.extsi %13 : tensor<128xi32, #blocked5> to tensor<128xi64, #blocked5>
2026-02-21T10:53:44.8658735Z     %32 = ttg.convert_layout %31 : tensor<128xi64, #blocked5> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:53:44.8659096Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi64, #blocked6>
2026-02-21T10:53:44.8659433Z     %34 = ttg.convert_layout %33 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #blocked4>
2026-02-21T10:53:44.8659750Z     %35 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked4> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T10:53:44.8660125Z     %36 = tt.expand_dims %35 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi64, #blocked8>
2026-02-21T10:53:44.8660458Z     %37 = ttg.convert_layout %36 : tensor<1x1x128xi64, #blocked8> -> tensor<1x1x128xi64, #blocked>
2026-02-21T10:53:44.8660726Z     %38 = tt.broadcast %37 : tensor<1x1x128xi64, #blocked> -> tensor<1x2x128xi64, #blocked>
2026-02-21T10:53:44.8660963Z     %39 = arith.addi %30, %38 : tensor<1x2x128xi64, #blocked>
2026-02-21T10:53:44.8661140Z     %40 = arith.addi %18, %39 : tensor<1x2x128xi64, #blocked>
2026-02-21T10:53:44.8661367Z     %41 = tt.addptr %16, %40 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>, tensor<1x2x128xi64, #blocked>
2026-02-21T10:53:44.8661581Z     %42 = arith.cmpi sge, %14, %c0_i64 : i64
2026-02-21T10:53:44.8661723Z     %43 = arith.cmpi slt, %14, %c192_i64 : i64
2026-02-21T10:53:44.8661856Z     %44 = arith.andi %42, %43 : i1
2026-02-21T10:53:44.8662013Z     %45 = arith.cmpi sge, %27, %cst_3 : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:53:44.8662205Z     %46 = arith.cmpi slt, %27, %cst_2 : tensor<1x2x1xi64, #blocked1>
2026-02-21T10:53:44.8662386Z     %47 = arith.andi %45, %46 : tensor<1x2x1xi1, #blocked1>
2026-02-21T10:53:44.8662556Z     %48 = tt.splat %44 : i1 -> tensor<1x2x1xi1, #blocked1>
2026-02-21T10:53:44.8662720Z     %49 = arith.andi %48, %47 : tensor<1x2x1xi1, #blocked1>
2026-02-21T10:53:44.8662926Z     %50 = tt.broadcast %49 : tensor<1x2x1xi1, #blocked1> -> tensor<1x2x128xi1, #blocked1>
2026-02-21T10:53:44.8663189Z     %51 = ttg.convert_layout %50 : tensor<1x2x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked>
2026-02-21T10:53:44.8663428Z     %52 = arith.cmpi sge, %37, %cst_1 : tensor<1x1x128xi64, #blocked>
2026-02-21T10:53:44.8663619Z     %53 = arith.cmpi slt, %37, %cst_0 : tensor<1x1x128xi64, #blocked>
2026-02-21T10:53:44.8663798Z     %54 = arith.andi %52, %53 : tensor<1x1x128xi1, #blocked>
2026-02-21T10:53:44.8664004Z     %55 = tt.broadcast %54 : tensor<1x1x128xi1, #blocked> -> tensor<1x2x128xi1, #blocked>
2026-02-21T10:53:44.8664211Z     %56 = arith.andi %51, %55 : tensor<1x2x128xi1, #blocked>
2026-02-21T10:53:44.8664393Z     %57 = tt.load %41, %56, %cst : tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.8664602Z     %58 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked5>
2026-02-21T10:53:44.8664788Z     %59 = arith.muli %8, %c65536_i32 : i32
2026-02-21T10:53:44.8665032Z     %60 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:53:44.8665396Z     %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6>
2026-02-21T10:53:44.8665718Z     %62 = ttg.convert_layout %61 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4>
2026-02-21T10:53:44.8665997Z     %63 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T10:53:44.8666335Z     %64 = tt.expand_dims %63 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x128x1xi32, #blocked9>
2026-02-21T10:53:44.8666658Z     %65 = ttg.convert_layout %64 : tensor<1x128x1xi32, #blocked9> -> tensor<1x128x1xi32, #blocked2>
2026-02-21T10:53:44.8666861Z     %66 = tt.splat %59 : i32 -> tensor<1x128x1xi32, #blocked2>
2026-02-21T10:53:44.8667016Z     %67 = arith.addi %66, %65 : tensor<1x128x1xi32, #blocked2>
2026-02-21T10:53:44.8667207Z     %68 = tt.broadcast %67 : tensor<1x128x1xi32, #blocked2> -> tensor<1x128x64xi32, #blocked2>
2026-02-21T10:53:44.8667448Z     %69 = ttg.convert_layout %68 : tensor<1x128x64xi32, #blocked2> -> tensor<1x128x64xi32, #blocked>
2026-02-21T10:53:44.8667680Z     %70 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x64x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.8667917Z     %71 = tt.reshape %57 : tensor<1x2x128xbf16, #blocked> -> tensor<2x128xbf16, #blocked4>
2026-02-21T10:53:44.8668108Z     %72 = tt.splat %59 : i32 -> tensor<1x64x1xi32, #blocked2>
2026-02-21T10:53:44.8668342Z     %73 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T10:53:44.8668674Z     %74 = tt.expand_dims %73 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8>
2026-02-21T10:53:44.8668967Z     %75 = ttg.convert_layout %74 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked>
2026-02-21T10:53:44.8669220Z     %76 = tt.broadcast %75 : tensor<1x1x128xi32, #blocked> -> tensor<1x64x128xi32, #blocked>
2026-02-21T10:53:44.8669440Z     %77 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x64x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.8669816Z     %78:3 = scf.for %arg4 = %c0_i32 to %c512_i32 step %c64_i32 iter_args(%arg5 = %cst_13, %arg6 = %cst_12, %arg7 = %cst_11) -> (tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>)  : i32 {
2026-02-21T10:53:44.8670165Z       %108 = tt.splat %arg4 : i32 -> tensor<64xi32, #blocked5>
2026-02-21T10:53:44.8670324Z       %109 = arith.addi %108, %58 : tensor<64xi32, #blocked5>
2026-02-21T10:53:44.8670557Z       %110 = ttg.convert_layout %109 : tensor<64xi32, #blocked5> -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:53:44.8670884Z       %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x64xi32, #blocked6>
2026-02-21T10:53:44.8671177Z       %112 = ttg.convert_layout %111 : tensor<1x64xi32, #blocked6> -> tensor<1x64xi32, #blocked4>
2026-02-21T10:53:44.8671458Z       %113 = ttg.convert_layout %112 : tensor<1x64xi32, #blocked4> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T10:53:44.8671797Z       %114 = tt.expand_dims %113 {axis = 1 : i32} : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x64xi32, #blocked8>
2026-02-21T10:53:44.8672095Z       %115 = ttg.convert_layout %114 : tensor<1x1x64xi32, #blocked8> -> tensor<1x1x64xi32, #blocked>
2026-02-21T10:53:44.8672309Z       %116 = arith.muli %115, %cst_10 : tensor<1x1x64xi32, #blocked>
2026-02-21T10:53:44.8672510Z       %117 = tt.broadcast %116 : tensor<1x1x64xi32, #blocked> -> tensor<1x128x64xi32, #blocked>
2026-02-21T10:53:44.8672708Z       %118 = arith.addi %69, %117 : tensor<1x128x64xi32, #blocked>
2026-02-21T10:53:44.8672922Z       %119 = tt.addptr %70, %118 : tensor<1x128x64x!tt.ptr<bf16>, #blocked>, tensor<1x128x64xi32, #blocked>
2026-02-21T10:53:44.8673134Z       %120 = tt.load %119 : tensor<1x128x64x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.8673337Z       %121 = tt.reshape %120 : tensor<1x128x64xbf16, #blocked> -> tensor<128x64xbf16, #blocked4>
2026-02-21T10:53:44.8673645Z       %122 = ttg.convert_layout %71 : tensor<2x128xbf16, #blocked4> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked4}>>
2026-02-21T10:53:44.8673996Z       %123 = ttg.convert_layout %121 : tensor<128x64xbf16, #blocked4> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked4}>>
2026-02-21T10:53:44.8674294Z       %124 = ttg.convert_layout %cst_9 : tensor<2x64xf32, #blocked4> -> tensor<2x64xf32, #blocked4>
2026-02-21T10:53:44.8674711Z       %125 = tt.dot %122, %123, %124, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked4}>> * tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked4}>> -> tensor<2x64xf32, #blocked4>
2026-02-21T10:53:44.8675094Z       %126 = tt.reshape %125 : tensor<2x64xf32, #blocked4> -> tensor<1x2x64xf32, #blocked>
2026-02-21T10:53:44.8675325Z       %127 = arith.truncf %126 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked>
2026-02-21T10:53:44.8675555Z       %128 = arith.extf %127 : tensor<1x2x64xbf16, #blocked> to tensor<1x2x64xf32, #blocked>
2026-02-21T10:53:44.8675740Z       %129 = "tt.reduce"(%128) <{axis = 2 : i32}> ({
2026-02-21T10:53:44.8675880Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T10:53:44.8676004Z         %182 = arith.maxnumf %arg8, %arg9 : f32
2026-02-21T10:53:44.8676124Z         tt.reduce.return %182 : f32
2026-02-21T10:53:44.8676307Z       }) : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T10:53:44.8676593Z       %130 = ttg.convert_layout %129 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8676860Z       %131 = arith.truncf %130 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3>
2026-02-21T10:53:44.8677094Z       %132 = arith.extf %131 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8677285Z       %133 = arith.mulf %132, %cst_8 : tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8677475Z       %134 = arith.truncf %133 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3>
2026-02-21T10:53:44.8677691Z       %135 = arith.extf %134 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8677885Z       %136 = arith.cmpf ogt, %arg5, %135 : tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8678057Z       %137 = arith.cmpf une, %arg5, %arg5 : tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8678218Z       %138 = arith.ori %136, %137 : tensor<1x2xi1, #blocked3>
2026-02-21T10:53:44.8678413Z       %139 = arith.select %138, %arg5, %135 : tensor<1x2xi1, #blocked3>, tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8678619Z       %140 = arith.mulf %128, %cst_7 : tensor<1x2x64xf32, #blocked>
2026-02-21T10:53:44.8678815Z       %141 = arith.truncf %140 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked>
2026-02-21T10:53:44.8679098Z       %142 = ttg.convert_layout %139 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T10:53:44.8679430Z       %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T10:53:44.8679724Z       %144 = ttg.convert_layout %143 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T10:53:44.8679960Z       %145 = arith.extf %141 : tensor<1x2x64xbf16, #blocked> to tensor<1x2x64xf32, #blocked>
2026-02-21T10:53:44.8680188Z       %146 = tt.broadcast %144 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x64xf32, #blocked1>
2026-02-21T10:53:44.8680427Z       %147 = ttg.convert_layout %146 : tensor<1x2x64xf32, #blocked1> -> tensor<1x2x64xf32, #blocked>
2026-02-21T10:53:44.8680633Z       %148 = arith.subf %145, %147 : tensor<1x2x64xf32, #blocked>
2026-02-21T10:53:44.8680928Z       %149 = tt.extern_elementwise %148 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2x64xf32, #blocked>
2026-02-21T10:53:44.8681227Z       %150 = "tt.reduce"(%149) <{axis = 2 : i32}> ({
2026-02-21T10:53:44.8681351Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T10:53:44.8681468Z         %182 = arith.addf %arg8, %arg9 : f32
2026-02-21T10:53:44.8681585Z         tt.reduce.return %182 : f32
2026-02-21T10:53:44.8681765Z       }) : (tensor<1x2x64xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T10:53:44.8682045Z       %151 = ttg.convert_layout %150 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8682301Z       %152 = arith.subf %arg5, %139 : tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8682634Z       %153 = tt.extern_elementwise %152 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked3>) -> tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8682919Z       %154 = arith.mulf %arg6, %153 : tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8683077Z       %155 = arith.addf %154, %151 : tensor<1x2xf32, #blocked3>
2026-02-21T10:53:44.8683315Z       %156 = ttg.convert_layout %153 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T10:53:44.8683665Z       %157 = tt.expand_dims %156 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T10:53:44.8683958Z       %158 = ttg.convert_layout %157 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T10:53:44.8684198Z       %159 = tt.broadcast %158 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1>
2026-02-21T10:53:44.8684440Z       %160 = ttg.convert_layout %159 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked>
2026-02-21T10:53:44.8684651Z       %161 = arith.mulf %arg7, %160 : tensor<1x2x128xf32, #blocked>
2026-02-21T10:53:44.8684918Z       %162 = ttg.convert_layout %112 : tensor<1x64xi32, #blocked4> -> tensor<1x64xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T10:53:44.8685256Z       %163 = tt.expand_dims %162 {axis = 2 : i32} : tensor<1x64xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x64x1xi32, #blocked9>
2026-02-21T10:53:44.8685554Z       %164 = ttg.convert_layout %163 : tensor<1x64x1xi32, #blocked9> -> tensor<1x64x1xi32, #blocked2>
2026-02-21T10:53:44.8685766Z       %165 = arith.muli %164, %cst_6 : tensor<1x64x1xi32, #blocked2>
2026-02-21T10:53:44.8685927Z       %166 = arith.addi %72, %165 : tensor<1x64x1xi32, #blocked2>
2026-02-21T10:53:44.8686126Z       %167 = tt.broadcast %166 : tensor<1x64x1xi32, #blocked2> -> tensor<1x64x128xi32, #blocked2>
2026-02-21T10:53:44.8686383Z       %168 = ttg.convert_layout %167 : tensor<1x64x128xi32, #blocked2> -> tensor<1x64x128xi32, #blocked>
2026-02-21T10:53:44.8686591Z       %169 = arith.addi %168, %76 : tensor<1x64x128xi32, #blocked>
2026-02-21T10:53:44.8686808Z       %170 = tt.addptr %77, %169 : tensor<1x64x128x!tt.ptr<bf16>, #blocked>, tensor<1x64x128xi32, #blocked>
2026-02-21T10:53:44.8687019Z       %171 = tt.load %170 : tensor<1x64x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.8687219Z       %172 = arith.truncf %149 : tensor<1x2x64xf32, #blocked> to tensor<1x2x64xbf16, #blocked>
2026-02-21T10:53:44.8687450Z       %173 = tt.reshape %161 : tensor<1x2x128xf32, #blocked> -> tensor<2x128xf32, #blocked4>
2026-02-21T10:53:44.8687674Z       %174 = tt.reshape %172 : tensor<1x2x64xbf16, #blocked> -> tensor<2x64xbf16, #blocked4>
2026-02-21T10:53:44.8687903Z       %175 = tt.reshape %171 : tensor<1x64x128xbf16, #blocked> -> tensor<64x128xbf16, #blocked4>
2026-02-21T10:53:44.8688195Z       %176 = ttg.convert_layout %174 : tensor<2x64xbf16, #blocked4> -> tensor<2x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>>
2026-02-21T10:53:44.8688548Z       %177 = ttg.convert_layout %175 : tensor<64x128xbf16, #blocked4> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>>
2026-02-21T10:53:44.8688848Z       %178 = ttg.convert_layout %173 : tensor<2x128xf32, #blocked4> -> tensor<2x128xf32, #blocked10>
2026-02-21T10:53:44.8689254Z       %179 = tt.dot %176, %177, %178, inputPrecision = tf32 : tensor<2x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x128xf32, #blocked10>
2026-02-21T10:53:44.8689677Z       %180 = ttg.convert_layout %179 : tensor<2x128xf32, #blocked10> -> tensor<2x128xf32, #blocked4>
2026-02-21T10:53:44.8689909Z       %181 = tt.reshape %180 : tensor<2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked>
2026-02-21T10:53:44.8690170Z       scf.yield %139, %155, %181 : tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>
2026-02-21T10:53:44.8690436Z     } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T10:53:44.8690692Z     %79 = ttg.convert_layout %78#1 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T10:53:44.8691017Z     %80 = tt.expand_dims %79 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T10:53:44.8691305Z     %81 = ttg.convert_layout %80 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T10:53:44.8691552Z     %82 = tt.broadcast %81 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1>
2026-02-21T10:53:44.8691788Z     %83 = ttg.convert_layout %82 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked>
2026-02-21T10:53:44.8691991Z     %84 = arith.divf %78#2, %83 : tensor<1x2x128xf32, #blocked>
2026-02-21T10:53:44.8692184Z     %85 = arith.truncf %84 : tensor<1x2x128xf32, #blocked> to tensor<1x2x128xbf16, #blocked>
2026-02-21T10:53:44.8692358Z     %86 = arith.muli %8, %c65536_i32 : i32
2026-02-21T10:53:44.8692568Z     %87 = ttg.convert_layout %12 : tensor<2xi32, #blocked5> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:53:44.8692895Z     %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi32, #blocked6>
2026-02-21T10:53:44.8693166Z     %89 = ttg.convert_layout %88 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #blocked3>
2026-02-21T10:53:44.8693451Z     %90 = ttg.convert_layout %89 : tensor<1x2xi32, #blocked3> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T10:53:44.8693785Z     %91 = tt.expand_dims %90 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi32, #blocked7>
2026-02-21T10:53:44.8694070Z     %92 = ttg.convert_layout %91 : tensor<1x2x1xi32, #blocked7> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T10:53:44.8694274Z     %93 = arith.muli %92, %cst_5 : tensor<1x2x1xi32, #blocked1>
2026-02-21T10:53:44.8694429Z     %94 = tt.splat %86 : i32 -> tensor<1x2x1xi32, #blocked1>
2026-02-21T10:53:44.8694579Z     %95 = arith.addi %94, %93 : tensor<1x2x1xi32, #blocked1>
2026-02-21T10:53:44.8694810Z     %96 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T10:53:44.8695128Z     %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6>
2026-02-21T10:53:44.8695410Z     %98 = ttg.convert_layout %97 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4>
2026-02-21T10:53:44.8695688Z     %99 = ttg.convert_layout %98 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T10:53:44.8696021Z     %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8>
2026-02-21T10:53:44.8703777Z     %101 = ttg.convert_layout %100 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked>
2026-02-21T10:53:44.8704026Z     %102 = tt.broadcast %95 : tensor<1x2x1xi32, #blocked1> -> tensor<1x2x128xi32, #blocked1>
2026-02-21T10:53:44.8704269Z     %103 = ttg.convert_layout %102 : tensor<1x2x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked>
2026-02-21T10:53:44.8704511Z     %104 = tt.broadcast %101 : tensor<1x1x128xi32, #blocked> -> tensor<1x2x128xi32, #blocked>
2026-02-21T10:53:44.8704756Z     %105 = arith.addi %103, %104 : tensor<1x2x128xi32, #blocked>
2026-02-21T10:53:44.8704944Z     %106 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.8705178Z     %107 = tt.addptr %106, %105 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>, tensor<1x2x128xi32, #blocked>
2026-02-21T10:53:44.8705388Z     tt.store %107, %85 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T10:53:44.8705542Z     tt.return
2026-02-21T10:53:44.8705620Z   }
2026-02-21T10:53:44.8705693Z }
2026-02-21T10:53:44.8705735Z 
2026-02-21T10:53:44.8705765Z {-#
2026-02-21T10:53:44.8705847Z   external_resources: {
2026-02-21T10:53:44.8705944Z     mlir_reproducer: {
2026-02-21T10:53:44.8708200Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T10:53:44.8710458Z       disable_threading: false,
2026-02-21T10:53:44.8710564Z       verify_each: true
2026-02-21T10:53:44.8710655Z     }
2026-02-21T10:53:44.8710727Z   }
2026-02-21T10:53:44.8710794Z #-}
2026-02-21T10:53:44.8711073Z /tmp/torchinductor_root/6v/c6vbchyecgbseqxf7v66tjcrvldnfmtcbunqfzmowyfinztxvrmq.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T10:53:44.8711752Z /tmp/torchinductor_root/6v/c6vbchyecgbseqxf7v66tjcrvldnfmtcbunqfzmowyfinztxvrmq.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T10:53:44.8712298Z [42s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T10:53:44.8713019Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T10:53:44.8713682Z Error: RuntimeError: PassManager::run failed
2026-02-21T10:53:44.8713847Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T10:53:53.7777171Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 8.8 configs/s
2026-02-21T10:53:53.7787270Z [51s] Adaptive compile timeout: 30s (90% percentile=12.2s, bounds=[30.0s, 30s])
2026-02-21T10:53:54.9898733Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━ 950/950 564.1 configs/s
2026-02-21T10:53:56.1324351Z [54s] Initial random population of 100, 5 starting points: 
2026-02-21T10:53:56.1324849Z error=15
2026-02-21T10:53:56.1325532Z timeout=1
2026-02-21T10:53:56.1325731Z ok=84
2026-02-21T10:53:56.1325933Z min=0.2113
2026-02-21T10:53:56.1326135Z mid=1.1127
2026-02-21T10:53:56.1326338Z max=198.0058
2026-02-21T10:53:56.1326601Z best={'block_sizes': [1, 128, 16],
2026-02-21T10:53:56.1326986Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T10:53:56.1327366Z  'l2_groupings': [64],
2026-02-21T10:53:56.1327635Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:53:56.1328110Z  'loop_orders': [[1, 0]],
2026-02-21T10:53:56.1328384Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:53:56.1328651Z  'num_stages': 2,
2026-02-21T10:53:56.1328880Z  'num_warps': 4,
2026-02-21T10:53:56.1329114Z  'pid_type': 'flat',
2026-02-21T10:53:56.1329371Z  'range_flattens': [None, None],
2026-02-21T10:53:56.1329678Z  'range_multi_buffers': [None, False],
2026-02-21T10:53:56.1329985Z  'range_num_stages': [0, 2],
2026-02-21T10:53:56.1330266Z  'range_unroll_factors': [0, 2],
2026-02-21T10:53:56.1330563Z  'range_warp_specializes': [],
2026-02-21T10:53:56.1330843Z  'waves_per_eu': 3}
2026-02-21T10:53:56.1412265Z [54s] Fitting surrogate: 100 points, 100 targets
2026-02-21T10:53:57.6060384Z [55s] Generation 1 starting: 80 neighbors, 5 active search path(s)
2026-02-21T10:54:13.6395690Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 2.0 configs/s
2026-02-21T10:54:18.4118388Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 17.5 configs/s
2026-02-21T10:54:22.3176041Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 229.1         
2026-02-21T10:54:22.3176549Z                                                                   configs/s     
2026-02-21T10:54:23.2246921Z [81s] Generation 1 complete: 
2026-02-21T10:54:23.2247235Z error=5
2026-02-21T10:54:23.2247414Z ok=80
2026-02-21T10:54:23.2247550Z min=0.1385
2026-02-21T10:54:23.2247690Z mid=0.2859
2026-02-21T10:54:23.2248350Z max=2.8840
2026-02-21T10:54:23.2248504Z best={'block_sizes': [1, 32, 32],
2026-02-21T10:54:23.2248773Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T10:54:23.2249062Z  'l2_groupings': [32],
2026-02-21T10:54:23.2249311Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:54:23.2249614Z  'loop_orders': [[0, 1]],
2026-02-21T10:54:23.2249810Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:54:23.2249997Z  'num_sm_multiplier': 16,
2026-02-21T10:54:23.2250173Z  'num_stages': 1,
2026-02-21T10:54:23.2250314Z  'num_warps': 2,
2026-02-21T10:54:23.2250475Z  'pid_type': 'persistent_blocked',
2026-02-21T10:54:23.2250674Z  'range_flattens': [False, None],
2026-02-21T10:54:23.2250868Z  'range_multi_buffers': [False, None],
2026-02-21T10:54:23.2251058Z  'range_num_stages': [1, 1],
2026-02-21T10:54:23.2251228Z  'range_unroll_factors': [1, 1],
2026-02-21T10:54:23.2251408Z  'range_warp_specializes': [],
2026-02-21T10:54:23.2251579Z  'waves_per_eu': 3}
2026-02-21T10:54:23.2562381Z [81s] Fitting surrogate: 185 points, 185 targets
2026-02-21T10:54:24.7520145Z [82s] Generation 2 starting: 82 neighbors, 5 active search path(s)
2026-02-21T10:54:47.5777593Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 1.2 configs/s
2026-02-21T10:54:52.5185359Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 17.1 configs/s
2026-02-21T10:55:00.8856978Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 114.0         
2026-02-21T10:55:00.8859094Z                                                                   configs/s     
2026-02-21T10:55:01.9512934Z [119s] Generation 2 complete: 
2026-02-21T10:55:01.9513743Z error=3
2026-02-21T10:55:01.9513950Z ok=84
2026-02-21T10:55:01.9514156Z min=0.1375
2026-02-21T10:55:01.9514370Z mid=0.2113
2026-02-21T10:55:01.9514567Z max=1.7679
2026-02-21T10:55:01.9514797Z best={'block_sizes': [1, 32, 32],
2026-02-21T10:55:01.9517042Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T10:55:01.9517480Z  'l2_groupings': [32],
2026-02-21T10:55:01.9517752Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:55:01.9518058Z  'loop_orders': [[0, 1]],
2026-02-21T10:55:01.9518508Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:55:01.9518774Z  'num_sm_multiplier': 16,
2026-02-21T10:55:01.9519025Z  'num_stages': 2,
2026-02-21T10:55:01.9519240Z  'num_warps': 2,
2026-02-21T10:55:01.9519496Z  'pid_type': 'persistent_blocked',
2026-02-21T10:55:01.9519789Z  'range_flattens': [False, None],
2026-02-21T10:55:01.9520078Z  'range_multi_buffers': [False, False],
2026-02-21T10:55:01.9520367Z  'range_num_stages': [1, 1],
2026-02-21T10:55:01.9520636Z  'range_unroll_factors': [1, 1],
2026-02-21T10:55:01.9520914Z  'range_warp_specializes': [],
2026-02-21T10:55:01.9521172Z  'waves_per_eu': 3}
2026-02-21T10:55:02.0301545Z [120s] Fitting surrogate: 272 points, 272 targets
2026-02-21T10:55:03.4968133Z [121s] Generation 3 starting: 76 neighbors, 5 active search path(s)
2026-02-21T10:55:18.3028075Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 5.9 configs/s
2026-02-21T10:55:23.0221876Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 16.8 configs/s
2026-02-21T10:55:29.7796487Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 139.8         
2026-02-21T10:55:29.7797287Z                                                                   configs/s     
2026-02-21T10:55:30.7754550Z [148s] Generation 3 complete: 
2026-02-21T10:55:30.7754974Z error=2
2026-02-21T10:55:30.7755186Z ok=79
2026-02-21T10:55:30.7755390Z min=0.1389
2026-02-21T10:55:30.7755592Z mid=0.2425
2026-02-21T10:55:30.7755822Z max=2.3012
2026-02-21T10:55:30.7756096Z best={'block_sizes': [1, 32, 32],
2026-02-21T10:55:30.7756506Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T10:55:30.7756918Z  'l2_groupings': [32],
2026-02-21T10:55:30.7757203Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:55:30.7757516Z  'loop_orders': [[0, 1]],
2026-02-21T10:55:30.7757794Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:55:30.7758648Z  'num_sm_multiplier': 16,
2026-02-21T10:55:30.7758911Z  'num_stages': 2,
2026-02-21T10:55:30.7759137Z  'num_warps': 2,
2026-02-21T10:55:30.7759415Z  'pid_type': 'persistent_blocked',
2026-02-21T10:55:30.7759724Z  'range_flattens': [False, None],
2026-02-21T10:55:30.7760037Z  'range_multi_buffers': [False, False],
2026-02-21T10:55:30.7760353Z  'range_num_stages': [1, 1],
2026-02-21T10:55:30.7760637Z  'range_unroll_factors': [1, 1],
2026-02-21T10:55:30.7760928Z  'range_warp_specializes': [],
2026-02-21T10:55:30.7761216Z  'waves_per_eu': 3}
2026-02-21T10:55:30.8424205Z [148s] Fitting surrogate: 353 points, 353 targets
2026-02-21T10:55:32.2848976Z [150s] Generation 4 starting: 72 neighbors, 5 active search path(s)
2026-02-21T10:55:53.5839298Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.7 configs/s
2026-02-21T10:55:57.9652762Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 16.8 configs/s
2026-02-21T10:56:05.9377987Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 119.6         
2026-02-21T10:56:05.9379685Z                                                                   configs/s     
2026-02-21T10:56:06.9468399Z [184s] Generation 4 complete: 
2026-02-21T10:56:06.9468838Z error=3
2026-02-21T10:56:06.9469041Z ok=74
2026-02-21T10:56:06.9469243Z min=0.1404
2026-02-21T10:56:06.9469443Z mid=0.2116
2026-02-21T10:56:06.9469697Z max=1.0073
2026-02-21T10:56:06.9469928Z best={'block_sizes': [1, 32, 32],
2026-02-21T10:56:06.9470340Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T10:56:06.9470751Z  'l2_groupings': [32],
2026-02-21T10:56:06.9471484Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:56:06.9471802Z  'loop_orders': [[0, 1]],
2026-02-21T10:56:06.9472083Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:56:06.9472367Z  'num_sm_multiplier': 16,
2026-02-21T10:56:06.9472621Z  'num_stages': 2,
2026-02-21T10:56:06.9472852Z  'num_warps': 2,
2026-02-21T10:56:06.9473108Z  'pid_type': 'persistent_blocked',
2026-02-21T10:56:06.9473428Z  'range_flattens': [None, None],
2026-02-21T10:56:06.9473729Z  'range_multi_buffers': [False, False],
2026-02-21T10:56:06.9474212Z  'range_num_stages': [1, 1],
2026-02-21T10:56:06.9474487Z  'range_unroll_factors': [1, 0],
2026-02-21T10:56:06.9474774Z  'range_warp_specializes': [],
2026-02-21T10:56:06.9475048Z  'waves_per_eu': 3}
2026-02-21T10:56:06.9523434Z [184s] Fitting surrogate: 430 points, 430 targets
2026-02-21T10:56:08.4891290Z [186s] Generation 5 starting: 78 neighbors, 5 active search path(s)
2026-02-21T10:56:19.1979255Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 3.5 configs/s
2026-02-21T10:56:24.1400736Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 16.5 configs/s
2026-02-21T10:56:35.7480274Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 88.7 configs/s
2026-02-21T10:56:36.8165697Z [214s] Generation 5 complete: 
2026-02-21T10:56:36.8165912Z ok=83
2026-02-21T10:56:36.8166104Z min=0.1409
2026-02-21T10:56:36.8166183Z mid=0.1550
2026-02-21T10:56:36.8166371Z max=0.8546
2026-02-21T10:56:36.8166475Z best={'block_sizes': [1, 32, 64],
2026-02-21T10:56:36.8166759Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T10:56:36.8166967Z  'l2_groupings': [1],
2026-02-21T10:56:36.8167086Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:56:36.8167629Z  'loop_orders': [[0, 1]],
2026-02-21T10:56:36.8167762Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:56:36.8167866Z  'num_stages': 3,
2026-02-21T10:56:36.8167950Z  'num_warps': 2,
2026-02-21T10:56:36.8168038Z  'pid_type': 'flat',
2026-02-21T10:56:36.8168135Z  'range_flattens': [None, None],
2026-02-21T10:56:36.8168255Z  'range_multi_buffers': [None, None],
2026-02-21T10:56:36.8168368Z  'range_num_stages': [0, 1],
2026-02-21T10:56:36.8168473Z  'range_unroll_factors': [0, 4],
2026-02-21T10:56:36.8168582Z  'range_warp_specializes': [],
2026-02-21T10:56:36.8168683Z  'waves_per_eu': 2}
2026-02-21T10:56:36.8228051Z [214s] Fitting surrogate: 513 points, 513 targets
2026-02-21T10:56:37.7042749Z [215s] Generation 6 starting: 78 neighbors, 5 active search path(s)
2026-02-21T10:56:48.2146882Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 5.2 configs/s
2026-02-21T10:56:52.9952495Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 16.8 configs/s
2026-02-21T10:57:00.6670952Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 123.8         
2026-02-21T10:57:00.6671564Z                                                                   configs/s     
2026-02-21T10:57:01.6273414Z [239s] Generation 6 complete: 
2026-02-21T10:57:01.6273583Z error=1
2026-02-21T10:57:01.6273673Z ok=82
2026-02-21T10:57:01.6273751Z min=0.1435
2026-02-21T10:57:01.6273836Z mid=0.2180
2026-02-21T10:57:01.6273910Z max=1.2749
2026-02-21T10:57:01.6273999Z best={'block_sizes': [1, 32, 64],
2026-02-21T10:57:01.6274150Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T10:57:01.6274304Z  'l2_groupings': [1],
2026-02-21T10:57:01.6274414Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:57:01.6274533Z  'loop_orders': [[0, 1]],
2026-02-21T10:57:01.6274643Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:57:01.6275089Z  'num_stages': 3,
2026-02-21T10:57:01.6275176Z  'num_warps': 2,
2026-02-21T10:57:01.6275261Z  'pid_type': 'flat',
2026-02-21T10:57:01.6275359Z  'range_flattens': [None, None],
2026-02-21T10:57:01.6275480Z  'range_multi_buffers': [None, None],
2026-02-21T10:57:01.6275595Z  'range_num_stages': [0, 1],
2026-02-21T10:57:01.6275695Z  'range_unroll_factors': [0, 4],
2026-02-21T10:57:01.6275807Z  'range_warp_specializes': [],
2026-02-21T10:57:01.6275991Z  'waves_per_eu': 2}
2026-02-21T10:57:01.6990391Z [239s] Fitting surrogate: 596 points, 596 targets
2026-02-21T10:57:02.3741448Z [240s] Generation 7 starting: 56 neighbors, 4 active search path(s)
2026-02-21T10:57:16.4482098Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 1.7 configs/s
2026-02-21T10:57:19.9949921Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 16.8 configs/s
2026-02-21T10:57:25.1609815Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 179.7         
2026-02-21T10:57:25.1610460Z                                                                   configs/s     
2026-02-21T10:57:26.0538872Z [264s] Generation 7 complete: 
2026-02-21T10:57:26.0539251Z ok=60
2026-02-21T10:57:26.0539922Z min=0.1418
2026-02-21T10:57:26.0540139Z mid=0.2171
2026-02-21T10:57:26.0540344Z max=1.1309
2026-02-21T10:57:26.0540583Z best={'block_sizes': [1, 32, 64],
2026-02-21T10:57:26.0540998Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T10:57:26.0541421Z  'l2_groupings': [1],
2026-02-21T10:57:26.0541703Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:57:26.0542021Z  'loop_orders': [[0, 1]],
2026-02-21T10:57:26.0542305Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:57:26.0542572Z  'num_stages': 3,
2026-02-21T10:57:26.0542803Z  'num_warps': 2,
2026-02-21T10:57:26.0543040Z  'pid_type': 'flat',
2026-02-21T10:57:26.0543465Z  'range_flattens': [None, None],
2026-02-21T10:57:26.0543773Z  'range_multi_buffers': [None, False],
2026-02-21T10:57:26.0544085Z  'range_num_stages': [0, 1],
2026-02-21T10:57:26.0544385Z  'range_unroll_factors': [0, 4],
2026-02-21T10:57:26.0544682Z  'range_warp_specializes': [],
2026-02-21T10:57:26.0544959Z  'waves_per_eu': 2}
2026-02-21T10:57:26.1035053Z [264s] Fitting surrogate: 656 points, 656 targets
2026-02-21T10:57:26.7988658Z [264s] Generation 8 starting: 58 neighbors, 4 active search path(s)
2026-02-21T10:57:37.5316202Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59/59 6.7 configs/s
2026-02-21T10:57:41.8531113Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 59/59 14.1 configs/s
2026-02-21T10:57:48.9309989Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 133.5         
2026-02-21T10:57:48.9310557Z                                                                   configs/s     
2026-02-21T10:57:49.8040177Z [287s] Generation 8 complete: 
2026-02-21T10:57:49.8040591Z ok=62
2026-02-21T10:57:49.8040844Z min=0.1401
2026-02-21T10:57:49.8041057Z mid=0.1890
2026-02-21T10:57:49.8041254Z max=0.8055
2026-02-21T10:57:49.8041504Z best={'block_sizes': [1, 32, 64],
2026-02-21T10:57:49.8041911Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T10:57:49.8042316Z  'l2_groupings': [64],
2026-02-21T10:57:49.8042697Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:57:49.8043011Z  'loop_orders': [[0, 1]],
2026-02-21T10:57:49.8043286Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:57:49.8043552Z  'num_stages': 2,
2026-02-21T10:57:49.8043783Z  'num_warps': 2,
2026-02-21T10:57:49.8044026Z  'pid_type': 'flat',
2026-02-21T10:57:49.8044291Z  'range_flattens': [None, None],
2026-02-21T10:57:49.8044589Z  'range_multi_buffers': [None, True],
2026-02-21T10:57:49.8044896Z  'range_num_stages': [0, 3],
2026-02-21T10:57:49.8045170Z  'range_unroll_factors': [0, 3],
2026-02-21T10:57:49.8045467Z  'range_warp_specializes': [],
2026-02-21T10:57:49.8045744Z  'waves_per_eu': 2}
2026-02-21T10:57:49.8684473Z [287s] Fitting surrogate: 718 points, 718 targets
2026-02-21T10:57:50.5844213Z [288s] Generation 9 starting: 57 neighbors, 4 active search path(s)
2026-02-21T10:58:02.8387535Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 2.6 configs/s
2026-02-21T10:58:06.4366982Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 16.5 configs/s
2026-02-21T10:58:12.7315455Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 149.8         
2026-02-21T10:58:12.7316084Z                                                                   configs/s     
2026-02-21T10:58:13.5868498Z [311s] Generation 9 complete: 
2026-02-21T10:58:13.5868870Z ok=61
2026-02-21T10:58:13.5869077Z min=0.1381
2026-02-21T10:58:13.5869290Z mid=0.1736
2026-02-21T10:58:13.5869482Z max=1.2565
2026-02-21T10:58:13.5869712Z best={'block_sizes': [1, 64, 32],
2026-02-21T10:58:13.5870119Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T10:58:13.5870516Z  'l2_groupings': [64],
2026-02-21T10:58:13.5870838Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:58:13.5871153Z  'loop_orders': [[0, 1]],
2026-02-21T10:58:13.5871454Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:58:13.5871716Z  'num_stages': 3,
2026-02-21T10:58:13.5871953Z  'num_warps': 4,
2026-02-21T10:58:13.5872181Z  'pid_type': 'flat',
2026-02-21T10:58:13.5872810Z  'range_flattens': [None, None],
2026-02-21T10:58:13.5873111Z  'range_multi_buffers': [None, True],
2026-02-21T10:58:13.5873399Z  'range_num_stages': [0, 3],
2026-02-21T10:58:13.5873556Z  'range_unroll_factors': [0, 4],
2026-02-21T10:58:13.5873685Z  'range_warp_specializes': [],
2026-02-21T10:58:13.5873812Z  'waves_per_eu': 3}
2026-02-21T10:58:13.6486489Z [311s] Fitting surrogate: 779 points, 779 targets
2026-02-21T10:58:14.9119835Z [312s] Generation 10 starting: 44 neighbors, 3 active search path(s)
2026-02-21T10:58:26.0065098Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 1.2 configs/s
2026-02-21T10:58:28.8610866Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 44/44 16.2 configs/s
2026-02-21T10:58:35.2023198Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 148.6         
2026-02-21T10:58:35.2023564Z                                                                   configs/s     
2026-02-21T10:58:36.0354576Z [334s] Generation 10 complete: 
2026-02-21T10:58:36.0354949Z ok=47
2026-02-21T10:58:36.0355166Z min=0.1344
2026-02-21T10:58:36.0355386Z mid=0.1558
2026-02-21T10:58:36.0355590Z max=1.8760
2026-02-21T10:58:36.0355826Z best={'block_sizes': [1, 32, 64],
2026-02-21T10:58:36.0356233Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T10:58:36.0356653Z  'l2_groupings': [64],
2026-02-21T10:58:36.0356934Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:58:36.0357259Z  'loop_orders': [[0, 1]],
2026-02-21T10:58:36.0357542Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:58:36.0357810Z  'num_stages': 3,
2026-02-21T10:58:36.0358045Z  'num_warps': 2,
2026-02-21T10:58:36.0358279Z  'pid_type': 'flat',
2026-02-21T10:58:36.0358560Z  'range_flattens': [None, None],
2026-02-21T10:58:36.0358862Z  'range_multi_buffers': [None, True],
2026-02-21T10:58:36.0359186Z  'range_num_stages': [0, 3],
2026-02-21T10:58:36.0359464Z  'range_unroll_factors': [0, 4],
2026-02-21T10:58:36.0359766Z  'range_warp_specializes': [],
2026-02-21T10:58:36.0360047Z  'waves_per_eu': 2}
2026-02-21T10:58:36.0960214Z [334s] Fitting surrogate: 826 points, 826 targets
2026-02-21T10:58:36.6420562Z [334s] Generation 11 starting: 44 neighbors, 3 active search path(s)
2026-02-21T10:58:44.7437966Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44/44 4.7 configs/s
2026-02-21T10:58:47.5023516Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 44/44 16.8 configs/s
2026-02-21T10:58:52.1580901Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 226.4         
2026-02-21T10:58:52.1581436Z                                                                   configs/s     
2026-02-21T10:58:52.9638082Z [350s] Generation 11 complete: 
2026-02-21T10:58:52.9638544Z ok=47
2026-02-21T10:58:52.9638776Z min=0.1328
2026-02-21T10:58:52.9638991Z mid=0.2040
2026-02-21T10:58:52.9639651Z max=0.6957
2026-02-21T10:58:52.9639874Z best={'block_sizes': [1, 32, 64],
2026-02-21T10:58:52.9640289Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T10:58:52.9640705Z  'l2_groupings': [64],
2026-02-21T10:58:52.9640993Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:58:52.9641319Z  'loop_orders': [[0, 1]],
2026-02-21T10:58:52.9641598Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:58:52.9641869Z  'num_stages': 3,
2026-02-21T10:58:52.9642115Z  'num_warps': 2,
2026-02-21T10:58:52.9642349Z  'pid_type': 'flat',
2026-02-21T10:58:52.9642714Z  'range_flattens': [None, None],
2026-02-21T10:58:52.9643028Z  'range_multi_buffers': [None, True],
2026-02-21T10:58:52.9643336Z  'range_num_stages': [0, 4],
2026-02-21T10:58:52.9643622Z  'range_unroll_factors': [0, 4],
2026-02-21T10:58:52.9643920Z  'range_warp_specializes': [],
2026-02-21T10:58:52.9644218Z  'waves_per_eu': 2}
2026-02-21T10:58:53.0055994Z [351s] Fitting surrogate: 873 points, 873 targets
2026-02-21T10:58:53.4319505Z [351s] Generation 12 starting: 27 neighbors, 2 active search path(s)
2026-02-21T10:59:01.4739846Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 2.3 configs/s
2026-02-21T10:59:03.2155784Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 27/27 16.9 configs/s
2026-02-21T10:59:05.5341044Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 371.5         
2026-02-21T10:59:05.5341559Z                                                                   configs/s     
2026-02-21T10:59:06.3742973Z [364s] Generation 12 complete: 
2026-02-21T10:59:06.3743392Z ok=29
2026-02-21T10:59:06.3743606Z min=0.1352
2026-02-21T10:59:06.3743825Z mid=0.1922
2026-02-21T10:59:06.3744028Z max=1.5350
2026-02-21T10:59:06.3744280Z best={'block_sizes': [1, 32, 64],
2026-02-21T10:59:06.3744689Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T10:59:06.3745795Z  'l2_groupings': [64],
2026-02-21T10:59:06.3746081Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:59:06.3746429Z  'loop_orders': [[0, 1]],
2026-02-21T10:59:06.3746714Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:59:06.3746979Z  'num_stages': 3,
2026-02-21T10:59:06.3747227Z  'num_warps': 2,
2026-02-21T10:59:06.3747407Z  'pid_type': 'flat',
2026-02-21T10:59:06.3747602Z  'range_flattens': [None, None],
2026-02-21T10:59:06.3747819Z  'range_multi_buffers': [None, True],
2026-02-21T10:59:06.3748042Z  'range_num_stages': [0, 4],
2026-02-21T10:59:06.3748238Z  'range_unroll_factors': [0, 4],
2026-02-21T10:59:06.3748458Z  'range_warp_specializes': [],
2026-02-21T10:59:06.3748662Z  'waves_per_eu': 2}
2026-02-21T10:59:06.4011677Z [364s] Fitting surrogate: 902 points, 902 targets
2026-02-21T10:59:06.6481895Z [364s] Generation 13 starting: 16 neighbors, 1 active search path(s)
2026-02-21T10:59:10.8132902Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 3.6 configs/s
2026-02-21T10:59:11.8910801Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.2 configs/s
2026-02-21T10:59:13.3878064Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 544.4         
2026-02-21T10:59:13.3878665Z                                                                   configs/s     
2026-02-21T10:59:14.2338011Z [372s] Generation 13 complete: 
2026-02-21T10:59:14.2338360Z ok=18
2026-02-21T10:59:14.2338574Z min=0.1360
2026-02-21T10:59:14.2338798Z mid=0.1694
2026-02-21T10:59:14.2339015Z max=0.7277
2026-02-21T10:59:14.2339244Z best={'block_sizes': [1, 32, 64],
2026-02-21T10:59:14.2340002Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T10:59:14.2340404Z  'l2_groupings': [64],
2026-02-21T10:59:14.2340692Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:59:14.2341010Z  'loop_orders': [[0, 1]],
2026-02-21T10:59:14.2341298Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:59:14.2341566Z  'num_stages': 3,
2026-02-21T10:59:14.2341808Z  'num_warps': 2,
2026-02-21T10:59:14.2342055Z  'pid_type': 'flat',
2026-02-21T10:59:14.2342325Z  'range_flattens': [None, None],
2026-02-21T10:59:14.2342797Z  'range_multi_buffers': [None, True],
2026-02-21T10:59:14.2343107Z  'range_num_stages': [0, 4],
2026-02-21T10:59:14.2343389Z  'range_unroll_factors': [0, 4],
2026-02-21T10:59:14.2343687Z  'range_warp_specializes': [],
2026-02-21T10:59:14.2343979Z  'waves_per_eu': 2}
2026-02-21T10:59:14.2506849Z [372s] Fitting surrogate: 920 points, 920 targets
2026-02-21T10:59:14.4831704Z [372s] Generation 14 starting: 10 neighbors, 1 active search path(s)
2026-02-21T10:59:17.6516203Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10/10 4.2 configs/s
2026-02-21T10:59:18.3477899Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 10/10 18.2 configs/s
2026-02-21T10:59:18.8194826Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1486.0        
2026-02-21T10:59:18.8195368Z                                                                   configs/s     
2026-02-21T10:59:19.5924968Z [377s] Generation 14 complete: 
2026-02-21T10:59:19.5925231Z ok=12
2026-02-21T10:59:19.5925360Z min=0.1354
2026-02-21T10:59:19.5925505Z mid=0.2067
2026-02-21T10:59:19.5925628Z max=0.7286
2026-02-21T10:59:19.5925766Z best={'block_sizes': [1, 32, 64],
2026-02-21T10:59:19.5926286Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T10:59:19.5926522Z  'l2_groupings': [64],
2026-02-21T10:59:19.5926694Z  'load_eviction_policies': ['', '', ''],
2026-02-21T10:59:19.5926886Z  'loop_orders': [[0, 1]],
2026-02-21T10:59:19.5927060Z  'matrix_instr_nonkdim': 16,
2026-02-21T10:59:19.5927229Z  'num_stages': 3,
2026-02-21T10:59:19.5927376Z  'num_warps': 2,
2026-02-21T10:59:19.5927519Z  'pid_type': 'flat',
2026-02-21T10:59:19.5927675Z  'range_flattens': [None, None],
2026-02-21T10:59:19.5927861Z  'range_multi_buffers': [None, True],
2026-02-21T10:59:19.5928046Z  'range_num_stages': [0, 4],
2026-02-21T10:59:19.5928214Z  'range_unroll_factors': [0, 4],
2026-02-21T10:59:19.5928542Z  'range_warp_specializes': [],
2026-02-21T10:59:19.5928714Z  'waves_per_eu': 2}
2026-02-21T10:59:19.6025899Z [377s] Fitting surrogate: 932 points, 932 targets
2026-02-21T10:59:19.7303787Z [377s] Autotuning complete in 377.7s after searching 883 configs.
2026-02-21T10:59:19.7303995Z One can hardcode the best config and skip autotuning with:
2026-02-21T10:59:19.7304694Z     @helion.kernel(config=helion.Config(block_sizes=[1, 32, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T10:59:19.7305320Z 
2026-02-21T10:59:19.7305484Z [377s] Code of selected kernel: /tmp/torchinductor_root/dg/cdgqamsh5f3665xiepsy3mv7a5juf25esbrhspkp7ydkdbuw4q2z.py
2026-02-21T10:59:20.6969789Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T10:59:20.6970662Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 32, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T10:59:20.6971855Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T10:59:20.6972029Z WARNING:tritonbench.utils.triton_op:Completed input ID 2:
2026-02-21T10:59:20.6972185Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T10:59:20.6972310Z ------------------------------------------
2026-02-21T10:59:20.6972430Z (4, 48, 512, 512, 128)
2026-02-21T10:59:20.6972503Z 
2026-02-21T10:59:20.6976939Z  50%|█████     | 3/6 [14:54<15:51, 317.23s/it]WARNING:tritonbench.utils.triton_op:Running input ID 4:
2026-02-21T10:59:20.6977245Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T10:59:20.6977366Z ------------------------------------------
2026-02-21T10:59:20.6977487Z (4, 48, 2048, 2048, 128)
2026-02-21T10:59:20.6978778Z INFO:tritonbench.utils.triton_op:Took 0.06ms to get benchmark function for aten
2026-02-21T10:59:21.6966569Z INFO:tritonbench.utils.triton_op:Took 1.44ms to get benchmark function for flex_attention
2026-02-21T10:59:23.2232115Z WARNING:__main__:Input tensor metadata:
2026-02-21T10:59:23.2232389Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T10:59:23.2232528Z               'dtype': 'torch.bfloat16',
2026-02-21T10:59:23.2232661Z               'shape': (4, 48, 2048, 128),
2026-02-21T10:59:23.2232801Z               'stride': (12582912, 262144, 128, 1)},
2026-02-21T10:59:23.2232935Z             { 'device': 'cuda:0',
2026-02-21T10:59:23.2233060Z               'dtype': 'torch.bfloat16',
2026-02-21T10:59:23.2233182Z               'shape': (4, 48, 2048, 128),
2026-02-21T10:59:23.2233321Z               'stride': (12582912, 262144, 128, 1)},
2026-02-21T10:59:23.2233449Z             { 'device': 'cuda:0',
2026-02-21T10:59:23.2233567Z               'dtype': 'torch.bfloat16',
2026-02-21T10:59:23.2233693Z               'shape': (4, 48, 2048, 128),
2026-02-21T10:59:23.2233825Z               'stride': (12582912, 262144, 128, 1)}),
2026-02-21T10:59:23.2233952Z   'kwargs': {}}
2026-02-21T10:59:23.2268584Z INFO:tritonbench.utils.triton_op:Took 4.12ms to get benchmark function for helion_attention
2026-02-21T10:59:23.4694352Z [0s] Autotune random seed: 2144140282
2026-02-21T10:59:23.6291759Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T10:59:57.0940209Z [33s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[3, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T10:59:57.6167359Z [33s] Timeout after 30s compiling Config(block_sizes=[1, 1, 2048], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:00:02.0213910Z [38s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:00:03.2498343Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:00:03.6512042Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:00:03.8180487Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:00:05.0804633Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:00:05.3701283Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 8, 1024], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:00:05.5019365Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 16, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:00:05.6578850Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 16, 256], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:00:06.4707463Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[4, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:00:07.6472346Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 32, 1024], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=16, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[3, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:00:08.2504612Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 4, 2048], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:00:08.7194187Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 2, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:00:08.9374382Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[2, 0], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:00:09.0858573Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 1, 1024], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[1, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:00:09.2385131Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:00:09.4728696Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 4, 256], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[4, 1], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:00:10.1980685Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:00:10.1998707Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.2 configs/s
2026-02-21T11:00:14.5738300Z /tmp/torchinductor_root/ry/cryea2bdhdua425js6x6ixzjrokt533cicfucyrleb5lw6rth7os.py:57:24: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T11:00:14.5740401Z             k = tl.load(tl.make_block_ptr(k_view, [192, 128, 2048], [262144, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T11:00:14.5741268Z                        ^
2026-02-21T11:00:14.5742998Z /tmp/torchinductor_root/ry/cryea2bdhdua425js6x6ixzjrokt533cicfucyrleb5lw6rth7os.py:59:145: note: - use: %144 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x1024xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [1, 0, 2]}>>) -> tensor<128x1024xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>>
2026-02-21T11:00:14.5744752Z 
2026-02-21T11:00:14.5745696Z             qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T11:00:14.5747072Z                                                                                                                                                 ^
2026-02-21T11:00:14.5747577Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T11:00:14.5748255Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}>
2026-02-21T11:00:14.5749135Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}>
2026-02-21T11:00:14.5749727Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}>
2026-02-21T11:00:14.5750265Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}>
2026-02-21T11:00:14.5750794Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:00:14.5751333Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [2, 1, 0]}>
2026-02-21T11:00:14.5751855Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T11:00:14.5752352Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}>
2026-02-21T11:00:14.5752840Z #blocked8 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}>
2026-02-21T11:00:14.5753318Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}>
2026-02-21T11:00:14.5753960Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}>
2026-02-21T11:00:14.5754494Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}>
2026-02-21T11:00:14.5755042Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}>
2026-02-21T11:00:14.5755579Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}>
2026-02-21T11:00:14.5756116Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:00:14.5756646Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [0, 1, 2]}>
2026-02-21T11:00:14.5757172Z #blocked16 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}>
2026-02-21T11:00:14.5757689Z #blocked17 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [0, 1, 2]}>
2026-02-21T11:00:14.5758304Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T11:00:14.5759192Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T11:00:14.5759805Z     %c262144_i32 = arith.constant 262144 : i32
2026-02-21T11:00:14.5759969Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T11:00:14.5760125Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T11:00:14.5760280Z     %c262144_i64 = arith.constant 262144 : i64
2026-02-21T11:00:14.5760503Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x128x1024xbf16, #blocked>
2026-02-21T11:00:14.5760766Z     %cst_0 = arith.constant dense<2048> : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:00:14.5761000Z     %cst_1 = arith.constant dense<0> : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:00:14.5761238Z     %cst_2 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked1>
2026-02-21T11:00:14.5761489Z     %cst_3 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked1>
2026-02-21T11:00:14.5761713Z     %cst_4 = arith.constant dense<128> : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:00:14.5761958Z     %cst_5 = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked2>
2026-02-21T11:00:14.5762209Z     %cst_6 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked3>
2026-02-21T11:00:14.5762430Z     %cst_7 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked3>
2026-02-21T11:00:14.5762741Z     %cst_8 = arith.constant dense<2048> : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:00:14.5762969Z     %cst_9 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:00:14.5763195Z     %cst_10 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:00:14.5763398Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T11:00:14.5763550Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T11:00:14.5763704Z     %c304_i32 = arith.constant 304 : i32
2026-02-21T11:00:14.5763889Z     %cst_11 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked4>
2026-02-21T11:00:14.5764118Z     %cst_12 = arith.constant dense<128> : tensor<1x1024x1xi32, #blocked5>
2026-02-21T11:00:14.5764367Z     %cst_13 = arith.constant dense<0.127517432> : tensor<1x2x1024xf32, #blocked>
2026-02-21T11:00:14.5764626Z     %cst_14 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5764881Z     %cst_15 = arith.constant dense<0.000000e+00> : tensor<2x1024xf32, #blocked7>
2026-02-21T11:00:14.5765086Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T11:00:14.5765289Z     %cst_16 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked2>
2026-02-21T11:00:14.5765573Z     %cst_17 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5765849Z     %cst_18 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5766048Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T11:00:14.5766192Z     %c192_i32 = arith.constant 192 : i32
2026-02-21T11:00:14.5766347Z     %c196608_i32 = arith.constant 196608 : i32
2026-02-21T11:00:14.5766505Z     %0 = tt.get_program_id x : i32
2026-02-21T11:00:14.5766698Z     %1 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked8>
2026-02-21T11:00:14.5766962Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked8>
2026-02-21T11:00:14.5767233Z     %3 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T11:00:14.5767492Z     %4 = arith.extsi %1 : tensor<2xi32, #blocked8> to tensor<2xi64, #blocked8>
2026-02-21T11:00:14.5767750Z     %5 = arith.extsi %2 : tensor<128xi32, #blocked8> to tensor<128xi64, #blocked8>
2026-02-21T11:00:14.5768086Z     %6 = ttg.convert_layout %5 : tensor<128xi64, #blocked8> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:00:14.5768531Z     %7 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi64, #blocked9>
2026-02-21T11:00:14.5768902Z     %8 = ttg.convert_layout %7 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #blocked10>
2026-02-21T11:00:14.5769304Z     %9 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T11:00:14.5769749Z     %10 = tt.expand_dims %9 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi64, #blocked11>
2026-02-21T11:00:14.5770062Z     %11 = ttg.convert_layout %10 : tensor<1x1x128xi64, #blocked11> -> tensor<1x1x128xi64, #blocked3>
2026-02-21T11:00:14.5770318Z     %12 = tt.broadcast %11 : tensor<1x1x128xi64, #blocked3> -> tensor<1x2x128xi64, #blocked3>
2026-02-21T11:00:14.5770566Z     %13 = ttg.convert_layout %12 : tensor<1x2x128xi64, #blocked3> -> tensor<1x2x128xi64, #blocked2>
2026-02-21T11:00:14.5770791Z     %14 = arith.cmpi sge, %11, %cst_7 : tensor<1x1x128xi64, #blocked3>
2026-02-21T11:00:14.5770968Z     %15 = arith.cmpi slt, %11, %cst_6 : tensor<1x1x128xi64, #blocked3>
2026-02-21T11:00:14.5771157Z     %16 = arith.andi %14, %15 : tensor<1x1x128xi1, #blocked3>
2026-02-21T11:00:14.5771353Z     %17 = tt.broadcast %16 : tensor<1x1x128xi1, #blocked3> -> tensor<1x2x128xi1, #blocked3>
2026-02-21T11:00:14.5771600Z     %18 = ttg.convert_layout %17 : tensor<1x2x128xi1, #blocked3> -> tensor<1x2x128xi1, #blocked2>
2026-02-21T11:00:14.5771839Z     %19 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked8>
2026-02-21T11:00:14.5772054Z     %20 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x1024x!tt.ptr<bf16>, #blocked>
2026-02-21T11:00:14.5772338Z     %21 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked12}>>
2026-02-21T11:00:14.5772687Z     %22 = tt.expand_dims %21 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x128x1xi64, #blocked12>
2026-02-21T11:00:14.5773004Z     %23 = ttg.convert_layout %22 : tensor<1x128x1xi64, #blocked12> -> tensor<1x128x1xi64, #blocked1>
2026-02-21T11:00:14.5773263Z     %24 = tt.broadcast %23 : tensor<1x128x1xi64, #blocked1> -> tensor<1x128x1024xi64, #blocked1>
2026-02-21T11:00:14.5781699Z     %25 = ttg.convert_layout %24 : tensor<1x128x1024xi64, #blocked1> -> tensor<1x128x1024xi64, #blocked>
2026-02-21T11:00:14.5781971Z     %26 = arith.extsi %19 : tensor<1024xi32, #blocked8> to tensor<1024xi64, #blocked8>
2026-02-21T11:00:14.5782171Z     %27 = arith.cmpi sge, %23, %cst_3 : tensor<1x128x1xi64, #blocked1>
2026-02-21T11:00:14.5782347Z     %28 = arith.cmpi slt, %23, %cst_2 : tensor<1x128x1xi64, #blocked1>
2026-02-21T11:00:14.5782515Z     %29 = arith.andi %27, %28 : tensor<1x128x1xi1, #blocked1>
2026-02-21T11:00:14.5782798Z     %30 = ttg.convert_layout %2 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:00:14.5783126Z     %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9>
2026-02-21T11:00:14.5783416Z     %32 = ttg.convert_layout %31 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked10>
2026-02-21T11:00:14.5783708Z     %33 = ttg.convert_layout %32 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T11:00:14.5784050Z     %34 = tt.expand_dims %33 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11>
2026-02-21T11:00:14.5784348Z     %35 = ttg.convert_layout %34 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked3>
2026-02-21T11:00:14.5784601Z     %36 = tt.broadcast %35 : tensor<1x1x128xi32, #blocked3> -> tensor<1x1024x128xi32, #blocked3>
2026-02-21T11:00:14.5784859Z     %37 = ttg.convert_layout %36 : tensor<1x1024x128xi32, #blocked3> -> tensor<1x1024x128xi32, #blocked13>
2026-02-21T11:00:14.5785121Z     %38 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x1024x128x!tt.ptr<bf16>, #blocked13>
2026-02-21T11:00:14.5785388Z     %39 = ttg.convert_layout %2 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:00:14.5785708Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9>
2026-02-21T11:00:14.5786011Z     %41 = ttg.convert_layout %40 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked10>
2026-02-21T11:00:14.5786296Z     %42 = ttg.convert_layout %41 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T11:00:14.5786637Z     %43 = tt.expand_dims %42 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11>
2026-02-21T11:00:14.5786936Z     %44 = ttg.convert_layout %43 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked3>
2026-02-21T11:00:14.5787178Z     %45 = tt.broadcast %44 : tensor<1x1x128xi32, #blocked3> -> tensor<1x2x128xi32, #blocked3>
2026-02-21T11:00:14.5787449Z     %46 = ttg.convert_layout %45 : tensor<1x2x128xi32, #blocked3> -> tensor<1x2x128xi32, #blocked2>
2026-02-21T11:00:14.5787679Z     %47 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T11:00:14.5787865Z     scf.for %arg4 = %0 to %c196608_i32 step %c304_i32  : i32 {
2026-02-21T11:00:14.5788014Z       %48 = arith.remsi %arg4, %c192_i32 : i32
2026-02-21T11:00:14.5788138Z       %49 = arith.divsi %arg4, %c192_i32 : i32
2026-02-21T11:00:14.5788261Z       %50 = arith.muli %49, %c2_i32 : i32
2026-02-21T11:00:14.5788395Z       %51 = tt.splat %50 : i32 -> tensor<2xi32, #blocked8>
2026-02-21T11:00:14.5788544Z       %52 = arith.addi %51, %1 : tensor<2xi32, #blocked8>
2026-02-21T11:00:14.5788678Z       %53 = arith.extsi %48 : i32 to i64
2026-02-21T11:00:14.5788793Z       %54 = arith.extsi %50 : i32 to i64
2026-02-21T11:00:14.5788914Z       %55 = arith.muli %53, %c262144_i64 : i64
2026-02-21T11:00:14.5789053Z       %56 = tt.splat %55 : i64 -> tensor<1x2x128xi64, #blocked2>
2026-02-21T11:00:14.5789205Z       %57 = tt.splat %54 : i64 -> tensor<2xi64, #blocked8>
2026-02-21T11:00:14.5789348Z       %58 = arith.addi %57, %4 : tensor<2xi64, #blocked8>
2026-02-21T11:00:14.5789572Z       %59 = ttg.convert_layout %58 : tensor<2xi64, #blocked8> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:00:14.5789890Z       %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x2xi64, #blocked9>
2026-02-21T11:00:14.5790168Z       %61 = ttg.convert_layout %60 : tensor<1x2xi64, #blocked9> -> tensor<1x2xi64, #blocked6>
2026-02-21T11:00:14.5790466Z       %62 = ttg.convert_layout %61 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T11:00:14.5790795Z       %63 = tt.expand_dims %62 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xi64, #blocked14>
2026-02-21T11:00:14.5791089Z       %64 = ttg.convert_layout %63 : tensor<1x2x1xi64, #blocked14> -> tensor<1x2x1xi64, #blocked4>
2026-02-21T11:00:14.5791298Z       %65 = arith.muli %64, %cst_10 : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:00:14.5791490Z       %66 = tt.broadcast %65 : tensor<1x2x1xi64, #blocked4> -> tensor<1x2x128xi64, #blocked4>
2026-02-21T11:00:14.5791730Z       %67 = ttg.convert_layout %66 : tensor<1x2x128xi64, #blocked4> -> tensor<1x2x128xi64, #blocked2>
2026-02-21T11:00:14.5791936Z       %68 = arith.addi %67, %13 : tensor<1x2x128xi64, #blocked2>
2026-02-21T11:00:14.5792093Z       %69 = arith.addi %56, %68 : tensor<1x2x128xi64, #blocked2>
2026-02-21T11:00:14.5792296Z       %70 = tt.addptr %3, %69 : tensor<1x2x128x!tt.ptr<bf16>, #blocked2>, tensor<1x2x128xi64, #blocked2>
2026-02-21T11:00:14.5792487Z       %71 = arith.cmpi sge, %53, %c0_i64 : i64
2026-02-21T11:00:14.5792616Z       %72 = arith.cmpi slt, %53, %c192_i64 : i64
2026-02-21T11:00:14.5792759Z       %73 = arith.andi %71, %72 : i1
2026-02-21T11:00:14.5792903Z       %74 = arith.cmpi sge, %64, %cst_9 : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:00:14.5793073Z       %75 = arith.cmpi slt, %64, %cst_8 : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:00:14.5793237Z       %76 = arith.andi %74, %75 : tensor<1x2x1xi1, #blocked4>
2026-02-21T11:00:14.5793394Z       %77 = tt.splat %73 : i1 -> tensor<1x2x1xi1, #blocked4>
2026-02-21T11:00:14.5793556Z       %78 = arith.andi %77, %76 : tensor<1x2x1xi1, #blocked4>
2026-02-21T11:00:14.5793747Z       %79 = tt.broadcast %78 : tensor<1x2x1xi1, #blocked4> -> tensor<1x2x128xi1, #blocked4>
2026-02-21T11:00:14.5793986Z       %80 = ttg.convert_layout %79 : tensor<1x2x128xi1, #blocked4> -> tensor<1x2x128xi1, #blocked2>
2026-02-21T11:00:14.5794194Z       %81 = arith.andi %80, %18 : tensor<1x2x128xi1, #blocked2>
2026-02-21T11:00:14.5794366Z       %82 = tt.load %70, %81, %cst_5 : tensor<1x2x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T11:00:14.5794544Z       %83 = tt.splat %55 : i64 -> tensor<1x128x1024xi64, #blocked>
2026-02-21T11:00:14.5794703Z       %84 = tt.splat %73 : i1 -> tensor<1x128x1xi1, #blocked1>
2026-02-21T11:00:14.5794869Z       %85 = arith.andi %84, %29 : tensor<1x128x1xi1, #blocked1>
2026-02-21T11:00:14.5795065Z       %86 = tt.broadcast %85 : tensor<1x128x1xi1, #blocked1> -> tensor<1x128x1024xi1, #blocked1>
2026-02-21T11:00:14.5795313Z       %87 = ttg.convert_layout %86 : tensor<1x128x1024xi1, #blocked1> -> tensor<1x128x1024xi1, #blocked>
2026-02-21T11:00:14.5795561Z       %88 = tt.reshape %82 : tensor<1x2x128xbf16, #blocked2> -> tensor<2x128xbf16, #blocked10>
2026-02-21T11:00:14.5795743Z       %89 = arith.muli %48, %c262144_i32 : i32
2026-02-21T11:00:14.5795881Z       %90 = tt.splat %89 : i32 -> tensor<1x1024x1xi32, #blocked5>
2026-02-21T11:00:14.5796247Z       %91:3 = scf.for %arg5 = %c0_i32 to %c2048_i32 step %c1024_i32 iter_args(%arg6 = %cst_18, %arg7 = %cst_17, %arg8 = %cst_16) -> (tensor<1x2xf32, #blocked6>, tensor<1x2xf32, #blocked6>, tensor<1x2x128xf32, #blocked2>)  : i32 {
2026-02-21T11:00:14.5796605Z         %113 = tt.splat %arg5 : i32 -> tensor<1024xi32, #blocked8>
2026-02-21T11:00:14.5796768Z         %114 = arith.addi %113, %19 : tensor<1024xi32, #blocked8>
2026-02-21T11:00:14.5796911Z         %115 = arith.extsi %arg5 : i32 to i64
2026-02-21T11:00:14.5797048Z         %116 = tt.splat %115 : i64 -> tensor<1024xi64, #blocked8>
2026-02-21T11:00:14.5797204Z         %117 = arith.addi %116, %26 : tensor<1024xi64, #blocked8>
2026-02-21T11:00:14.5797449Z         %118 = ttg.convert_layout %117 : tensor<1024xi64, #blocked8> -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:00:14.5797792Z         %119 = tt.expand_dims %118 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x1024xi64, #blocked9>
2026-02-21T11:00:14.5798116Z         %120 = ttg.convert_layout %119 : tensor<1x1024xi64, #blocked9> -> tensor<1x1024xi64, #blocked7>
2026-02-21T11:00:14.5798414Z         %121 = ttg.convert_layout %120 : tensor<1x1024xi64, #blocked7> -> tensor<1x1024xi64, #ttg.slice<{dim = 1, parent = #blocked15}>>
2026-02-21T11:00:14.5798772Z         %122 = tt.expand_dims %121 {axis = 1 : i32} : tensor<1x1024xi64, #ttg.slice<{dim = 1, parent = #blocked15}>> -> tensor<1x1x1024xi64, #blocked15>
2026-02-21T11:00:14.5799089Z         %123 = ttg.convert_layout %122 : tensor<1x1x1024xi64, #blocked15> -> tensor<1x1x1024xi64, #blocked>
2026-02-21T11:00:14.5799314Z         %124 = arith.muli %123, %cst_4 : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:00:14.5799525Z         %125 = tt.broadcast %124 : tensor<1x1x1024xi64, #blocked> -> tensor<1x128x1024xi64, #blocked>
2026-02-21T11:00:14.5799731Z         %126 = arith.addi %25, %125 : tensor<1x128x1024xi64, #blocked>
2026-02-21T11:00:14.5799897Z         %127 = arith.addi %83, %126 : tensor<1x128x1024xi64, #blocked>
2026-02-21T11:00:14.5800112Z         %128 = tt.addptr %20, %127 : tensor<1x128x1024x!tt.ptr<bf16>, #blocked>, tensor<1x128x1024xi64, #blocked>
2026-02-21T11:00:14.5800360Z         %129 = arith.cmpi sge, %123, %cst_1 : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:00:14.5800543Z         %130 = arith.cmpi slt, %123, %cst_0 : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:00:14.5800715Z         %131 = arith.andi %129, %130 : tensor<1x1x1024xi1, #blocked>
2026-02-21T11:00:14.5800919Z         %132 = tt.broadcast %131 : tensor<1x1x1024xi1, #blocked> -> tensor<1x128x1024xi1, #blocked>
2026-02-21T11:00:14.5801122Z         %133 = arith.andi %87, %132 : tensor<1x128x1024xi1, #blocked>
2026-02-21T11:00:14.5801317Z         %134 = tt.load %128, %133, %cst : tensor<1x128x1024x!tt.ptr<bf16>, #blocked>
2026-02-21T11:00:14.5801539Z         %135 = tt.reshape %134 : tensor<1x128x1024xbf16, #blocked> -> tensor<128x1024xbf16, #blocked7>
2026-02-21T11:00:14.5801847Z         %136 = ttg.convert_layout %88 : tensor<2x128xbf16, #blocked10> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>>
2026-02-21T11:00:14.5802215Z         %137 = ttg.convert_layout %135 : tensor<128x1024xbf16, #blocked7> -> tensor<128x1024xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>>
2026-02-21T11:00:14.5802549Z         %138 = ttg.convert_layout %cst_15 : tensor<2x1024xf32, #blocked7> -> tensor<2x1024xf32, #blocked16>
2026-02-21T11:00:14.5803027Z         %139 = tt.dot %136, %137, %138, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>> * tensor<128x1024xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>> -> tensor<2x1024xf32, #blocked16>
2026-02-21T11:00:14.5803445Z         %140 = ttg.convert_layout %139 : tensor<2x1024xf32, #blocked16> -> tensor<2x1024xf32, #blocked7>
2026-02-21T11:00:14.5803688Z         %141 = tt.reshape %140 : tensor<2x1024xf32, #blocked7> -> tensor<1x2x1024xf32, #blocked>
2026-02-21T11:00:14.5803930Z         %142 = arith.truncf %141 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked>
2026-02-21T11:00:14.5804172Z         %143 = arith.extf %142 : tensor<1x2x1024xbf16, #blocked> to tensor<1x2x1024xf32, #blocked>
2026-02-21T11:00:14.5804367Z         %144 = "tt.reduce"(%143) <{axis = 2 : i32}> ({
2026-02-21T11:00:14.5804497Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T11:00:14.5804619Z           %199 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T11:00:14.5804748Z           tt.reduce.return %199 : f32
2026-02-21T11:00:14.5804936Z         }) : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T11:00:14.5805227Z         %145 = ttg.convert_layout %144 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5805497Z         %146 = arith.truncf %145 : tensor<1x2xf32, #blocked6> to tensor<1x2xbf16, #blocked6>
2026-02-21T11:00:14.5805721Z         %147 = arith.extf %146 : tensor<1x2xbf16, #blocked6> to tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5806086Z         %148 = arith.mulf %147, %cst_14 : tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5806275Z         %149 = arith.truncf %148 : tensor<1x2xf32, #blocked6> to tensor<1x2xbf16, #blocked6>
2026-02-21T11:00:14.5806495Z         %150 = arith.extf %149 : tensor<1x2xbf16, #blocked6> to tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5806689Z         %151 = arith.cmpf ogt, %arg6, %150 : tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5806863Z         %152 = arith.cmpf une, %arg6, %arg6 : tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5807030Z         %153 = arith.ori %151, %152 : tensor<1x2xi1, #blocked6>
2026-02-21T11:00:14.5807222Z         %154 = arith.select %153, %arg6, %150 : tensor<1x2xi1, #blocked6>, tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5807429Z         %155 = arith.mulf %143, %cst_13 : tensor<1x2x1024xf32, #blocked>
2026-02-21T11:00:14.5807634Z         %156 = arith.truncf %155 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked>
2026-02-21T11:00:14.5807925Z         %157 = ttg.convert_layout %154 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T11:00:14.5808267Z         %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14>
2026-02-21T11:00:14.5808586Z         %159 = ttg.convert_layout %158 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4>
2026-02-21T11:00:14.5808831Z         %160 = arith.extf %156 : tensor<1x2x1024xbf16, #blocked> to tensor<1x2x1024xf32, #blocked>
2026-02-21T11:00:14.5809068Z         %161 = tt.broadcast %159 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x1024xf32, #blocked4>
2026-02-21T11:00:14.5809343Z         %162 = ttg.convert_layout %161 : tensor<1x2x1024xf32, #blocked4> -> tensor<1x2x1024xf32, #blocked>
2026-02-21T11:00:14.5809560Z         %163 = arith.subf %160, %162 : tensor<1x2x1024xf32, #blocked>
2026-02-21T11:00:14.5809865Z         %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2x1024xf32, #blocked>
2026-02-21T11:00:14.5810155Z         %165 = "tt.reduce"(%164) <{axis = 2 : i32}> ({
2026-02-21T11:00:14.5810285Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T11:00:14.5810406Z           %199 = arith.addf %arg9, %arg10 : f32
2026-02-21T11:00:14.5810529Z           tt.reduce.return %199 : f32
2026-02-21T11:00:14.5810735Z         }) : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T11:00:14.5811026Z         %166 = ttg.convert_layout %165 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5811266Z         %167 = arith.subf %arg6, %154 : tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5811554Z         %168 = tt.extern_elementwise %167 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked6>) -> tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5811842Z         %169 = arith.mulf %arg7, %168 : tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5811998Z         %170 = arith.addf %169, %166 : tensor<1x2xf32, #blocked6>
2026-02-21T11:00:14.5812243Z         %171 = ttg.convert_layout %168 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T11:00:14.5812578Z         %172 = tt.expand_dims %171 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14>
2026-02-21T11:00:14.5812880Z         %173 = ttg.convert_layout %172 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4>
2026-02-21T11:00:14.5813126Z         %174 = tt.broadcast %173 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4>
2026-02-21T11:00:14.5813375Z         %175 = ttg.convert_layout %174 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked2>
2026-02-21T11:00:14.5813590Z         %176 = arith.mulf %arg8, %175 : tensor<1x2x128xf32, #blocked2>
2026-02-21T11:00:14.5813850Z         %177 = ttg.convert_layout %114 : tensor<1024xi32, #blocked8> -> tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:00:14.5814189Z         %178 = tt.expand_dims %177 {axis = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x1024xi32, #blocked9>
2026-02-21T11:00:14.5814492Z         %179 = ttg.convert_layout %178 : tensor<1x1024xi32, #blocked9> -> tensor<1x1024xi32, #blocked7>
2026-02-21T11:00:14.5814787Z         %180 = ttg.convert_layout %179 : tensor<1x1024xi32, #blocked7> -> tensor<1x1024xi32, #ttg.slice<{dim = 2, parent = #blocked17}>>
2026-02-21T11:00:14.5815139Z         %181 = tt.expand_dims %180 {axis = 2 : i32} : tensor<1x1024xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> -> tensor<1x1024x1xi32, #blocked17>
2026-02-21T11:00:14.5815455Z         %182 = ttg.convert_layout %181 : tensor<1x1024x1xi32, #blocked17> -> tensor<1x1024x1xi32, #blocked5>
2026-02-21T11:00:14.5815681Z         %183 = arith.muli %182, %cst_12 : tensor<1x1024x1xi32, #blocked5>
2026-02-21T11:00:14.5815852Z         %184 = arith.addi %90, %183 : tensor<1x1024x1xi32, #blocked5>
2026-02-21T11:00:14.5816057Z         %185 = tt.broadcast %184 : tensor<1x1024x1xi32, #blocked5> -> tensor<1x1024x128xi32, #blocked5>
2026-02-21T11:00:14.5816345Z         %186 = ttg.convert_layout %185 : tensor<1x1024x128xi32, #blocked5> -> tensor<1x1024x128xi32, #blocked13>
2026-02-21T11:00:14.5816569Z         %187 = arith.addi %186, %37 : tensor<1x1024x128xi32, #blocked13>
2026-02-21T11:00:14.5816800Z         %188 = tt.addptr %38, %187 : tensor<1x1024x128x!tt.ptr<bf16>, #blocked13>, tensor<1x1024x128xi32, #blocked13>
2026-02-21T11:00:14.5817032Z         %189 = tt.load %188 : tensor<1x1024x128x!tt.ptr<bf16>, #blocked13>
2026-02-21T11:00:14.5817264Z         %190 = arith.truncf %164 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked>
2026-02-21T11:00:14.5817510Z         %191 = tt.reshape %176 : tensor<1x2x128xf32, #blocked2> -> tensor<2x128xf32, #blocked10>
2026-02-21T11:00:14.5817745Z         %192 = tt.reshape %190 : tensor<1x2x1024xbf16, #blocked> -> tensor<2x1024xbf16, #blocked7>
2026-02-21T11:00:14.5817995Z         %193 = tt.reshape %189 : tensor<1x1024x128xbf16, #blocked13> -> tensor<1024x128xbf16, #blocked10>
2026-02-21T11:00:14.5818309Z         %194 = ttg.convert_layout %192 : tensor<2x1024xbf16, #blocked7> -> tensor<2x1024xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>>
2026-02-21T11:00:14.5818691Z         %195 = ttg.convert_layout %193 : tensor<1024x128xbf16, #blocked10> -> tensor<1024x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>>
2026-02-21T11:00:14.5819002Z         %196 = ttg.convert_layout %191 : tensor<2x128xf32, #blocked10> -> tensor<2x128xf32, #blocked10>
2026-02-21T11:00:14.5819416Z         %197 = tt.dot %194, %195, %196, inputPrecision = tf32 : tensor<2x1024xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<1024x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x128xf32, #blocked10>
2026-02-21T11:00:14.5819822Z         %198 = tt.reshape %197 : tensor<2x128xf32, #blocked10> -> tensor<1x2x128xf32, #blocked2>
2026-02-21T11:00:14.5820093Z         scf.yield %154, %170, %198 : tensor<1x2xf32, #blocked6>, tensor<1x2xf32, #blocked6>, tensor<1x2x128xf32, #blocked2>
2026-02-21T11:00:14.5820300Z       }
2026-02-21T11:00:14.5820490Z       %92 = ttg.convert_layout %91#1 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T11:00:14.5820826Z       %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14>
2026-02-21T11:00:14.5821121Z       %94 = ttg.convert_layout %93 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4>
2026-02-21T11:00:14.5821360Z       %95 = tt.broadcast %94 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4>
2026-02-21T11:00:14.5821598Z       %96 = ttg.convert_layout %95 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked2>
2026-02-21T11:00:14.5821809Z       %97 = arith.divf %91#2, %96 : tensor<1x2x128xf32, #blocked2>
2026-02-21T11:00:14.5822025Z       %98 = arith.truncf %97 : tensor<1x2x128xf32, #blocked2> to tensor<1x2x128xbf16, #blocked2>
2026-02-21T11:00:14.5822214Z       %99 = arith.muli %48, %c262144_i32 : i32
2026-02-21T11:00:14.5822432Z       %100 = ttg.convert_layout %52 : tensor<2xi32, #blocked8> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:00:14.5822751Z       %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x2xi32, #blocked9>
2026-02-21T11:00:14.5823039Z       %102 = ttg.convert_layout %101 : tensor<1x2xi32, #blocked9> -> tensor<1x2xi32, #blocked6>
2026-02-21T11:00:14.5823321Z       %103 = ttg.convert_layout %102 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T11:00:14.5823658Z       %104 = tt.expand_dims %103 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xi32, #blocked14>
2026-02-21T11:00:14.5823959Z       %105 = ttg.convert_layout %104 : tensor<1x2x1xi32, #blocked14> -> tensor<1x2x1xi32, #blocked4>
2026-02-21T11:00:14.5824168Z       %106 = arith.muli %105, %cst_11 : tensor<1x2x1xi32, #blocked4>
2026-02-21T11:00:14.5824351Z       %107 = tt.splat %99 : i32 -> tensor<1x2x1xi32, #blocked4>
2026-02-21T11:00:14.5824511Z       %108 = arith.addi %107, %106 : tensor<1x2x1xi32, #blocked4>
2026-02-21T11:00:14.5824706Z       %109 = tt.broadcast %108 : tensor<1x2x1xi32, #blocked4> -> tensor<1x2x128xi32, #blocked4>
2026-02-21T11:00:14.5824951Z       %110 = ttg.convert_layout %109 : tensor<1x2x128xi32, #blocked4> -> tensor<1x2x128xi32, #blocked2>
2026-02-21T11:00:14.5825174Z       %111 = arith.addi %110, %46 : tensor<1x2x128xi32, #blocked2>
2026-02-21T11:00:14.5825386Z       %112 = tt.addptr %47, %111 : tensor<1x2x128x!tt.ptr<bf16>, #blocked2>, tensor<1x2x128xi32, #blocked2>
2026-02-21T11:00:14.5825605Z       tt.store %112, %98 : tensor<1x2x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T11:00:14.5825750Z     } {tt.loop_unroll_factor = 1 : i32}
2026-02-21T11:00:14.5825863Z     tt.return
2026-02-21T11:00:14.5825944Z   }
2026-02-21T11:00:14.5826021Z }
2026-02-21T11:00:14.5826068Z 
2026-02-21T11:00:14.5826098Z {-#
2026-02-21T11:00:14.5826182Z   external_resources: {
2026-02-21T11:00:14.5826281Z     mlir_reproducer: {
2026-02-21T11:00:14.5828521Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T11:00:14.5830804Z       disable_threading: false,
2026-02-21T11:00:14.5830909Z       verify_each: true
2026-02-21T11:00:14.5831000Z     }
2026-02-21T11:00:14.5831070Z   }
2026-02-21T11:00:14.5831140Z #-}
2026-02-21T11:00:14.5831431Z /tmp/torchinductor_root/ry/cryea2bdhdua425js6x6ixzjrokt533cicfucyrleb5lw6rth7os.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T11:00:14.5832135Z /tmp/torchinductor_root/ry/cryea2bdhdua425js6x6ixzjrokt533cicfucyrleb5lw6rth7os.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T11:00:14.5832681Z [50s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T11:00:14.5833477Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 1024], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[0, 0], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T11:00:14.5834203Z Error: RuntimeError: PassManager::run failed
2026-02-21T11:00:14.5834369Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T11:00:45.3004339Z /tmp/torchinductor_root/5g/c5giwctfjccaue6bd75t5sbqdhwhf2ay7njm5q53dk3emny6uyli.py:55:130: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T11:00:45.3005023Z         k = tl.load(k_view + (indices_0[:, None, None] * 262144 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None)
2026-02-21T11:00:45.3005415Z                                                                                                                                  ^
2026-02-21T11:00:45.3006846Z /tmp/torchinductor_root/5g/c5giwctfjccaue6bd75t5sbqdhwhf2ay7njm5q53dk3emny6uyli.py:57:141: note: - use: %132 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x512xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x512xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>>
2026-02-21T11:00:45.3007741Z 
2026-02-21T11:00:45.3008317Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T11:00:45.3008990Z                                                                                                                                             ^
2026-02-21T11:00:45.3009252Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T11:00:45.3010609Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:00:45.3011090Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:00:45.3011550Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:00:45.3011988Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T11:00:45.3012424Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T11:00:45.3012832Z #blocked5 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}>
2026-02-21T11:00:45.3013244Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}>
2026-02-21T11:00:45.3013699Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:00:45.3014159Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:00:45.3014734Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:00:45.3015186Z #blocked10 = #ttg.blocked<{sizePerThread = [2, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T11:00:45.3015624Z #blocked11 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T11:00:45.3016061Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T11:00:45.3016685Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T11:00:45.3017148Z     %c262144_i32 = arith.constant 262144 : i32
2026-02-21T11:00:45.3017299Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T11:00:45.3017442Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T11:00:45.3017579Z     %c262144_i64 = arith.constant 262144 : i64
2026-02-21T11:00:45.3017865Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked>
2026-02-21T11:00:45.3018092Z     %cst_0 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked>
2026-02-21T11:00:45.3018306Z     %cst_1 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked>
2026-02-21T11:00:45.3018509Z     %cst_2 = arith.constant dense<2048> : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:00:45.3018720Z     %cst_3 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:00:45.3018946Z     %cst_4 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:00:45.3019120Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T11:00:45.3019258Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T11:00:45.3019394Z     %c3072_i32 = arith.constant 3072 : i32
2026-02-21T11:00:45.3019558Z     %cst_5 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked1>
2026-02-21T11:00:45.3019768Z     %cst_6 = arith.constant dense<128> : tensor<1x512x1xi32, #blocked2>
2026-02-21T11:00:45.3019983Z     %cst_7 = arith.constant dense<0.127517432> : tensor<1x2x512xf32, #blocked>
2026-02-21T11:00:45.3020216Z     %cst_8 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3020462Z     %cst_9 = arith.constant dense<0.000000e+00> : tensor<2x512xf32, #blocked4>
2026-02-21T11:00:45.3020681Z     %cst_10 = arith.constant dense<128> : tensor<1x1x512xi32, #blocked>
2026-02-21T11:00:45.3020851Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T11:00:45.3021032Z     %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked>
2026-02-21T11:00:45.3021256Z     %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3021485Z     %cst_13 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3021662Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T11:00:45.3021795Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T11:00:45.3021926Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T11:00:45.3022062Z     %0 = tt.get_program_id x : i32
2026-02-21T11:00:45.3022198Z     %1 = arith.divsi %0, %c3072_i32 : i32
2026-02-21T11:00:45.3022327Z     %2 = arith.muli %1, %c16_i32 : i32
2026-02-21T11:00:45.3022460Z     %3 = arith.subi %c1024_i32, %2 : i32
2026-02-21T11:00:45.3022587Z     %4 = arith.minsi %3, %c16_i32 : i32
2026-02-21T11:00:45.3022722Z     %5 = arith.remsi %0, %c3072_i32 : i32
2026-02-21T11:00:45.3022851Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T11:00:45.3022978Z     %7 = arith.addi %2, %6 : i32
2026-02-21T11:00:45.3023103Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T11:00:45.3023229Z     %9 = arith.muli %7, %c2_i32 : i32
2026-02-21T11:00:45.3023405Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked5>
2026-02-21T11:00:45.3023611Z     %11 = tt.splat %9 : i32 -> tensor<2xi32, #blocked5>
2026-02-21T11:00:45.3023811Z     %12 = arith.addi %11, %10 : tensor<2xi32, #blocked5>
2026-02-21T11:00:45.3024013Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked5>
2026-02-21T11:00:45.3024206Z     %14 = arith.extsi %8 : i32 to i64
2026-02-21T11:00:45.3024334Z     %15 = arith.extsi %9 : i32 to i64
2026-02-21T11:00:45.3024518Z     %16 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:00:45.3024712Z     %17 = arith.muli %14, %c262144_i64 : i64
2026-02-21T11:00:45.3024871Z     %18 = tt.splat %17 : i64 -> tensor<1x2x128xi64, #blocked>
2026-02-21T11:00:45.3025043Z     %19 = tt.splat %15 : i64 -> tensor<2xi64, #blocked5>
2026-02-21T11:00:45.3025239Z     %20 = arith.extsi %10 : tensor<2xi32, #blocked5> to tensor<2xi64, #blocked5>
2026-02-21T11:00:45.3025438Z     %21 = arith.addi %19, %20 : tensor<2xi64, #blocked5>
2026-02-21T11:00:45.3025699Z     %22 = ttg.convert_layout %21 : tensor<2xi64, #blocked5> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:00:45.3026022Z     %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi64, #blocked6>
2026-02-21T11:00:45.3026319Z     %24 = ttg.convert_layout %23 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #blocked3>
2026-02-21T11:00:45.3026596Z     %25 = ttg.convert_layout %24 : tensor<1x2xi64, #blocked3> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T11:00:45.3026922Z     %26 = tt.expand_dims %25 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi64, #blocked7>
2026-02-21T11:00:45.3027214Z     %27 = ttg.convert_layout %26 : tensor<1x2x1xi64, #blocked7> -> tensor<1x2x1xi64, #blocked1>
2026-02-21T11:00:45.3027433Z     %28 = arith.muli %27, %cst_4 : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:00:45.3027627Z     %29 = tt.broadcast %28 : tensor<1x2x1xi64, #blocked1> -> tensor<1x2x128xi64, #blocked1>
2026-02-21T11:00:45.3027864Z     %30 = ttg.convert_layout %29 : tensor<1x2x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked>
2026-02-21T11:00:45.3028095Z     %31 = arith.extsi %13 : tensor<128xi32, #blocked5> to tensor<128xi64, #blocked5>
2026-02-21T11:00:45.3028363Z     %32 = ttg.convert_layout %31 : tensor<128xi64, #blocked5> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:00:45.3028702Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi64, #blocked6>
2026-02-21T11:00:45.3028986Z     %34 = ttg.convert_layout %33 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #blocked4>
2026-02-21T11:00:45.3029269Z     %35 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked4> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T11:00:45.3029603Z     %36 = tt.expand_dims %35 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi64, #blocked8>
2026-02-21T11:00:45.3029900Z     %37 = ttg.convert_layout %36 : tensor<1x1x128xi64, #blocked8> -> tensor<1x1x128xi64, #blocked>
2026-02-21T11:00:45.3030136Z     %38 = tt.broadcast %37 : tensor<1x1x128xi64, #blocked> -> tensor<1x2x128xi64, #blocked>
2026-02-21T11:00:45.3030332Z     %39 = arith.addi %30, %38 : tensor<1x2x128xi64, #blocked>
2026-02-21T11:00:45.3030482Z     %40 = arith.addi %18, %39 : tensor<1x2x128xi64, #blocked>
2026-02-21T11:00:45.3030679Z     %41 = tt.addptr %16, %40 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>, tensor<1x2x128xi64, #blocked>
2026-02-21T11:00:45.3030871Z     %42 = arith.cmpi sge, %14, %c0_i64 : i64
2026-02-21T11:00:45.3030993Z     %43 = arith.cmpi slt, %14, %c192_i64 : i64
2026-02-21T11:00:45.3031112Z     %44 = arith.andi %42, %43 : i1
2026-02-21T11:00:45.3031250Z     %45 = arith.cmpi sge, %27, %cst_3 : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:00:45.3031423Z     %46 = arith.cmpi slt, %27, %cst_2 : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:00:45.3031585Z     %47 = arith.andi %45, %46 : tensor<1x2x1xi1, #blocked1>
2026-02-21T11:00:45.3031733Z     %48 = tt.splat %44 : i1 -> tensor<1x2x1xi1, #blocked1>
2026-02-21T11:00:45.3031900Z     %49 = arith.andi %48, %47 : tensor<1x2x1xi1, #blocked1>
2026-02-21T11:00:45.3032082Z     %50 = tt.broadcast %49 : tensor<1x2x1xi1, #blocked1> -> tensor<1x2x128xi1, #blocked1>
2026-02-21T11:00:45.3032319Z     %51 = ttg.convert_layout %50 : tensor<1x2x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked>
2026-02-21T11:00:45.3032526Z     %52 = arith.cmpi sge, %37, %cst_1 : tensor<1x1x128xi64, #blocked>
2026-02-21T11:00:45.3032696Z     %53 = arith.cmpi slt, %37, %cst_0 : tensor<1x1x128xi64, #blocked>
2026-02-21T11:00:45.3032857Z     %54 = arith.andi %52, %53 : tensor<1x1x128xi1, #blocked>
2026-02-21T11:00:45.3033039Z     %55 = tt.broadcast %54 : tensor<1x1x128xi1, #blocked> -> tensor<1x2x128xi1, #blocked>
2026-02-21T11:00:45.3033226Z     %56 = arith.andi %51, %55 : tensor<1x2x128xi1, #blocked>
2026-02-21T11:00:45.3033386Z     %57 = tt.load %41, %56, %cst : tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:00:45.3033579Z     %58 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked5>
2026-02-21T11:00:45.3033745Z     %59 = arith.muli %8, %c262144_i32 : i32
2026-02-21T11:00:45.3033963Z     %60 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:00:45.3034309Z     %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6>
2026-02-21T11:00:45.3034595Z     %62 = ttg.convert_layout %61 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4>
2026-02-21T11:00:45.3034877Z     %63 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T11:00:45.3035229Z     %64 = tt.expand_dims %63 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x128x1xi32, #blocked9>
2026-02-21T11:00:45.3035525Z     %65 = ttg.convert_layout %64 : tensor<1x128x1xi32, #blocked9> -> tensor<1x128x1xi32, #blocked2>
2026-02-21T11:00:45.3035733Z     %66 = tt.splat %59 : i32 -> tensor<1x128x1xi32, #blocked2>
2026-02-21T11:00:45.3035888Z     %67 = arith.addi %66, %65 : tensor<1x128x1xi32, #blocked2>
2026-02-21T11:00:45.3036087Z     %68 = tt.broadcast %67 : tensor<1x128x1xi32, #blocked2> -> tensor<1x128x512xi32, #blocked2>
2026-02-21T11:00:45.3036352Z     %69 = ttg.convert_layout %68 : tensor<1x128x512xi32, #blocked2> -> tensor<1x128x512xi32, #blocked>
2026-02-21T11:00:45.3036588Z     %70 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x512x!tt.ptr<bf16>, #blocked>
2026-02-21T11:00:45.3036806Z     %71 = tt.reshape %57 : tensor<1x2x128xbf16, #blocked> -> tensor<2x128xbf16, #blocked4>
2026-02-21T11:00:45.3036999Z     %72 = tt.splat %59 : i32 -> tensor<1x512x1xi32, #blocked2>
2026-02-21T11:00:45.3037239Z     %73 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T11:00:45.3037571Z     %74 = tt.expand_dims %73 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8>
2026-02-21T11:00:45.3037871Z     %75 = ttg.convert_layout %74 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked>
2026-02-21T11:00:45.3038115Z     %76 = tt.broadcast %75 : tensor<1x1x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked>
2026-02-21T11:00:45.3038333Z     %77 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x512x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:00:45.3038720Z     %78:3 = scf.for %arg4 = %c0_i32 to %c2048_i32 step %c512_i32 iter_args(%arg5 = %cst_13, %arg6 = %cst_12, %arg7 = %cst_11) -> (tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>)  : i32 {
2026-02-21T11:00:45.3039084Z       %108 = tt.splat %arg4 : i32 -> tensor<512xi32, #blocked5>
2026-02-21T11:00:45.3039242Z       %109 = arith.addi %108, %58 : tensor<512xi32, #blocked5>
2026-02-21T11:00:45.3039482Z       %110 = ttg.convert_layout %109 : tensor<512xi32, #blocked5> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:00:45.3039850Z       %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6>
2026-02-21T11:00:45.3040148Z       %112 = ttg.convert_layout %111 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked4>
2026-02-21T11:00:45.3040441Z       %113 = ttg.convert_layout %112 : tensor<1x512xi32, #blocked4> -> tensor<1x512xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T11:00:45.3040782Z       %114 = tt.expand_dims %113 {axis = 1 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x512xi32, #blocked8>
2026-02-21T11:00:45.3041088Z       %115 = ttg.convert_layout %114 : tensor<1x1x512xi32, #blocked8> -> tensor<1x1x512xi32, #blocked>
2026-02-21T11:00:45.3041300Z       %116 = arith.muli %115, %cst_10 : tensor<1x1x512xi32, #blocked>
2026-02-21T11:00:45.3041506Z       %117 = tt.broadcast %116 : tensor<1x1x512xi32, #blocked> -> tensor<1x128x512xi32, #blocked>
2026-02-21T11:00:45.3041717Z       %118 = arith.addi %69, %117 : tensor<1x128x512xi32, #blocked>
2026-02-21T11:00:45.3041928Z       %119 = tt.addptr %70, %118 : tensor<1x128x512x!tt.ptr<bf16>, #blocked>, tensor<1x128x512xi32, #blocked>
2026-02-21T11:00:45.3042167Z       %120 = tt.load %119 : tensor<1x128x512x!tt.ptr<bf16>, #blocked>
2026-02-21T11:00:45.3042372Z       %121 = tt.reshape %120 : tensor<1x128x512xbf16, #blocked> -> tensor<128x512xbf16, #blocked4>
2026-02-21T11:00:45.3042729Z       %122 = ttg.convert_layout %71 : tensor<2x128xbf16, #blocked4> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>>
2026-02-21T11:00:45.3043089Z       %123 = ttg.convert_layout %121 : tensor<128x512xbf16, #blocked4> -> tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>>
2026-02-21T11:00:45.3043414Z       %124 = ttg.convert_layout %cst_9 : tensor<2x512xf32, #blocked4> -> tensor<2x512xf32, #blocked10>
2026-02-21T11:00:45.3043837Z       %125 = tt.dot %122, %123, %124, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x512xf32, #blocked10>
2026-02-21T11:00:45.3044247Z       %126 = ttg.convert_layout %125 : tensor<2x512xf32, #blocked10> -> tensor<2x512xf32, #blocked4>
2026-02-21T11:00:45.3044508Z       %127 = tt.reshape %126 : tensor<2x512xf32, #blocked4> -> tensor<1x2x512xf32, #blocked>
2026-02-21T11:00:45.3044769Z       %128 = arith.truncf %127 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked>
2026-02-21T11:00:45.3045005Z       %129 = arith.extf %128 : tensor<1x2x512xbf16, #blocked> to tensor<1x2x512xf32, #blocked>
2026-02-21T11:00:45.3045194Z       %130 = "tt.reduce"(%129) <{axis = 2 : i32}> ({
2026-02-21T11:00:45.3045324Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T11:00:45.3045446Z         %183 = arith.maxnumf %arg8, %arg9 : f32
2026-02-21T11:00:45.3045570Z         tt.reduce.return %183 : f32
2026-02-21T11:00:45.3045754Z       }) : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T11:00:45.3046042Z       %131 = ttg.convert_layout %130 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3046314Z       %132 = arith.truncf %131 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3>
2026-02-21T11:00:45.3046532Z       %133 = arith.extf %132 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3046724Z       %134 = arith.mulf %133, %cst_8 : tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3046911Z       %135 = arith.truncf %134 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3>
2026-02-21T11:00:45.3047127Z       %136 = arith.extf %135 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3047318Z       %137 = arith.cmpf ogt, %arg5, %136 : tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3047490Z       %138 = arith.cmpf une, %arg5, %arg5 : tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3047675Z       %139 = arith.ori %137, %138 : tensor<1x2xi1, #blocked3>
2026-02-21T11:00:45.3047867Z       %140 = arith.select %139, %arg5, %136 : tensor<1x2xi1, #blocked3>, tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3048074Z       %141 = arith.mulf %129, %cst_7 : tensor<1x2x512xf32, #blocked>
2026-02-21T11:00:45.3048276Z       %142 = arith.truncf %141 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked>
2026-02-21T11:00:45.3048559Z       %143 = ttg.convert_layout %140 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T11:00:45.3048893Z       %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T11:00:45.3049187Z       %145 = ttg.convert_layout %144 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T11:00:45.3049428Z       %146 = arith.extf %142 : tensor<1x2x512xbf16, #blocked> to tensor<1x2x512xf32, #blocked>
2026-02-21T11:00:45.3049678Z       %147 = tt.broadcast %145 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x512xf32, #blocked1>
2026-02-21T11:00:45.3049920Z       %148 = ttg.convert_layout %147 : tensor<1x2x512xf32, #blocked1> -> tensor<1x2x512xf32, #blocked>
2026-02-21T11:00:45.3050151Z       %149 = arith.subf %146, %148 : tensor<1x2x512xf32, #blocked>
2026-02-21T11:00:45.3050450Z       %150 = tt.extern_elementwise %149 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2x512xf32, #blocked>
2026-02-21T11:00:45.3050736Z       %151 = "tt.reduce"(%150) <{axis = 2 : i32}> ({
2026-02-21T11:00:45.3050861Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T11:00:45.3050993Z         %183 = arith.addf %arg8, %arg9 : f32
2026-02-21T11:00:45.3051113Z         tt.reduce.return %183 : f32
2026-02-21T11:00:45.3051292Z       }) : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T11:00:45.3051581Z       %152 = ttg.convert_layout %151 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3051824Z       %153 = arith.subf %arg5, %140 : tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3052109Z       %154 = tt.extern_elementwise %153 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked3>) -> tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3052409Z       %155 = arith.mulf %arg6, %154 : tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3052565Z       %156 = arith.addf %155, %152 : tensor<1x2xf32, #blocked3>
2026-02-21T11:00:45.3052802Z       %157 = ttg.convert_layout %154 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T11:00:45.3053136Z       %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T11:00:45.3053428Z       %159 = ttg.convert_layout %158 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T11:00:45.3053671Z       %160 = tt.broadcast %159 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1>
2026-02-21T11:00:45.3053914Z       %161 = ttg.convert_layout %160 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked>
2026-02-21T11:00:45.3054135Z       %162 = arith.mulf %arg7, %161 : tensor<1x2x128xf32, #blocked>
2026-02-21T11:00:45.3054394Z       %163 = ttg.convert_layout %112 : tensor<1x512xi32, #blocked4> -> tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T11:00:45.3054739Z       %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x512x1xi32, #blocked9>
2026-02-21T11:00:45.3055052Z       %165 = ttg.convert_layout %164 : tensor<1x512x1xi32, #blocked9> -> tensor<1x512x1xi32, #blocked2>
2026-02-21T11:00:45.3055268Z       %166 = arith.muli %165, %cst_6 : tensor<1x512x1xi32, #blocked2>
2026-02-21T11:00:45.3055443Z       %167 = arith.addi %72, %166 : tensor<1x512x1xi32, #blocked2>
2026-02-21T11:00:45.3055674Z       %168 = tt.broadcast %167 : tensor<1x512x1xi32, #blocked2> -> tensor<1x512x128xi32, #blocked2>
2026-02-21T11:00:45.3055935Z       %169 = ttg.convert_layout %168 : tensor<1x512x128xi32, #blocked2> -> tensor<1x512x128xi32, #blocked>
2026-02-21T11:00:45.3056160Z       %170 = arith.addi %169, %76 : tensor<1x512x128xi32, #blocked>
2026-02-21T11:00:45.3056377Z       %171 = tt.addptr %77, %170 : tensor<1x512x128x!tt.ptr<bf16>, #blocked>, tensor<1x512x128xi32, #blocked>
2026-02-21T11:00:45.3056601Z       %172 = tt.load %171 : tensor<1x512x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:00:45.3056810Z       %173 = arith.truncf %150 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked>
2026-02-21T11:00:45.3057048Z       %174 = tt.reshape %162 : tensor<1x2x128xf32, #blocked> -> tensor<2x128xf32, #blocked4>
2026-02-21T11:00:45.3057286Z       %175 = tt.reshape %173 : tensor<1x2x512xbf16, #blocked> -> tensor<2x512xbf16, #blocked4>
2026-02-21T11:00:45.3057530Z       %176 = tt.reshape %172 : tensor<1x512x128xbf16, #blocked> -> tensor<512x128xbf16, #blocked4>
2026-02-21T11:00:45.3057840Z       %177 = ttg.convert_layout %175 : tensor<2x512xbf16, #blocked4> -> tensor<2x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked11}>>
2026-02-21T11:00:45.3058224Z       %178 = ttg.convert_layout %176 : tensor<512x128xbf16, #blocked4> -> tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked11}>>
2026-02-21T11:00:45.3058532Z       %179 = ttg.convert_layout %174 : tensor<2x128xf32, #blocked4> -> tensor<2x128xf32, #blocked11>
2026-02-21T11:00:45.3058950Z       %180 = tt.dot %177, %178, %179, inputPrecision = tf32 : tensor<2x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked11}>> * tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked11}>> -> tensor<2x128xf32, #blocked11>
2026-02-21T11:00:45.3059377Z       %181 = ttg.convert_layout %180 : tensor<2x128xf32, #blocked11> -> tensor<2x128xf32, #blocked4>
2026-02-21T11:00:45.3059624Z       %182 = tt.reshape %181 : tensor<2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked>
2026-02-21T11:00:45.3059899Z       scf.yield %140, %156, %182 : tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>
2026-02-21T11:00:45.3060157Z     } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T11:00:45.3060441Z     %79 = ttg.convert_layout %78#1 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T11:00:45.3060772Z     %80 = tt.expand_dims %79 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T11:00:45.3061069Z     %81 = ttg.convert_layout %80 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T11:00:45.3061311Z     %82 = tt.broadcast %81 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1>
2026-02-21T11:00:45.3061550Z     %83 = ttg.convert_layout %82 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked>
2026-02-21T11:00:45.3061763Z     %84 = arith.divf %78#2, %83 : tensor<1x2x128xf32, #blocked>
2026-02-21T11:00:45.3061961Z     %85 = arith.truncf %84 : tensor<1x2x128xf32, #blocked> to tensor<1x2x128xbf16, #blocked>
2026-02-21T11:00:45.3062148Z     %86 = arith.muli %8, %c262144_i32 : i32
2026-02-21T11:00:45.3062368Z     %87 = ttg.convert_layout %12 : tensor<2xi32, #blocked5> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:00:45.3062683Z     %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi32, #blocked6>
2026-02-21T11:00:45.3062964Z     %89 = ttg.convert_layout %88 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #blocked3>
2026-02-21T11:00:45.3063240Z     %90 = ttg.convert_layout %89 : tensor<1x2xi32, #blocked3> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T11:00:45.3063569Z     %91 = tt.expand_dims %90 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi32, #blocked7>
2026-02-21T11:00:45.3063879Z     %92 = ttg.convert_layout %91 : tensor<1x2x1xi32, #blocked7> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T11:00:45.3064085Z     %93 = arith.muli %92, %cst_5 : tensor<1x2x1xi32, #blocked1>
2026-02-21T11:00:45.3064250Z     %94 = tt.splat %86 : i32 -> tensor<1x2x1xi32, #blocked1>
2026-02-21T11:00:45.3064404Z     %95 = arith.addi %94, %93 : tensor<1x2x1xi32, #blocked1>
2026-02-21T11:00:45.3064642Z     %96 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:00:45.3064969Z     %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6>
2026-02-21T11:00:45.3065256Z     %98 = ttg.convert_layout %97 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4>
2026-02-21T11:00:45.3065542Z     %99 = ttg.convert_layout %98 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T11:00:45.3065880Z     %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8>
2026-02-21T11:00:45.3066186Z     %101 = ttg.convert_layout %100 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked>
2026-02-21T11:00:45.3066462Z     %102 = tt.broadcast %95 : tensor<1x2x1xi32, #blocked1> -> tensor<1x2x128xi32, #blocked1>
2026-02-21T11:00:45.3066705Z     %103 = ttg.convert_layout %102 : tensor<1x2x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked>
2026-02-21T11:00:45.3066954Z     %104 = tt.broadcast %101 : tensor<1x1x128xi32, #blocked> -> tensor<1x2x128xi32, #blocked>
2026-02-21T11:00:45.3067157Z     %105 = arith.addi %103, %104 : tensor<1x2x128xi32, #blocked>
2026-02-21T11:00:45.3067366Z     %106 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:00:45.3067607Z     %107 = tt.addptr %106, %105 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>, tensor<1x2x128xi32, #blocked>
2026-02-21T11:00:45.3067820Z     tt.store %107, %85 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:00:45.3067964Z     tt.return
2026-02-21T11:00:45.3068048Z   }
2026-02-21T11:00:45.3068132Z }
2026-02-21T11:00:45.3068179Z 
2026-02-21T11:00:45.3068214Z {-#
2026-02-21T11:00:45.3068302Z   external_resources: {
2026-02-21T11:00:45.3068409Z     mlir_reproducer: {
2026-02-21T11:00:45.3070654Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T11:00:45.3072948Z       disable_threading: false,
2026-02-21T11:00:45.3073060Z       verify_each: true
2026-02-21T11:00:45.3073159Z     }
2026-02-21T11:00:45.3073234Z   }
2026-02-21T11:00:45.3073313Z #-}
2026-02-21T11:00:45.3073619Z /tmp/torchinductor_root/5g/c5giwctfjccaue6bd75t5sbqdhwhf2ay7njm5q53dk3emny6uyli.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T11:00:45.3074322Z /tmp/torchinductor_root/5g/c5giwctfjccaue6bd75t5sbqdhwhf2ay7njm5q53dk3emny6uyli.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T11:00:45.3074878Z [81s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T11:00:45.3075617Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 512], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T11:00:45.3076280Z Error: RuntimeError: PassManager::run failed
2026-02-21T11:00:45.3076457Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T11:01:22.7217953Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 2.5 configs/s
2026-02-21T11:01:22.7227683Z [119s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s])
2026-02-21T11:01:22.7508858Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 - configs/s
2026-02-21T11:01:23.7567826Z [120s] Initial random population of 100, 5 starting points: 
2026-02-21T11:01:23.7568291Z error=20
2026-02-21T11:01:23.7568513Z timeout=19
2026-02-21T11:01:23.7568725Z ok=61
2026-02-21T11:01:23.7569356Z min=2.2194
2026-02-21T11:01:23.7569563Z mid=14.6173
2026-02-21T11:01:23.7569765Z max=3143.3940
2026-02-21T11:01:23.7570019Z best={'block_sizes': [1, 256, 16],
2026-02-21T11:01:23.7570430Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T11:01:23.7570826Z  'l2_groupings': [8],
2026-02-21T11:01:23.7571101Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:01:23.7571438Z  'loop_orders': [[0, 1]],
2026-02-21T11:01:23.7571672Z  'matrix_instr_nonkdim': 16,
2026-02-21T11:01:23.7571936Z  'num_sm_multiplier': 16,
2026-02-21T11:01:23.7572163Z  'num_stages': 4,
2026-02-21T11:01:23.7572355Z  'num_warps': 16,
2026-02-21T11:01:23.7572572Z  'pid_type': 'persistent_blocked',
2026-02-21T11:01:23.7572951Z  'range_flattens': [False, True],
2026-02-21T11:01:23.7573216Z  'range_multi_buffers': [None, False],
2026-02-21T11:01:23.7573486Z  'range_num_stages': [1, 3],
2026-02-21T11:01:23.7573721Z  'range_unroll_factors': [2, 3],
2026-02-21T11:01:23.7573974Z  'range_warp_specializes': [],
2026-02-21T11:01:23.7574216Z  'waves_per_eu': 1}
2026-02-21T11:01:23.7584301Z [120s] Fitting surrogate: 100 points, 100 targets
2026-02-21T11:01:24.5109528Z [120s] Generation 1 starting: 77 neighbors, 5 active search path(s)
2026-02-21T11:02:01.9538464Z [158s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:02:01.9559233Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 0.6 configs/s
2026-02-21T11:02:09.4118078Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 10.9 configs/s
2026-02-21T11:02:09.5643567Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━ 115/115 2103.5         
2026-02-21T11:02:09.5643929Z                                                                  configs/s      
2026-02-21T11:02:15.2693629Z [171s] Generation 1 complete: 
2026-02-21T11:02:15.2694065Z error=2
2026-02-21T11:02:15.2694287Z timeout=1
2026-02-21T11:02:15.2694496Z ok=80
2026-02-21T11:02:15.2694705Z min=1.7978
2026-02-21T11:02:15.2694909Z mid=3.4986
2026-02-21T11:02:15.2695147Z max=20.8857
2026-02-21T11:02:15.2695427Z best={'block_sizes': [1, 64, 16],
2026-02-21T11:02:15.2695839Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:02:15.2696352Z  'l2_groupings': [4],
2026-02-21T11:02:15.2696632Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:02:15.2696953Z  'loop_orders': [[0, 1]],
2026-02-21T11:02:15.2697241Z  'matrix_instr_nonkdim': 16,
2026-02-21T11:02:15.2697511Z  'num_stages': 2,
2026-02-21T11:02:15.2697739Z  'num_warps': 4,
2026-02-21T11:02:15.2697974Z  'pid_type': 'flat',
2026-02-21T11:02:15.2698239Z  'range_flattens': [None, False],
2026-02-21T11:02:15.2698557Z  'range_multi_buffers': [None, True],
2026-02-21T11:02:15.2698865Z  'range_num_stages': [0, 1],
2026-02-21T11:02:15.2699144Z  'range_unroll_factors': [0, 4],
2026-02-21T11:02:15.2699442Z  'range_warp_specializes': [],
2026-02-21T11:02:15.2699717Z  'waves_per_eu': 4}
2026-02-21T11:02:15.2723165Z [171s] Fitting surrogate: 183 points, 183 targets
2026-02-21T11:02:16.0794248Z [172s] Generation 2 starting: 79 neighbors, 5 active search path(s)
2026-02-21T11:02:48.3935689Z [204s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[1, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:02:48.3948830Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.7 configs/s
2026-02-21T11:02:55.4598086Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 11.6 configs/s
2026-02-21T11:02:56.1408772Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 129/129 69.1 configs/s
2026-02-21T11:03:02.9796517Z [219s] Generation 2 complete: 
2026-02-21T11:03:02.9796874Z error=3
2026-02-21T11:03:02.9797030Z timeout=1
2026-02-21T11:03:02.9797215Z ok=80
2026-02-21T11:03:02.9797366Z min=1.6125
2026-02-21T11:03:02.9797507Z mid=3.4795
2026-02-21T11:03:02.9797657Z max=19.0848
2026-02-21T11:03:02.9797851Z best={'block_sizes': [1, 64, 16],
2026-02-21T11:03:02.9798163Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:03:02.9798461Z  'l2_groupings': [4],
2026-02-21T11:03:02.9799023Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:03:02.9799262Z  'loop_orders': [[0, 1]],
2026-02-21T11:03:02.9799461Z  'matrix_instr_nonkdim': 16,
2026-02-21T11:03:02.9799656Z  'num_stages': 1,
2026-02-21T11:03:02.9799832Z  'num_warps': 4,
2026-02-21T11:03:02.9800057Z  'pid_type': 'flat',
2026-02-21T11:03:02.9800241Z  'range_flattens': [None, False],
2026-02-21T11:03:02.9800459Z  'range_multi_buffers': [None, None],
2026-02-21T11:03:02.9800674Z  'range_num_stages': [0, 1],
2026-02-21T11:03:02.9800871Z  'range_unroll_factors': [0, 4],
2026-02-21T11:03:02.9801089Z  'range_warp_specializes': [],
2026-02-21T11:03:02.9801285Z  'waves_per_eu': 4}
2026-02-21T11:03:02.9845051Z [219s] Fitting surrogate: 267 points, 267 targets
2026-02-21T11:03:03.8130008Z [220s] Generation 3 starting: 79 neighbors, 5 active search path(s)
2026-02-21T11:03:21.8827993Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 1.5 configs/s
2026-02-21T11:03:27.8371828Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 13.6 configs/s
2026-02-21T11:03:33.6406421Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 129/129 13.4 configs/s
2026-02-21T11:03:39.8989175Z [256s] Generation 3 complete: 
2026-02-21T11:03:39.8991481Z error=6
2026-02-21T11:03:39.8991820Z ok=78
2026-02-21T11:03:39.8992037Z min=1.5153
2026-02-21T11:03:39.8992252Z mid=2.1004
2026-02-21T11:03:39.8992449Z max=22.4364
2026-02-21T11:03:39.8992688Z best={'block_sizes': [1, 64, 64],
2026-02-21T11:03:39.8993126Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:03:39.8993536Z  'l2_groupings': [4],
2026-02-21T11:03:39.8993820Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:03:39.8994150Z  'loop_orders': [[0, 1]],
2026-02-21T11:03:39.8994427Z  'matrix_instr_nonkdim': 16,
2026-02-21T11:03:39.8994715Z  'num_stages': 1,
2026-02-21T11:03:39.8994946Z  'num_warps': 4,
2026-02-21T11:03:39.8995176Z  'pid_type': 'flat',
2026-02-21T11:03:39.8995404Z  'range_flattens': [None, False],
2026-02-21T11:03:39.8995524Z  'range_multi_buffers': [None, None],
2026-02-21T11:03:39.8995640Z  'range_num_stages': [0, 1],
2026-02-21T11:03:39.8995744Z  'range_unroll_factors': [0, 4],
2026-02-21T11:03:39.8995859Z  'range_warp_specializes': [],
2026-02-21T11:03:39.8995963Z  'waves_per_eu': 4}
2026-02-21T11:03:39.9066071Z [256s] Fitting surrogate: 351 points, 351 targets
2026-02-21T11:03:40.7297183Z [257s] Generation 4 starting: 73 neighbors, 5 active search path(s)
2026-02-21T11:04:13.0906899Z [289s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:04:13.0921397Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.9 configs/s
2026-02-21T11:04:19.4633426Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 11.5 configs/s
2026-02-21T11:04:23.5479363Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 131/131 19.0 configs/s
2026-02-21T11:04:31.0596406Z [307s] Generation 4 complete: 
2026-02-21T11:04:31.0598917Z error=3
2026-02-21T11:04:31.0599283Z timeout=1
2026-02-21T11:04:31.0599380Z ok=74
2026-02-21T11:04:31.0599460Z min=1.3760
2026-02-21T11:04:31.0599543Z mid=2.0470
2026-02-21T11:04:31.0599623Z max=18.5830
2026-02-21T11:04:31.0599721Z best={'block_sizes': [1, 64, 64],
2026-02-21T11:04:31.0599913Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:04:31.0600084Z  'l2_groupings': [4],
2026-02-21T11:04:31.0600187Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:04:31.0601569Z  'loop_orders': [[0, 1]],
2026-02-21T11:04:31.0601677Z  'matrix_instr_nonkdim': 16,
2026-02-21T11:04:31.0601784Z  'num_stages': 1,
2026-02-21T11:04:31.0601875Z  'num_warps': 4,
2026-02-21T11:04:31.0602241Z  'pid_type': 'flat',
2026-02-21T11:04:31.0602348Z  'range_flattens': [None, False],
2026-02-21T11:04:31.0602460Z  'range_multi_buffers': [None, None],
2026-02-21T11:04:31.0602631Z  'range_num_stages': [0, 1],
2026-02-21T11:04:31.0602742Z  'range_unroll_factors': [0, 4],
2026-02-21T11:04:31.0602852Z  'range_warp_specializes': [],
2026-02-21T11:04:31.0602957Z  'waves_per_eu': 4}
2026-02-21T11:04:31.0682816Z [307s] Fitting surrogate: 429 points, 429 targets
2026-02-21T11:04:31.7913124Z [308s] Generation 5 starting: 57 neighbors, 4 active search path(s)
2026-02-21T11:05:03.5082754Z [339s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[2, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:05:04.4310195Z [340s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:05:04.4328957Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 0.7 configs/s
2026-02-21T11:05:08.4304826Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 14.4 configs/s
2026-02-21T11:05:10.1507248Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 145/145 40.4 configs/s
2026-02-21T11:05:16.8629954Z [353s] Generation 5 complete: 
2026-02-21T11:05:16.8630379Z error=3
2026-02-21T11:05:16.8632923Z timeout=2
2026-02-21T11:05:16.8633231Z ok=56
2026-02-21T11:05:16.8633433Z min=1.4232
2026-02-21T11:05:16.8633639Z mid=1.8395
2026-02-21T11:05:16.8633841Z max=22.4022
2026-02-21T11:05:16.8634077Z best={'block_sizes': [1, 64, 64],
2026-02-21T11:05:16.8634451Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:05:16.8635269Z  'l2_groupings': [4],
2026-02-21T11:05:16.8635486Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:05:16.8635733Z  'loop_orders': [[0, 1]],
2026-02-21T11:05:16.8635948Z  'matrix_instr_nonkdim': 16,
2026-02-21T11:05:16.8636161Z  'num_stages': 1,
2026-02-21T11:05:16.8636339Z  'num_warps': 4,
2026-02-21T11:05:16.8636557Z  'pid_type': 'flat',
2026-02-21T11:05:16.8636760Z  'range_flattens': [None, False],
2026-02-21T11:05:16.8637004Z  'range_multi_buffers': [None, None],
2026-02-21T11:05:16.8637394Z  'range_num_stages': [0, 1],
2026-02-21T11:05:16.8637608Z  'range_unroll_factors': [0, 4],
2026-02-21T11:05:16.8637837Z  'range_warp_specializes': [],
2026-02-21T11:05:16.8638058Z  'waves_per_eu': 4}
2026-02-21T11:05:16.8676229Z [353s] Fitting surrogate: 490 points, 490 targets
2026-02-21T11:05:17.4756648Z [353s] Generation 6 starting: 52 neighbors, 3 active search path(s)
2026-02-21T11:05:50.3436903Z [386s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:05:50.3460168Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 0.7 configs/s
2026-02-21T11:05:54.1070619Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 13.9 configs/s
2026-02-21T11:05:56.3837781Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 145/145 31.7 configs/s
2026-02-21T11:06:02.4587875Z [398s] Generation 6 complete: 
2026-02-21T11:06:02.4588279Z error=3
2026-02-21T11:06:02.4588477Z timeout=1
2026-02-21T11:06:02.4588662Z ok=52
2026-02-21T11:06:02.4588840Z min=1.3725
2026-02-21T11:06:02.4589029Z mid=1.9183
2026-02-21T11:06:02.4589229Z max=18.5029
2026-02-21T11:06:02.4589447Z best={'block_sizes': [1, 64, 64],
2026-02-21T11:06:02.4589820Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:06:02.4590199Z  'l2_groupings': [4],
2026-02-21T11:06:02.4590454Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:06:02.4590740Z  'loop_orders': [[0, 1]],
2026-02-21T11:06:02.4591163Z  'matrix_instr_nonkdim': 16,
2026-02-21T11:06:02.4591405Z  'num_stages': 1,
2026-02-21T11:06:02.4591618Z  'num_warps': 4,
2026-02-21T11:06:02.4591853Z  'pid_type': 'flat',
2026-02-21T11:06:02.4592097Z  'range_flattens': [None, False],
2026-02-21T11:06:02.4592372Z  'range_multi_buffers': [None, None],
2026-02-21T11:06:02.4592654Z  'range_num_stages': [0, 1],
2026-02-21T11:06:02.4592921Z  'range_unroll_factors': [0, 4],
2026-02-21T11:06:02.4593191Z  'range_warp_specializes': [],
2026-02-21T11:06:02.4593449Z  'waves_per_eu': 4}
2026-02-21T11:06:02.4640379Z [398s] Fitting surrogate: 546 points, 546 targets
2026-02-21T11:06:03.6799416Z [400s] Generation 7 starting: 51 neighbors, 3 active search path(s)
2026-02-21T11:06:38.2377137Z [434s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[2, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:06:38.2396480Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 0.8 configs/s
2026-02-21T11:06:41.5318491Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 15.7 configs/s
2026-02-21T11:06:41.7265896Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━ 145/145 1016.5         
2026-02-21T11:06:41.7269878Z                                                                  configs/s      
2026-02-21T11:06:47.8761539Z [444s] Generation 7 complete: 
2026-02-21T11:06:47.8761810Z error=7
2026-02-21T11:06:47.8761894Z timeout=1
2026-02-21T11:06:47.8761978Z ok=47
2026-02-21T11:06:47.8762053Z min=1.3521
2026-02-21T11:06:47.8762135Z mid=1.9459
2026-02-21T11:06:47.8762209Z max=9.9115
2026-02-21T11:06:47.8762297Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:06:47.8762483Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:06:47.8762720Z  'l2_groupings': [4],
2026-02-21T11:06:47.8762826Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:06:47.8763269Z  'loop_orders': [[0, 1]],
2026-02-21T11:06:47.8763373Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:06:47.8763476Z  'num_stages': 3,
2026-02-21T11:06:47.8763572Z  'num_warps': 4,
2026-02-21T11:06:47.8766442Z  'pid_type': 'flat',
2026-02-21T11:06:47.8766663Z  'range_flattens': [None, None],
2026-02-21T11:06:47.8766829Z  'range_multi_buffers': [None, None],
2026-02-21T11:06:47.8766964Z  'range_num_stages': [0, 1],
2026-02-21T11:06:47.8767088Z  'range_unroll_factors': [0, 4],
2026-02-21T11:06:47.8767208Z  'range_warp_specializes': [],
2026-02-21T11:06:47.8767320Z  'waves_per_eu': 2}
2026-02-21T11:06:47.8823622Z [444s] Fitting surrogate: 601 points, 601 targets
2026-02-21T11:06:48.3618929Z [444s] Generation 8 starting: 37 neighbors, 2 active search path(s)
2026-02-21T11:07:20.3064509Z [476s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:07:24.1572503Z [480s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:07:24.1592501Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 0.9 configs/s
2026-02-21T11:07:26.5222693Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 15.9 configs/s
2026-02-21T11:07:27.8311182Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 147/147 38.7 configs/s
2026-02-21T11:07:33.0606900Z [489s] Generation 8 complete: 
2026-02-21T11:07:33.0607337Z error=3
2026-02-21T11:07:33.0607699Z timeout=2
2026-02-21T11:07:33.0608048Z ok=35
2026-02-21T11:07:33.0608390Z min=1.4081
2026-02-21T11:07:33.0608602Z mid=1.6308
2026-02-21T11:07:33.0608802Z max=6.2155
2026-02-21T11:07:33.0609035Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:07:33.0609467Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:07:33.0609904Z  'l2_groupings': [4],
2026-02-21T11:07:33.0610191Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:07:33.0610524Z  'loop_orders': [[0, 1]],
2026-02-21T11:07:33.0610780Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:07:33.0611024Z  'num_stages': 3,
2026-02-21T11:07:33.0611238Z  'num_warps': 4,
2026-02-21T11:07:33.0611459Z  'pid_type': 'flat',
2026-02-21T11:07:33.0611704Z  'range_flattens': [None, None],
2026-02-21T11:07:33.0611973Z  'range_multi_buffers': [None, None],
2026-02-21T11:07:33.0612893Z  'range_num_stages': [0, 1],
2026-02-21T11:07:33.0613149Z  'range_unroll_factors': [0, 4],
2026-02-21T11:07:33.0613412Z  'range_warp_specializes': [],
2026-02-21T11:07:33.0613699Z  'waves_per_eu': 2}
2026-02-21T11:07:33.0658400Z [489s] Fitting surrogate: 641 points, 641 targets
2026-02-21T11:07:33.5645699Z [489s] Generation 9 starting: 39 neighbors, 2 active search path(s)
2026-02-21T11:08:06.7941657Z [523s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:08:09.3092780Z [525s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=2, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:08:09.8105734Z [526s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:08:09.8128102Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 1.0 configs/s
2026-02-21T11:08:12.8933018Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 12.8 configs/s
2026-02-21T11:08:13.0865179Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━ 147/147 1117.7         
2026-02-21T11:08:13.0869252Z                                                                  configs/s      
2026-02-21T11:08:18.9014017Z [535s] Generation 9 complete: 
2026-02-21T11:08:18.9014303Z error=4
2026-02-21T11:08:18.9014407Z timeout=3
2026-02-21T11:08:18.9014507Z ok=35
2026-02-21T11:08:18.9014646Z min=1.3740
2026-02-21T11:08:18.9014747Z mid=1.6504
2026-02-21T11:08:18.9014835Z max=6.6216
2026-02-21T11:08:18.9014956Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:08:18.9015213Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:08:18.9015405Z  'l2_groupings': [4],
2026-02-21T11:08:18.9015537Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:08:18.9015695Z  'loop_orders': [[0, 1]],
2026-02-21T11:08:18.9015836Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:08:18.9015976Z  'num_stages': 3,
2026-02-21T11:08:18.9016087Z  'num_warps': 4,
2026-02-21T11:08:18.9016219Z  'pid_type': 'flat',
2026-02-21T11:08:18.9016345Z  'range_flattens': [None, None],
2026-02-21T11:08:18.9016498Z  'range_multi_buffers': [None, None],
2026-02-21T11:08:18.9016646Z  'range_num_stages': [0, 1],
2026-02-21T11:08:18.9016768Z  'range_unroll_factors': [0, 4],
2026-02-21T11:08:18.9016914Z  'range_warp_specializes': [],
2026-02-21T11:08:18.9017041Z  'waves_per_eu': 2}
2026-02-21T11:08:18.9060030Z [535s] Fitting surrogate: 683 points, 683 targets
2026-02-21T11:08:19.3837491Z [535s] Generation 10 starting: 38 neighbors, 2 active search path(s)
2026-02-21T11:08:52.2365533Z [568s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:08:52.8284371Z [569s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:08:55.2354206Z [571s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:08:55.2368411Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 1.1 configs/s
2026-02-21T11:08:57.7107490Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 15.6 configs/s
2026-02-21T11:08:59.1580440Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━ 147/147 39.3 configs/s
2026-02-21T11:09:04.4554510Z [580s] Generation 10 complete: 
2026-02-21T11:09:04.4554995Z error=4
2026-02-21T11:09:04.4555214Z timeout=3
2026-02-21T11:09:04.4555423Z ok=34
2026-02-21T11:09:04.4555631Z min=1.4854
2026-02-21T11:09:04.4555841Z mid=1.6330
2026-02-21T11:09:04.4556039Z max=26.4248
2026-02-21T11:09:04.4556279Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:09:04.4556702Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:09:04.4557720Z  'l2_groupings': [4],
2026-02-21T11:09:04.4558001Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:09:04.4558345Z  'loop_orders': [[0, 1]],
2026-02-21T11:09:04.4558629Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:09:04.4558904Z  'num_stages': 3,
2026-02-21T11:09:04.4559135Z  'num_warps': 4,
2026-02-21T11:09:04.4559375Z  'pid_type': 'flat',
2026-02-21T11:09:04.4559656Z  'range_flattens': [None, None],
2026-02-21T11:09:04.4559959Z  'range_multi_buffers': [None, None],
2026-02-21T11:09:04.4560275Z  'range_num_stages': [0, 1],
2026-02-21T11:09:04.4560558Z  'range_unroll_factors': [0, 4],
2026-02-21T11:09:04.4560858Z  'range_warp_specializes': [],
2026-02-21T11:09:04.4561135Z  'waves_per_eu': 2}
2026-02-21T11:09:04.4627339Z [580s] Fitting surrogate: 724 points, 724 targets
2026-02-21T11:09:04.9131000Z [581s] Generation 11 starting: 41 neighbors, 2 active search path(s)
2026-02-21T11:09:36.4360246Z [612s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[2, 3], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:09:38.1870911Z [614s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:09:40.9085631Z [617s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:09:41.6874195Z [618s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:09:41.6889230Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 1.1 configs/s
2026-02-21T11:09:44.3449669Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 41/41 15.7 configs/s
2026-02-21T11:09:44.4798149Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━ 147/147 1633.3         
2026-02-21T11:09:44.4802226Z                                                                  configs/s      
2026-02-21T11:09:48.8052195Z [625s] Generation 11 complete: 
2026-02-21T11:09:48.8052572Z error=5
2026-02-21T11:09:48.8053215Z timeout=4
2026-02-21T11:09:48.8053428Z ok=35
2026-02-21T11:09:48.8053632Z min=1.3459
2026-02-21T11:09:48.8053846Z mid=2.0912
2026-02-21T11:09:48.8054050Z max=8.5581
2026-02-21T11:09:48.8054285Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:09:48.8054729Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:09:48.8055128Z  'l2_groupings': [4],
2026-02-21T11:09:48.8055450Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:09:48.8055751Z  'loop_orders': [[0, 1]],
2026-02-21T11:09:48.8056030Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:09:48.8056289Z  'num_stages': 3,
2026-02-21T11:09:48.8056516Z  'num_warps': 4,
2026-02-21T11:09:48.8056917Z  'pid_type': 'flat',
2026-02-21T11:09:48.8057174Z  'range_flattens': [None, None],
2026-02-21T11:09:48.8057485Z  'range_multi_buffers': [None, None],
2026-02-21T11:09:48.8057781Z  'range_num_stages': [0, 1],
2026-02-21T11:09:48.8058052Z  'range_unroll_factors': [0, 4],
2026-02-21T11:09:48.8058337Z  'range_warp_specializes': [],
2026-02-21T11:09:48.8058608Z  'waves_per_eu': 2}
2026-02-21T11:09:48.8091620Z [625s] Fitting surrogate: 768 points, 768 targets
2026-02-21T11:09:49.2550956Z [625s] Generation 12 starting: 37 neighbors, 2 active search path(s)
2026-02-21T11:10:21.7341762Z [658s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:10:22.3214801Z [658s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:10:24.1114793Z [660s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:10:24.1131417Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 1.1 configs/s
2026-02-21T11:10:26.6452782Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 14.8 configs/s
2026-02-21T11:10:26.7980533Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━ 148/148 1386.2         
2026-02-21T11:10:26.7984005Z                                                                  configs/s      
2026-02-21T11:10:31.6333489Z [668s] Generation 12 complete: 
2026-02-21T11:10:31.6333705Z error=2
2026-02-21T11:10:31.6333788Z timeout=3
2026-02-21T11:10:31.6333928Z ok=35
2026-02-21T11:10:31.6334004Z min=1.3744
2026-02-21T11:10:31.6334089Z mid=1.5638
2026-02-21T11:10:31.6334174Z max=13.1922
2026-02-21T11:10:31.6334272Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:10:31.6334437Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:10:31.6334593Z  'l2_groupings': [4],
2026-02-21T11:10:31.6334705Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:10:31.6334840Z  'loop_orders': [[0, 1]],
2026-02-21T11:10:31.6334951Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:10:31.6335063Z  'num_stages': 3,
2026-02-21T11:10:31.6335159Z  'num_warps': 4,
2026-02-21T11:10:31.6335249Z  'pid_type': 'flat',
2026-02-21T11:10:31.6335357Z  'range_flattens': [None, None],
2026-02-21T11:10:31.6335914Z  'range_multi_buffers': [None, None],
2026-02-21T11:10:31.6336030Z  'range_num_stages': [0, 1],
2026-02-21T11:10:31.6336141Z  'range_unroll_factors': [0, 4],
2026-02-21T11:10:31.6336257Z  'range_warp_specializes': [],
2026-02-21T11:10:31.6336367Z  'waves_per_eu': 2}
2026-02-21T11:10:31.6396255Z [668s] Fitting surrogate: 808 points, 808 targets
2026-02-21T11:10:32.1045775Z [668s] Generation 13 starting: 40 neighbors, 2 active search path(s)
2026-02-21T11:11:06.2862938Z [702s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:11:08.4065608Z [704s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:11:08.4088932Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 1.3 configs/s
2026-02-21T11:11:11.1514844Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 40/40 14.8 configs/s
2026-02-21T11:11:11.5335271Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━ 148/148 171.4 configs/s
2026-02-21T11:11:17.2540730Z [713s] Generation 13 complete: 
2026-02-21T11:11:17.2540986Z error=3
2026-02-21T11:11:17.2541089Z timeout=2
2026-02-21T11:11:17.2541172Z ok=38
2026-02-21T11:11:17.2541253Z min=1.2978
2026-02-21T11:11:17.2541338Z mid=1.6165
2026-02-21T11:11:17.2541433Z max=11.5613
2026-02-21T11:11:17.2541532Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:11:17.2541692Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:11:17.2541850Z  'l2_groupings': [4],
2026-02-21T11:11:17.2541956Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:11:17.2542656Z  'loop_orders': [[0, 1]],
2026-02-21T11:11:17.2542796Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:11:17.2542907Z  'num_stages': 3,
2026-02-21T11:11:17.2543004Z  'num_warps': 4,
2026-02-21T11:11:17.2543093Z  'pid_type': 'flat',
2026-02-21T11:11:17.2543198Z  'range_flattens': [None, None],
2026-02-21T11:11:17.2543314Z  'range_multi_buffers': [None, None],
2026-02-21T11:11:17.2543444Z  'range_num_stages': [0, 1],
2026-02-21T11:11:17.2543550Z  'range_unroll_factors': [0, 4],
2026-02-21T11:11:17.2543764Z  'range_warp_specializes': [],
2026-02-21T11:11:17.2543874Z  'waves_per_eu': 2}
2026-02-21T11:11:17.2606023Z [713s] Fitting surrogate: 851 points, 851 targets
2026-02-21T11:11:17.7351644Z [714s] Generation 14 starting: 37 neighbors, 2 active search path(s)
2026-02-21T11:11:49.1114612Z [745s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:11:51.2721583Z [747s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[None, None], range_num_stages=[4, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:11:51.9902690Z [748s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[4, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:11:51.9922260Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 1.1 configs/s
2026-02-21T11:11:54.4943594Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 15.2 configs/s
2026-02-21T11:11:54.6420388Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━ 154/154 1466.7         
2026-02-21T11:11:54.6423624Z                                                                  configs/s      
2026-02-21T11:11:59.4695959Z [755s] Generation 14 complete: 
2026-02-21T11:11:59.4696367Z error=3
2026-02-21T11:11:59.4696574Z timeout=3
2026-02-21T11:11:59.4696780Z ok=34
2026-02-21T11:11:59.4696975Z min=1.3496
2026-02-21T11:11:59.4697185Z mid=1.5922
2026-02-21T11:11:59.4697388Z max=11.9982
2026-02-21T11:11:59.4697623Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:11:59.4698065Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:11:59.4698473Z  'l2_groupings': [4],
2026-02-21T11:11:59.4698762Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:11:59.4699082Z  'loop_orders': [[0, 1]],
2026-02-21T11:11:59.4699365Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:11:59.4699634Z  'num_stages': 3,
2026-02-21T11:11:59.4699881Z  'num_warps': 4,
2026-02-21T11:11:59.4700111Z  'pid_type': 'flat',
2026-02-21T11:11:59.4700375Z  'range_flattens': [None, None],
2026-02-21T11:11:59.4700686Z  'range_multi_buffers': [None, None],
2026-02-21T11:11:59.4700957Z  'range_num_stages': [0, 1],
2026-02-21T11:11:59.4701205Z  'range_unroll_factors': [0, 4],
2026-02-21T11:11:59.4701470Z  'range_warp_specializes': [],
2026-02-21T11:11:59.4701724Z  'waves_per_eu': 2}
2026-02-21T11:11:59.4744013Z [755s] Fitting surrogate: 891 points, 891 targets
2026-02-21T11:11:59.9976189Z [756s] Generation 15 starting: 41 neighbors, 2 active search path(s)
2026-02-21T11:12:32.6529140Z [789s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:12:32.8436577Z [789s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, True], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:12:33.0215778Z [789s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[2, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:12:34.3421795Z [790s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:12:34.5693137Z [790s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:12:35.0233572Z [791s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:12:35.0254647Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 1.3 configs/s
2026-02-21T11:12:36.9964820Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 41/41 21.4 configs/s
2026-02-21T11:12:37.1443741Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━ 154/154 1499.0         
2026-02-21T11:12:37.1447361Z                                                                  configs/s      
2026-02-21T11:12:41.9987658Z [798s] Generation 15 complete: 
2026-02-21T11:12:41.9988097Z error=10
2026-02-21T11:12:41.9988307Z timeout=6
2026-02-21T11:12:41.9988512Z ok=28
2026-02-21T11:12:41.9988733Z min=1.3623
2026-02-21T11:12:41.9988932Z mid=1.5670
2026-02-21T11:12:41.9989134Z max=11.7756
2026-02-21T11:12:41.9989367Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:12:41.9989804Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:12:41.9990208Z  'l2_groupings': [4],
2026-02-21T11:12:41.9990487Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:12:41.9990801Z  'loop_orders': [[0, 1]],
2026-02-21T11:12:41.9991083Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:12:41.9992086Z  'num_stages': 3,
2026-02-21T11:12:41.9992318Z  'num_warps': 4,
2026-02-21T11:12:41.9992576Z  'pid_type': 'flat',
2026-02-21T11:12:41.9992832Z  'range_flattens': [None, None],
2026-02-21T11:12:41.9993136Z  'range_multi_buffers': [None, None],
2026-02-21T11:12:41.9993442Z  'range_num_stages': [0, 1],
2026-02-21T11:12:41.9993718Z  'range_unroll_factors': [0, 4],
2026-02-21T11:12:41.9994018Z  'range_warp_specializes': [],
2026-02-21T11:12:41.9994301Z  'waves_per_eu': 2}
2026-02-21T11:12:42.0035207Z [798s] Fitting surrogate: 935 points, 935 targets
2026-02-21T11:12:42.4793653Z [798s] Generation 16 starting: 37 neighbors, 2 active search path(s)
2026-02-21T11:13:13.9223528Z [830s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[3, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:13:16.1248794Z [832s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, True], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:13:16.1266460Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 1.2 configs/s
2026-02-21T11:13:18.7862298Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 14.1 configs/s
2026-02-21T11:13:20.0479952Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━ 154/154 41.2 configs/s
2026-02-21T11:13:25.2452005Z [841s] Generation 16 complete: 
2026-02-21T11:13:25.2452443Z error=1
2026-02-21T11:13:25.2452799Z timeout=2
2026-02-21T11:13:25.2453096Z ok=37
2026-02-21T11:13:25.2453331Z min=1.3121
2026-02-21T11:13:25.2453679Z mid=1.5823
2026-02-21T11:13:25.2453902Z max=8.6099
2026-02-21T11:13:25.2454210Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:13:25.2455384Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:13:25.2455841Z  'l2_groupings': [4],
2026-02-21T11:13:25.2456135Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:13:25.2456470Z  'loop_orders': [[0, 1]],
2026-02-21T11:13:25.2456762Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:13:25.2457029Z  'num_stages': 3,
2026-02-21T11:13:25.2457269Z  'num_warps': 4,
2026-02-21T11:13:25.2457516Z  'pid_type': 'flat',
2026-02-21T11:13:25.2457788Z  'range_flattens': [None, None],
2026-02-21T11:13:25.2458095Z  'range_multi_buffers': [None, None],
2026-02-21T11:13:25.2458404Z  'range_num_stages': [0, 1],
2026-02-21T11:13:25.2458704Z  'range_unroll_factors': [0, 4],
2026-02-21T11:13:25.2458998Z  'range_warp_specializes': [],
2026-02-21T11:13:25.2459277Z  'waves_per_eu': 2}
2026-02-21T11:13:25.2513058Z [841s] Fitting surrogate: 975 points, 975 targets
2026-02-21T11:13:25.7328589Z [842s] Generation 17 starting: 38 neighbors, 2 active search path(s)
2026-02-21T11:14:00.7923925Z [877s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[2, 2], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:14:04.0228500Z [880s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, False], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:14:04.0252234Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 1.1 configs/s
2026-02-21T11:14:06.7681413Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 14.0 configs/s
2026-02-21T11:14:06.9332343Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━ 154/154 1329.3         
2026-02-21T11:14:06.9335828Z                                                                  configs/s      
2026-02-21T11:14:12.3437838Z [888s] Generation 17 complete: 
2026-02-21T11:14:12.3438257Z error=3
2026-02-21T11:14:12.3438467Z timeout=2
2026-02-21T11:14:12.3438672Z ok=36
2026-02-21T11:14:12.3438874Z min=1.3965
2026-02-21T11:14:12.3439116Z mid=1.6284
2026-02-21T11:14:12.3439324Z max=11.5837
2026-02-21T11:14:12.3439580Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:14:12.3440009Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:14:12.3440413Z  'l2_groupings': [4],
2026-02-21T11:14:12.3440694Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:14:12.3441015Z  'loop_orders': [[0, 1]],
2026-02-21T11:14:12.3441302Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:14:12.3441733Z  'num_stages': 3,
2026-02-21T11:14:12.3441972Z  'num_warps': 4,
2026-02-21T11:14:12.3442206Z  'pid_type': 'flat',
2026-02-21T11:14:12.3442491Z  'range_flattens': [None, None],
2026-02-21T11:14:12.3442912Z  'range_multi_buffers': [None, None],
2026-02-21T11:14:12.3443218Z  'range_num_stages': [0, 1],
2026-02-21T11:14:12.3443508Z  'range_unroll_factors': [0, 4],
2026-02-21T11:14:12.3443810Z  'range_warp_specializes': [],
2026-02-21T11:14:12.3444092Z  'waves_per_eu': 2}
2026-02-21T11:14:12.3511113Z [888s] Fitting surrogate: 1016 points, 1016 targets
2026-02-21T11:14:12.8333573Z [889s] Generation 18 starting: 39 neighbors, 2 active search path(s)
2026-02-21T11:14:45.9809781Z [922s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[3, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:14:45.9830061Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 1.2 configs/s
2026-02-21T11:14:48.9256532Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 39/39 13.4 configs/s
2026-02-21T11:14:50.1775665Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━ 154/154 53.4 configs/s
2026-02-21T11:14:55.8575378Z [932s] Generation 18 complete: 
2026-02-21T11:14:55.8575804Z error=1
2026-02-21T11:14:55.8576024Z timeout=1
2026-02-21T11:14:55.8576282Z ok=40
2026-02-21T11:14:55.8576490Z min=1.3889
2026-02-21T11:14:55.8576693Z mid=1.6112
2026-02-21T11:14:55.8576899Z max=23.5224
2026-02-21T11:14:55.8577147Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:14:55.8577596Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:14:55.8578010Z  'l2_groupings': [4],
2026-02-21T11:14:55.8578306Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:14:55.8578632Z  'loop_orders': [[0, 1]],
2026-02-21T11:14:55.8578916Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:14:55.8579696Z  'num_stages': 3,
2026-02-21T11:14:55.8579920Z  'num_warps': 4,
2026-02-21T11:14:55.8580165Z  'pid_type': 'flat',
2026-02-21T11:14:55.8580425Z  'range_flattens': [None, None],
2026-02-21T11:14:55.8580774Z  'range_multi_buffers': [None, None],
2026-02-21T11:14:55.8581088Z  'range_num_stages': [0, 1],
2026-02-21T11:14:55.8581333Z  'range_unroll_factors': [0, 4],
2026-02-21T11:14:55.8581597Z  'range_warp_specializes': [],
2026-02-21T11:14:55.8581845Z  'waves_per_eu': 2}
2026-02-21T11:14:55.8645734Z [932s] Fitting surrogate: 1058 points, 1058 targets
2026-02-21T11:14:56.3091417Z [932s] Generation 19 starting: 34 neighbors, 2 active search path(s)
2026-02-21T11:15:29.3802160Z [965s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:15:31.8707484Z [968s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[2, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:15:32.0876205Z [968s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:15:32.0894568Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 0.7 configs/s
2026-02-21T11:15:34.3455969Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 15.3 configs/s
2026-02-21T11:15:34.5112645Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━ 154/154 1301.6         
2026-02-21T11:15:34.5115643Z                                                                  configs/s      
2026-02-21T11:15:39.9258108Z [976s] Generation 19 complete: 
2026-02-21T11:15:39.9258493Z error=1
2026-02-21T11:15:39.9258714Z timeout=3
2026-02-21T11:15:39.9258923Z ok=33
2026-02-21T11:15:39.9259124Z min=1.3823
2026-02-21T11:15:39.9259338Z mid=1.5900
2026-02-21T11:15:39.9259562Z max=8.6143
2026-02-21T11:15:39.9259833Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:15:39.9260254Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:15:39.9260695Z  'l2_groupings': [4],
2026-02-21T11:15:39.9260982Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:15:39.9261320Z  'loop_orders': [[0, 1]],
2026-02-21T11:15:39.9261637Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:15:39.9261940Z  'num_stages': 3,
2026-02-21T11:15:39.9262188Z  'num_warps': 4,
2026-02-21T11:15:39.9262425Z  'pid_type': 'flat',
2026-02-21T11:15:39.9262698Z  'range_flattens': [None, None],
2026-02-21T11:15:39.9263008Z  'range_multi_buffers': [None, None],
2026-02-21T11:15:39.9263713Z  'range_num_stages': [0, 1],
2026-02-21T11:15:39.9263990Z  'range_unroll_factors': [0, 4],
2026-02-21T11:15:39.9264268Z  'range_warp_specializes': [],
2026-02-21T11:15:39.9264377Z  'waves_per_eu': 2}
2026-02-21T11:15:39.9342566Z [976s] Fitting surrogate: 1095 points, 1095 targets
2026-02-21T11:15:40.3601168Z [976s] Generation 20 starting: 30 neighbors, 2 active search path(s)
2026-02-21T11:16:04.1225009Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 0.5 configs/s
2026-02-21T11:16:06.4191955Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 30/30 13.4 configs/s
2026-02-21T11:16:06.5441065Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 154/154 - configs/s
2026-02-21T11:16:10.6239567Z [1006s] Generation 20 complete: 
2026-02-21T11:16:10.6242735Z error=4
2026-02-21T11:16:10.6243042Z ok=29
2026-02-21T11:16:10.6243245Z min=1.3522
2026-02-21T11:16:10.6243451Z mid=1.5795
2026-02-21T11:16:10.6243649Z max=25.1943
2026-02-21T11:16:10.6243944Z best={'block_sizes': [1, 128, 64],
2026-02-21T11:16:10.6244400Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:16:10.6244800Z  'l2_groupings': [4],
2026-02-21T11:16:10.6245069Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:16:10.6245366Z  'loop_orders': [[0, 1]],
2026-02-21T11:16:10.6245638Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:16:10.6245902Z  'num_stages': 3,
2026-02-21T11:16:10.6246138Z  'num_warps': 4,
2026-02-21T11:16:10.6246367Z  'pid_type': 'flat',
2026-02-21T11:16:10.6246641Z  'range_flattens': [None, None],
2026-02-21T11:16:10.6246937Z  'range_multi_buffers': [None, None],
2026-02-21T11:16:10.6247237Z  'range_num_stages': [0, 1],
2026-02-21T11:16:10.6247501Z  'range_unroll_factors': [0, 4],
2026-02-21T11:16:10.6248297Z  'range_warp_specializes': [],
2026-02-21T11:16:10.6248567Z  'waves_per_eu': 2}
2026-02-21T11:16:10.6291005Z [1007s] Fitting surrogate: 1128 points, 1128 targets
2026-02-21T11:16:10.7655015Z [1007s] Autotuning complete in 1007.1s after searching 1003 configs.
2026-02-21T11:16:10.7655584Z One can hardcode the best config and skip autotuning with:
2026-02-21T11:16:10.7657762Z     @helion.kernel(config=helion.Config(block_sizes=[1, 128, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T11:16:10.7659320Z 
2026-02-21T11:16:10.7659618Z [1007s] Code of selected kernel: /tmp/torchinductor_root/bl/cbldhhithzna2aj5wivr6b6fmcidpk7j44eu3rdkbssuax7iw2z4.py
2026-02-21T11:16:11.6798235Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T11:16:11.6799154Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T11:16:11.6800015Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T11:16:11.6800247Z WARNING:tritonbench.utils.triton_op:Completed input ID 4:
2026-02-21T11:16:11.6800416Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T11:16:11.6800551Z ------------------------------------------
2026-02-21T11:16:11.6800696Z (4, 48, 2048, 2048, 128)
2026-02-21T11:16:11.6800763Z 
2026-02-21T11:16:11.6808589Z  67%|██████▋   | 4/6 [31:45<19:42, 591.12s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5:
2026-02-21T11:16:11.6808862Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T11:16:11.6809366Z ------------------------------------------
2026-02-21T11:16:11.6809504Z (4, 48, 4096, 4096, 128)
2026-02-21T11:16:11.6810273Z INFO:tritonbench.utils.triton_op:Took 0.09ms to get benchmark function for aten
2026-02-21T11:16:12.7003577Z INFO:tritonbench.utils.triton_op:Took 1.67ms to get benchmark function for flex_attention
2026-02-21T11:16:14.2678211Z WARNING:__main__:Input tensor metadata:
2026-02-21T11:16:14.2678508Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T11:16:14.2678725Z               'dtype': 'torch.bfloat16',
2026-02-21T11:16:14.2679263Z               'shape': (4, 48, 4096, 128),
2026-02-21T11:16:14.2679452Z               'stride': (25165824, 524288, 128, 1)},
2026-02-21T11:16:14.2679638Z             { 'device': 'cuda:0',
2026-02-21T11:16:14.2679825Z               'dtype': 'torch.bfloat16',
2026-02-21T11:16:14.2679995Z               'shape': (4, 48, 4096, 128),
2026-02-21T11:16:14.2680182Z               'stride': (25165824, 524288, 128, 1)},
2026-02-21T11:16:14.2680363Z             { 'device': 'cuda:0',
2026-02-21T11:16:14.2680541Z               'dtype': 'torch.bfloat16',
2026-02-21T11:16:14.2680704Z               'shape': (4, 48, 4096, 128),
2026-02-21T11:16:14.2680884Z               'stride': (25165824, 524288, 128, 1)}),
2026-02-21T11:16:14.2681058Z   'kwargs': {}}
2026-02-21T11:16:14.2722647Z INFO:tritonbench.utils.triton_op:Took 4.83ms to get benchmark function for helion_attention
2026-02-21T11:16:14.5166678Z [0s] Autotune random seed: 2144140282
2026-02-21T11:16:15.4864810Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T11:16:48.7144387Z [33s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[3, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:16:50.7280141Z [35s] Timeout after 30s compiling Config(block_sizes=[1, 8, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:16:56.1670290Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 1, 4096], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:16:56.7732820Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:16:57.1268103Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:16:59.2854632Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:16:59.7674687Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 8, 1024], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:16:59.9613169Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 16, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:17:00.1503382Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 16, 256], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:17:00.4396133Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 8, 4096], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:17:01.0962181Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[4, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:17:02.3044985Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 32, 1024], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=16, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[3, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:17:03.0777761Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 4, 2048], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:17:03.6832502Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 2, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:17:04.0959230Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[2, 0], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:17:04.3093106Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 1, 1024], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[1, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:17:04.4961889Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:17:05.4779296Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 32, 1024], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:17:05.8860676Z [50s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 32], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:17:05.8884149Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.3 configs/s
2026-02-21T11:19:18.4715345Z /tmp/torchinductor_root/qx/cqxoflqc4pbusmmbqhmay5me5afqx2e4hofhburtffelleatvdmk.py:57:24: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T11:19:18.4716469Z             k = tl.load(tl.make_block_ptr(k_view, [192, 128, 4096], [524288, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T11:19:18.4717120Z                        ^
2026-02-21T11:19:18.4718410Z /tmp/torchinductor_root/qx/cqxoflqc4pbusmmbqhmay5me5afqx2e4hofhburtffelleatvdmk.py:59:145: note: - use: %144 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x1024xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [1, 0, 2]}>>) -> tensor<128x1024xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>>
2026-02-21T11:19:18.4719643Z 
2026-02-21T11:19:18.4720378Z             qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T11:19:18.4721226Z                                                                                                                                                 ^
2026-02-21T11:19:18.4722145Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T11:19:18.4722704Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}>
2026-02-21T11:19:18.4723309Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}>
2026-02-21T11:19:18.4724011Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 2, 2], order = [2, 1, 0]}>
2026-02-21T11:19:18.4724597Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}>
2026-02-21T11:19:18.4725182Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:19:18.4725767Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [2, 1, 0]}>
2026-02-21T11:19:18.4726340Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T11:19:18.4726893Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}>
2026-02-21T11:19:18.4727437Z #blocked8 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}>
2026-02-21T11:19:18.4727972Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}>
2026-02-21T11:19:18.4728600Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}>
2026-02-21T11:19:18.4729179Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}>
2026-02-21T11:19:18.4729769Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}>
2026-02-21T11:19:18.4730312Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}>
2026-02-21T11:19:18.4730849Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:19:18.4731274Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [0, 1, 2]}>
2026-02-21T11:19:18.4731705Z #blocked16 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}>
2026-02-21T11:19:18.4732133Z #blocked17 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [0, 1, 2]}>
2026-02-21T11:19:18.4732604Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T11:19:18.4733326Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T11:19:18.4733880Z     %c524288_i32 = arith.constant 524288 : i32
2026-02-21T11:19:18.4734069Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T11:19:18.4734233Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T11:19:18.4734406Z     %c524288_i64 = arith.constant 524288 : i64
2026-02-21T11:19:18.4734641Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x128x1024xbf16, #blocked>
2026-02-21T11:19:18.4734924Z     %cst_0 = arith.constant dense<4096> : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:19:18.4735180Z     %cst_1 = arith.constant dense<0> : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:19:18.4735428Z     %cst_2 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked1>
2026-02-21T11:19:18.4735700Z     %cst_3 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked1>
2026-02-21T11:19:18.4735944Z     %cst_4 = arith.constant dense<128> : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:19:18.4736205Z     %cst_5 = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked2>
2026-02-21T11:19:18.4736475Z     %cst_6 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked3>
2026-02-21T11:19:18.4736778Z     %cst_7 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked3>
2026-02-21T11:19:18.4737041Z     %cst_8 = arith.constant dense<4096> : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:19:18.4737288Z     %cst_9 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:19:18.4737531Z     %cst_10 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:19:18.4737740Z     %c1024_i32 = arith.constant 1024 : i32
2026-02-21T11:19:18.4737905Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T11:19:18.4738065Z     %c304_i32 = arith.constant 304 : i32
2026-02-21T11:19:18.4738269Z     %cst_11 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked4>
2026-02-21T11:19:18.4738519Z     %cst_12 = arith.constant dense<128> : tensor<1x1024x1xi32, #blocked5>
2026-02-21T11:19:18.4738790Z     %cst_13 = arith.constant dense<0.127517432> : tensor<1x2x1024xf32, #blocked>
2026-02-21T11:19:18.4739064Z     %cst_14 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4739332Z     %cst_15 = arith.constant dense<0.000000e+00> : tensor<2x1024xf32, #blocked7>
2026-02-21T11:19:18.4739560Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T11:19:18.4739777Z     %cst_16 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked2>
2026-02-21T11:19:18.4740073Z     %cst_17 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4740314Z     %cst_18 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4740488Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T11:19:18.4740617Z     %c192_i32 = arith.constant 192 : i32
2026-02-21T11:19:18.4740751Z     %c393216_i32 = arith.constant 393216 : i32
2026-02-21T11:19:18.4740888Z     %0 = tt.get_program_id x : i32
2026-02-21T11:19:18.4741056Z     %1 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked8>
2026-02-21T11:19:18.4741284Z     %2 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked8>
2026-02-21T11:19:18.4741530Z     %3 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T11:19:18.4741755Z     %4 = arith.extsi %1 : tensor<2xi32, #blocked8> to tensor<2xi64, #blocked8>
2026-02-21T11:19:18.4741981Z     %5 = arith.extsi %2 : tensor<128xi32, #blocked8> to tensor<128xi64, #blocked8>
2026-02-21T11:19:18.4742270Z     %6 = ttg.convert_layout %5 : tensor<128xi64, #blocked8> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:19:18.4742628Z     %7 = tt.expand_dims %6 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi64, #blocked9>
2026-02-21T11:19:18.4742945Z     %8 = ttg.convert_layout %7 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #blocked10>
2026-02-21T11:19:18.4743263Z     %9 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T11:19:18.4743645Z     %10 = tt.expand_dims %9 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi64, #blocked11>
2026-02-21T11:19:18.4743978Z     %11 = ttg.convert_layout %10 : tensor<1x1x128xi64, #blocked11> -> tensor<1x1x128xi64, #blocked3>
2026-02-21T11:19:18.4744256Z     %12 = tt.broadcast %11 : tensor<1x1x128xi64, #blocked3> -> tensor<1x2x128xi64, #blocked3>
2026-02-21T11:19:18.4744521Z     %13 = ttg.convert_layout %12 : tensor<1x2x128xi64, #blocked3> -> tensor<1x2x128xi64, #blocked2>
2026-02-21T11:19:18.4744762Z     %14 = arith.cmpi sge, %11, %cst_7 : tensor<1x1x128xi64, #blocked3>
2026-02-21T11:19:18.4744954Z     %15 = arith.cmpi slt, %11, %cst_6 : tensor<1x1x128xi64, #blocked3>
2026-02-21T11:19:18.4745148Z     %16 = arith.andi %14, %15 : tensor<1x1x128xi1, #blocked3>
2026-02-21T11:19:18.4745361Z     %17 = tt.broadcast %16 : tensor<1x1x128xi1, #blocked3> -> tensor<1x2x128xi1, #blocked3>
2026-02-21T11:19:18.4745620Z     %18 = ttg.convert_layout %17 : tensor<1x2x128xi1, #blocked3> -> tensor<1x2x128xi1, #blocked2>
2026-02-21T11:19:18.4745878Z     %19 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked8>
2026-02-21T11:19:18.4746117Z     %20 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x1024x!tt.ptr<bf16>, #blocked>
2026-02-21T11:19:18.4746431Z     %21 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked12}>>
2026-02-21T11:19:18.4746807Z     %22 = tt.expand_dims %21 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x128x1xi64, #blocked12>
2026-02-21T11:19:18.4747136Z     %23 = ttg.convert_layout %22 : tensor<1x128x1xi64, #blocked12> -> tensor<1x128x1xi64, #blocked1>
2026-02-21T11:19:18.4747416Z     %24 = tt.broadcast %23 : tensor<1x128x1xi64, #blocked1> -> tensor<1x128x1024xi64, #blocked1>
2026-02-21T11:19:18.4747696Z     %25 = ttg.convert_layout %24 : tensor<1x128x1024xi64, #blocked1> -> tensor<1x128x1024xi64, #blocked>
2026-02-21T11:19:18.4747960Z     %26 = arith.extsi %19 : tensor<1024xi32, #blocked8> to tensor<1024xi64, #blocked8>
2026-02-21T11:19:18.4748177Z     %27 = arith.cmpi sge, %23, %cst_3 : tensor<1x128x1xi64, #blocked1>
2026-02-21T11:19:18.4748367Z     %28 = arith.cmpi slt, %23, %cst_2 : tensor<1x128x1xi64, #blocked1>
2026-02-21T11:19:18.4748551Z     %29 = arith.andi %27, %28 : tensor<1x128x1xi1, #blocked1>
2026-02-21T11:19:18.4748830Z     %30 = ttg.convert_layout %2 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:19:18.4749179Z     %31 = tt.expand_dims %30 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9>
2026-02-21T11:19:18.4749495Z     %32 = ttg.convert_layout %31 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked10>
2026-02-21T11:19:18.4749805Z     %33 = ttg.convert_layout %32 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T11:19:18.4750194Z     %34 = tt.expand_dims %33 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11>
2026-02-21T11:19:18.4750529Z     %35 = ttg.convert_layout %34 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked3>
2026-02-21T11:19:18.4750791Z     %36 = tt.broadcast %35 : tensor<1x1x128xi32, #blocked3> -> tensor<1x1024x128xi32, #blocked3>
2026-02-21T11:19:18.4751064Z     %37 = ttg.convert_layout %36 : tensor<1x1024x128xi32, #blocked3> -> tensor<1x1024x128xi32, #blocked13>
2026-02-21T11:19:18.4751324Z     %38 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x1024x128x!tt.ptr<bf16>, #blocked13>
2026-02-21T11:19:18.4751601Z     %39 = ttg.convert_layout %2 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:19:18.4751929Z     %40 = tt.expand_dims %39 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9>
2026-02-21T11:19:18.4752223Z     %41 = ttg.convert_layout %40 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked10>
2026-02-21T11:19:18.4752517Z     %42 = ttg.convert_layout %41 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T11:19:18.4752864Z     %43 = tt.expand_dims %42 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11>
2026-02-21T11:19:18.4753178Z     %44 = ttg.convert_layout %43 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked3>
2026-02-21T11:19:18.4753422Z     %45 = tt.broadcast %44 : tensor<1x1x128xi32, #blocked3> -> tensor<1x2x128xi32, #blocked3>
2026-02-21T11:19:18.4753689Z     %46 = ttg.convert_layout %45 : tensor<1x2x128xi32, #blocked3> -> tensor<1x2x128xi32, #blocked2>
2026-02-21T11:19:18.4753921Z     %47 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T11:19:18.4754117Z     scf.for %arg4 = %0 to %c393216_i32 step %c304_i32  : i32 {
2026-02-21T11:19:18.4754272Z       %48 = arith.remsi %arg4, %c192_i32 : i32
2026-02-21T11:19:18.4754402Z       %49 = arith.divsi %arg4, %c192_i32 : i32
2026-02-21T11:19:18.4754531Z       %50 = arith.muli %49, %c2_i32 : i32
2026-02-21T11:19:18.4754700Z       %51 = tt.splat %50 : i32 -> tensor<2xi32, #blocked8>
2026-02-21T11:19:18.4754855Z       %52 = arith.addi %51, %1 : tensor<2xi32, #blocked8>
2026-02-21T11:19:18.4754996Z       %53 = arith.extsi %48 : i32 to i64
2026-02-21T11:19:18.4755121Z       %54 = arith.extsi %50 : i32 to i64
2026-02-21T11:19:18.4755244Z       %55 = arith.muli %53, %c524288_i64 : i64
2026-02-21T11:19:18.4755394Z       %56 = tt.splat %55 : i64 -> tensor<1x2x128xi64, #blocked2>
2026-02-21T11:19:18.4755553Z       %57 = tt.splat %54 : i64 -> tensor<2xi64, #blocked8>
2026-02-21T11:19:18.4755701Z       %58 = arith.addi %57, %4 : tensor<2xi64, #blocked8>
2026-02-21T11:19:18.4755937Z       %59 = ttg.convert_layout %58 : tensor<2xi64, #blocked8> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:19:18.4756255Z       %60 = tt.expand_dims %59 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x2xi64, #blocked9>
2026-02-21T11:19:18.4756542Z       %61 = ttg.convert_layout %60 : tensor<1x2xi64, #blocked9> -> tensor<1x2xi64, #blocked6>
2026-02-21T11:19:18.4756830Z       %62 = ttg.convert_layout %61 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T11:19:18.4757182Z       %63 = tt.expand_dims %62 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xi64, #blocked14>
2026-02-21T11:19:18.4757486Z       %64 = ttg.convert_layout %63 : tensor<1x2x1xi64, #blocked14> -> tensor<1x2x1xi64, #blocked4>
2026-02-21T11:19:18.4757700Z       %65 = arith.muli %64, %cst_10 : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:19:18.4757901Z       %66 = tt.broadcast %65 : tensor<1x2x1xi64, #blocked4> -> tensor<1x2x128xi64, #blocked4>
2026-02-21T11:19:18.4758151Z       %67 = ttg.convert_layout %66 : tensor<1x2x128xi64, #blocked4> -> tensor<1x2x128xi64, #blocked2>
2026-02-21T11:19:18.4758381Z       %68 = arith.addi %67, %13 : tensor<1x2x128xi64, #blocked2>
2026-02-21T11:19:18.4758546Z       %69 = arith.addi %56, %68 : tensor<1x2x128xi64, #blocked2>
2026-02-21T11:19:18.4758756Z       %70 = tt.addptr %3, %69 : tensor<1x2x128x!tt.ptr<bf16>, #blocked2>, tensor<1x2x128xi64, #blocked2>
2026-02-21T11:19:18.4758957Z       %71 = arith.cmpi sge, %53, %c0_i64 : i64
2026-02-21T11:19:18.4759095Z       %72 = arith.cmpi slt, %53, %c192_i64 : i64
2026-02-21T11:19:18.4759222Z       %73 = arith.andi %71, %72 : i1
2026-02-21T11:19:18.4759368Z       %74 = arith.cmpi sge, %64, %cst_9 : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:19:18.4759542Z       %75 = arith.cmpi slt, %64, %cst_8 : tensor<1x2x1xi64, #blocked4>
2026-02-21T11:19:18.4759711Z       %76 = arith.andi %74, %75 : tensor<1x2x1xi1, #blocked4>
2026-02-21T11:19:18.4759866Z       %77 = tt.splat %73 : i1 -> tensor<1x2x1xi1, #blocked4>
2026-02-21T11:19:18.4760024Z       %78 = arith.andi %77, %76 : tensor<1x2x1xi1, #blocked4>
2026-02-21T11:19:18.4760213Z       %79 = tt.broadcast %78 : tensor<1x2x1xi1, #blocked4> -> tensor<1x2x128xi1, #blocked4>
2026-02-21T11:19:18.4760461Z       %80 = ttg.convert_layout %79 : tensor<1x2x128xi1, #blocked4> -> tensor<1x2x128xi1, #blocked2>
2026-02-21T11:19:18.4760675Z       %81 = arith.andi %80, %18 : tensor<1x2x128xi1, #blocked2>
2026-02-21T11:19:18.4760849Z       %82 = tt.load %70, %81, %cst_5 : tensor<1x2x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T11:19:18.4761030Z       %83 = tt.splat %55 : i64 -> tensor<1x128x1024xi64, #blocked>
2026-02-21T11:19:18.4761189Z       %84 = tt.splat %73 : i1 -> tensor<1x128x1xi1, #blocked1>
2026-02-21T11:19:18.4761389Z       %85 = arith.andi %84, %29 : tensor<1x128x1xi1, #blocked1>
2026-02-21T11:19:18.4761594Z       %86 = tt.broadcast %85 : tensor<1x128x1xi1, #blocked1> -> tensor<1x128x1024xi1, #blocked1>
2026-02-21T11:19:18.4761852Z       %87 = ttg.convert_layout %86 : tensor<1x128x1024xi1, #blocked1> -> tensor<1x128x1024xi1, #blocked>
2026-02-21T11:19:18.4762109Z       %88 = tt.reshape %82 : tensor<1x2x128xbf16, #blocked2> -> tensor<2x128xbf16, #blocked10>
2026-02-21T11:19:18.4762293Z       %89 = arith.muli %48, %c524288_i32 : i32
2026-02-21T11:19:18.4762458Z       %90 = tt.splat %89 : i32 -> tensor<1x1024x1xi32, #blocked5>
2026-02-21T11:19:18.4762880Z       %91:3 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c1024_i32 iter_args(%arg6 = %cst_18, %arg7 = %cst_17, %arg8 = %cst_16) -> (tensor<1x2xf32, #blocked6>, tensor<1x2xf32, #blocked6>, tensor<1x2x128xf32, #blocked2>)  : i32 {
2026-02-21T11:19:18.4763242Z         %113 = tt.splat %arg5 : i32 -> tensor<1024xi32, #blocked8>
2026-02-21T11:19:18.4763413Z         %114 = arith.addi %113, %19 : tensor<1024xi32, #blocked8>
2026-02-21T11:19:18.4763558Z         %115 = arith.extsi %arg5 : i32 to i64
2026-02-21T11:19:18.4763709Z         %116 = tt.splat %115 : i64 -> tensor<1024xi64, #blocked8>
2026-02-21T11:19:18.4763869Z         %117 = arith.addi %116, %26 : tensor<1024xi64, #blocked8>
2026-02-21T11:19:18.4764127Z         %118 = ttg.convert_layout %117 : tensor<1024xi64, #blocked8> -> tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:19:18.4764484Z         %119 = tt.expand_dims %118 {axis = 0 : i32} : tensor<1024xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x1024xi64, #blocked9>
2026-02-21T11:19:18.4764817Z         %120 = ttg.convert_layout %119 : tensor<1x1024xi64, #blocked9> -> tensor<1x1024xi64, #blocked7>
2026-02-21T11:19:18.4765124Z         %121 = ttg.convert_layout %120 : tensor<1x1024xi64, #blocked7> -> tensor<1x1024xi64, #ttg.slice<{dim = 1, parent = #blocked15}>>
2026-02-21T11:19:18.4765487Z         %122 = tt.expand_dims %121 {axis = 1 : i32} : tensor<1x1024xi64, #ttg.slice<{dim = 1, parent = #blocked15}>> -> tensor<1x1x1024xi64, #blocked15>
2026-02-21T11:19:18.4765811Z         %123 = ttg.convert_layout %122 : tensor<1x1x1024xi64, #blocked15> -> tensor<1x1x1024xi64, #blocked>
2026-02-21T11:19:18.4766044Z         %124 = arith.muli %123, %cst_4 : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:19:18.4766282Z         %125 = tt.broadcast %124 : tensor<1x1x1024xi64, #blocked> -> tensor<1x128x1024xi64, #blocked>
2026-02-21T11:19:18.4766501Z         %126 = arith.addi %25, %125 : tensor<1x128x1024xi64, #blocked>
2026-02-21T11:19:18.4766674Z         %127 = arith.addi %83, %126 : tensor<1x128x1024xi64, #blocked>
2026-02-21T11:19:18.4766896Z         %128 = tt.addptr %20, %127 : tensor<1x128x1024x!tt.ptr<bf16>, #blocked>, tensor<1x128x1024xi64, #blocked>
2026-02-21T11:19:18.4767133Z         %129 = arith.cmpi sge, %123, %cst_1 : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:19:18.4767316Z         %130 = arith.cmpi slt, %123, %cst_0 : tensor<1x1x1024xi64, #blocked>
2026-02-21T11:19:18.4767499Z         %131 = arith.andi %129, %130 : tensor<1x1x1024xi1, #blocked>
2026-02-21T11:19:18.4767705Z         %132 = tt.broadcast %131 : tensor<1x1x1024xi1, #blocked> -> tensor<1x128x1024xi1, #blocked>
2026-02-21T11:19:18.4767921Z         %133 = arith.andi %87, %132 : tensor<1x128x1024xi1, #blocked>
2026-02-21T11:19:18.4768108Z         %134 = tt.load %128, %133, %cst : tensor<1x128x1024x!tt.ptr<bf16>, #blocked>
2026-02-21T11:19:18.4768335Z         %135 = tt.reshape %134 : tensor<1x128x1024xbf16, #blocked> -> tensor<128x1024xbf16, #blocked7>
2026-02-21T11:19:18.4768654Z         %136 = ttg.convert_layout %88 : tensor<2x128xbf16, #blocked10> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>>
2026-02-21T11:19:18.4769025Z         %137 = ttg.convert_layout %135 : tensor<128x1024xbf16, #blocked7> -> tensor<128x1024xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>>
2026-02-21T11:19:18.4769353Z         %138 = ttg.convert_layout %cst_15 : tensor<2x1024xf32, #blocked7> -> tensor<2x1024xf32, #blocked16>
2026-02-21T11:19:18.4769810Z         %139 = tt.dot %136, %137, %138, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked16}>> * tensor<128x1024xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked16}>> -> tensor<2x1024xf32, #blocked16>
2026-02-21T11:19:18.4770231Z         %140 = ttg.convert_layout %139 : tensor<2x1024xf32, #blocked16> -> tensor<2x1024xf32, #blocked7>
2026-02-21T11:19:18.4770485Z         %141 = tt.reshape %140 : tensor<2x1024xf32, #blocked7> -> tensor<1x2x1024xf32, #blocked>
2026-02-21T11:19:18.4770752Z         %142 = arith.truncf %141 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked>
2026-02-21T11:19:18.4770998Z         %143 = arith.extf %142 : tensor<1x2x1024xbf16, #blocked> to tensor<1x2x1024xf32, #blocked>
2026-02-21T11:19:18.4771196Z         %144 = "tt.reduce"(%143) <{axis = 2 : i32}> ({
2026-02-21T11:19:18.4771328Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T11:19:18.4771457Z           %199 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T11:19:18.4771588Z           tt.reduce.return %199 : f32
2026-02-21T11:19:18.4771785Z         }) : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T11:19:18.4772083Z         %145 = ttg.convert_layout %144 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4772359Z         %146 = arith.truncf %145 : tensor<1x2xf32, #blocked6> to tensor<1x2xbf16, #blocked6>
2026-02-21T11:19:18.4772621Z         %147 = arith.extf %146 : tensor<1x2xbf16, #blocked6> to tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4772824Z         %148 = arith.mulf %147, %cst_14 : tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4773049Z         %149 = arith.truncf %148 : tensor<1x2xf32, #blocked6> to tensor<1x2xbf16, #blocked6>
2026-02-21T11:19:18.4773275Z         %150 = arith.extf %149 : tensor<1x2xbf16, #blocked6> to tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4773474Z         %151 = arith.cmpf ogt, %arg6, %150 : tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4773653Z         %152 = arith.cmpf une, %arg6, %arg6 : tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4773829Z         %153 = arith.ori %151, %152 : tensor<1x2xi1, #blocked6>
2026-02-21T11:19:18.4774030Z         %154 = arith.select %153, %arg6, %150 : tensor<1x2xi1, #blocked6>, tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4774265Z         %155 = arith.mulf %143, %cst_13 : tensor<1x2x1024xf32, #blocked>
2026-02-21T11:19:18.4774477Z         %156 = arith.truncf %155 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked>
2026-02-21T11:19:18.4774779Z         %157 = ttg.convert_layout %154 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T11:19:18.4775130Z         %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14>
2026-02-21T11:19:18.4775436Z         %159 = ttg.convert_layout %158 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4>
2026-02-21T11:19:18.4775692Z         %160 = arith.extf %156 : tensor<1x2x1024xbf16, #blocked> to tensor<1x2x1024xf32, #blocked>
2026-02-21T11:19:18.4775938Z         %161 = tt.broadcast %159 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x1024xf32, #blocked4>
2026-02-21T11:19:18.4776200Z         %162 = ttg.convert_layout %161 : tensor<1x2x1024xf32, #blocked4> -> tensor<1x2x1024xf32, #blocked>
2026-02-21T11:19:18.4776425Z         %163 = arith.subf %160, %162 : tensor<1x2x1024xf32, #blocked>
2026-02-21T11:19:18.4776741Z         %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2x1024xf32, #blocked>
2026-02-21T11:19:18.4777043Z         %165 = "tt.reduce"(%164) <{axis = 2 : i32}> ({
2026-02-21T11:19:18.4777176Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T11:19:18.4777304Z           %199 = arith.addf %arg9, %arg10 : f32
2026-02-21T11:19:18.4777451Z           tt.reduce.return %199 : f32
2026-02-21T11:19:18.4777640Z         }) : (tensor<1x2x1024xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T11:19:18.4777936Z         %166 = ttg.convert_layout %165 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4778182Z         %167 = arith.subf %arg6, %154 : tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4778475Z         %168 = tt.extern_elementwise %167 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked6>) -> tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4778788Z         %169 = arith.mulf %arg7, %168 : tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4778950Z         %170 = arith.addf %169, %166 : tensor<1x2xf32, #blocked6>
2026-02-21T11:19:18.4779202Z         %171 = ttg.convert_layout %168 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T11:19:18.4779544Z         %172 = tt.expand_dims %171 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14>
2026-02-21T11:19:18.4779856Z         %173 = ttg.convert_layout %172 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4>
2026-02-21T11:19:18.4780108Z         %174 = tt.broadcast %173 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4>
2026-02-21T11:19:18.4780365Z         %175 = ttg.convert_layout %174 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked2>
2026-02-21T11:19:18.4780591Z         %176 = arith.mulf %arg8, %175 : tensor<1x2x128xf32, #blocked2>
2026-02-21T11:19:18.4780861Z         %177 = ttg.convert_layout %114 : tensor<1024xi32, #blocked8> -> tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:19:18.4781208Z         %178 = tt.expand_dims %177 {axis = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x1024xi32, #blocked9>
2026-02-21T11:19:18.4781519Z         %179 = ttg.convert_layout %178 : tensor<1x1024xi32, #blocked9> -> tensor<1x1024xi32, #blocked7>
2026-02-21T11:19:18.4781820Z         %180 = ttg.convert_layout %179 : tensor<1x1024xi32, #blocked7> -> tensor<1x1024xi32, #ttg.slice<{dim = 2, parent = #blocked17}>>
2026-02-21T11:19:18.4782202Z         %181 = tt.expand_dims %180 {axis = 2 : i32} : tensor<1x1024xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> -> tensor<1x1024x1xi32, #blocked17>
2026-02-21T11:19:18.4782530Z         %182 = ttg.convert_layout %181 : tensor<1x1024x1xi32, #blocked17> -> tensor<1x1024x1xi32, #blocked5>
2026-02-21T11:19:18.4782761Z         %183 = arith.muli %182, %cst_12 : tensor<1x1024x1xi32, #blocked5>
2026-02-21T11:19:18.4782940Z         %184 = arith.addi %90, %183 : tensor<1x1024x1xi32, #blocked5>
2026-02-21T11:19:18.4783153Z         %185 = tt.broadcast %184 : tensor<1x1024x1xi32, #blocked5> -> tensor<1x1024x128xi32, #blocked5>
2026-02-21T11:19:18.4783430Z         %186 = ttg.convert_layout %185 : tensor<1x1024x128xi32, #blocked5> -> tensor<1x1024x128xi32, #blocked13>
2026-02-21T11:19:18.4783660Z         %187 = arith.addi %186, %37 : tensor<1x1024x128xi32, #blocked13>
2026-02-21T11:19:18.4783896Z         %188 = tt.addptr %38, %187 : tensor<1x1024x128x!tt.ptr<bf16>, #blocked13>, tensor<1x1024x128xi32, #blocked13>
2026-02-21T11:19:18.4784137Z         %189 = tt.load %188 : tensor<1x1024x128x!tt.ptr<bf16>, #blocked13>
2026-02-21T11:19:18.4784352Z         %190 = arith.truncf %164 : tensor<1x2x1024xf32, #blocked> to tensor<1x2x1024xbf16, #blocked>
2026-02-21T11:19:18.4784603Z         %191 = tt.reshape %176 : tensor<1x2x128xf32, #blocked2> -> tensor<2x128xf32, #blocked10>
2026-02-21T11:19:18.4784844Z         %192 = tt.reshape %190 : tensor<1x2x1024xbf16, #blocked> -> tensor<2x1024xbf16, #blocked7>
2026-02-21T11:19:18.4785105Z         %193 = tt.reshape %189 : tensor<1x1024x128xbf16, #blocked13> -> tensor<1024x128xbf16, #blocked10>
2026-02-21T11:19:18.4785425Z         %194 = ttg.convert_layout %192 : tensor<2x1024xbf16, #blocked7> -> tensor<2x1024xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>>
2026-02-21T11:19:18.4785829Z         %195 = ttg.convert_layout %193 : tensor<1024x128xbf16, #blocked10> -> tensor<1024x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>>
2026-02-21T11:19:18.4786148Z         %196 = ttg.convert_layout %191 : tensor<2x128xf32, #blocked10> -> tensor<2x128xf32, #blocked10>
2026-02-21T11:19:18.4786574Z         %197 = tt.dot %194, %195, %196, inputPrecision = tf32 : tensor<2x1024xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<1024x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x128xf32, #blocked10>
2026-02-21T11:19:18.4787001Z         %198 = tt.reshape %197 : tensor<2x128xf32, #blocked10> -> tensor<1x2x128xf32, #blocked2>
2026-02-21T11:19:18.4787279Z         scf.yield %154, %170, %198 : tensor<1x2xf32, #blocked6>, tensor<1x2xf32, #blocked6>, tensor<1x2x128xf32, #blocked2>
2026-02-21T11:19:18.4787489Z       }
2026-02-21T11:19:18.4787688Z       %92 = ttg.convert_layout %91#1 : tensor<1x2xf32, #blocked6> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T11:19:18.4788035Z       %93 = tt.expand_dims %92 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xf32, #blocked14>
2026-02-21T11:19:18.4788333Z       %94 = ttg.convert_layout %93 : tensor<1x2x1xf32, #blocked14> -> tensor<1x2x1xf32, #blocked4>
2026-02-21T11:19:18.4788580Z       %95 = tt.broadcast %94 : tensor<1x2x1xf32, #blocked4> -> tensor<1x2x128xf32, #blocked4>
2026-02-21T11:19:18.4788822Z       %96 = ttg.convert_layout %95 : tensor<1x2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked2>
2026-02-21T11:19:18.4789041Z       %97 = arith.divf %91#2, %96 : tensor<1x2x128xf32, #blocked2>
2026-02-21T11:19:18.4789263Z       %98 = arith.truncf %97 : tensor<1x2x128xf32, #blocked2> to tensor<1x2x128xbf16, #blocked2>
2026-02-21T11:19:18.4789456Z       %99 = arith.muli %48, %c524288_i32 : i32
2026-02-21T11:19:18.4789686Z       %100 = ttg.convert_layout %52 : tensor<2xi32, #blocked8> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:19:18.4790014Z       %101 = tt.expand_dims %100 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x2xi32, #blocked9>
2026-02-21T11:19:18.4790309Z       %102 = ttg.convert_layout %101 : tensor<1x2xi32, #blocked9> -> tensor<1x2xi32, #blocked6>
2026-02-21T11:19:18.4790621Z       %103 = ttg.convert_layout %102 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T11:19:18.4790961Z       %104 = tt.expand_dims %103 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x2x1xi32, #blocked14>
2026-02-21T11:19:18.4791275Z       %105 = ttg.convert_layout %104 : tensor<1x2x1xi32, #blocked14> -> tensor<1x2x1xi32, #blocked4>
2026-02-21T11:19:18.4791491Z       %106 = arith.muli %105, %cst_11 : tensor<1x2x1xi32, #blocked4>
2026-02-21T11:19:18.4791665Z       %107 = tt.splat %99 : i32 -> tensor<1x2x1xi32, #blocked4>
2026-02-21T11:19:18.4791835Z       %108 = arith.addi %107, %106 : tensor<1x2x1xi32, #blocked4>
2026-02-21T11:19:18.4792041Z       %109 = tt.broadcast %108 : tensor<1x2x1xi32, #blocked4> -> tensor<1x2x128xi32, #blocked4>
2026-02-21T11:19:18.4792294Z       %110 = ttg.convert_layout %109 : tensor<1x2x128xi32, #blocked4> -> tensor<1x2x128xi32, #blocked2>
2026-02-21T11:19:18.4792510Z       %111 = arith.addi %110, %46 : tensor<1x2x128xi32, #blocked2>
2026-02-21T11:19:18.4792731Z       %112 = tt.addptr %47, %111 : tensor<1x2x128x!tt.ptr<bf16>, #blocked2>, tensor<1x2x128xi32, #blocked2>
2026-02-21T11:19:18.4792958Z       tt.store %112, %98 : tensor<1x2x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T11:19:18.4793108Z     } {tt.loop_unroll_factor = 1 : i32}
2026-02-21T11:19:18.4793228Z     tt.return
2026-02-21T11:19:18.4793317Z   }
2026-02-21T11:19:18.4793401Z }
2026-02-21T11:19:18.4793447Z 
2026-02-21T11:19:18.4793480Z {-#
2026-02-21T11:19:18.4793568Z   external_resources: {
2026-02-21T11:19:18.4793675Z     mlir_reproducer: {
2026-02-21T11:19:18.4795933Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T11:19:18.4798296Z       disable_threading: false,
2026-02-21T11:19:18.4798406Z       verify_each: true
2026-02-21T11:19:18.4798508Z     }
2026-02-21T11:19:18.4798582Z   }
2026-02-21T11:19:18.4798659Z #-}
2026-02-21T11:19:18.4798939Z /tmp/torchinductor_root/qx/cqxoflqc4pbusmmbqhmay5me5afqx2e4hofhburtffelleatvdmk.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T11:19:18.4799693Z /tmp/torchinductor_root/qx/cqxoflqc4pbusmmbqhmay5me5afqx2e4hofhburtffelleatvdmk.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T11:19:18.4800255Z [182s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T11:19:18.4801077Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 1024], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[0, 0], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T11:19:18.4801823Z Error: RuntimeError: PassManager::run failed
2026-02-21T11:19:18.4802002Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T11:21:15.9693968Z /tmp/torchinductor_root/qe/cqe554l7qr33o6wxmnihp3prhkkxheyptmy4cme6pxp6ul4zcf3u.py:55:130: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T11:21:15.9695144Z         k = tl.load(k_view + (indices_0[:, None, None] * 524288 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None)
2026-02-21T11:21:15.9695826Z                                                                                                                                  ^
2026-02-21T11:21:15.9697617Z /tmp/torchinductor_root/qe/cqe554l7qr33o6wxmnihp3prhkkxheyptmy4cme6pxp6ul4zcf3u.py:57:141: note: - use: %132 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x512xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x512xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>>
2026-02-21T11:21:15.9699170Z 
2026-02-21T11:21:15.9700064Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T11:21:15.9701782Z                                                                                                                                             ^
2026-02-21T11:21:15.9702255Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T11:21:15.9702903Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:21:15.9703733Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:21:15.9704714Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:21:15.9705368Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T11:21:15.9705915Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T11:21:15.9706429Z #blocked5 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}>
2026-02-21T11:21:15.9706944Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}>
2026-02-21T11:21:15.9707499Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:21:15.9708073Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:21:15.9708718Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:21:15.9709268Z #blocked10 = #ttg.blocked<{sizePerThread = [2, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T11:21:15.9709809Z #blocked11 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T11:21:15.9710407Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T11:21:15.9711494Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T11:21:15.9712218Z     %c524288_i32 = arith.constant 524288 : i32
2026-02-21T11:21:15.9712454Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T11:21:15.9712668Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T11:21:15.9712892Z     %c524288_i64 = arith.constant 524288 : i64
2026-02-21T11:21:15.9713191Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked>
2026-02-21T11:21:15.9713551Z     %cst_0 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked>
2026-02-21T11:21:15.9713880Z     %cst_1 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked>
2026-02-21T11:21:15.9714205Z     %cst_2 = arith.constant dense<4096> : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:21:15.9714529Z     %cst_3 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:21:15.9714827Z     %cst_4 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:21:15.9715039Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T11:21:15.9715206Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T11:21:15.9715366Z     %c3072_i32 = arith.constant 3072 : i32
2026-02-21T11:21:15.9715566Z     %cst_5 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked1>
2026-02-21T11:21:15.9715806Z     %cst_6 = arith.constant dense<128> : tensor<1x512x1xi32, #blocked2>
2026-02-21T11:21:15.9716061Z     %cst_7 = arith.constant dense<0.127517432> : tensor<1x2x512xf32, #blocked>
2026-02-21T11:21:15.9716323Z     %cst_8 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9716606Z     %cst_9 = arith.constant dense<0.000000e+00> : tensor<2x512xf32, #blocked4>
2026-02-21T11:21:15.9716863Z     %cst_10 = arith.constant dense<128> : tensor<1x1x512xi32, #blocked>
2026-02-21T11:21:15.9717060Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T11:21:15.9717265Z     %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked>
2026-02-21T11:21:15.9717529Z     %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9717788Z     %cst_13 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9718017Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T11:21:15.9718172Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T11:21:15.9718331Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T11:21:15.9718486Z     %0 = tt.get_program_id x : i32
2026-02-21T11:21:15.9718642Z     %1 = arith.divsi %0, %c3072_i32 : i32
2026-02-21T11:21:15.9718801Z     %2 = arith.muli %1, %c16_i32 : i32
2026-02-21T11:21:15.9718953Z     %3 = arith.subi %c2048_i32, %2 : i32
2026-02-21T11:21:15.9719104Z     %4 = arith.minsi %3, %c16_i32 : i32
2026-02-21T11:21:15.9719258Z     %5 = arith.remsi %0, %c3072_i32 : i32
2026-02-21T11:21:15.9719408Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T11:21:15.9719558Z     %7 = arith.addi %2, %6 : i32
2026-02-21T11:21:15.9719704Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T11:21:15.9719850Z     %9 = arith.muli %7, %c2_i32 : i32
2026-02-21T11:21:15.9720061Z     %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked5>
2026-02-21T11:21:15.9720304Z     %11 = tt.splat %9 : i32 -> tensor<2xi32, #blocked5>
2026-02-21T11:21:15.9720505Z     %12 = arith.addi %11, %10 : tensor<2xi32, #blocked5>
2026-02-21T11:21:15.9720760Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked5>
2026-02-21T11:21:15.9720986Z     %14 = arith.extsi %8 : i32 to i64
2026-02-21T11:21:15.9721140Z     %15 = arith.extsi %9 : i32 to i64
2026-02-21T11:21:15.9721350Z     %16 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:21:15.9721579Z     %17 = arith.muli %14, %c524288_i64 : i64
2026-02-21T11:21:15.9721767Z     %18 = tt.splat %17 : i64 -> tensor<1x2x128xi64, #blocked>
2026-02-21T11:21:15.9721972Z     %19 = tt.splat %15 : i64 -> tensor<2xi64, #blocked5>
2026-02-21T11:21:15.9722222Z     %20 = arith.extsi %10 : tensor<2xi32, #blocked5> to tensor<2xi64, #blocked5>
2026-02-21T11:21:15.9722456Z     %21 = arith.addi %19, %20 : tensor<2xi64, #blocked5>
2026-02-21T11:21:15.9722859Z     %22 = ttg.convert_layout %21 : tensor<2xi64, #blocked5> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:21:15.9723286Z     %23 = tt.expand_dims %22 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi64, #blocked6>
2026-02-21T11:21:15.9723664Z     %24 = ttg.convert_layout %23 : tensor<1x2xi64, #blocked6> -> tensor<1x2xi64, #blocked3>
2026-02-21T11:21:15.9724031Z     %25 = ttg.convert_layout %24 : tensor<1x2xi64, #blocked3> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T11:21:15.9724474Z     %26 = tt.expand_dims %25 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi64, #blocked7>
2026-02-21T11:21:15.9724865Z     %27 = ttg.convert_layout %26 : tensor<1x2x1xi64, #blocked7> -> tensor<1x2x1xi64, #blocked1>
2026-02-21T11:21:15.9725143Z     %28 = arith.muli %27, %cst_4 : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:21:15.9725408Z     %29 = tt.broadcast %28 : tensor<1x2x1xi64, #blocked1> -> tensor<1x2x128xi64, #blocked1>
2026-02-21T11:21:15.9725699Z     %30 = ttg.convert_layout %29 : tensor<1x2x128xi64, #blocked1> -> tensor<1x2x128xi64, #blocked>
2026-02-21T11:21:15.9725950Z     %31 = arith.extsi %13 : tensor<128xi32, #blocked5> to tensor<128xi64, #blocked5>
2026-02-21T11:21:15.9726234Z     %32 = ttg.convert_layout %31 : tensor<128xi64, #blocked5> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:21:15.9726576Z     %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi64, #blocked6>
2026-02-21T11:21:15.9726902Z     %34 = ttg.convert_layout %33 : tensor<1x128xi64, #blocked6> -> tensor<1x128xi64, #blocked4>
2026-02-21T11:21:15.9727201Z     %35 = ttg.convert_layout %34 : tensor<1x128xi64, #blocked4> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T11:21:15.9727563Z     %36 = tt.expand_dims %35 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi64, #blocked8>
2026-02-21T11:21:15.9727902Z     %37 = ttg.convert_layout %36 : tensor<1x1x128xi64, #blocked8> -> tensor<1x1x128xi64, #blocked>
2026-02-21T11:21:15.9728156Z     %38 = tt.broadcast %37 : tensor<1x1x128xi64, #blocked> -> tensor<1x2x128xi64, #blocked>
2026-02-21T11:21:15.9728366Z     %39 = arith.addi %30, %38 : tensor<1x2x128xi64, #blocked>
2026-02-21T11:21:15.9728532Z     %40 = arith.addi %18, %39 : tensor<1x2x128xi64, #blocked>
2026-02-21T11:21:15.9728749Z     %41 = tt.addptr %16, %40 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>, tensor<1x2x128xi64, #blocked>
2026-02-21T11:21:15.9728953Z     %42 = arith.cmpi sge, %14, %c0_i64 : i64
2026-02-21T11:21:15.9729087Z     %43 = arith.cmpi slt, %14, %c192_i64 : i64
2026-02-21T11:21:15.9729219Z     %44 = arith.andi %42, %43 : i1
2026-02-21T11:21:15.9729369Z     %45 = arith.cmpi sge, %27, %cst_3 : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:21:15.9729557Z     %46 = arith.cmpi slt, %27, %cst_2 : tensor<1x2x1xi64, #blocked1>
2026-02-21T11:21:15.9729730Z     %47 = arith.andi %45, %46 : tensor<1x2x1xi1, #blocked1>
2026-02-21T11:21:15.9729896Z     %48 = tt.splat %44 : i1 -> tensor<1x2x1xi1, #blocked1>
2026-02-21T11:21:15.9730077Z     %49 = arith.andi %48, %47 : tensor<1x2x1xi1, #blocked1>
2026-02-21T11:21:15.9730272Z     %50 = tt.broadcast %49 : tensor<1x2x1xi1, #blocked1> -> tensor<1x2x128xi1, #blocked1>
2026-02-21T11:21:15.9730525Z     %51 = ttg.convert_layout %50 : tensor<1x2x128xi1, #blocked1> -> tensor<1x2x128xi1, #blocked>
2026-02-21T11:21:15.9730749Z     %52 = arith.cmpi sge, %37, %cst_1 : tensor<1x1x128xi64, #blocked>
2026-02-21T11:21:15.9730935Z     %53 = arith.cmpi slt, %37, %cst_0 : tensor<1x1x128xi64, #blocked>
2026-02-21T11:21:15.9731107Z     %54 = arith.andi %52, %53 : tensor<1x1x128xi1, #blocked>
2026-02-21T11:21:15.9731329Z     %55 = tt.broadcast %54 : tensor<1x1x128xi1, #blocked> -> tensor<1x2x128xi1, #blocked>
2026-02-21T11:21:15.9731533Z     %56 = arith.andi %51, %55 : tensor<1x2x128xi1, #blocked>
2026-02-21T11:21:15.9731710Z     %57 = tt.load %41, %56, %cst : tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:21:15.9731915Z     %58 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked5>
2026-02-21T11:21:15.9732095Z     %59 = arith.muli %8, %c524288_i32 : i32
2026-02-21T11:21:15.9732330Z     %60 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:21:15.9732675Z     %61 = tt.expand_dims %60 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6>
2026-02-21T11:21:15.9732978Z     %62 = ttg.convert_layout %61 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4>
2026-02-21T11:21:15.9733282Z     %63 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T11:21:15.9733638Z     %64 = tt.expand_dims %63 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x128x1xi32, #blocked9>
2026-02-21T11:21:15.9733957Z     %65 = ttg.convert_layout %64 : tensor<1x128x1xi32, #blocked9> -> tensor<1x128x1xi32, #blocked2>
2026-02-21T11:21:15.9734182Z     %66 = tt.splat %59 : i32 -> tensor<1x128x1xi32, #blocked2>
2026-02-21T11:21:15.9734348Z     %67 = arith.addi %66, %65 : tensor<1x128x1xi32, #blocked2>
2026-02-21T11:21:15.9734561Z     %68 = tt.broadcast %67 : tensor<1x128x1xi32, #blocked2> -> tensor<1x128x512xi32, #blocked2>
2026-02-21T11:21:15.9734850Z     %69 = ttg.convert_layout %68 : tensor<1x128x512xi32, #blocked2> -> tensor<1x128x512xi32, #blocked>
2026-02-21T11:21:15.9735091Z     %70 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x512x!tt.ptr<bf16>, #blocked>
2026-02-21T11:21:15.9735313Z     %71 = tt.reshape %57 : tensor<1x2x128xbf16, #blocked> -> tensor<2x128xbf16, #blocked4>
2026-02-21T11:21:15.9735508Z     %72 = tt.splat %59 : i32 -> tensor<1x512x1xi32, #blocked2>
2026-02-21T11:21:15.9735752Z     %73 = ttg.convert_layout %62 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T11:21:15.9736108Z     %74 = tt.expand_dims %73 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8>
2026-02-21T11:21:15.9736410Z     %75 = ttg.convert_layout %74 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked>
2026-02-21T11:21:15.9736656Z     %76 = tt.broadcast %75 : tensor<1x1x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked>
2026-02-21T11:21:15.9736882Z     %77 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x512x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:21:15.9737278Z     %78:3 = scf.for %arg4 = %c0_i32 to %c4096_i32 step %c512_i32 iter_args(%arg5 = %cst_13, %arg6 = %cst_12, %arg7 = %cst_11) -> (tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>)  : i32 {
2026-02-21T11:21:15.9737643Z       %108 = tt.splat %arg4 : i32 -> tensor<512xi32, #blocked5>
2026-02-21T11:21:15.9737819Z       %109 = arith.addi %108, %58 : tensor<512xi32, #blocked5>
2026-02-21T11:21:15.9738065Z       %110 = ttg.convert_layout %109 : tensor<512xi32, #blocked5> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:21:15.9738427Z       %111 = tt.expand_dims %110 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x512xi32, #blocked6>
2026-02-21T11:21:15.9738725Z       %112 = ttg.convert_layout %111 : tensor<1x512xi32, #blocked6> -> tensor<1x512xi32, #blocked4>
2026-02-21T11:21:15.9739022Z       %113 = ttg.convert_layout %112 : tensor<1x512xi32, #blocked4> -> tensor<1x512xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T11:21:15.9739371Z       %114 = tt.expand_dims %113 {axis = 1 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x512xi32, #blocked8>
2026-02-21T11:21:15.9739702Z       %115 = ttg.convert_layout %114 : tensor<1x1x512xi32, #blocked8> -> tensor<1x1x512xi32, #blocked>
2026-02-21T11:21:15.9739917Z       %116 = arith.muli %115, %cst_10 : tensor<1x1x512xi32, #blocked>
2026-02-21T11:21:15.9740136Z       %117 = tt.broadcast %116 : tensor<1x1x512xi32, #blocked> -> tensor<1x128x512xi32, #blocked>
2026-02-21T11:21:15.9740345Z       %118 = arith.addi %69, %117 : tensor<1x128x512xi32, #blocked>
2026-02-21T11:21:15.9740563Z       %119 = tt.addptr %70, %118 : tensor<1x128x512x!tt.ptr<bf16>, #blocked>, tensor<1x128x512xi32, #blocked>
2026-02-21T11:21:15.9740788Z       %120 = tt.load %119 : tensor<1x128x512x!tt.ptr<bf16>, #blocked>
2026-02-21T11:21:15.9740995Z       %121 = tt.reshape %120 : tensor<1x128x512xbf16, #blocked> -> tensor<128x512xbf16, #blocked4>
2026-02-21T11:21:15.9741298Z       %122 = ttg.convert_layout %71 : tensor<2x128xbf16, #blocked4> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>>
2026-02-21T11:21:15.9741655Z       %123 = ttg.convert_layout %121 : tensor<128x512xbf16, #blocked4> -> tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>>
2026-02-21T11:21:15.9741968Z       %124 = ttg.convert_layout %cst_9 : tensor<2x512xf32, #blocked4> -> tensor<2x512xf32, #blocked10>
2026-02-21T11:21:15.9742390Z       %125 = tt.dot %122, %123, %124, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<2x512xf32, #blocked10>
2026-02-21T11:21:15.9742814Z       %126 = ttg.convert_layout %125 : tensor<2x512xf32, #blocked10> -> tensor<2x512xf32, #blocked4>
2026-02-21T11:21:15.9743051Z       %127 = tt.reshape %126 : tensor<2x512xf32, #blocked4> -> tensor<1x2x512xf32, #blocked>
2026-02-21T11:21:15.9743310Z       %128 = arith.truncf %127 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked>
2026-02-21T11:21:15.9743543Z       %129 = arith.extf %128 : tensor<1x2x512xbf16, #blocked> to tensor<1x2x512xf32, #blocked>
2026-02-21T11:21:15.9743730Z       %130 = "tt.reduce"(%129) <{axis = 2 : i32}> ({
2026-02-21T11:21:15.9743858Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T11:21:15.9743980Z         %183 = arith.maxnumf %arg8, %arg9 : f32
2026-02-21T11:21:15.9744125Z         tt.reduce.return %183 : f32
2026-02-21T11:21:15.9744312Z       }) : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T11:21:15.9744604Z       %131 = ttg.convert_layout %130 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9744873Z       %132 = arith.truncf %131 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3>
2026-02-21T11:21:15.9745097Z       %133 = arith.extf %132 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9745314Z       %134 = arith.mulf %133, %cst_8 : tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9745504Z       %135 = arith.truncf %134 : tensor<1x2xf32, #blocked3> to tensor<1x2xbf16, #blocked3>
2026-02-21T11:21:15.9745722Z       %136 = arith.extf %135 : tensor<1x2xbf16, #blocked3> to tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9745913Z       %137 = arith.cmpf ogt, %arg5, %136 : tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9746090Z       %138 = arith.cmpf une, %arg5, %arg5 : tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9746255Z       %139 = arith.ori %137, %138 : tensor<1x2xi1, #blocked3>
2026-02-21T11:21:15.9746472Z       %140 = arith.select %139, %arg5, %136 : tensor<1x2xi1, #blocked3>, tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9746680Z       %141 = arith.mulf %129, %cst_7 : tensor<1x2x512xf32, #blocked>
2026-02-21T11:21:15.9746880Z       %142 = arith.truncf %141 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked>
2026-02-21T11:21:15.9747167Z       %143 = ttg.convert_layout %140 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T11:21:15.9747504Z       %144 = tt.expand_dims %143 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T11:21:15.9747818Z       %145 = ttg.convert_layout %144 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T11:21:15.9748062Z       %146 = arith.extf %142 : tensor<1x2x512xbf16, #blocked> to tensor<1x2x512xf32, #blocked>
2026-02-21T11:21:15.9748295Z       %147 = tt.broadcast %145 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x512xf32, #blocked1>
2026-02-21T11:21:15.9748538Z       %148 = ttg.convert_layout %147 : tensor<1x2x512xf32, #blocked1> -> tensor<1x2x512xf32, #blocked>
2026-02-21T11:21:15.9748749Z       %149 = arith.subf %146, %148 : tensor<1x2x512xf32, #blocked>
2026-02-21T11:21:15.9749051Z       %150 = tt.extern_elementwise %149 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2x512xf32, #blocked>
2026-02-21T11:21:15.9749348Z       %151 = "tt.reduce"(%150) <{axis = 2 : i32}> ({
2026-02-21T11:21:15.9749472Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T11:21:15.9749590Z         %183 = arith.addf %arg8, %arg9 : f32
2026-02-21T11:21:15.9749708Z         tt.reduce.return %183 : f32
2026-02-21T11:21:15.9749889Z       }) : (tensor<1x2x512xf32, #blocked>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>>
2026-02-21T11:21:15.9750178Z       %152 = ttg.convert_layout %151 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked}>> -> tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9750420Z       %153 = arith.subf %arg5, %140 : tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9750706Z       %154 = tt.extern_elementwise %153 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked3>) -> tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9751006Z       %155 = arith.mulf %arg6, %154 : tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9751164Z       %156 = arith.addf %155, %152 : tensor<1x2xf32, #blocked3>
2026-02-21T11:21:15.9751403Z       %157 = ttg.convert_layout %154 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T11:21:15.9751734Z       %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T11:21:15.9752047Z       %159 = ttg.convert_layout %158 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T11:21:15.9752285Z       %160 = tt.broadcast %159 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1>
2026-02-21T11:21:15.9752527Z       %161 = ttg.convert_layout %160 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked>
2026-02-21T11:21:15.9752738Z       %162 = arith.mulf %arg7, %161 : tensor<1x2x128xf32, #blocked>
2026-02-21T11:21:15.9752984Z       %163 = ttg.convert_layout %112 : tensor<1x512xi32, #blocked4> -> tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T11:21:15.9753326Z       %164 = tt.expand_dims %163 {axis = 2 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x512x1xi32, #blocked9>
2026-02-21T11:21:15.9753629Z       %165 = ttg.convert_layout %164 : tensor<1x512x1xi32, #blocked9> -> tensor<1x512x1xi32, #blocked2>
2026-02-21T11:21:15.9753847Z       %166 = arith.muli %165, %cst_6 : tensor<1x512x1xi32, #blocked2>
2026-02-21T11:21:15.9754016Z       %167 = arith.addi %72, %166 : tensor<1x512x1xi32, #blocked2>
2026-02-21T11:21:15.9754230Z       %168 = tt.broadcast %167 : tensor<1x512x1xi32, #blocked2> -> tensor<1x512x128xi32, #blocked2>
2026-02-21T11:21:15.9754488Z       %169 = ttg.convert_layout %168 : tensor<1x512x128xi32, #blocked2> -> tensor<1x512x128xi32, #blocked>
2026-02-21T11:21:15.9754702Z       %170 = arith.addi %169, %76 : tensor<1x512x128xi32, #blocked>
2026-02-21T11:21:15.9754917Z       %171 = tt.addptr %77, %170 : tensor<1x512x128x!tt.ptr<bf16>, #blocked>, tensor<1x512x128xi32, #blocked>
2026-02-21T11:21:15.9755133Z       %172 = tt.load %171 : tensor<1x512x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:21:15.9755338Z       %173 = arith.truncf %150 : tensor<1x2x512xf32, #blocked> to tensor<1x2x512xbf16, #blocked>
2026-02-21T11:21:15.9755592Z       %174 = tt.reshape %162 : tensor<1x2x128xf32, #blocked> -> tensor<2x128xf32, #blocked4>
2026-02-21T11:21:15.9755819Z       %175 = tt.reshape %173 : tensor<1x2x512xbf16, #blocked> -> tensor<2x512xbf16, #blocked4>
2026-02-21T11:21:15.9756056Z       %176 = tt.reshape %172 : tensor<1x512x128xbf16, #blocked> -> tensor<512x128xbf16, #blocked4>
2026-02-21T11:21:15.9756355Z       %177 = ttg.convert_layout %175 : tensor<2x512xbf16, #blocked4> -> tensor<2x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked11}>>
2026-02-21T11:21:15.9756709Z       %178 = ttg.convert_layout %176 : tensor<512x128xbf16, #blocked4> -> tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked11}>>
2026-02-21T11:21:15.9757015Z       %179 = ttg.convert_layout %174 : tensor<2x128xf32, #blocked4> -> tensor<2x128xf32, #blocked11>
2026-02-21T11:21:15.9757425Z       %180 = tt.dot %177, %178, %179, inputPrecision = tf32 : tensor<2x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked11}>> * tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked11}>> -> tensor<2x128xf32, #blocked11>
2026-02-21T11:21:15.9757833Z       %181 = ttg.convert_layout %180 : tensor<2x128xf32, #blocked11> -> tensor<2x128xf32, #blocked4>
2026-02-21T11:21:15.9758071Z       %182 = tt.reshape %181 : tensor<2x128xf32, #blocked4> -> tensor<1x2x128xf32, #blocked>
2026-02-21T11:21:15.9758335Z       scf.yield %140, %156, %182 : tensor<1x2xf32, #blocked3>, tensor<1x2xf32, #blocked3>, tensor<1x2x128xf32, #blocked>
2026-02-21T11:21:15.9758587Z     } {tt.flatten, tt.loop_unroll_factor = 1 : i32, tt.num_stages = 4 : i32}
2026-02-21T11:21:15.9758853Z     %79 = ttg.convert_layout %78#1 : tensor<1x2xf32, #blocked3> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T11:21:15.9759197Z     %80 = tt.expand_dims %79 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xf32, #blocked7>
2026-02-21T11:21:15.9759489Z     %81 = ttg.convert_layout %80 : tensor<1x2x1xf32, #blocked7> -> tensor<1x2x1xf32, #blocked1>
2026-02-21T11:21:15.9759723Z     %82 = tt.broadcast %81 : tensor<1x2x1xf32, #blocked1> -> tensor<1x2x128xf32, #blocked1>
2026-02-21T11:21:15.9759962Z     %83 = ttg.convert_layout %82 : tensor<1x2x128xf32, #blocked1> -> tensor<1x2x128xf32, #blocked>
2026-02-21T11:21:15.9760186Z     %84 = arith.divf %78#2, %83 : tensor<1x2x128xf32, #blocked>
2026-02-21T11:21:15.9760378Z     %85 = arith.truncf %84 : tensor<1x2x128xf32, #blocked> to tensor<1x2x128xbf16, #blocked>
2026-02-21T11:21:15.9760558Z     %86 = arith.muli %8, %c524288_i32 : i32
2026-02-21T11:21:15.9760770Z     %87 = ttg.convert_layout %12 : tensor<2xi32, #blocked5> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:21:15.9761083Z     %88 = tt.expand_dims %87 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x2xi32, #blocked6>
2026-02-21T11:21:15.9761364Z     %89 = ttg.convert_layout %88 : tensor<1x2xi32, #blocked6> -> tensor<1x2xi32, #blocked3>
2026-02-21T11:21:15.9761636Z     %90 = ttg.convert_layout %89 : tensor<1x2xi32, #blocked3> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T11:21:15.9761960Z     %91 = tt.expand_dims %90 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2x1xi32, #blocked7>
2026-02-21T11:21:15.9762266Z     %92 = ttg.convert_layout %91 : tensor<1x2x1xi32, #blocked7> -> tensor<1x2x1xi32, #blocked1>
2026-02-21T11:21:15.9762469Z     %93 = arith.muli %92, %cst_5 : tensor<1x2x1xi32, #blocked1>
2026-02-21T11:21:15.9762668Z     %94 = tt.splat %86 : i32 -> tensor<1x2x1xi32, #blocked1>
2026-02-21T11:21:15.9762818Z     %95 = arith.addi %94, %93 : tensor<1x2x1xi32, #blocked1>
2026-02-21T11:21:15.9763052Z     %96 = ttg.convert_layout %13 : tensor<128xi32, #blocked5> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>>
2026-02-21T11:21:15.9763370Z     %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked6}>> -> tensor<1x128xi32, #blocked6>
2026-02-21T11:21:15.9763675Z     %98 = ttg.convert_layout %97 : tensor<1x128xi32, #blocked6> -> tensor<1x128xi32, #blocked4>
2026-02-21T11:21:15.9763959Z     %99 = ttg.convert_layout %98 : tensor<1x128xi32, #blocked4> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>>
2026-02-21T11:21:15.9764295Z     %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked8}>> -> tensor<1x1x128xi32, #blocked8>
2026-02-21T11:21:15.9764595Z     %101 = ttg.convert_layout %100 : tensor<1x1x128xi32, #blocked8> -> tensor<1x1x128xi32, #blocked>
2026-02-21T11:21:15.9764836Z     %102 = tt.broadcast %95 : tensor<1x2x1xi32, #blocked1> -> tensor<1x2x128xi32, #blocked1>
2026-02-21T11:21:15.9765079Z     %103 = ttg.convert_layout %102 : tensor<1x2x128xi32, #blocked1> -> tensor<1x2x128xi32, #blocked>
2026-02-21T11:21:15.9765323Z     %104 = tt.broadcast %101 : tensor<1x1x128xi32, #blocked> -> tensor<1x2x128xi32, #blocked>
2026-02-21T11:21:15.9765519Z     %105 = arith.addi %103, %104 : tensor<1x2x128xi32, #blocked>
2026-02-21T11:21:15.9765703Z     %106 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:21:15.9765936Z     %107 = tt.addptr %106, %105 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>, tensor<1x2x128xi32, #blocked>
2026-02-21T11:21:15.9766150Z     tt.store %107, %85 : tensor<1x2x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:21:15.9766282Z     tt.return
2026-02-21T11:21:15.9766366Z   }
2026-02-21T11:21:15.9766441Z }
2026-02-21T11:21:15.9766483Z 
2026-02-21T11:21:15.9766513Z {-#
2026-02-21T11:21:15.9766597Z   external_resources: {
2026-02-21T11:21:15.9766695Z     mlir_reproducer: {
2026-02-21T11:21:15.9768958Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T11:21:15.9782368Z       disable_threading: false,
2026-02-21T11:21:15.9782479Z       verify_each: true
2026-02-21T11:21:15.9782568Z     }
2026-02-21T11:21:15.9782644Z   }
2026-02-21T11:21:15.9782716Z #-}
2026-02-21T11:21:15.9782990Z /tmp/torchinductor_root/qe/cqe554l7qr33o6wxmnihp3prhkkxheyptmy4cme6pxp6ul4zcf3u.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T11:21:15.9783719Z /tmp/torchinductor_root/qe/cqe554l7qr33o6wxmnihp3prhkkxheyptmy4cme6pxp6ul4zcf3u.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T11:21:15.9784272Z [300s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T11:21:15.9785024Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 512], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T11:21:15.9785688Z Error: RuntimeError: PassManager::run failed
2026-02-21T11:21:15.9785855Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T11:24:02.1995452Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 1.1 configs/s
2026-02-21T11:24:02.2005745Z [466s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s])
2026-02-21T11:24:02.2377803Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 - configs/s
2026-02-21T11:24:02.9888146Z [467s] Initial random population of 100, 5 starting points: 
2026-02-21T11:24:02.9888585Z error=23
2026-02-21T11:24:02.9888806Z timeout=19
2026-02-21T11:24:02.9889005Z ok=58
2026-02-21T11:24:02.9889204Z min=8.1410
2026-02-21T11:24:02.9889403Z mid=59.0059
2026-02-21T11:24:02.9889632Z max=12096.0410
2026-02-21T11:24:02.9889878Z best={'block_sizes': [1, 256, 16],
2026-02-21T11:24:02.9890278Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T11:24:02.9890696Z  'l2_groupings': [8],
2026-02-21T11:24:02.9890966Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:24:02.9891306Z  'loop_orders': [[0, 1]],
2026-02-21T11:24:02.9891589Z  'matrix_instr_nonkdim': 16,
2026-02-21T11:24:02.9891899Z  'num_sm_multiplier': 16,
2026-02-21T11:24:02.9892151Z  'num_stages': 4,
2026-02-21T11:24:02.9892379Z  'num_warps': 16,
2026-02-21T11:24:02.9892625Z  'pid_type': 'persistent_blocked',
2026-02-21T11:24:02.9893210Z  'range_flattens': [False, True],
2026-02-21T11:24:02.9893514Z  'range_multi_buffers': [None, False],
2026-02-21T11:24:02.9893821Z  'range_num_stages': [1, 3],
2026-02-21T11:24:02.9894098Z  'range_unroll_factors': [2, 3],
2026-02-21T11:24:02.9894393Z  'range_warp_specializes': [],
2026-02-21T11:24:02.9894631Z  'waves_per_eu': 1}
2026-02-21T11:24:02.9903256Z [467s] Fitting surrogate: 100 points, 100 targets
2026-02-21T11:24:03.8537636Z [468s] Generation 1 starting: 84 neighbors, 5 active search path(s)
2026-02-21T11:24:38.8742162Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 0.5 configs/s
2026-02-21T11:24:59.1466955Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 86/86 4.3 configs/s
2026-02-21T11:24:59.3677996Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 27/27 - configs/s
2026-02-21T11:25:04.5778644Z [529s] Generation 1 complete: 
2026-02-21T11:25:04.5781669Z ok=89
2026-02-21T11:25:04.5781982Z min=7.1705
2026-02-21T11:25:04.5782217Z mid=14.2522
2026-02-21T11:25:04.5782371Z max=192.5479
2026-02-21T11:25:04.5782588Z best={'block_sizes': [1, 256, 32],
2026-02-21T11:25:04.5782925Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T11:25:04.5783240Z  'l2_groupings': [8],
2026-02-21T11:25:04.5783453Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:25:04.5783694Z  'loop_orders': [[0, 1]],
2026-02-21T11:25:04.5783918Z  'matrix_instr_nonkdim': 16,
2026-02-21T11:25:04.5784129Z  'num_sm_multiplier': 8,
2026-02-21T11:25:04.5784341Z  'num_stages': 4,
2026-02-21T11:25:04.5784516Z  'num_warps': 8,
2026-02-21T11:25:04.5784714Z  'pid_type': 'persistent_blocked',
2026-02-21T11:25:04.5784945Z  'range_flattens': [None, True],
2026-02-21T11:25:04.5785548Z  'range_multi_buffers': [None, False],
2026-02-21T11:25:04.5785785Z  'range_num_stages': [1, 3],
2026-02-21T11:25:04.5785993Z  'range_unroll_factors': [2, 3],
2026-02-21T11:25:04.5786215Z  'range_warp_specializes': [],
2026-02-21T11:25:04.5786419Z  'waves_per_eu': 1}
2026-02-21T11:25:04.5801384Z [529s] Fitting surrogate: 189 points, 189 targets
2026-02-21T11:25:05.5203144Z [530s] Generation 2 starting: 88 neighbors, 5 active search path(s)
2026-02-21T11:25:39.6176177Z [564s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[1, 3], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:25:43.6066609Z [568s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:25:43.9458873Z [568s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[1, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:25:44.6402464Z [569s] Timeout after 30s compiling Config(block_sizes=[1, 256, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:25:44.6426611Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 0.8 configs/s
2026-02-21T11:26:01.8498940Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 91/91 5.3 configs/s
2026-02-21T11:26:02.1636623Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 29/29 - configs/s
2026-02-21T11:26:09.9604388Z [594s] Generation 2 complete: 
2026-02-21T11:26:09.9605454Z error=3
2026-02-21T11:26:09.9605658Z timeout=4
2026-02-21T11:26:09.9605861Z ok=86
2026-02-21T11:26:09.9606049Z min=6.7719
2026-02-21T11:26:09.9606256Z mid=12.1696
2026-02-21T11:26:09.9606451Z max=75.6151
2026-02-21T11:26:09.9606836Z best={'block_sizes': [1, 128, 16],
2026-02-21T11:26:09.9607240Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T11:26:09.9607638Z  'l2_groupings': [64],
2026-02-21T11:26:09.9607910Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:26:09.9608236Z  'loop_orders': [[0, 1]],
2026-02-21T11:26:09.9608501Z  'matrix_instr_nonkdim': 16,
2026-02-21T11:26:09.9608757Z  'num_stages': 4,
2026-02-21T11:26:09.9608983Z  'num_warps': 4,
2026-02-21T11:26:09.9609213Z  'pid_type': 'flat',
2026-02-21T11:26:09.9609466Z  'range_flattens': [None, None],
2026-02-21T11:26:09.9609741Z  'range_multi_buffers': [None, True],
2026-02-21T11:26:09.9610052Z  'range_num_stages': [0, 4],
2026-02-21T11:26:09.9610315Z  'range_unroll_factors': [0, 4],
2026-02-21T11:26:09.9610600Z  'range_warp_specializes': [],
2026-02-21T11:26:09.9610870Z  'waves_per_eu': 2}
2026-02-21T11:26:09.9632626Z [594s] Fitting surrogate: 282 points, 282 targets
2026-02-21T11:26:11.5116247Z [596s] Generation 3 starting: 80 neighbors, 5 active search path(s)
2026-02-21T11:26:48.0444715Z [632s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:26:48.2670704Z [632s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:26:49.6951747Z [634s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:26:50.0880143Z [634s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:26:50.0901107Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 0.9 configs/s
2026-02-21T11:27:10.5125322Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 81/81 3.9 configs/s
2026-02-21T11:27:10.8083048Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 30/30 - configs/s
2026-02-21T11:27:18.5333234Z [663s] Generation 3 complete: 
2026-02-21T11:27:18.5336006Z error=6
2026-02-21T11:27:18.5336323Z timeout=4
2026-02-21T11:27:18.5336545Z ok=75
2026-02-21T11:27:18.5336986Z min=6.6709
2026-02-21T11:27:18.5337232Z mid=13.1824
2026-02-21T11:27:18.5337460Z max=163.7117
2026-02-21T11:27:18.5340431Z best={'block_sizes': [1, 256, 32],
2026-02-21T11:27:18.5340954Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T11:27:18.5341368Z  'l2_groupings': [16],
2026-02-21T11:27:18.5342201Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:27:18.5342523Z  'loop_orders': [[0, 1]],
2026-02-21T11:27:18.5342812Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:27:18.5343245Z  'num_sm_multiplier': 8,
2026-02-21T11:27:18.5343515Z  'num_stages': 4,
2026-02-21T11:27:18.5343747Z  'num_warps': 8,
2026-02-21T11:27:18.5344020Z  'pid_type': 'persistent_blocked',
2026-02-21T11:27:18.5344333Z  'range_flattens': [None, True],
2026-02-21T11:27:18.5344590Z  'range_multi_buffers': [None, False],
2026-02-21T11:27:18.5344837Z  'range_num_stages': [1, 3],
2026-02-21T11:27:18.5345057Z  'range_unroll_factors': [3, 3],
2026-02-21T11:27:18.5345291Z  'range_warp_specializes': [],
2026-02-21T11:27:18.5345505Z  'waves_per_eu': 1}
2026-02-21T11:27:18.5360898Z [663s] Fitting surrogate: 367 points, 367 targets
2026-02-21T11:27:19.3621391Z [663s] Generation 4 starting: 72 neighbors, 5 active search path(s)
2026-02-21T11:27:53.2125402Z [697s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:27:55.6738950Z [700s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:27:55.8807216Z [700s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:27:57.2050761Z [701s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:27:58.3562673Z [702s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[1, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:27:58.3587239Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.5 configs/s
2026-02-21T11:28:09.2170982Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 73/73 6.8 configs/s
2026-02-21T11:28:09.5326169Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 33/33 - configs/s
2026-02-21T11:28:18.3384737Z [722s] Generation 4 complete: 
2026-02-21T11:28:18.3385058Z error=2
2026-02-21T11:28:18.3385233Z timeout=5
2026-02-21T11:28:18.3385396Z ok=70
2026-02-21T11:28:18.3385573Z min=6.2016
2026-02-21T11:28:18.3385766Z mid=8.3862
2026-02-21T11:28:18.3385926Z max=75.5474
2026-02-21T11:28:18.3386127Z best={'block_sizes': [1, 256, 32],
2026-02-21T11:28:18.3386501Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T11:28:18.3386855Z  'l2_groupings': [16],
2026-02-21T11:28:18.3387079Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:28:18.3387630Z  'loop_orders': [[0, 1]],
2026-02-21T11:28:18.3387859Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:28:18.3388101Z  'num_sm_multiplier': 8,
2026-02-21T11:28:18.3388315Z  'num_stages': 4,
2026-02-21T11:28:18.3388509Z  'num_warps': 8,
2026-02-21T11:28:18.3388735Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:28:18.3389011Z  'range_flattens': [None, True],
2026-02-21T11:28:18.3389263Z  'range_multi_buffers': [None, False],
2026-02-21T11:28:18.3389522Z  'range_num_stages': [1, 3],
2026-02-21T11:28:18.3389751Z  'range_unroll_factors': [3, 3],
2026-02-21T11:28:18.3389997Z  'range_warp_specializes': [],
2026-02-21T11:28:18.3390372Z  'waves_per_eu': 1}
2026-02-21T11:28:18.3418192Z [722s] Fitting surrogate: 444 points, 444 targets
2026-02-21T11:28:19.1005389Z [723s] Generation 5 starting: 67 neighbors, 4 active search path(s)
2026-02-21T11:28:54.3124062Z [758s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:28:56.0005995Z [760s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:28:56.2370889Z [760s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:28:56.2390292Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 0.7 configs/s
2026-02-21T11:29:07.0291593Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 68/68 6.3 configs/s
2026-02-21T11:29:07.2975880Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s
2026-02-21T11:29:14.9157794Z [779s] Generation 5 complete: 
2026-02-21T11:29:14.9158109Z error=7
2026-02-21T11:29:14.9160380Z timeout=3
2026-02-21T11:29:14.9160906Z ok=61
2026-02-21T11:29:14.9161047Z min=5.8681
2026-02-21T11:29:14.9161196Z mid=8.4334
2026-02-21T11:29:14.9161339Z max=107.0934
2026-02-21T11:29:14.9161519Z best={'block_sizes': [1, 256, 64],
2026-02-21T11:29:14.9161834Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T11:29:14.9162136Z  'l2_groupings': [16],
2026-02-21T11:29:14.9162330Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:29:14.9162559Z  'loop_orders': [[0, 1]],
2026-02-21T11:29:14.9162927Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:29:14.9163131Z  'num_sm_multiplier': 8,
2026-02-21T11:29:14.9163318Z  'num_stages': 4,
2026-02-21T11:29:14.9163481Z  'num_warps': 8,
2026-02-21T11:29:14.9163667Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:29:14.9163900Z  'range_flattens': [None, None],
2026-02-21T11:29:14.9164122Z  'range_multi_buffers': [None, False],
2026-02-21T11:29:14.9164346Z  'range_num_stages': [1, 2],
2026-02-21T11:29:14.9164548Z  'range_unroll_factors': [3, 3],
2026-02-21T11:29:14.9164761Z  'range_warp_specializes': [],
2026-02-21T11:29:14.9164967Z  'waves_per_eu': 1}
2026-02-21T11:29:14.9189807Z [779s] Fitting surrogate: 515 points, 515 targets
2026-02-21T11:29:15.5914235Z [780s] Generation 6 starting: 66 neighbors, 4 active search path(s)
2026-02-21T11:29:50.0773860Z [814s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:29:51.1574239Z [815s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:29:51.1592585Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 0.6 configs/s
2026-02-21T11:30:03.5275135Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 67/67 5.4 configs/s
2026-02-21T11:30:03.7005072Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s
2026-02-21T11:30:08.5285647Z [833s] Generation 6 complete: 
2026-02-21T11:30:08.5285966Z error=4
2026-02-21T11:30:08.5286127Z timeout=2
2026-02-21T11:30:08.5286281Z ok=64
2026-02-21T11:30:08.5286431Z min=5.8744
2026-02-21T11:30:08.5286589Z mid=15.0732
2026-02-21T11:30:08.5286747Z max=136.5611
2026-02-21T11:30:08.5286939Z best={'block_sizes': [1, 256, 64],
2026-02-21T11:30:08.5287285Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T11:30:08.5287608Z  'l2_groupings': [16],
2026-02-21T11:30:08.5287846Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:30:08.5288088Z  'loop_orders': [[0, 1]],
2026-02-21T11:30:08.5288306Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:30:08.5288522Z  'num_sm_multiplier': 8,
2026-02-21T11:30:08.5288759Z  'num_stages': 4,
2026-02-21T11:30:08.5288938Z  'num_warps': 8,
2026-02-21T11:30:08.5289142Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:30:08.5289385Z  'range_flattens': [None, None],
2026-02-21T11:30:08.5289620Z  'range_multi_buffers': [False, False],
2026-02-21T11:30:08.5290142Z  'range_num_stages': [1, 2],
2026-02-21T11:30:08.5290359Z  'range_unroll_factors': [3, 3],
2026-02-21T11:30:08.5290592Z  'range_warp_specializes': [],
2026-02-21T11:30:08.5290804Z  'waves_per_eu': 2}
2026-02-21T11:30:08.5314865Z [833s] Fitting surrogate: 585 points, 585 targets
2026-02-21T11:30:09.1997517Z [833s] Generation 7 starting: 65 neighbors, 4 active search path(s)
2026-02-21T11:30:42.1807940Z [866s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:30:47.8761511Z [872s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[3, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:30:48.3139412Z [872s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:30:48.3161349Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 0.6 configs/s
2026-02-21T11:31:02.7862714Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 67/67 4.6 configs/s
2026-02-21T11:31:02.9874931Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s
2026-02-21T11:31:08.6643822Z [893s] Generation 7 complete: 
2026-02-21T11:31:08.6644258Z error=3
2026-02-21T11:31:08.6644478Z timeout=3
2026-02-21T11:31:08.6644685Z ok=63
2026-02-21T11:31:08.6645419Z min=5.8937
2026-02-21T11:31:08.6645628Z mid=13.1874
2026-02-21T11:31:08.6645838Z max=110.0586
2026-02-21T11:31:08.6646101Z best={'block_sizes': [1, 256, 64],
2026-02-21T11:31:08.6646632Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T11:31:08.6647094Z  'l2_groupings': [16],
2026-02-21T11:31:08.6647391Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:31:08.6647736Z  'loop_orders': [[0, 1]],
2026-02-21T11:31:08.6648013Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:31:08.6648312Z  'num_sm_multiplier': 8,
2026-02-21T11:31:08.6648546Z  'num_stages': 4,
2026-02-21T11:31:08.6648748Z  'num_warps': 8,
2026-02-21T11:31:08.6648965Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:31:08.6649227Z  'range_flattens': [None, None],
2026-02-21T11:31:08.6649478Z  'range_multi_buffers': [False, False],
2026-02-21T11:31:08.6649733Z  'range_num_stages': [1, 1],
2026-02-21T11:31:08.6649971Z  'range_unroll_factors': [3, 3],
2026-02-21T11:31:08.6650214Z  'range_warp_specializes': [],
2026-02-21T11:31:08.6650454Z  'waves_per_eu': 2}
2026-02-21T11:31:08.6677397Z [893s] Fitting surrogate: 654 points, 654 targets
2026-02-21T11:31:09.3244715Z [893s] Generation 8 starting: 62 neighbors, 4 active search path(s)
2026-02-21T11:31:44.2029485Z [928s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:31:45.2382135Z [929s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 32], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:31:45.6986877Z [930s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:31:46.1649944Z [930s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:31:47.3389427Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64/64 0.6 configs/s
2026-02-21T11:31:59.1618274Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 64/64 5.5 configs/s
2026-02-21T11:31:59.3424833Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s
2026-02-21T11:32:04.4088518Z [948s] Generation 8 complete: 
2026-02-21T11:32:04.4088945Z error=5
2026-02-21T11:32:04.4089222Z timeout=4
2026-02-21T11:32:04.4089433Z ok=57
2026-02-21T11:32:04.4089633Z min=5.8528
2026-02-21T11:32:04.4089840Z mid=12.1601
2026-02-21T11:32:04.4090040Z max=87.8726
2026-02-21T11:32:04.4090276Z best={'block_sizes': [1, 256, 64],
2026-02-21T11:32:04.4090697Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T11:32:04.4091110Z  'l2_groupings': [16],
2026-02-21T11:32:04.4091903Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:32:04.4092236Z  'loop_orders': [[0, 1]],
2026-02-21T11:32:04.4092540Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:32:04.4092823Z  'num_sm_multiplier': 8,
2026-02-21T11:32:04.4093083Z  'num_stages': 4,
2026-02-21T11:32:04.4093316Z  'num_warps': 8,
2026-02-21T11:32:04.4093624Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:32:04.4093905Z  'range_flattens': [False, None],
2026-02-21T11:32:04.4094243Z  'range_multi_buffers': [False, False],
2026-02-21T11:32:04.4094675Z  'range_num_stages': [1, 1],
2026-02-21T11:32:04.4095058Z  'range_unroll_factors': [3, 3],
2026-02-21T11:32:04.4095481Z  'range_warp_specializes': [],
2026-02-21T11:32:04.4095848Z  'waves_per_eu': 2}
2026-02-21T11:32:04.4119471Z [948s] Fitting surrogate: 720 points, 720 targets
2026-02-21T11:32:05.1049211Z [949s] Generation 9 starting: 66 neighbors, 4 active search path(s)
2026-02-21T11:32:39.3461156Z [983s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 32], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:32:39.8180733Z [984s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:32:40.0957633Z [984s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:32:41.3959661Z [985s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:32:41.3979465Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 0.5 configs/s
2026-02-21T11:32:55.5907172Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 68/68 4.8 configs/s
2026-02-21T11:32:55.7418554Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s
2026-02-21T11:32:59.9719309Z [1004s] Generation 9 complete: 
2026-02-21T11:32:59.9719664Z error=3
2026-02-21T11:32:59.9719999Z timeout=4
2026-02-21T11:32:59.9720341Z ok=63
2026-02-21T11:32:59.9721016Z min=5.8623
2026-02-21T11:32:59.9721300Z mid=14.7878
2026-02-21T11:32:59.9721523Z max=171.9621
2026-02-21T11:32:59.9721833Z best={'block_sizes': [1, 256, 64],
2026-02-21T11:32:59.9722263Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T11:32:59.9722774Z  'l2_groupings': [16],
2026-02-21T11:32:59.9723034Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:32:59.9723326Z  'loop_orders': [[0, 1]],
2026-02-21T11:32:59.9723582Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:32:59.9723837Z  'num_sm_multiplier': 8,
2026-02-21T11:32:59.9724078Z  'num_stages': 4,
2026-02-21T11:32:59.9724288Z  'num_warps': 8,
2026-02-21T11:32:59.9724699Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:32:59.9724994Z  'range_flattens': [False, None],
2026-02-21T11:32:59.9725290Z  'range_multi_buffers': [False, False],
2026-02-21T11:32:59.9725586Z  'range_num_stages': [1, 1],
2026-02-21T11:32:59.9725841Z  'range_unroll_factors': [3, 3],
2026-02-21T11:32:59.9726121Z  'range_warp_specializes': [],
2026-02-21T11:32:59.9726365Z  'waves_per_eu': 2}
2026-02-21T11:32:59.9748140Z [1004s] Fitting surrogate: 790 points, 790 targets
2026-02-21T11:33:00.7117366Z [1005s] Generation 10 starting: 73 neighbors, 4 active search path(s)
2026-02-21T11:33:33.9087155Z [1038s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:33:34.1868315Z [1038s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[1, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:33:34.1887764Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 1.0 configs/s
2026-02-21T11:33:47.0945702Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 5.8 configs/s
2026-02-21T11:33:47.3314895Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s
2026-02-21T11:33:54.0437617Z [1058s] Generation 10 complete: 
2026-02-21T11:33:54.0438047Z error=1
2026-02-21T11:33:54.0438333Z timeout=2
2026-02-21T11:33:54.0438538Z ok=74
2026-02-21T11:33:54.0438740Z min=5.7939
2026-02-21T11:33:54.0438946Z mid=10.3790
2026-02-21T11:33:54.0439771Z max=65.0460
2026-02-21T11:33:54.0440047Z best={'block_sizes': [1, 256, 64],
2026-02-21T11:33:54.0440462Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T11:33:54.0441004Z  'l2_groupings': [16],
2026-02-21T11:33:54.0441283Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:33:54.0441601Z  'loop_orders': [[0, 1]],
2026-02-21T11:33:54.0441874Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:33:54.0442153Z  'num_sm_multiplier': 8,
2026-02-21T11:33:54.0442435Z  'num_stages': 4,
2026-02-21T11:33:54.0442794Z  'num_warps': 8,
2026-02-21T11:33:54.0443053Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:33:54.0443379Z  'range_flattens': [False, None],
2026-02-21T11:33:54.0443684Z  'range_multi_buffers': [None, True],
2026-02-21T11:33:54.0443988Z  'range_num_stages': [1, 1],
2026-02-21T11:33:54.0444273Z  'range_unroll_factors': [3, 3],
2026-02-21T11:33:54.0444574Z  'range_warp_specializes': [],
2026-02-21T11:33:54.0444851Z  'waves_per_eu': 2}
2026-02-21T11:33:54.0473073Z [1058s] Fitting surrogate: 867 points, 867 targets
2026-02-21T11:33:54.8541203Z [1059s] Generation 11 starting: 73 neighbors, 4 active search path(s)
2026-02-21T11:34:29.4836223Z [1093s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[1, 1], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:34:32.0071633Z [1096s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:34:33.0563717Z [1097s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:34:34.4338718Z [1098s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:34:34.4360959Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 0.5 configs/s
2026-02-21T11:34:45.4709007Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 6.7 configs/s
2026-02-21T11:34:45.7447264Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s
2026-02-21T11:34:53.6450982Z [1118s] Generation 11 complete: 
2026-02-21T11:34:53.6451648Z error=3
2026-02-21T11:34:53.6451771Z timeout=4
2026-02-21T11:34:53.6452898Z ok=70
2026-02-21T11:34:53.6453955Z min=5.7165
2026-02-21T11:34:53.6454113Z mid=7.4070
2026-02-21T11:34:53.6454199Z max=59.7397
2026-02-21T11:34:53.6454307Z best={'block_sizes': [1, 256, 64],
2026-02-21T11:34:53.6454513Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T11:34:53.6454713Z  'l2_groupings': [16],
2026-02-21T11:34:53.6454819Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:34:53.6454974Z  'loop_orders': [[0, 1]],
2026-02-21T11:34:53.6455078Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:34:53.6455186Z  'num_sm_multiplier': 8,
2026-02-21T11:34:53.6455283Z  'num_stages': 4,
2026-02-21T11:34:53.6455604Z  'num_warps': 8,
2026-02-21T11:34:53.6455708Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:34:53.6455830Z  'range_flattens': [False, None],
2026-02-21T11:34:53.6455948Z  'range_multi_buffers': [None, True],
2026-02-21T11:34:53.6456062Z  'range_num_stages': [2, 1],
2026-02-21T11:34:53.6456184Z  'range_unroll_factors': [3, 2],
2026-02-21T11:34:53.6456331Z  'range_warp_specializes': [],
2026-02-21T11:34:53.6456435Z  'waves_per_eu': 2}
2026-02-21T11:34:53.6495288Z [1118s] Fitting surrogate: 944 points, 944 targets
2026-02-21T11:34:55.2004244Z [1119s] Generation 12 starting: 71 neighbors, 4 active search path(s)
2026-02-21T11:35:36.6685586Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.5 configs/s
2026-02-21T11:35:49.8047757Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 5.5 configs/s
2026-02-21T11:35:50.1326124Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s
2026-02-21T11:35:59.6248198Z [1184s] Generation 12 complete: 
2026-02-21T11:35:59.6248643Z error=4
2026-02-21T11:35:59.6248856Z ok=71
2026-02-21T11:35:59.6249072Z min=5.6851
2026-02-21T11:35:59.6249280Z mid=6.1906
2026-02-21T11:35:59.6249494Z max=139.4411
2026-02-21T11:35:59.6249752Z best={'block_sizes': [1, 256, 64],
2026-02-21T11:35:59.6250201Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T11:35:59.6250612Z  'l2_groupings': [16],
2026-02-21T11:35:59.6250892Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:35:59.6251226Z  'loop_orders': [[0, 1]],
2026-02-21T11:35:59.6251512Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:35:59.6251801Z  'num_sm_multiplier': 8,
2026-02-21T11:35:59.6252080Z  'num_stages': 4,
2026-02-21T11:35:59.6252357Z  'num_warps': 8,
2026-02-21T11:35:59.6252629Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:35:59.6252971Z  'range_flattens': [None, None],
2026-02-21T11:35:59.6253281Z  'range_multi_buffers': [None, True],
2026-02-21T11:35:59.6253588Z  'range_num_stages': [2, 1],
2026-02-21T11:35:59.6253885Z  'range_unroll_factors': [3, 2],
2026-02-21T11:35:59.6254175Z  'range_warp_specializes': [],
2026-02-21T11:35:59.6254461Z  'waves_per_eu': 2}
2026-02-21T11:35:59.6292651Z [1184s] Fitting surrogate: 1019 points, 1019 targets
2026-02-21T11:36:00.4012661Z [1184s] Generation 13 starting: 74 neighbors, 4 active search path(s)
2026-02-21T11:36:23.8029101Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 1.7 configs/s
2026-02-21T11:36:34.0408465Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 7.4 configs/s
2026-02-21T11:36:34.3268795Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s
2026-02-21T11:36:42.6111208Z [1227s] Generation 13 complete: 
2026-02-21T11:36:42.6111533Z error=10
2026-02-21T11:36:42.6111683Z ok=68
2026-02-21T11:36:42.6112257Z min=5.7041
2026-02-21T11:36:42.6112404Z mid=7.1150
2026-02-21T11:36:42.6112555Z max=61.7277
2026-02-21T11:36:42.6112740Z best={'block_sizes': [1, 256, 64],
2026-02-21T11:36:42.6113053Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T11:36:42.6113347Z  'l2_groupings': [16],
2026-02-21T11:36:42.6113544Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:36:42.6113777Z  'loop_orders': [[0, 1]],
2026-02-21T11:36:42.6113976Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:36:42.6114306Z  'num_sm_multiplier': 8,
2026-02-21T11:36:42.6114493Z  'num_stages': 4,
2026-02-21T11:36:42.6114663Z  'num_warps': 8,
2026-02-21T11:36:42.6114848Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:36:42.6115084Z  'range_flattens': [None, None],
2026-02-21T11:36:42.6115306Z  'range_multi_buffers': [None, True],
2026-02-21T11:36:42.6115523Z  'range_num_stages': [2, 1],
2026-02-21T11:36:42.6115744Z  'range_unroll_factors': [4, 2],
2026-02-21T11:36:42.6115956Z  'range_warp_specializes': [],
2026-02-21T11:36:42.6116161Z  'waves_per_eu': 2}
2026-02-21T11:36:42.6152155Z [1227s] Fitting surrogate: 1097 points, 1097 targets
2026-02-21T11:36:43.4076660Z [1227s] Generation 14 starting: 77 neighbors, 4 active search path(s)
2026-02-21T11:37:18.3023313Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 1.0 configs/s
2026-02-21T11:37:30.3250760Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 6.5 configs/s
2026-02-21T11:37:30.6301529Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s
2026-02-21T11:37:39.3687261Z [1283s] Generation 14 complete: 
2026-02-21T11:37:39.3687662Z error=2
2026-02-21T11:37:39.3687881Z ok=79
2026-02-21T11:37:39.3688100Z min=5.6552
2026-02-21T11:37:39.3688309Z mid=7.1259
2026-02-21T11:37:39.3688515Z max=68.7424
2026-02-21T11:37:39.3688748Z best={'block_sizes': [1, 256, 64],
2026-02-21T11:37:39.3689594Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T11:37:39.3690004Z  'l2_groupings': [16],
2026-02-21T11:37:39.3690307Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:37:39.3690622Z  'loop_orders': [[0, 1]],
2026-02-21T11:37:39.3690904Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:37:39.3691184Z  'num_sm_multiplier': 8,
2026-02-21T11:37:39.3691463Z  'num_stages': 4,
2026-02-21T11:37:39.3691704Z  'num_warps': 8,
2026-02-21T11:37:39.3691964Z  'pid_type': 'persistent_interleaved',
2026-02-21T11:37:39.3692295Z  'range_flattens': [None, None],
2026-02-21T11:37:39.3692594Z  'range_multi_buffers': [None, True],
2026-02-21T11:37:39.3692916Z  'range_num_stages': [2, 1],
2026-02-21T11:37:39.3693193Z  'range_unroll_factors': [4, 2],
2026-02-21T11:37:39.3693410Z  'range_warp_specializes': [],
2026-02-21T11:37:39.3693515Z  'waves_per_eu': 2}
2026-02-21T11:37:39.3734387Z [1283s] Fitting surrogate: 1178 points, 1178 targets
2026-02-21T11:37:40.2469318Z [1284s] Generation 15 starting: 76 neighbors, 4 active search path(s)
2026-02-21T11:38:13.4800530Z [1317s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[2, 1], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:38:13.9969503Z [1318s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:38:21.9646644Z [1326s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:38:21.9662646Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 1.0 configs/s
2026-02-21T11:38:31.2676964Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 8.1 configs/s
2026-02-21T11:38:31.5864855Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s
2026-02-21T11:38:40.6860305Z [1345s] Generation 15 complete: 
2026-02-21T11:38:40.6860515Z error=4
2026-02-21T11:38:40.6860609Z timeout=3
2026-02-21T11:38:40.6860690Z ok=73
2026-02-21T11:38:40.6860825Z min=5.5934
2026-02-21T11:38:40.6860904Z mid=5.8480
2026-02-21T11:38:40.6860991Z max=73.9800
2026-02-21T11:38:40.6861094Z best={'block_sizes': [1, 64, 32],
2026-02-21T11:38:40.6861257Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:38:40.6861409Z  'l2_groupings': [8],
2026-02-21T11:38:40.6862079Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:38:40.6862201Z  'loop_orders': [[0, 1]],
2026-02-21T11:38:40.6862313Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:38:40.6862421Z  'num_sm_multiplier': 128,
2026-02-21T11:38:40.6862525Z  'num_stages': 4,
2026-02-21T11:38:40.6862631Z  'num_warps': 2,
2026-02-21T11:38:40.6862728Z  'pid_type': 'persistent_blocked',
2026-02-21T11:38:40.6862850Z  'range_flattens': [None, None],
2026-02-21T11:38:40.6862964Z  'range_multi_buffers': [False, False],
2026-02-21T11:38:40.6863085Z  'range_num_stages': [1, 1],
2026-02-21T11:38:40.6863189Z  'range_unroll_factors': [4, 4],
2026-02-21T11:38:40.6863306Z  'range_warp_specializes': [],
2026-02-21T11:38:40.6863526Z  'waves_per_eu': 2}
2026-02-21T11:38:40.6908653Z [1345s] Fitting surrogate: 1258 points, 1258 targets
2026-02-21T11:38:41.1490241Z [1345s] Generation 16 starting: 39 neighbors, 2 active search path(s)
2026-02-21T11:39:12.2084057Z [1376s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[1, 2], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:39:16.7610149Z [1381s] Timeout after 30s compiling Config(block_sizes=[1, 64, 64], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:39:16.7631322Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 0.9 configs/s
2026-02-21T11:39:22.0606233Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 39/39 7.3 configs/s
2026-02-21T11:39:22.2055874Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s
2026-02-21T11:39:26.2583551Z [1390s] Generation 16 complete: 
2026-02-21T11:39:26.2584321Z error=5
2026-02-21T11:39:26.2584479Z timeout=2
2026-02-21T11:39:26.2584637Z ok=34
2026-02-21T11:39:26.2584787Z min=5.6074
2026-02-21T11:39:26.2584944Z mid=5.8169
2026-02-21T11:39:26.2585094Z max=65.2135
2026-02-21T11:39:26.2585276Z best={'block_sizes': [1, 64, 32],
2026-02-21T11:39:26.2585578Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:39:26.2585894Z  'l2_groupings': [8],
2026-02-21T11:39:26.2586102Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:39:26.2586347Z  'loop_orders': [[0, 1]],
2026-02-21T11:39:26.2586701Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:39:26.2586911Z  'num_sm_multiplier': 128,
2026-02-21T11:39:26.2587115Z  'num_stages': 3,
2026-02-21T11:39:26.2587284Z  'num_warps': 2,
2026-02-21T11:39:26.2587494Z  'pid_type': 'persistent_blocked',
2026-02-21T11:39:26.2587726Z  'range_flattens': [None, None],
2026-02-21T11:39:26.2587953Z  'range_multi_buffers': [False, False],
2026-02-21T11:39:26.2588187Z  'range_num_stages': [1, 1],
2026-02-21T11:39:26.2588400Z  'range_unroll_factors': [4, 4],
2026-02-21T11:39:26.2588619Z  'range_warp_specializes': [],
2026-02-21T11:39:26.2588829Z  'waves_per_eu': 2}
2026-02-21T11:39:26.2625268Z [1390s] Fitting surrogate: 1299 points, 1299 targets
2026-02-21T11:39:26.5620425Z [1391s] Generation 17 starting: 20 neighbors, 1 active search path(s)
2026-02-21T11:39:50.7786813Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 0.9 configs/s
2026-02-21T11:39:54.4052513Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 5.6 configs/s
2026-02-21T11:39:54.4984884Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s
2026-02-21T11:39:57.0845453Z [1421s] Generation 17 complete: 
2026-02-21T11:39:57.0849145Z ok=21
2026-02-21T11:39:57.0850075Z min=5.6250
2026-02-21T11:39:57.0850301Z mid=7.2491
2026-02-21T11:39:57.0850523Z max=66.9534
2026-02-21T11:39:57.0850764Z best={'block_sizes': [1, 64, 32],
2026-02-21T11:39:57.0851222Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:39:57.0851663Z  'l2_groupings': [8],
2026-02-21T11:39:57.0851949Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:39:57.0852273Z  'loop_orders': [[0, 1]],
2026-02-21T11:39:57.0852543Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:39:57.0852831Z  'num_sm_multiplier': 128,
2026-02-21T11:39:57.0853098Z  'num_stages': 3,
2026-02-21T11:39:57.0853321Z  'num_warps': 2,
2026-02-21T11:39:57.0853736Z  'pid_type': 'persistent_blocked',
2026-02-21T11:39:57.0854058Z  'range_flattens': [None, None],
2026-02-21T11:39:57.0854386Z  'range_multi_buffers': [False, None],
2026-02-21T11:39:57.0854697Z  'range_num_stages': [1, 1],
2026-02-21T11:39:57.0854987Z  'range_unroll_factors': [4, 3],
2026-02-21T11:39:57.0855291Z  'range_warp_specializes': [],
2026-02-21T11:39:57.0855563Z  'waves_per_eu': 2}
2026-02-21T11:39:57.0886034Z [1421s] Fitting surrogate: 1320 points, 1320 targets
2026-02-21T11:39:57.4187736Z [1421s] Generation 18 starting: 20 neighbors, 1 active search path(s)
2026-02-21T11:40:10.4873386Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 2.0 configs/s
2026-02-21T11:40:13.7086928Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 6.3 configs/s
2026-02-21T11:40:13.7915489Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s
2026-02-21T11:40:16.1193156Z [1440s] Generation 18 complete: 
2026-02-21T11:40:16.1193617Z ok=21
2026-02-21T11:40:16.1193891Z min=5.5970
2026-02-21T11:40:16.1194099Z mid=7.5278
2026-02-21T11:40:16.1194303Z max=67.0089
2026-02-21T11:40:16.1195141Z best={'block_sizes': [1, 64, 32],
2026-02-21T11:40:16.1195556Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:40:16.1195957Z  'l2_groupings': [8],
2026-02-21T11:40:16.1196252Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:40:16.1196568Z  'loop_orders': [[0, 1]],
2026-02-21T11:40:16.1196846Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:40:16.1197129Z  'num_sm_multiplier': 64,
2026-02-21T11:40:16.1197386Z  'num_stages': 3,
2026-02-21T11:40:16.1197754Z  'num_warps': 2,
2026-02-21T11:40:16.1198013Z  'pid_type': 'persistent_blocked',
2026-02-21T11:40:16.1198329Z  'range_flattens': [None, None],
2026-02-21T11:40:16.1198632Z  'range_multi_buffers': [False, False],
2026-02-21T11:40:16.1198946Z  'range_num_stages': [1, 1],
2026-02-21T11:40:16.1199221Z  'range_unroll_factors': [4, 3],
2026-02-21T11:40:16.1199514Z  'range_warp_specializes': [],
2026-02-21T11:40:16.1199791Z  'waves_per_eu': 2}
2026-02-21T11:40:16.1237604Z [1440s] Fitting surrogate: 1341 points, 1341 targets
2026-02-21T11:40:16.4304717Z [1440s] Generation 19 starting: 18 neighbors, 1 active search path(s)
2026-02-21T11:40:31.3639025Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 1.5 configs/s
2026-02-21T11:40:33.4804988Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 18/18 8.9 configs/s
2026-02-21T11:40:33.5983758Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 - configs/s
2026-02-21T11:40:36.9386664Z [1461s] Generation 19 complete: 
2026-02-21T11:40:36.9389251Z ok=19
2026-02-21T11:40:36.9389563Z min=5.6048
2026-02-21T11:40:36.9390085Z mid=5.6658
2026-02-21T11:40:36.9390349Z max=19.9376
2026-02-21T11:40:36.9390634Z best={'block_sizes': [1, 64, 32],
2026-02-21T11:40:36.9391064Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:40:36.9391466Z  'l2_groupings': [8],
2026-02-21T11:40:36.9392290Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:40:36.9392641Z  'loop_orders': [[0, 1]],
2026-02-21T11:40:36.9392916Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:40:36.9393212Z  'num_sm_multiplier': 64,
2026-02-21T11:40:36.9393479Z  'num_stages': 3,
2026-02-21T11:40:36.9393712Z  'num_warps': 2,
2026-02-21T11:40:36.9393973Z  'pid_type': 'persistent_blocked',
2026-02-21T11:40:36.9394303Z  'range_flattens': [None, None],
2026-02-21T11:40:36.9394606Z  'range_multi_buffers': [False, False],
2026-02-21T11:40:36.9394925Z  'range_num_stages': [1, 1],
2026-02-21T11:40:36.9395210Z  'range_unroll_factors': [3, 2],
2026-02-21T11:40:36.9395516Z  'range_warp_specializes': [],
2026-02-21T11:40:36.9395794Z  'waves_per_eu': 2}
2026-02-21T11:40:36.9431617Z [1461s] Fitting surrogate: 1360 points, 1360 targets
2026-02-21T11:40:37.2567677Z [1461s] Generation 20 starting: 20 neighbors, 1 active search path(s)
2026-02-21T11:40:46.7160710Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 1.9 configs/s
2026-02-21T11:40:49.6016614Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 20/20 7.4 configs/s
2026-02-21T11:40:49.6844841Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 - configs/s
2026-02-21T11:40:52.0010313Z [1476s] Generation 20 complete: 
2026-02-21T11:40:52.0010541Z error=2
2026-02-21T11:40:52.0010657Z ok=19
2026-02-21T11:40:52.0010770Z min=5.5780
2026-02-21T11:40:52.0010847Z mid=5.9763
2026-02-21T11:40:52.0010927Z max=74.4706
2026-02-21T11:40:52.0011017Z best={'block_sizes': [1, 64, 32],
2026-02-21T11:40:52.0011166Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T11:40:52.0011327Z  'l2_groupings': [8],
2026-02-21T11:40:52.0011430Z  'load_eviction_policies': ['', '', ''],
2026-02-21T11:40:52.0011549Z  'loop_orders': [[0, 1]],
2026-02-21T11:40:52.0011651Z  'matrix_instr_nonkdim': 32,
2026-02-21T11:40:52.0011757Z  'num_sm_multiplier': 128,
2026-02-21T11:40:52.0011855Z  'num_stages': 3,
2026-02-21T11:40:52.0011942Z  'num_warps': 2,
2026-02-21T11:40:52.0012044Z  'pid_type': 'persistent_blocked',
2026-02-21T11:40:52.0012159Z  'range_flattens': [None, None],
2026-02-21T11:40:52.0012271Z  'range_multi_buffers': [False, False],
2026-02-21T11:40:52.0012825Z  'range_num_stages': [1, 1],
2026-02-21T11:40:52.0012929Z  'range_unroll_factors': [3, 2],
2026-02-21T11:40:52.0013037Z  'range_warp_specializes': [],
2026-02-21T11:40:52.0013145Z  'waves_per_eu': 2}
2026-02-21T11:40:52.0064919Z [1476s] Fitting surrogate: 1381 points, 1381 targets
2026-02-21T11:40:52.1712119Z [1476s] Autotuning complete in 1476.7s after searching 1273 configs.
2026-02-21T11:40:52.1712328Z One can hardcode the best config and skip autotuning with:
2026-02-21T11:40:52.1713292Z     @helion.kernel(config=helion.Config(block_sizes=[1, 64, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T11:40:52.1714003Z 
2026-02-21T11:40:52.1714172Z [1476s] Code of selected kernel: /tmp/torchinductor_root/lq/clqqq6sh25ncvaga6kc7ssmw43ae45674ysypgbyxarb52rfaetm.py
2026-02-21T11:40:53.2268645Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T11:40:53.2271023Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 64, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T11:40:53.2273405Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T11:40:53.2273883Z WARNING:tritonbench.utils.triton_op:Completed input ID 5:
2026-02-21T11:40:53.2274303Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T11:40:53.2274631Z ------------------------------------------
2026-02-21T11:40:53.2274946Z (4, 48, 4096, 4096, 128)
2026-02-21T11:40:53.2275125Z 
2026-02-21T11:40:53.2278151Z  83%|████████▎ | 5/6 [56:26<15:12, 912.21s/it]WARNING:tritonbench.utils.triton_op:Running input ID 6:
2026-02-21T11:40:53.2278719Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T11:40:53.2279072Z ------------------------------------------
2026-02-21T11:40:53.2279385Z (4, 48, 8192, 8192, 128)
2026-02-21T11:40:53.2280996Z INFO:tritonbench.utils.triton_op:Took 0.09ms to get benchmark function for aten
2026-02-21T11:40:55.1044292Z INFO:tritonbench.utils.triton_op:Took 1.51ms to get benchmark function for flex_attention
2026-02-21T11:40:56.9080211Z WARNING:__main__:Input tensor metadata:
2026-02-21T11:40:56.9080469Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T11:40:56.9080690Z               'dtype': 'torch.bfloat16',
2026-02-21T11:40:56.9080910Z               'shape': (4, 48, 8192, 128),
2026-02-21T11:40:56.9081139Z               'stride': (50331648, 1048576, 128, 1)},
2026-02-21T11:40:56.9081366Z             { 'device': 'cuda:0',
2026-02-21T11:40:56.9081561Z               'dtype': 'torch.bfloat16',
2026-02-21T11:40:56.9081764Z               'shape': (4, 48, 8192, 128),
2026-02-21T11:40:56.9081974Z               'stride': (50331648, 1048576, 128, 1)},
2026-02-21T11:40:56.9082669Z             { 'device': 'cuda:0',
2026-02-21T11:40:56.9082866Z               'dtype': 'torch.bfloat16',
2026-02-21T11:40:56.9083078Z               'shape': (4, 48, 8192, 128),
2026-02-21T11:40:56.9083289Z               'stride': (50331648, 1048576, 128, 1)}),
2026-02-21T11:40:56.9083499Z   'kwargs': {}}
2026-02-21T11:40:56.9136637Z INFO:tritonbench.utils.triton_op:Took 6.06ms to get benchmark function for helion_attention
2026-02-21T11:40:57.1575150Z [0s] Autotune random seed: 2144140282
2026-02-21T11:40:57.4098462Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T11:41:30.6760529Z [33s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[3, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:41:30.9358227Z [33s] Timeout after 30s compiling Config(block_sizes=[1, 16, 2048], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:41:31.4552729Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 1, 2048], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:41:32.3764060Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 2, 8192], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:41:33.8171506Z [36s] Timeout after 30s compiling Config(block_sizes=[1, 16, 2048], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:41:34.0972910Z [36s] Timeout after 30s compiling Config(block_sizes=[1, 64, 512], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[4, 0], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:41:34.6625390Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 4, 512], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:41:35.0302585Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 1, 4096], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[0, 2], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:41:37.7973091Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[3, 4], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:41:38.3975151Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 64, 512], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[2, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T11:41:38.8752831Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:41:40.7618844Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:41:41.2183102Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 8, 1024], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:41:41.4614437Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 16, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, None], range_num_stages=[4, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:41:41.7355548Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 16, 256], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[1, 3], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:41:41.9487626Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 8, 4096], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:41:42.2679522Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 256, 1024], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[4, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:41:43.6329803Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[1, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:41:44.5807226Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 2, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:41:44.8622570Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 8192, 32], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[1, 0], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T11:41:45.1242714Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[2, 0], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:41:45.3997395Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 1, 1024], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[False, False], range_num_stages=[1, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:41:45.6188797Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T11:41:46.1978697Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 16, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:41:46.5733738Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 256, 512], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T11:41:46.5754630Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.5 configs/s
2026-02-21T11:59:18.7145316Z /tmp/torchinductor_root/o6/co6rjuctqpaz75vipxwjv5tecvb22lqonz5hnx6w6pycwovwezru.py:88:135: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T11:59:18.7146501Z             v = tl.load(v_view + (indices_0[:, None, None] * 1048576 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None)
2026-02-21T11:59:18.7147935Z                                                                                                                                       ^
2026-02-21T11:59:18.7149936Z /tmp/torchinductor_root/o6/co6rjuctqpaz75vipxwjv5tecvb22lqonz5hnx6w6pycwovwezru.py:92:144: note: - use: %134 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x16x128xbf16, #ttg.blocked<{sizePerThread = [1, 1, 4], threadsPerWarp = [1, 2, 32], warpsPerCTA = [1, 8, 1], order = [2, 0, 1]}>>) -> tensor<16x128xbf16, #ttg.linear<{register = [[0, 1], [0, 2]], lane = [[0, 4], [0, 8], [0, 16], [0, 32], [0, 64], [1, 0]], warp = [[2, 0], [4, 0], [8, 0]], block = []}>>
2026-02-21T11:59:18.7151473Z 
2026-02-21T11:59:18.7152672Z             acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128])
2026-02-21T11:59:18.7153729Z                                                                                                                                                ^
2026-02-21T11:59:18.7154135Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T11:59:18.7154671Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}>
2026-02-21T11:59:18.7155517Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}>
2026-02-21T11:59:18.7156217Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [16, 4, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:59:18.7156921Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:59:18.7157617Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 4, 16], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:59:18.7158305Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T11:59:18.7158986Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T11:59:18.7159674Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T11:59:18.7160336Z #blocked8 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}>
2026-02-21T11:59:18.7160964Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}>
2026-02-21T11:59:18.7161520Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}>
2026-02-21T11:59:18.7162012Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}>
2026-02-21T11:59:18.7162718Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}>
2026-02-21T11:59:18.7163206Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}>
2026-02-21T11:59:18.7163698Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 8, 1], order = [2, 1, 0]}>
2026-02-21T11:59:18.7164189Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [16, 4, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:59:18.7164781Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:59:18.7165270Z #blocked17 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}>
2026-02-21T11:59:18.7165797Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T11:59:18.7166598Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T11:59:18.7167231Z     %c1048576_i32 = arith.constant 1048576 : i32
2026-02-21T11:59:18.7167443Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T11:59:18.7167640Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T11:59:18.7167836Z     %c1048576_i64 = arith.constant 1048576 : i64
2026-02-21T11:59:18.7168126Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x4x128xbf16, #blocked>
2026-02-21T11:59:18.7168448Z     %cst_0 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked1>
2026-02-21T11:59:18.7168731Z     %cst_1 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked1>
2026-02-21T11:59:18.7169018Z     %cst_2 = arith.constant dense<8192> : tensor<1x4x1xi64, #blocked2>
2026-02-21T11:59:18.7169293Z     %cst_3 = arith.constant dense<0> : tensor<1x4x1xi64, #blocked2>
2026-02-21T11:59:18.7169573Z     %cst_4 = arith.constant dense<128> : tensor<1x4x1xi64, #blocked2>
2026-02-21T11:59:18.7169808Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T11:59:18.7169995Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T11:59:18.7170216Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T11:59:18.7170450Z     %cst_5 = arith.constant dense<128> : tensor<1x16x1xi32, #blocked3>
2026-02-21T11:59:18.7170791Z     %cst_6 = arith.constant dense<0.127517432> : tensor<1x4x16xf32, #blocked4>
2026-02-21T11:59:18.7171075Z     %cst_7 = arith.constant dense<0.127517432> : tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7171309Z     %cst_8 = arith.constant dense<0.000000e+00> : tensor<4x16xf32, #blocked6>
2026-02-21T11:59:18.7171539Z     %cst_9 = arith.constant dense<128> : tensor<1x1x16xi32, #blocked7>
2026-02-21T11:59:18.7171725Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T11:59:18.7171908Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<1x4x128xf32, #blocked>
2026-02-21T11:59:18.7172147Z     %cst_11 = arith.constant dense<1.000000e+00> : tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7172384Z     %cst_12 = arith.constant dense<0xFF800000> : tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7172572Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T11:59:18.7172719Z     %c192_i32 = arith.constant 192 : i32
2026-02-21T11:59:18.7172868Z     %c393216_i32 = arith.constant 393216 : i32
2026-02-21T11:59:18.7173020Z     %c21_i32 = arith.constant 21 : i32
2026-02-21T11:59:18.7173167Z     %0 = tt.get_program_id x : i32
2026-02-21T11:59:18.7173297Z     %1 = arith.muli %0, %c21_i32 : i32
2026-02-21T11:59:18.7173438Z     %2 = arith.addi %1, %c21_i32 : i32
2026-02-21T11:59:18.7173577Z     %3 = arith.minsi %2, %c393216_i32 : i32
2026-02-21T11:59:18.7173777Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked8>
2026-02-21T11:59:18.7174052Z     %5 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x4x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:59:18.7174297Z     %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked8>
2026-02-21T11:59:18.7174533Z     %7 = arith.extsi %6 : tensor<4xi32, #blocked8> to tensor<4xi64, #blocked8>
2026-02-21T11:59:18.7174771Z     %8 = arith.extsi %4 : tensor<128xi32, #blocked8> to tensor<128xi64, #blocked8>
2026-02-21T11:59:18.7175091Z     %9 = ttg.convert_layout %8 : tensor<128xi64, #blocked8> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:59:18.7175510Z     %10 = tt.expand_dims %9 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi64, #blocked9>
2026-02-21T11:59:18.7175875Z     %11 = ttg.convert_layout %10 : tensor<1x128xi64, #blocked9> -> tensor<1x128xi64, #blocked10>
2026-02-21T11:59:18.7176233Z     %12 = ttg.convert_layout %11 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T11:59:18.7176655Z     %13 = tt.expand_dims %12 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi64, #blocked11>
2026-02-21T11:59:18.7177022Z     %14 = ttg.convert_layout %13 : tensor<1x1x128xi64, #blocked11> -> tensor<1x1x128xi64, #blocked1>
2026-02-21T11:59:18.7177323Z     %15 = tt.broadcast %14 : tensor<1x1x128xi64, #blocked1> -> tensor<1x4x128xi64, #blocked1>
2026-02-21T11:59:18.7177618Z     %16 = ttg.convert_layout %15 : tensor<1x4x128xi64, #blocked1> -> tensor<1x4x128xi64, #blocked>
2026-02-21T11:59:18.7177874Z     %17 = arith.cmpi sge, %14, %cst_1 : tensor<1x1x128xi64, #blocked1>
2026-02-21T11:59:18.7178102Z     %18 = arith.cmpi slt, %14, %cst_0 : tensor<1x1x128xi64, #blocked1>
2026-02-21T11:59:18.7178313Z     %19 = arith.andi %17, %18 : tensor<1x1x128xi1, #blocked1>
2026-02-21T11:59:18.7178549Z     %20 = tt.broadcast %19 : tensor<1x1x128xi1, #blocked1> -> tensor<1x4x128xi1, #blocked1>
2026-02-21T11:59:18.7178836Z     %21 = ttg.convert_layout %20 : tensor<1x4x128xi1, #blocked1> -> tensor<1x4x128xi1, #blocked>
2026-02-21T11:59:18.7179109Z     %22 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked8>
2026-02-21T11:59:18.7179422Z     %23 = ttg.convert_layout %4 : tensor<128xi32, #blocked8> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:59:18.7179835Z     %24 = tt.expand_dims %23 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x128xi32, #blocked9>
2026-02-21T11:59:18.7180183Z     %25 = ttg.convert_layout %24 : tensor<1x128xi32, #blocked9> -> tensor<1x128xi32, #blocked10>
2026-02-21T11:59:18.7180540Z     %26 = ttg.convert_layout %25 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked12}>>
2026-02-21T11:59:18.7180921Z     %27 = tt.expand_dims %26 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x128x1xi32, #blocked12>
2026-02-21T11:59:18.7181226Z     %28 = ttg.convert_layout %27 : tensor<1x128x1xi32, #blocked12> -> tensor<1x128x1xi32, #blocked13>
2026-02-21T11:59:18.7181469Z     %29 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x16x!tt.ptr<bf16>, #blocked14>
2026-02-21T11:59:18.7181742Z     %30 = ttg.convert_layout %25 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T11:59:18.7182087Z     %31 = tt.expand_dims %30 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11>
2026-02-21T11:59:18.7182394Z     %32 = ttg.convert_layout %31 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked1>
2026-02-21T11:59:18.7182649Z     %33 = tt.broadcast %32 : tensor<1x1x128xi32, #blocked1> -> tensor<1x16x128xi32, #blocked1>
2026-02-21T11:59:18.7182900Z     %34 = ttg.convert_layout %33 : tensor<1x16x128xi32, #blocked1> -> tensor<1x16x128xi32, #blocked>
2026-02-21T11:59:18.7183132Z     %35 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x16x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:59:18.7183359Z     %36 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x4x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:59:18.7183539Z     scf.for %arg4 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T11:59:18.7183675Z       %37 = arith.remsi %arg4, %c192_i32 : i32
2026-02-21T11:59:18.7183806Z       %38 = arith.divsi %arg4, %c192_i32 : i32
2026-02-21T11:59:18.7183931Z       %39 = arith.muli %38, %c4_i32 : i32
2026-02-21T11:59:18.7184055Z       %40 = arith.extsi %37 : i32 to i64
2026-02-21T11:59:18.7184195Z       %41 = arith.extsi %39 : i32 to i64
2026-02-21T11:59:18.7184320Z       %42 = arith.muli %40, %c1048576_i64 : i64
2026-02-21T11:59:18.7184491Z       %43 = tt.splat %42 : i64 -> tensor<1x4x128xi64, #blocked>
2026-02-21T11:59:18.7184649Z       %44 = tt.splat %41 : i64 -> tensor<4xi64, #blocked8>
2026-02-21T11:59:18.7184802Z       %45 = arith.addi %44, %7 : tensor<4xi64, #blocked8>
2026-02-21T11:59:18.7185032Z       %46 = ttg.convert_layout %45 : tensor<4xi64, #blocked8> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:59:18.7185354Z       %47 = tt.expand_dims %46 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x4xi64, #blocked9>
2026-02-21T11:59:18.7185638Z       %48 = ttg.convert_layout %47 : tensor<1x4xi64, #blocked9> -> tensor<1x4xi64, #blocked5>
2026-02-21T11:59:18.7185918Z       %49 = ttg.convert_layout %48 : tensor<1x4xi64, #blocked5> -> tensor<1x4xi64, #ttg.slice<{dim = 2, parent = #blocked15}>>
2026-02-21T11:59:18.7186254Z       %50 = tt.expand_dims %49 {axis = 2 : i32} : tensor<1x4xi64, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x4x1xi64, #blocked15>
2026-02-21T11:59:18.7186575Z       %51 = ttg.convert_layout %50 : tensor<1x4x1xi64, #blocked15> -> tensor<1x4x1xi64, #blocked2>
2026-02-21T11:59:18.7186785Z       %52 = arith.muli %51, %cst_4 : tensor<1x4x1xi64, #blocked2>
2026-02-21T11:59:18.7195378Z       %53 = tt.broadcast %52 : tensor<1x4x1xi64, #blocked2> -> tensor<1x4x128xi64, #blocked2>
2026-02-21T11:59:18.7195633Z       %54 = ttg.convert_layout %53 : tensor<1x4x128xi64, #blocked2> -> tensor<1x4x128xi64, #blocked>
2026-02-21T11:59:18.7195861Z       %55 = arith.addi %54, %16 : tensor<1x4x128xi64, #blocked>
2026-02-21T11:59:18.7196026Z       %56 = arith.addi %43, %55 : tensor<1x4x128xi64, #blocked>
2026-02-21T11:59:18.7196276Z       %57 = tt.addptr %5, %56 : tensor<1x4x128x!tt.ptr<bf16>, #blocked>, tensor<1x4x128xi64, #blocked>
2026-02-21T11:59:18.7196481Z       %58 = arith.cmpi sge, %40, %c0_i64 : i64
2026-02-21T11:59:18.7196623Z       %59 = arith.cmpi slt, %40, %c192_i64 : i64
2026-02-21T11:59:18.7196747Z       %60 = arith.andi %58, %59 : i1
2026-02-21T11:59:18.7196896Z       %61 = arith.cmpi sge, %51, %cst_3 : tensor<1x4x1xi64, #blocked2>
2026-02-21T11:59:18.7197070Z       %62 = arith.cmpi slt, %51, %cst_2 : tensor<1x4x1xi64, #blocked2>
2026-02-21T11:59:18.7197241Z       %63 = arith.andi %61, %62 : tensor<1x4x1xi1, #blocked2>
2026-02-21T11:59:18.7197397Z       %64 = tt.splat %60 : i1 -> tensor<1x4x1xi1, #blocked2>
2026-02-21T11:59:18.7197555Z       %65 = arith.andi %64, %63 : tensor<1x4x1xi1, #blocked2>
2026-02-21T11:59:18.7197750Z       %66 = tt.broadcast %65 : tensor<1x4x1xi1, #blocked2> -> tensor<1x4x128xi1, #blocked2>
2026-02-21T11:59:18.7197988Z       %67 = ttg.convert_layout %66 : tensor<1x4x128xi1, #blocked2> -> tensor<1x4x128xi1, #blocked>
2026-02-21T11:59:18.7198201Z       %68 = arith.andi %67, %21 : tensor<1x4x128xi1, #blocked>
2026-02-21T11:59:18.7198368Z       %69 = tt.load %57, %68, %cst : tensor<1x4x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:59:18.7198531Z       %70 = arith.muli %37, %c1048576_i32 : i32
2026-02-21T11:59:18.7198677Z       %71 = tt.splat %70 : i32 -> tensor<1x128x1xi32, #blocked13>
2026-02-21T11:59:18.7198848Z       %72 = arith.addi %71, %28 : tensor<1x128x1xi32, #blocked13>
2026-02-21T11:59:18.7199055Z       %73 = tt.broadcast %72 : tensor<1x128x1xi32, #blocked13> -> tensor<1x128x16xi32, #blocked13>
2026-02-21T11:59:18.7199319Z       %74 = ttg.convert_layout %73 : tensor<1x128x16xi32, #blocked13> -> tensor<1x128x16xi32, #blocked14>
2026-02-21T11:59:18.7199595Z       %75 = tt.reshape %69 : tensor<1x4x128xbf16, #blocked> -> tensor<4x128xbf16, #blocked10>
2026-02-21T11:59:18.7199789Z       %76 = tt.splat %70 : i32 -> tensor<1x16x1xi32, #blocked3>
2026-02-21T11:59:18.7200152Z       %77:3 = scf.for %arg5 = %c0_i32 to %c8192_i32 step %c16_i32 iter_args(%arg6 = %cst_12, %arg7 = %cst_11, %arg8 = %cst_10) -> (tensor<1x4xf32, #blocked5>, tensor<1x4xf32, #blocked5>, tensor<1x4x128xf32, #blocked>)  : i32 {
2026-02-21T11:59:18.7200537Z         %86 = tt.splat %arg5 : i32 -> tensor<16xi32, #blocked8>
2026-02-21T11:59:18.7200690Z         %87 = arith.addi %86, %22 : tensor<16xi32, #blocked8>
2026-02-21T11:59:18.7200926Z         %88 = ttg.convert_layout %87 : tensor<16xi32, #blocked8> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked9}>>
2026-02-21T11:59:18.7201253Z         %89 = tt.expand_dims %88 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked9}>> -> tensor<1x16xi32, #blocked9>
2026-02-21T11:59:18.7201545Z         %90 = ttg.convert_layout %89 : tensor<1x16xi32, #blocked9> -> tensor<1x16xi32, #blocked6>
2026-02-21T11:59:18.7201831Z         %91 = ttg.convert_layout %90 : tensor<1x16xi32, #blocked6> -> tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked16}>>
2026-02-21T11:59:18.7202171Z         %92 = tt.expand_dims %91 {axis = 1 : i32} : tensor<1x16xi32, #ttg.slice<{dim = 1, parent = #blocked16}>> -> tensor<1x1x16xi32, #blocked16>
2026-02-21T11:59:18.7202478Z         %93 = ttg.convert_layout %92 : tensor<1x1x16xi32, #blocked16> -> tensor<1x1x16xi32, #blocked7>
2026-02-21T11:59:18.7202758Z         %94 = arith.muli %93, %cst_9 : tensor<1x1x16xi32, #blocked7>
2026-02-21T11:59:18.7202979Z         %95 = tt.broadcast %94 : tensor<1x1x16xi32, #blocked7> -> tensor<1x128x16xi32, #blocked7>
2026-02-21T11:59:18.7203236Z         %96 = ttg.convert_layout %95 : tensor<1x128x16xi32, #blocked7> -> tensor<1x128x16xi32, #blocked14>
2026-02-21T11:59:18.7203455Z         %97 = arith.addi %74, %96 : tensor<1x128x16xi32, #blocked14>
2026-02-21T11:59:18.7203681Z         %98 = tt.addptr %29, %97 : tensor<1x128x16x!tt.ptr<bf16>, #blocked14>, tensor<1x128x16xi32, #blocked14>
2026-02-21T11:59:18.7203903Z         %99 = tt.load %98 : tensor<1x128x16x!tt.ptr<bf16>, #blocked14>
2026-02-21T11:59:18.7204104Z         %100 = tt.reshape %99 : tensor<1x128x16xbf16, #blocked14> -> tensor<128x16xbf16, #blocked6>
2026-02-21T11:59:18.7204433Z         %101 = ttg.convert_layout %75 : tensor<4x128xbf16, #blocked10> -> tensor<4x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>>
2026-02-21T11:59:18.7204790Z         %102 = ttg.convert_layout %100 : tensor<128x16xbf16, #blocked6> -> tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>>
2026-02-21T11:59:18.7205098Z         %103 = ttg.convert_layout %cst_8 : tensor<4x16xf32, #blocked6> -> tensor<4x16xf32, #blocked6>
2026-02-21T11:59:18.7205511Z         %104 = tt.dot %101, %102, %103, inputPrecision = tf32 : tensor<4x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<4x16xf32, #blocked6>
2026-02-21T11:59:18.7205907Z         %105 = tt.reshape %104 : tensor<4x16xf32, #blocked6> -> tensor<1x4x16xf32, #blocked4>
2026-02-21T11:59:18.7206151Z         %106 = arith.truncf %105 : tensor<1x4x16xf32, #blocked4> to tensor<1x4x16xbf16, #blocked4>
2026-02-21T11:59:18.7206397Z         %107 = arith.extf %106 : tensor<1x4x16xbf16, #blocked4> to tensor<1x4x16xf32, #blocked4>
2026-02-21T11:59:18.7206593Z         %108 = "tt.reduce"(%107) <{axis = 2 : i32}> ({
2026-02-21T11:59:18.7206729Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T11:59:18.7206854Z           %160 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T11:59:18.7206988Z           tt.reduce.return %160 : f32
2026-02-21T11:59:18.7207181Z         }) : (tensor<1x4x16xf32, #blocked4>) -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked4}>>
2026-02-21T11:59:18.7207477Z         %109 = ttg.convert_layout %108 : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked4}>> -> tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7207778Z         %110 = arith.truncf %109 : tensor<1x4xf32, #blocked5> to tensor<1x4xbf16, #blocked5>
2026-02-21T11:59:18.7208008Z         %111 = arith.extf %110 : tensor<1x4xbf16, #blocked5> to tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7208212Z         %112 = arith.mulf %111, %cst_7 : tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7208411Z         %113 = arith.truncf %112 : tensor<1x4xf32, #blocked5> to tensor<1x4xbf16, #blocked5>
2026-02-21T11:59:18.7208657Z         %114 = arith.extf %113 : tensor<1x4xbf16, #blocked5> to tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7208856Z         %115 = arith.cmpf ogt, %arg6, %114 : tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7209036Z         %116 = arith.cmpf une, %arg6, %arg6 : tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7209211Z         %117 = arith.ori %115, %116 : tensor<1x4xi1, #blocked5>
2026-02-21T11:59:18.7209407Z         %118 = arith.select %117, %arg6, %114 : tensor<1x4xi1, #blocked5>, tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7209620Z         %119 = arith.mulf %107, %cst_6 : tensor<1x4x16xf32, #blocked4>
2026-02-21T11:59:18.7209825Z         %120 = arith.truncf %119 : tensor<1x4x16xf32, #blocked4> to tensor<1x4x16xbf16, #blocked4>
2026-02-21T11:59:18.7210117Z         %121 = ttg.convert_layout %118 : tensor<1x4xf32, #blocked5> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>>
2026-02-21T11:59:18.7210467Z         %122 = tt.expand_dims %121 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x4x1xf32, #blocked15>
2026-02-21T11:59:18.7210773Z         %123 = ttg.convert_layout %122 : tensor<1x4x1xf32, #blocked15> -> tensor<1x4x1xf32, #blocked2>
2026-02-21T11:59:18.7211044Z         %124 = arith.extf %120 : tensor<1x4x16xbf16, #blocked4> to tensor<1x4x16xf32, #blocked4>
2026-02-21T11:59:18.7211279Z         %125 = tt.broadcast %123 : tensor<1x4x1xf32, #blocked2> -> tensor<1x4x16xf32, #blocked2>
2026-02-21T11:59:18.7211530Z         %126 = ttg.convert_layout %125 : tensor<1x4x16xf32, #blocked2> -> tensor<1x4x16xf32, #blocked4>
2026-02-21T11:59:18.7211746Z         %127 = arith.subf %124, %126 : tensor<1x4x16xf32, #blocked4>
2026-02-21T11:59:18.7212052Z         %128 = tt.extern_elementwise %127 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x4x16xf32, #blocked4>) -> tensor<1x4x16xf32, #blocked4>
2026-02-21T11:59:18.7212364Z         %129 = "tt.reduce"(%128) <{axis = 2 : i32}> ({
2026-02-21T11:59:18.7212494Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T11:59:18.7212620Z           %160 = arith.addf %arg9, %arg10 : f32
2026-02-21T11:59:18.7212745Z           tt.reduce.return %160 : f32
2026-02-21T11:59:18.7212939Z         }) : (tensor<1x4x16xf32, #blocked4>) -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked4}>>
2026-02-21T11:59:18.7213234Z         %130 = ttg.convert_layout %129 : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked4}>> -> tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7213480Z         %131 = arith.subf %arg6, %118 : tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7213772Z         %132 = tt.extern_elementwise %131 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x4xf32, #blocked5>) -> tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7214064Z         %133 = arith.mulf %arg7, %132 : tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7214226Z         %134 = arith.addf %133, %130 : tensor<1x4xf32, #blocked5>
2026-02-21T11:59:18.7214472Z         %135 = ttg.convert_layout %132 : tensor<1x4xf32, #blocked5> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>>
2026-02-21T11:59:18.7214816Z         %136 = tt.expand_dims %135 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x4x1xf32, #blocked15>
2026-02-21T11:59:18.7215122Z         %137 = ttg.convert_layout %136 : tensor<1x4x1xf32, #blocked15> -> tensor<1x4x1xf32, #blocked2>
2026-02-21T11:59:18.7215373Z         %138 = tt.broadcast %137 : tensor<1x4x1xf32, #blocked2> -> tensor<1x4x128xf32, #blocked2>
2026-02-21T11:59:18.7215641Z         %139 = ttg.convert_layout %138 : tensor<1x4x128xf32, #blocked2> -> tensor<1x4x128xf32, #blocked>
2026-02-21T11:59:18.7215860Z         %140 = arith.mulf %arg8, %139 : tensor<1x4x128xf32, #blocked>
2026-02-21T11:59:18.7216114Z         %141 = ttg.convert_layout %90 : tensor<1x16xi32, #blocked6> -> tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked17}>>
2026-02-21T11:59:18.7216465Z         %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> -> tensor<1x16x1xi32, #blocked17>
2026-02-21T11:59:18.7216796Z         %143 = ttg.convert_layout %142 : tensor<1x16x1xi32, #blocked17> -> tensor<1x16x1xi32, #blocked3>
2026-02-21T11:59:18.7217014Z         %144 = arith.muli %143, %cst_5 : tensor<1x16x1xi32, #blocked3>
2026-02-21T11:59:18.7217188Z         %145 = arith.addi %76, %144 : tensor<1x16x1xi32, #blocked3>
2026-02-21T11:59:18.7217390Z         %146 = tt.broadcast %145 : tensor<1x16x1xi32, #blocked3> -> tensor<1x16x128xi32, #blocked3>
2026-02-21T11:59:18.7217652Z         %147 = ttg.convert_layout %146 : tensor<1x16x128xi32, #blocked3> -> tensor<1x16x128xi32, #blocked>
2026-02-21T11:59:18.7217871Z         %148 = arith.addi %147, %34 : tensor<1x16x128xi32, #blocked>
2026-02-21T11:59:18.7218086Z         %149 = tt.addptr %35, %148 : tensor<1x16x128x!tt.ptr<bf16>, #blocked>, tensor<1x16x128xi32, #blocked>
2026-02-21T11:59:18.7218307Z         %150 = tt.load %149 : tensor<1x16x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:59:18.7218513Z         %151 = arith.truncf %128 : tensor<1x4x16xf32, #blocked4> to tensor<1x4x16xbf16, #blocked4>
2026-02-21T11:59:18.7226940Z         %152 = tt.reshape %140 : tensor<1x4x128xf32, #blocked> -> tensor<4x128xf32, #blocked10>
2026-02-21T11:59:18.7227185Z         %153 = tt.reshape %151 : tensor<1x4x16xbf16, #blocked4> -> tensor<4x16xbf16, #blocked6>
2026-02-21T11:59:18.7227429Z         %154 = tt.reshape %150 : tensor<1x16x128xbf16, #blocked> -> tensor<16x128xbf16, #blocked10>
2026-02-21T11:59:18.7227737Z         %155 = ttg.convert_layout %153 : tensor<4x16xbf16, #blocked6> -> tensor<4x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>>
2026-02-21T11:59:18.7228096Z         %156 = ttg.convert_layout %154 : tensor<16x128xbf16, #blocked10> -> tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>>
2026-02-21T11:59:18.7228434Z         %157 = ttg.convert_layout %152 : tensor<4x128xf32, #blocked10> -> tensor<4x128xf32, #blocked10>
2026-02-21T11:59:18.7228864Z         %158 = tt.dot %155, %156, %157, inputPrecision = tf32 : tensor<4x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked10}>> * tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked10}>> -> tensor<4x128xf32, #blocked10>
2026-02-21T11:59:18.7229265Z         %159 = tt.reshape %158 : tensor<4x128xf32, #blocked10> -> tensor<1x4x128xf32, #blocked>
2026-02-21T11:59:18.7229539Z         scf.yield %118, %134, %159 : tensor<1x4xf32, #blocked5>, tensor<1x4xf32, #blocked5>, tensor<1x4x128xf32, #blocked>
2026-02-21T11:59:18.7229759Z       } {tt.num_stages = 4 : i32}
2026-02-21T11:59:18.7229979Z       %78 = ttg.convert_layout %77#1 : tensor<1x4xf32, #blocked5> -> tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>>
2026-02-21T11:59:18.7230323Z       %79 = tt.expand_dims %78 {axis = 2 : i32} : tensor<1x4xf32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x4x1xf32, #blocked15>
2026-02-21T11:59:18.7230620Z       %80 = ttg.convert_layout %79 : tensor<1x4x1xf32, #blocked15> -> tensor<1x4x1xf32, #blocked2>
2026-02-21T11:59:18.7230864Z       %81 = tt.broadcast %80 : tensor<1x4x1xf32, #blocked2> -> tensor<1x4x128xf32, #blocked2>
2026-02-21T11:59:18.7231111Z       %82 = ttg.convert_layout %81 : tensor<1x4x128xf32, #blocked2> -> tensor<1x4x128xf32, #blocked>
2026-02-21T11:59:18.7231323Z       %83 = arith.divf %77#2, %82 : tensor<1x4x128xf32, #blocked>
2026-02-21T11:59:18.7231526Z       %84 = arith.truncf %83 : tensor<1x4x128xf32, #blocked> to tensor<1x4x128xbf16, #blocked>
2026-02-21T11:59:18.7231771Z       %85 = tt.addptr %36, %56 : tensor<1x4x128x!tt.ptr<bf16>, #blocked>, tensor<1x4x128xi64, #blocked>
2026-02-21T11:59:18.7232003Z       tt.store %85, %84, %68 : tensor<1x4x128x!tt.ptr<bf16>, #blocked>
2026-02-21T11:59:18.7232203Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.loop_unroll_factor = 1 : i32}
2026-02-21T11:59:18.7232368Z     tt.return
2026-02-21T11:59:18.7232455Z   }
2026-02-21T11:59:18.7232535Z }
2026-02-21T11:59:18.7232582Z 
2026-02-21T11:59:18.7232620Z {-#
2026-02-21T11:59:18.7232702Z   external_resources: {
2026-02-21T11:59:18.7232836Z     mlir_reproducer: {
2026-02-21T11:59:18.7235080Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T11:59:18.7237383Z       disable_threading: false,
2026-02-21T11:59:18.7237502Z       verify_each: true
2026-02-21T11:59:18.7237596Z     }
2026-02-21T11:59:18.7237674Z   }
2026-02-21T11:59:18.7237744Z #-}
2026-02-21T11:59:18.7238025Z /tmp/torchinductor_root/o6/co6rjuctqpaz75vipxwjv5tecvb22lqonz5hnx6w6pycwovwezru.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T11:59:18.7238755Z /tmp/torchinductor_root/o6/co6rjuctqpaz75vipxwjv5tecvb22lqonz5hnx6w6pycwovwezru.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T11:59:18.7239311Z [1101s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T11:59:18.7240106Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 4, 16], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=4, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[0, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T11:59:18.7240831Z Error: RuntimeError: PassManager::run failed
2026-02-21T11:59:18.7241006Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T12:00:16.9520987Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.3 configs/s
2026-02-21T12:00:16.9532676Z [1159s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s])
2026-02-21T12:00:17.0246768Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6/6 - configs/s
2026-02-21T12:00:17.4295027Z [1160s] Initial random population of 100, 5 starting points: 
2026-02-21T12:00:17.4298417Z error=19
2026-02-21T12:00:17.4298741Z timeout=25
2026-02-21T12:00:17.4298945Z ok=56
2026-02-21T12:00:17.4299136Z min=31.6083
2026-02-21T12:00:17.4299767Z mid=276.2712
2026-02-21T12:00:17.4299996Z max=56214.3750
2026-02-21T12:00:17.4300257Z best={'block_sizes': [1, 256, 16],
2026-02-21T12:00:17.4300684Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T12:00:17.4301087Z  'l2_groupings': [8],
2026-02-21T12:00:17.4301356Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:00:17.4301639Z  'loop_orders': [[0, 1]],
2026-02-21T12:00:17.4301889Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:00:17.4302122Z  'num_sm_multiplier': 16,
2026-02-21T12:00:17.4302478Z  'num_stages': 4,
2026-02-21T12:00:17.4302665Z  'num_warps': 16,
2026-02-21T12:00:17.4302885Z  'pid_type': 'persistent_blocked',
2026-02-21T12:00:17.4303156Z  'range_flattens': [False, True],
2026-02-21T12:00:17.4303429Z  'range_multi_buffers': [None, False],
2026-02-21T12:00:17.4303684Z  'range_num_stages': [1, 3],
2026-02-21T12:00:17.4303919Z  'range_unroll_factors': [2, 3],
2026-02-21T12:00:17.4304161Z  'range_warp_specializes': [],
2026-02-21T12:00:17.4304387Z  'waves_per_eu': 1}
2026-02-21T12:00:17.4309385Z [1160s] Fitting surrogate: 100 points, 100 targets
2026-02-21T12:00:18.2834306Z [1160s] Generation 1 starting: 82 neighbors, 5 active search path(s)
2026-02-21T12:00:39.0100903Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 2.1 configs/s
2026-02-21T12:01:44.0125525Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 82/82 1.5 configs/s
2026-02-21T12:01:44.7338082Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 7/7 - configs/s
2026-02-21T12:01:49.7223982Z [1252s] Generation 1 complete: 
2026-02-21T12:01:49.7224357Z error=10
2026-02-21T12:01:49.7224566Z ok=77
2026-02-21T12:01:49.7224770Z min=25.8854
2026-02-21T12:01:49.7224974Z mid=64.4642
2026-02-21T12:01:49.7225597Z max=1690.6870
2026-02-21T12:01:49.7225846Z best={'block_sizes': [1, 256, 32],
2026-02-21T12:01:49.7226266Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T12:01:49.7226661Z  'l2_groupings': [8],
2026-02-21T12:01:49.7226959Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:01:49.7227291Z  'loop_orders': [[0, 1]],
2026-02-21T12:01:49.7227571Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:01:49.7227854Z  'num_sm_multiplier': 16,
2026-02-21T12:01:49.7228113Z  'num_stages': 4,
2026-02-21T12:01:49.7228343Z  'num_warps': 8,
2026-02-21T12:01:49.7228597Z  'pid_type': 'persistent_blocked',
2026-02-21T12:01:49.7228908Z  'range_flattens': [False, True],
2026-02-21T12:01:49.7229375Z  'range_multi_buffers': [True, False],
2026-02-21T12:01:49.7229691Z  'range_num_stages': [1, 3],
2026-02-21T12:01:49.7229982Z  'range_unroll_factors': [2, 3],
2026-02-21T12:01:49.7230286Z  'range_warp_specializes': [],
2026-02-21T12:01:49.7230557Z  'waves_per_eu': 1}
2026-02-21T12:01:49.7239075Z [1252s] Fitting surrogate: 187 points, 187 targets
2026-02-21T12:01:50.5941738Z [1253s] Generation 2 starting: 80 neighbors, 5 active search path(s)
2026-02-21T12:02:27.4604732Z [1290s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:02:29.4240209Z [1292s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:02:29.4265216Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 0.7 configs/s
2026-02-21T12:03:16.0525739Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 83/83 1.8 configs/s
2026-02-21T12:03:17.5014775Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 7/7 - configs/s
2026-02-21T12:03:27.5499759Z [1350s] Generation 2 complete: 
2026-02-21T12:03:27.5500179Z error=3
2026-02-21T12:03:27.5500392Z timeout=2
2026-02-21T12:03:27.5500604Z ok=81
2026-02-21T12:03:27.5500802Z min=25.8371
2026-02-21T12:03:27.5501056Z mid=31.6133
2026-02-21T12:03:27.5501251Z max=552.2924
2026-02-21T12:03:27.5501496Z best={'block_sizes': [1, 256, 32],
2026-02-21T12:03:27.5502424Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T12:03:27.5502831Z  'l2_groupings': [8],
2026-02-21T12:03:27.5503101Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:03:27.5503441Z  'loop_orders': [[0, 1]],
2026-02-21T12:03:27.5503713Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:03:27.5503997Z  'num_sm_multiplier': 16,
2026-02-21T12:03:27.5504258Z  'num_stages': 4,
2026-02-21T12:03:27.5504482Z  'num_warps': 8,
2026-02-21T12:03:27.5504755Z  'pid_type': 'persistent_blocked',
2026-02-21T12:03:27.5505058Z  'range_flattens': [False, True],
2026-02-21T12:03:27.5505364Z  'range_multi_buffers': [True, False],
2026-02-21T12:03:27.5505664Z  'range_num_stages': [1, 3],
2026-02-21T12:03:27.5505944Z  'range_unroll_factors': [2, 3],
2026-02-21T12:03:27.5506170Z  'range_warp_specializes': [],
2026-02-21T12:03:27.5506329Z  'waves_per_eu': 1}
2026-02-21T12:03:27.5519022Z [1350s] Fitting surrogate: 273 points, 273 targets
2026-02-21T12:03:28.4512333Z [1351s] Generation 3 starting: 77 neighbors, 5 active search path(s)
2026-02-21T12:03:53.4077472Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 0.8 configs/s
2026-02-21T12:04:39.8044797Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 79/79 1.6 configs/s
2026-02-21T12:04:41.0677143Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 7/7 - configs/s
2026-02-21T12:04:49.8138323Z [1432s] Generation 3 complete: 
2026-02-21T12:04:49.8138743Z error=1
2026-02-21T12:04:49.8139000Z ok=82
2026-02-21T12:04:49.8139207Z min=25.6264
2026-02-21T12:04:49.8139427Z mid=36.8950
2026-02-21T12:04:49.8139629Z max=402.1087
2026-02-21T12:04:49.8140081Z best={'block_sizes': [1, 256, 32],
2026-02-21T12:04:49.8140521Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T12:04:49.8140922Z  'l2_groupings': [8],
2026-02-21T12:04:49.8141756Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:04:49.8142070Z  'loop_orders': [[0, 1]],
2026-02-21T12:04:49.8142356Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:04:49.8142660Z  'num_sm_multiplier': 16,
2026-02-21T12:04:49.8143015Z  'num_stages': 4,
2026-02-21T12:04:49.8143268Z  'num_warps': 8,
2026-02-21T12:04:49.8143522Z  'pid_type': 'persistent_blocked',
2026-02-21T12:04:49.8143958Z  'range_flattens': [False, True],
2026-02-21T12:04:49.8144291Z  'range_multi_buffers': [True, False],
2026-02-21T12:04:49.8144673Z  'range_num_stages': [1, 3],
2026-02-21T12:04:49.8144954Z  'range_unroll_factors': [2, 3],
2026-02-21T12:04:49.8145241Z  'range_warp_specializes': [],
2026-02-21T12:04:49.8145452Z  'waves_per_eu': 1}
2026-02-21T12:04:49.8160953Z [1432s] Fitting surrogate: 356 points, 356 targets
2026-02-21T12:04:50.5773437Z [1433s] Generation 4 starting: 68 neighbors, 4 active search path(s)
2026-02-21T12:05:27.3670864Z [1469s] Timeout after 30s compiling Config(block_sizes=[1, 64, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:05:27.3693343Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70/70 0.5 configs/s
2026-02-21T12:06:08.6951612Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 70/70 1.4 configs/s
2026-02-21T12:06:09.3859292Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:06:15.5215444Z [1518s] Generation 4 complete: 
2026-02-21T12:06:15.5215839Z error=2
2026-02-21T12:06:15.5216054Z timeout=1
2026-02-21T12:06:15.5216251Z ok=70
2026-02-21T12:06:15.5216456Z min=20.8176
2026-02-21T12:06:15.5216659Z mid=44.1705
2026-02-21T12:06:15.5216861Z max=379.6740
2026-02-21T12:06:15.5217148Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:06:15.5217563Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:06:15.5218469Z  'l2_groupings': [8],
2026-02-21T12:06:15.5218750Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:06:15.5219070Z  'loop_orders': [[0, 1]],
2026-02-21T12:06:15.5219340Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:06:15.5219773Z  'num_sm_multiplier': 128,
2026-02-21T12:06:15.5220046Z  'num_stages': 2,
2026-02-21T12:06:15.5220282Z  'num_warps': 8,
2026-02-21T12:06:15.5220540Z  'pid_type': 'persistent_blocked',
2026-02-21T12:06:15.5220860Z  'range_flattens': [False, None],
2026-02-21T12:06:15.5221185Z  'range_multi_buffers': [False, False],
2026-02-21T12:06:15.5221496Z  'range_num_stages': [3, 2],
2026-02-21T12:06:15.5221786Z  'range_unroll_factors': [4, 3],
2026-02-21T12:06:15.5222080Z  'range_warp_specializes': [],
2026-02-21T12:06:15.5222360Z  'waves_per_eu': 1}
2026-02-21T12:06:15.5237394Z [1518s] Fitting surrogate: 429 points, 429 targets
2026-02-21T12:06:16.2432993Z [1518s] Generation 5 starting: 71 neighbors, 4 active search path(s)
2026-02-21T12:06:52.4546048Z [1555s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:06:53.6054693Z [1556s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:06:54.3550180Z [1556s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:06:55.9905227Z [1558s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:07:05.0747387Z [1567s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:07:05.0761583Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.3 configs/s
2026-02-21T12:07:39.9752118Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 73/73 1.9 configs/s
2026-02-21T12:07:40.5741964Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:07:45.8667701Z [1608s] Generation 5 complete: 
2026-02-21T12:07:45.8668120Z error=5
2026-02-21T12:07:45.8668490Z timeout=5
2026-02-21T12:07:45.8668818Z ok=65
2026-02-21T12:07:45.8669125Z min=20.9904
2026-02-21T12:07:45.8669431Z mid=36.2941
2026-02-21T12:07:45.8669688Z max=326.3378
2026-02-21T12:07:45.8669970Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:07:45.8670410Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:07:45.8670810Z  'l2_groupings': [8],
2026-02-21T12:07:45.8671849Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:07:45.8672210Z  'loop_orders': [[0, 1]],
2026-02-21T12:07:45.8672489Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:07:45.8672778Z  'num_sm_multiplier': 128,
2026-02-21T12:07:45.8673055Z  'num_stages': 2,
2026-02-21T12:07:45.8673292Z  'num_warps': 8,
2026-02-21T12:07:45.8673568Z  'pid_type': 'persistent_blocked',
2026-02-21T12:07:45.8673878Z  'range_flattens': [False, True],
2026-02-21T12:07:45.8674156Z  'range_multi_buffers': [False, False],
2026-02-21T12:07:45.8674401Z  'range_num_stages': [3, 2],
2026-02-21T12:07:45.8674621Z  'range_unroll_factors': [4, 3],
2026-02-21T12:07:45.8674848Z  'range_warp_specializes': [],
2026-02-21T12:07:45.8675213Z  'waves_per_eu': 2}
2026-02-21T12:07:45.8693123Z [1608s] Fitting surrogate: 504 points, 504 targets
2026-02-21T12:07:47.6951487Z [1610s] Generation 6 starting: 75 neighbors, 4 active search path(s)
2026-02-21T12:08:22.0283281Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 0.8 configs/s
2026-02-21T12:09:17.2038236Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 77/77 1.0 configs/s
2026-02-21T12:09:18.0854051Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:09:25.8956464Z [1708s] Generation 6 complete: 
2026-02-21T12:09:25.8956933Z error=2
2026-02-21T12:09:25.8957143Z ok=77
2026-02-21T12:09:25.8957347Z min=20.8192
2026-02-21T12:09:25.8957556Z mid=35.9649
2026-02-21T12:09:25.8957771Z max=296.3518
2026-02-21T12:09:25.8958031Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:09:25.8958431Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:09:25.8960366Z  'l2_groupings': [8],
2026-02-21T12:09:25.8960728Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:09:25.8961057Z  'loop_orders': [[0, 1]],
2026-02-21T12:09:25.8961360Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:09:25.8961636Z  'num_sm_multiplier': 128,
2026-02-21T12:09:25.8961888Z  'num_stages': 2,
2026-02-21T12:09:25.8962099Z  'num_warps': 8,
2026-02-21T12:09:25.8962348Z  'pid_type': 'persistent_blocked',
2026-02-21T12:09:25.8962781Z  'range_flattens': [False, True],
2026-02-21T12:09:25.8963072Z  'range_multi_buffers': [False, False],
2026-02-21T12:09:25.8963356Z  'range_num_stages': [3, 2],
2026-02-21T12:09:25.8963631Z  'range_unroll_factors': [4, 3],
2026-02-21T12:09:25.8963913Z  'range_warp_specializes': [],
2026-02-21T12:09:25.8964168Z  'waves_per_eu': 2}
2026-02-21T12:09:25.8982723Z [1708s] Fitting surrogate: 583 points, 583 targets
2026-02-21T12:09:26.6863973Z [1709s] Generation 7 starting: 69 neighbors, 4 active search path(s)
2026-02-21T12:10:02.9214048Z [1745s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:10:04.1996603Z [1746s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:10:04.2020917Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.5 configs/s
2026-02-21T12:10:36.6106238Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 71/71 2.2 configs/s
2026-02-21T12:10:37.1603571Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:10:42.0149661Z [1784s] Generation 7 complete: 
2026-02-21T12:10:42.0149910Z error=9
2026-02-21T12:10:42.0150004Z timeout=2
2026-02-21T12:10:42.0150361Z ok=62
2026-02-21T12:10:42.0150452Z min=20.9846
2026-02-21T12:10:42.0150550Z mid=43.6457
2026-02-21T12:10:42.0150643Z max=215.3094
2026-02-21T12:10:42.0150755Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:10:42.0150952Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:10:42.0151147Z  'l2_groupings': [8],
2026-02-21T12:10:42.0151274Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:10:42.0151421Z  'loop_orders': [[0, 1]],
2026-02-21T12:10:42.0151549Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:10:42.0151672Z  'num_sm_multiplier': 128,
2026-02-21T12:10:42.0151789Z  'num_stages': 1,
2026-02-21T12:10:42.0151893Z  'num_warps': 8,
2026-02-21T12:10:42.0152130Z  'pid_type': 'persistent_blocked',
2026-02-21T12:10:42.0152268Z  'range_flattens': [False, True],
2026-02-21T12:10:42.0152420Z  'range_multi_buffers': [False, False],
2026-02-21T12:10:42.0152565Z  'range_num_stages': [3, 2],
2026-02-21T12:10:42.0152691Z  'range_unroll_factors': [4, 3],
2026-02-21T12:10:42.0152828Z  'range_warp_specializes': [],
2026-02-21T12:10:42.0152949Z  'waves_per_eu': 2}
2026-02-21T12:10:42.0178456Z [1784s] Fitting surrogate: 656 points, 656 targets
2026-02-21T12:10:42.8180057Z [1785s] Generation 8 starting: 75 neighbors, 4 active search path(s)
2026-02-21T12:11:18.5984667Z [1821s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, False], range_num_stages=[3, 3], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:11:22.7572129Z [1825s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:11:23.1353673Z [1825s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, False], range_num_stages=[4, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:11:24.5823146Z [1827s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:11:26.8732251Z [1829s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[False, False], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:11:28.5244763Z [1831s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[4, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:11:30.4062016Z [1832s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:11:30.4089469Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.6 configs/s
2026-02-21T12:12:02.6255597Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 76/76 2.2 configs/s
2026-02-21T12:12:03.4341598Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:12:10.5909830Z [1873s] Generation 8 complete: 
2026-02-21T12:12:10.5910259Z error=10
2026-02-21T12:12:10.5910480Z timeout=7
2026-02-21T12:12:10.5910683Z ok=62
2026-02-21T12:12:10.5910888Z min=20.9229
2026-02-21T12:12:10.5911096Z mid=28.2292
2026-02-21T12:12:10.5911295Z max=546.9233
2026-02-21T12:12:10.5911542Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:12:10.5912019Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:12:10.5912518Z  'l2_groupings': [8],
2026-02-21T12:12:10.5912807Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:12:10.5913135Z  'loop_orders': [[0, 1]],
2026-02-21T12:12:10.5913410Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:12:10.5913695Z  'num_sm_multiplier': 128,
2026-02-21T12:12:10.5913981Z  'num_stages': 2,
2026-02-21T12:12:10.5914234Z  'num_warps': 8,
2026-02-21T12:12:10.5914495Z  'pid_type': 'persistent_blocked',
2026-02-21T12:12:10.5914820Z  'range_flattens': [False, True],
2026-02-21T12:12:10.5915137Z  'range_multi_buffers': [True, False],
2026-02-21T12:12:10.5915446Z  'range_num_stages': [3, 2],
2026-02-21T12:12:10.5915736Z  'range_unroll_factors': [4, 3],
2026-02-21T12:12:10.5916030Z  'range_warp_specializes': [],
2026-02-21T12:12:10.5916308Z  'waves_per_eu': 2}
2026-02-21T12:12:10.5989631Z [1873s] Fitting surrogate: 735 points, 735 targets
2026-02-21T12:12:11.0253443Z [1873s] Generation 9 starting: 36 neighbors, 2 active search path(s)
2026-02-21T12:12:47.1357879Z [1909s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, None], range_num_stages=[4, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:12:47.1379761Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 0.5 configs/s
2026-02-21T12:13:12.4859477Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 37/37 1.4 configs/s
2026-02-21T12:13:12.8917710Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:13:16.4606362Z [1939s] Generation 9 complete: 
2026-02-21T12:13:16.4606794Z error=2
2026-02-21T12:13:16.4607043Z timeout=1
2026-02-21T12:13:16.4607252Z ok=35
2026-02-21T12:13:16.4607460Z min=20.9333
2026-02-21T12:13:16.4607663Z mid=31.0326
2026-02-21T12:13:16.4607865Z max=735.9344
2026-02-21T12:13:16.4608106Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:13:16.4608523Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:13:16.4608925Z  'l2_groupings': [8],
2026-02-21T12:13:16.4609222Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:13:16.4609536Z  'loop_orders': [[0, 1]],
2026-02-21T12:13:16.4609831Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:13:16.4610108Z  'num_sm_multiplier': 64,
2026-02-21T12:13:16.4610371Z  'num_stages': 2,
2026-02-21T12:13:16.4610604Z  'num_warps': 8,
2026-02-21T12:13:16.4611378Z  'pid_type': 'persistent_blocked',
2026-02-21T12:13:16.4611701Z  'range_flattens': [True, True],
2026-02-21T12:13:16.4612001Z  'range_multi_buffers': [True, False],
2026-02-21T12:13:16.4612312Z  'range_num_stages': [3, 2],
2026-02-21T12:13:16.4612596Z  'range_unroll_factors': [4, 3],
2026-02-21T12:13:16.4612870Z  'range_warp_specializes': [],
2026-02-21T12:13:16.4613100Z  'waves_per_eu': 2}
2026-02-21T12:13:16.4632983Z [1939s] Fitting surrogate: 773 points, 773 targets
2026-02-21T12:13:16.9488714Z [1939s] Generation 10 starting: 35 neighbors, 2 active search path(s)
2026-02-21T12:13:52.3739692Z [1974s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:13:52.3762754Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0.8 configs/s
2026-02-21T12:14:04.9455579Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 2.7 configs/s
2026-02-21T12:14:05.1704646Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:14:07.1394471Z [1989s] Generation 10 complete: 
2026-02-21T12:14:07.1396428Z error=13
2026-02-21T12:14:07.1396625Z timeout=1
2026-02-21T12:14:07.1396719Z ok=23
2026-02-21T12:14:07.1396822Z min=20.9181
2026-02-21T12:14:07.1396908Z mid=38.9961
2026-02-21T12:14:07.1396990Z max=241.0125
2026-02-21T12:14:07.1397117Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:14:07.1397278Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:14:07.1397444Z  'l2_groupings': [8],
2026-02-21T12:14:07.1397552Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:14:07.1397671Z  'loop_orders': [[0, 1]],
2026-02-21T12:14:07.1397777Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:14:07.1397891Z  'num_sm_multiplier': 32,
2026-02-21T12:14:07.1397990Z  'num_stages': 1,
2026-02-21T12:14:07.1398080Z  'num_warps': 8,
2026-02-21T12:14:07.1398175Z  'pid_type': 'persistent_blocked',
2026-02-21T12:14:07.1398295Z  'range_flattens': [True, True],
2026-02-21T12:14:07.1398668Z  'range_multi_buffers': [True, False],
2026-02-21T12:14:07.1398785Z  'range_num_stages': [3, 2],
2026-02-21T12:14:07.1398888Z  'range_unroll_factors': [4, 3],
2026-02-21T12:14:07.1399001Z  'range_warp_specializes': [],
2026-02-21T12:14:07.1399101Z  'waves_per_eu': 2}
2026-02-21T12:14:07.1411237Z [1989s] Fitting surrogate: 810 points, 810 targets
2026-02-21T12:14:07.6132016Z [1990s] Generation 11 starting: 35 neighbors, 2 active search path(s)
2026-02-21T12:14:39.4023916Z [2021s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:14:41.6968215Z [2024s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[4, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:14:43.8362494Z [2026s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:14:44.7578805Z [2027s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:14:44.7607117Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0.9 configs/s
2026-02-21T12:15:00.7957825Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 2.4 configs/s
2026-02-21T12:15:01.0920903Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:15:03.6902983Z [2046s] Generation 11 complete: 
2026-02-21T12:15:03.6903442Z error=2
2026-02-21T12:15:03.6903655Z timeout=4
2026-02-21T12:15:03.6903853Z ok=31
2026-02-21T12:15:03.6905327Z min=20.8761
2026-02-21T12:15:03.6905536Z mid=37.7577
2026-02-21T12:15:03.6905743Z max=209.6498
2026-02-21T12:15:03.6905995Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:15:03.6906448Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:15:03.6906836Z  'l2_groupings': [8],
2026-02-21T12:15:03.6907118Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:15:03.6907439Z  'loop_orders': [[0, 1]],
2026-02-21T12:15:03.6907710Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:15:03.6907994Z  'num_sm_multiplier': 32,
2026-02-21T12:15:03.6908263Z  'num_stages': 1,
2026-02-21T12:15:03.6908501Z  'num_warps': 8,
2026-02-21T12:15:03.6908755Z  'pid_type': 'persistent_blocked',
2026-02-21T12:15:03.6909573Z  'range_flattens': [True, True],
2026-02-21T12:15:03.6909876Z  'range_multi_buffers': [True, False],
2026-02-21T12:15:03.6910194Z  'range_num_stages': [3, 3],
2026-02-21T12:15:03.6910495Z  'range_unroll_factors': [4, 3],
2026-02-21T12:15:03.6910788Z  'range_warp_specializes': [],
2026-02-21T12:15:03.6910988Z  'waves_per_eu': 2}
2026-02-21T12:15:03.6932657Z [2046s] Fitting surrogate: 847 points, 847 targets
2026-02-21T12:15:04.1315872Z [2046s] Generation 12 starting: 35 neighbors, 2 active search path(s)
2026-02-21T12:15:36.7480115Z [2079s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, False], range_num_stages=[3, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:15:38.7938117Z [2081s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:15:39.1888827Z [2081s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[3, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:15:39.7470904Z [2082s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:15:39.7487340Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 1.0 configs/s
2026-02-21T12:15:58.1339230Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 1.9 configs/s
2026-02-21T12:15:58.3960478Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:16:00.6832188Z [2103s] Generation 12 complete: 
2026-02-21T12:16:00.6832447Z error=3
2026-02-21T12:16:00.6832933Z timeout=4
2026-02-21T12:16:00.6833063Z ok=30
2026-02-21T12:16:00.6833162Z min=20.9142
2026-02-21T12:16:00.6833279Z mid=36.8290
2026-02-21T12:16:00.6833380Z max=469.2883
2026-02-21T12:16:00.6833500Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:16:00.6833699Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:16:00.6833906Z  'l2_groupings': [8],
2026-02-21T12:16:00.6834045Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:16:00.6834200Z  'loop_orders': [[0, 1]],
2026-02-21T12:16:00.6834336Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:16:00.6834471Z  'num_sm_multiplier': 32,
2026-02-21T12:16:00.6834609Z  'num_stages': 2,
2026-02-21T12:16:00.6834718Z  'num_warps': 8,
2026-02-21T12:16:00.6834849Z  'pid_type': 'persistent_blocked',
2026-02-21T12:16:00.6834991Z  'range_flattens': [True, True],
2026-02-21T12:16:00.6835139Z  'range_multi_buffers': [True, False],
2026-02-21T12:16:00.6835292Z  'range_num_stages': [3, 3],
2026-02-21T12:16:00.6835428Z  'range_unroll_factors': [4, 3],
2026-02-21T12:16:00.6835574Z  'range_warp_specializes': [],
2026-02-21T12:16:00.6835702Z  'waves_per_eu': 2}
2026-02-21T12:16:00.6862605Z [2103s] Fitting surrogate: 884 points, 884 targets
2026-02-21T12:16:01.1579161Z [2103s] Generation 13 starting: 39 neighbors, 2 active search path(s)
2026-02-21T12:16:36.2046509Z [2138s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[3, 2], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:16:36.7042188Z [2139s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[3, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:16:39.4617238Z [2142s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:16:39.4639099Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 0.6 configs/s
2026-02-21T12:17:10.0427574Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 40/40 1.2 configs/s
2026-02-21T12:17:10.3244765Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:17:12.7938290Z [2175s] Generation 13 complete: 
2026-02-21T12:17:12.7938790Z error=5
2026-02-21T12:17:12.7939014Z timeout=3
2026-02-21T12:17:12.7939219Z ok=33
2026-02-21T12:17:12.7939433Z min=20.9302
2026-02-21T12:17:12.7939637Z mid=45.8267
2026-02-21T12:17:12.7939859Z max=532.5824
2026-02-21T12:17:12.7940114Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:17:12.7940517Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:17:12.7940954Z  'l2_groupings': [8],
2026-02-21T12:17:12.7941235Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:17:12.7941583Z  'loop_orders': [[0, 1]],
2026-02-21T12:17:12.7941864Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:17:12.7942153Z  'num_sm_multiplier': 32,
2026-02-21T12:17:12.7942415Z  'num_stages': 2,
2026-02-21T12:17:12.7942646Z  'num_warps': 8,
2026-02-21T12:17:12.7942917Z  'pid_type': 'persistent_blocked',
2026-02-21T12:17:12.7943229Z  'range_flattens': [None, True],
2026-02-21T12:17:12.7943546Z  'range_multi_buffers': [True, False],
2026-02-21T12:17:12.7943855Z  'range_num_stages': [3, 3],
2026-02-21T12:17:12.7944151Z  'range_unroll_factors': [4, 3],
2026-02-21T12:17:12.7944440Z  'range_warp_specializes': [],
2026-02-21T12:17:12.7944719Z  'waves_per_eu': 2}
2026-02-21T12:17:12.7968824Z [2175s] Fitting surrogate: 925 points, 925 targets
2026-02-21T12:17:14.3649088Z [2176s] Generation 14 starting: 32 neighbors, 2 active search path(s)
2026-02-21T12:17:50.1003982Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 0.7 configs/s
2026-02-21T12:18:24.2373133Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 0.7 configs/s
2026-02-21T12:18:24.4762061Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:18:26.5537660Z [2249s] Generation 14 complete: 
2026-02-21T12:18:26.5538044Z error=3
2026-02-21T12:18:26.5538249Z ok=31
2026-02-21T12:18:26.5538450Z min=20.8891
2026-02-21T12:18:26.5538671Z mid=42.1927
2026-02-21T12:18:26.5538905Z max=690.1064
2026-02-21T12:18:26.5539155Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:18:26.5540003Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:18:26.5540394Z  'l2_groupings': [8],
2026-02-21T12:18:26.5540675Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:18:26.5541006Z  'loop_orders': [[0, 1]],
2026-02-21T12:18:26.5541282Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:18:26.5541560Z  'num_sm_multiplier': 32,
2026-02-21T12:18:26.5541819Z  'num_stages': 2,
2026-02-21T12:18:26.5542045Z  'num_warps': 8,
2026-02-21T12:18:26.5542304Z  'pid_type': 'persistent_blocked',
2026-02-21T12:18:26.5542624Z  'range_flattens': [False, None],
2026-02-21T12:18:26.5542933Z  'range_multi_buffers': [True, False],
2026-02-21T12:18:26.5543245Z  'range_num_stages': [3, 3],
2026-02-21T12:18:26.5543525Z  'range_unroll_factors': [4, 3],
2026-02-21T12:18:26.5572466Z  'range_warp_specializes': [],
2026-02-21T12:18:26.5572634Z  'waves_per_eu': 2}
2026-02-21T12:18:26.5572796Z [2249s] Fitting surrogate: 959 points, 959 targets
2026-02-21T12:18:27.0047181Z [2249s] Generation 15 starting: 33 neighbors, 2 active search path(s)
2026-02-21T12:19:04.5151017Z [2287s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:19:04.5173676Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 0.7 configs/s
2026-02-21T12:19:21.7322711Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 2.0 configs/s
2026-02-21T12:19:22.0298053Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:19:24.6371414Z [2307s] Generation 15 complete: 
2026-02-21T12:19:24.6372370Z error=6
2026-02-21T12:19:24.6372588Z timeout=1
2026-02-21T12:19:24.6372797Z ok=28
2026-02-21T12:19:24.6372998Z min=20.8988
2026-02-21T12:19:24.6373221Z mid=39.6252
2026-02-21T12:19:24.6373421Z max=224.5784
2026-02-21T12:19:24.6373665Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:19:24.6374091Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:19:24.6374483Z  'l2_groupings': [8],
2026-02-21T12:19:24.6374764Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:19:24.6375076Z  'loop_orders': [[0, 1]],
2026-02-21T12:19:24.6375356Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:19:24.6375649Z  'num_sm_multiplier': 32,
2026-02-21T12:19:24.6375904Z  'num_stages': 2,
2026-02-21T12:19:24.6376133Z  'num_warps': 8,
2026-02-21T12:19:24.6376387Z  'pid_type': 'persistent_blocked',
2026-02-21T12:19:24.6376692Z  'range_flattens': [False, False],
2026-02-21T12:19:24.6376994Z  'range_multi_buffers': [True, False],
2026-02-21T12:19:24.6377301Z  'range_num_stages': [3, 3],
2026-02-21T12:19:24.6377584Z  'range_unroll_factors': [4, 3],
2026-02-21T12:19:24.6377876Z  'range_warp_specializes': [],
2026-02-21T12:19:24.6378156Z  'waves_per_eu': 2}
2026-02-21T12:19:24.6405771Z [2307s] Fitting surrogate: 994 points, 994 targets
2026-02-21T12:19:25.0667668Z [2307s] Generation 16 starting: 35 neighbors, 2 active search path(s)
2026-02-21T12:19:58.8084936Z [2341s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, False], range_num_stages=[4, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:20:00.9846234Z [2343s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:20:00.9868042Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 1.0 configs/s
2026-02-21T12:20:24.0674241Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 1.5 configs/s
2026-02-21T12:20:24.2946003Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:20:26.2798198Z [2368s] Generation 16 complete: 
2026-02-21T12:20:26.2798551Z error=4
2026-02-21T12:20:26.2798759Z timeout=2
2026-02-21T12:20:26.2798963Z ok=31
2026-02-21T12:20:26.2799162Z min=20.9688
2026-02-21T12:20:26.2799373Z mid=37.6421
2026-02-21T12:20:26.2799584Z max=430.6540
2026-02-21T12:20:26.2799828Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:20:26.2800261Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:20:26.2800654Z  'l2_groupings': [8],
2026-02-21T12:20:26.2800942Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:20:26.2801279Z  'loop_orders': [[0, 1]],
2026-02-21T12:20:26.2801560Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:20:26.2801845Z  'num_sm_multiplier': 32,
2026-02-21T12:20:26.2802110Z  'num_stages': 2,
2026-02-21T12:20:26.2802968Z  'num_warps': 8,
2026-02-21T12:20:26.2803253Z  'pid_type': 'persistent_blocked',
2026-02-21T12:20:26.2803567Z  'range_flattens': [False, False],
2026-02-21T12:20:26.2803872Z  'range_multi_buffers': [True, None],
2026-02-21T12:20:26.2804157Z  'range_num_stages': [3, 3],
2026-02-21T12:20:26.2804397Z  'range_unroll_factors': [4, 3],
2026-02-21T12:20:26.2804654Z  'range_warp_specializes': [],
2026-02-21T12:20:26.2804890Z  'waves_per_eu': 2}
2026-02-21T12:20:26.2833094Z [2368s] Fitting surrogate: 1031 points, 1031 targets
2026-02-21T12:20:26.7030342Z [2369s] Generation 17 starting: 35 neighbors, 2 active search path(s)
2026-02-21T12:21:00.5663761Z [2403s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:21:02.5056773Z [2405s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:21:03.9431023Z [2406s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:21:03.9456470Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 0.7 configs/s
2026-02-21T12:21:25.1470984Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 1.7 configs/s
2026-02-21T12:21:25.4945458Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:21:28.5497840Z [2431s] Generation 17 complete: 
2026-02-21T12:21:28.5501614Z error=2
2026-02-21T12:21:28.5501897Z timeout=3
2026-02-21T12:21:28.5502136Z ok=32
2026-02-21T12:21:28.5502343Z min=20.8908
2026-02-21T12:21:28.5502558Z mid=39.5303
2026-02-21T12:21:28.5502759Z max=401.4762
2026-02-21T12:21:28.5503061Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:21:28.5503487Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:21:28.5504308Z  'l2_groupings': [8],
2026-02-21T12:21:28.5504592Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:21:28.5504912Z  'loop_orders': [[0, 1]],
2026-02-21T12:21:28.5505215Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:21:28.5505496Z  'num_sm_multiplier': 32,
2026-02-21T12:21:28.5505763Z  'num_stages': 2,
2026-02-21T12:21:28.5505996Z  'num_warps': 8,
2026-02-21T12:21:28.5506264Z  'pid_type': 'persistent_blocked',
2026-02-21T12:21:28.5506619Z  'range_flattens': [None, False],
2026-02-21T12:21:28.5506914Z  'range_multi_buffers': [True, None],
2026-02-21T12:21:28.5507166Z  'range_num_stages': [3, 3],
2026-02-21T12:21:28.5507407Z  'range_unroll_factors': [4, 3],
2026-02-21T12:21:28.5507653Z  'range_warp_specializes': [],
2026-02-21T12:21:28.5507879Z  'waves_per_eu': 2}
2026-02-21T12:21:28.5538624Z [2431s] Fitting surrogate: 1068 points, 1068 targets
2026-02-21T12:21:29.0342580Z [2431s] Generation 18 starting: 36 neighbors, 2 active search path(s)
2026-02-21T12:22:01.1201730Z [2463s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[3, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:22:02.7427700Z [2465s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:22:03.8355569Z [2466s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 4], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:22:03.8378664Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 0.9 configs/s
2026-02-21T12:22:30.8763307Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 1.3 configs/s
2026-02-21T12:22:31.1170837Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:22:33.2107101Z [2495s] Generation 18 complete: 
2026-02-21T12:22:33.2107482Z error=8
2026-02-21T12:22:33.2107689Z timeout=3
2026-02-21T12:22:33.2107921Z ok=27
2026-02-21T12:22:33.2108130Z min=20.9218
2026-02-21T12:22:33.2108335Z mid=44.1776
2026-02-21T12:22:33.2108568Z max=839.8673
2026-02-21T12:22:33.2108814Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:22:33.2109265Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:22:33.2109666Z  'l2_groupings': [8],
2026-02-21T12:22:33.2109950Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:22:33.2110267Z  'loop_orders': [[0, 1]],
2026-02-21T12:22:33.2110550Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:22:33.2110825Z  'num_sm_multiplier': 32,
2026-02-21T12:22:33.2111439Z  'num_stages': 3,
2026-02-21T12:22:33.2111661Z  'num_warps': 8,
2026-02-21T12:22:33.2111921Z  'pid_type': 'persistent_blocked',
2026-02-21T12:22:33.2112240Z  'range_flattens': [None, False],
2026-02-21T12:22:33.2112537Z  'range_multi_buffers': [True, None],
2026-02-21T12:22:33.2112847Z  'range_num_stages': [4, 3],
2026-02-21T12:22:33.2113133Z  'range_unroll_factors': [4, 3],
2026-02-21T12:22:33.2113438Z  'range_warp_specializes': [],
2026-02-21T12:22:33.2113724Z  'waves_per_eu': 2}
2026-02-21T12:22:33.2142528Z [2495s] Fitting surrogate: 1106 points, 1106 targets
2026-02-21T12:22:33.6610090Z [2496s] Generation 19 starting: 34 neighbors, 2 active search path(s)
2026-02-21T12:23:07.4084416Z [2529s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[4, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:23:08.3940116Z [2530s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[4, 3], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:23:09.2113004Z [2531s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[4, 3], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:23:09.2137638Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 0.9 configs/s
2026-02-21T12:23:23.9905397Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 2.3 configs/s
2026-02-21T12:23:24.2821016Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:23:26.8389744Z [2549s] Generation 19 complete: 
2026-02-21T12:23:26.8390248Z error=5
2026-02-21T12:23:26.8390463Z timeout=3
2026-02-21T12:23:26.8390668Z ok=28
2026-02-21T12:23:26.8390866Z min=20.8872
2026-02-21T12:23:26.8391081Z mid=37.1607
2026-02-21T12:23:26.8391280Z max=191.2476
2026-02-21T12:23:26.8391542Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:23:26.8391947Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:23:26.8392318Z  'l2_groupings': [8],
2026-02-21T12:23:26.8392580Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:23:26.8392893Z  'loop_orders': [[0, 1]],
2026-02-21T12:23:26.8393151Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:23:26.8393416Z  'num_sm_multiplier': 32,
2026-02-21T12:23:26.8393664Z  'num_stages': 3,
2026-02-21T12:23:26.8393877Z  'num_warps': 8,
2026-02-21T12:23:26.8394122Z  'pid_type': 'persistent_blocked',
2026-02-21T12:23:26.8394413Z  'range_flattens': [None, False],
2026-02-21T12:23:26.8394706Z  'range_multi_buffers': [True, None],
2026-02-21T12:23:26.8394994Z  'range_num_stages': [3, 3],
2026-02-21T12:23:26.8395256Z  'range_unroll_factors': [3, 3],
2026-02-21T12:23:26.8396167Z  'range_warp_specializes': [],
2026-02-21T12:23:26.8396425Z  'waves_per_eu': 2}
2026-02-21T12:23:26.8427409Z [2549s] Fitting surrogate: 1142 points, 1142 targets
2026-02-21T12:23:27.2989949Z [2549s] Generation 20 starting: 36 neighbors, 2 active search path(s)
2026-02-21T12:24:00.2943138Z [2582s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[3, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:24:04.3360787Z [2586s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:24:05.8988677Z [2588s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:24:05.9003724Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 0.5 configs/s
2026-02-21T12:24:26.3858938Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 1.8 configs/s
2026-02-21T12:24:26.7931238Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T12:24:30.4027808Z [2612s] Generation 20 complete: 
2026-02-21T12:24:30.4028274Z error=5
2026-02-21T12:24:30.4028498Z timeout=3
2026-02-21T12:24:30.4028695Z ok=30
2026-02-21T12:24:30.4028931Z min=20.9518
2026-02-21T12:24:30.4029135Z mid=22.1884
2026-02-21T12:24:30.4029443Z max=433.0034
2026-02-21T12:24:30.4029807Z best={'block_sizes': [1, 256, 64],
2026-02-21T12:24:30.4030433Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:24:30.4031005Z  'l2_groupings': [8],
2026-02-21T12:24:30.4031288Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:24:30.4031616Z  'loop_orders': [[0, 1]],
2026-02-21T12:24:30.4032038Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:24:30.4032492Z  'num_sm_multiplier': 32,
2026-02-21T12:24:30.4032919Z  'num_stages': 3,
2026-02-21T12:24:30.4033267Z  'num_warps': 8,
2026-02-21T12:24:30.4033582Z  'pid_type': 'persistent_blocked',
2026-02-21T12:24:30.4033897Z  'range_flattens': [False, False],
2026-02-21T12:24:30.4034207Z  'range_multi_buffers': [True, None],
2026-02-21T12:24:30.4034441Z  'range_num_stages': [3, 3],
2026-02-21T12:24:30.4034648Z  'range_unroll_factors': [3, 3],
2026-02-21T12:24:30.4034860Z  'range_warp_specializes': [],
2026-02-21T12:24:30.4035061Z  'waves_per_eu': 2}
2026-02-21T12:24:30.4066578Z [2612s] Fitting surrogate: 1180 points, 1180 targets
2026-02-21T12:24:30.5462549Z [2613s] Autotuning complete in 2613.1s after searching 1074 configs.
2026-02-21T12:24:30.5462791Z One can hardcode the best config and skip autotuning with:
2026-02-21T12:24:30.5463593Z     @helion.kernel(config=helion.Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T12:24:30.5464712Z 
2026-02-21T12:24:30.5464887Z [2613s] Code of selected kernel: /tmp/torchinductor_root/dw/cdwlh63qytjqmhnny24trv2tdjay2tlbktwxnqcdldr37fjxfebq.py
2026-02-21T12:24:31.6746970Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T12:24:31.6750244Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 3], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T12:24:31.6754782Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T12:24:31.6755816Z WARNING:tritonbench.utils.triton_op:Completed input ID 6:
2026-02-21T12:24:31.6756269Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T12:24:31.6756619Z ------------------------------------------
2026-02-21T12:24:31.6757064Z (4, 48, 8192, 8192, 128)
2026-02-21T12:24:31.6757240Z 
2026-02-21T12:24:31.6757802Z 100%|██████████| 6/6 [1:40:05<00:00, 1492.33s/it]
2026-02-21T12:24:31.6758232Z 100%|██████████| 6/6 [1:40:05<00:00, 1000.88s/it]
2026-02-21T12:24:31.6758675Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmpli665d5a.csv
2026-02-21T12:24:34.9230162Z   (Batch, Heads, SeqLen, SeqLen_KV, Dhead)    flex_attention-speedup    flex_attention-accuracy    helion_attention-speedup    helion_attention-accuracy
2026-02-21T12:24:34.9231373Z ------------------------------------------  ------------------------  -------------------------  --------------------------  ---------------------------
2026-02-21T12:24:34.9233081Z                     (4, 48, 128, 128, 128)                   2.027                     0                            3.37355                            0
2026-02-21T12:24:34.9233786Z                     (4, 48, 256, 256, 128)                   2.18314                   0                            3.25082                            0
2026-02-21T12:24:34.9234363Z                     (4, 48, 512, 512, 128)                   2.73606                   1                            3.31177                            0
2026-02-21T12:24:34.9234955Z                   (4, 48, 2048, 2048, 128)                   3.51616                   1                            5.07911                            0
2026-02-21T12:24:34.9235550Z                   (4, 48, 4096, 4096, 128)                   3.65348                   1                            4.96449                            0
2026-02-21T12:24:34.9236140Z                   (4, 48, 8192, 8192, 128)                   3.89044                   1                            5.52917                            0
2026-02-21T12:24:34.9236737Z                                    average                   3.00105                   0.666667                     4.25149                            0
2026-02-21T12:26:53.4609172Z Applying custom args for flash_attention: {'d_head': 128, 'num_inputs': 6}
2026-02-21T12:26:53.4741760Z INFO:root:TMA benchmarks will be running without grid constant TMA descriptor.
2026-02-21T12:26:53.4763150Z TMA benchmarks will be running without grid constant TMA descriptor.
2026-02-21T12:26:53.4875354Z Running flash_attention benchmark with Helion implementation...
2026-02-21T12:26:53.4875565Z 
2026-02-21T12:26:53.6321606Z Equally-spaced-k mode: Selected 6 equally spaced inputs (total available: 7)
2026-02-21T12:26:53.6322306Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 1, 2, 4, 5, 6]
2026-02-21T12:26:53.6328527Z 
2026-02-21T12:26:53.6334149Z   0%|          | 0/6 [00:00<?, ?it/s]WARNING:tritonbench.utils.triton_op:Running input ID 0:
2026-02-21T12:26:53.6334417Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T12:26:53.6334860Z ------------------------------------------
2026-02-21T12:26:53.6335028Z (4, 48, 128, 128, 128)
2026-02-21T12:26:53.6338245Z INFO:tritonbench.utils.triton_op:Took 0.13ms to get benchmark function for aten
2026-02-21T12:26:57.3877712Z INFO:tritonbench.utils.triton_op:Took 30.03ms to get benchmark function for flex_attention
2026-02-21T12:26:59.3557235Z WARNING:__main__:Input tensor metadata:
2026-02-21T12:26:59.3557717Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T12:26:59.3558061Z               'dtype': 'torch.bfloat16',
2026-02-21T12:26:59.3558829Z               'shape': (4, 48, 128, 128),
2026-02-21T12:26:59.3559155Z               'stride': (786432, 16384, 128, 1)},
2026-02-21T12:26:59.3559491Z             { 'device': 'cuda:0',
2026-02-21T12:26:59.3559791Z               'dtype': 'torch.bfloat16',
2026-02-21T12:26:59.3560106Z               'shape': (4, 48, 128, 128),
2026-02-21T12:26:59.3560431Z               'stride': (786432, 16384, 128, 1)},
2026-02-21T12:26:59.3560753Z             { 'device': 'cuda:0',
2026-02-21T12:26:59.3561045Z               'dtype': 'torch.bfloat16',
2026-02-21T12:26:59.3561354Z               'shape': (4, 48, 128, 128),
2026-02-21T12:26:59.3561682Z               'stride': (786432, 16384, 128, 1)}),
2026-02-21T12:26:59.3562105Z   'kwargs': {}}
2026-02-21T12:26:59.3562673Z INFO:tritonbench.utils.triton_op:Took 0.50ms to get benchmark function for helion_attention
2026-02-21T12:26:59.8672043Z [0s] Autotune random seed: 2150287535
2026-02-21T12:26:59.9106776Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T12:27:12.1850085Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 2.5 configs/s
2026-02-21T12:27:17.8175962Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 18.2 configs/s
2026-02-21T12:27:17.8185256Z [17s] Adaptive compile timeout: 30s (90% percentile=3.5s, bounds=[30.0s, 60s])
2026-02-21T12:27:18.3945312Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1598.2 configs/s
2026-02-21T12:27:18.6491572Z [18s] Initial random population of 100, 5 starting points: 
2026-02-21T12:27:18.6492227Z error=10
2026-02-21T12:27:18.6492440Z ok=90
2026-02-21T12:27:18.6492647Z min=0.0201
2026-02-21T12:27:18.6492882Z mid=0.1452
2026-02-21T12:27:18.6493086Z max=4.3308
2026-02-21T12:27:18.6493313Z best={'block_sizes': [1, 128, 128],
2026-02-21T12:27:18.6493729Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T12:27:18.6494124Z  'l2_groupings': [2],
2026-02-21T12:27:18.6494411Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:27:18.6494731Z  'loop_orders': [[0, 1]],
2026-02-21T12:27:18.6495007Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:27:18.6495282Z  'num_stages': 3,
2026-02-21T12:27:18.6495509Z  'num_warps': 8,
2026-02-21T12:27:18.6495707Z  'pid_type': 'xyz',
2026-02-21T12:27:18.6495921Z  'range_flattens': [None, False],
2026-02-21T12:27:18.6496194Z  'range_multi_buffers': [None, True],
2026-02-21T12:27:18.6496449Z  'range_num_stages': [0, 0],
2026-02-21T12:27:18.6496692Z  'range_unroll_factors': [0, 1],
2026-02-21T12:27:18.6496946Z  'range_warp_specializes': [],
2026-02-21T12:27:18.6497176Z  'waves_per_eu': 4}
2026-02-21T12:27:18.6535681Z [18s] Fitting surrogate: 100 points, 100 targets
2026-02-21T12:27:19.4038960Z [19s] Generation 1 starting: 75 neighbors, 5 active search path(s)
2026-02-21T12:27:30.2869750Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 6.4 configs/s
2026-02-21T12:27:35.0659902Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.4 configs/s
2026-02-21T12:27:38.8850315Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 279.1         
2026-02-21T12:27:38.8852759Z                                                                   configs/s     
2026-02-21T12:27:39.3847735Z [39s] Generation 1 complete: 
2026-02-21T12:27:39.3848197Z ok=80
2026-02-21T12:27:39.3848543Z min=0.0205
2026-02-21T12:27:39.3848853Z mid=0.0274
2026-02-21T12:27:39.3849106Z max=0.2548
2026-02-21T12:27:39.3849610Z best={'block_sizes': [1, 128, 128],
2026-02-21T12:27:39.3850034Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T12:27:39.3850441Z  'l2_groupings': [2],
2026-02-21T12:27:39.3850763Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:27:39.3851086Z  'loop_orders': [[0, 1]],
2026-02-21T12:27:39.3851365Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:27:39.3851640Z  'num_stages': 3,
2026-02-21T12:27:39.3851873Z  'num_warps': 8,
2026-02-21T12:27:39.3852114Z  'pid_type': 'xyz',
2026-02-21T12:27:39.3852377Z  'range_flattens': [None, False],
2026-02-21T12:27:39.3852685Z  'range_multi_buffers': [None, True],
2026-02-21T12:27:39.3852993Z  'range_num_stages': [0, 0],
2026-02-21T12:27:39.3853244Z  'range_unroll_factors': [0, 1],
2026-02-21T12:27:39.3853510Z  'range_warp_specializes': [],
2026-02-21T12:27:39.3853753Z  'waves_per_eu': 4}
2026-02-21T12:27:39.4165622Z [39s] Fitting surrogate: 180 points, 180 targets
2026-02-21T12:27:40.1320366Z [40s] Generation 2 starting: 71 neighbors, 5 active search path(s)
2026-02-21T12:27:50.7553944Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 5.5 configs/s
2026-02-21T12:27:55.6213788Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 15.4 configs/s
2026-02-21T12:27:58.7895572Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 312.4         
2026-02-21T12:27:58.7896159Z                                                                   configs/s     
2026-02-21T12:27:59.2093364Z [59s] Generation 2 complete: 
2026-02-21T12:27:59.2093635Z ok=76
2026-02-21T12:27:59.2093793Z min=0.0200
2026-02-21T12:27:59.2094053Z mid=0.0271
2026-02-21T12:27:59.2094283Z max=0.2613
2026-02-21T12:27:59.2094461Z best={'block_sizes': [1, 64, 64],
2026-02-21T12:27:59.2094746Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:27:59.2095015Z  'l2_groupings': [64],
2026-02-21T12:27:59.2095499Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:27:59.2095715Z  'loop_orders': [[0, 1]],
2026-02-21T12:27:59.2095919Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:27:59.2096100Z  'num_stages': 2,
2026-02-21T12:27:59.2096262Z  'num_warps': 4,
2026-02-21T12:27:59.2096423Z  'pid_type': 'xyz',
2026-02-21T12:27:59.2096602Z  'range_flattens': [None, None],
2026-02-21T12:27:59.2096814Z  'range_multi_buffers': [None, False],
2026-02-21T12:27:59.2097022Z  'range_num_stages': [0, 1],
2026-02-21T12:27:59.2097211Z  'range_unroll_factors': [0, 3],
2026-02-21T12:27:59.2097409Z  'range_warp_specializes': [],
2026-02-21T12:27:59.2097603Z  'waves_per_eu': 2}
2026-02-21T12:27:59.2739745Z [59s] Fitting surrogate: 256 points, 256 targets
2026-02-21T12:28:00.0235810Z [60s] Generation 3 starting: 74 neighbors, 5 active search path(s)
2026-02-21T12:28:28.2612500Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.5 configs/s
2026-02-21T12:28:32.9071288Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 76/76 16.9 configs/s
2026-02-21T12:28:35.8545157Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 366.3         
2026-02-21T12:28:35.8545767Z                                                                   configs/s     
2026-02-21T12:28:36.2708849Z [96s] Generation 3 complete: 
2026-02-21T12:28:36.2709242Z error=2
2026-02-21T12:28:36.2709479Z ok=77
2026-02-21T12:28:36.2709690Z min=0.0198
2026-02-21T12:28:36.2709898Z mid=0.0280
2026-02-21T12:28:36.2710103Z max=0.2141
2026-02-21T12:28:36.2710333Z best={'block_sizes': [1, 64, 64],
2026-02-21T12:28:36.2710739Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'pointer'],
2026-02-21T12:28:36.2711457Z  'l2_groupings': [64],
2026-02-21T12:28:36.2711737Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:28:36.2712064Z  'loop_orders': [[0, 1]],
2026-02-21T12:28:36.2712347Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:28:36.2712617Z  'num_stages': 2,
2026-02-21T12:28:36.2712853Z  'num_warps': 4,
2026-02-21T12:28:36.2713082Z  'pid_type': 'xyz',
2026-02-21T12:28:36.2713347Z  'range_flattens': [None, None],
2026-02-21T12:28:36.2713665Z  'range_multi_buffers': [None, False],
2026-02-21T12:28:36.2714144Z  'range_num_stages': [0, 2],
2026-02-21T12:28:36.2714432Z  'range_unroll_factors': [0, 3],
2026-02-21T12:28:36.2714721Z  'range_warp_specializes': [],
2026-02-21T12:28:36.2715009Z  'waves_per_eu': 2}
2026-02-21T12:28:36.3278273Z [96s] Fitting surrogate: 335 points, 335 targets
2026-02-21T12:28:37.0143447Z [97s] Generation 4 starting: 71 neighbors, 5 active search path(s)
2026-02-21T12:28:50.0792721Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 2.1 configs/s
2026-02-21T12:28:54.7179956Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 74/74 16.5 configs/s
2026-02-21T12:28:58.2352041Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 304.9         
2026-02-21T12:28:58.2354452Z                                                                   configs/s     
2026-02-21T12:28:58.6824031Z [118s] Generation 4 complete: 
2026-02-21T12:28:58.6824487Z ok=76
2026-02-21T12:28:58.6824700Z min=0.0198
2026-02-21T12:28:58.6824923Z mid=0.0259
2026-02-21T12:28:58.6825147Z max=0.4225
2026-02-21T12:28:58.6825388Z best={'block_sizes': [1, 64, 64],
2026-02-21T12:28:58.6825808Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:28:58.6826779Z  'l2_groupings': [64],
2026-02-21T12:28:58.6827066Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:28:58.6827384Z  'loop_orders': [[0, 1]],
2026-02-21T12:28:58.6827672Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:28:58.6827946Z  'num_stages': 2,
2026-02-21T12:28:58.6828203Z  'num_warps': 4,
2026-02-21T12:28:58.6828436Z  'pid_type': 'xyz',
2026-02-21T12:28:58.6828699Z  'range_flattens': [None, None],
2026-02-21T12:28:58.6829011Z  'range_multi_buffers': [None, False],
2026-02-21T12:28:58.6829329Z  'range_num_stages': [0, 2],
2026-02-21T12:28:58.6829615Z  'range_unroll_factors': [0, 3],
2026-02-21T12:28:58.6829909Z  'range_warp_specializes': [],
2026-02-21T12:28:58.6830367Z  'waves_per_eu': 2}
2026-02-21T12:28:58.7544741Z [118s] Fitting surrogate: 411 points, 411 targets
2026-02-21T12:28:59.4742883Z [119s] Generation 5 starting: 68 neighbors, 5 active search path(s)
2026-02-21T12:29:09.4635743Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 4.6 configs/s
2026-02-21T12:29:13.7804397Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 16.5 configs/s
2026-02-21T12:29:16.7568320Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 333.2         
2026-02-21T12:29:16.7568930Z                                                                   configs/s     
2026-02-21T12:29:17.1680959Z [137s] Generation 5 complete: 
2026-02-21T12:29:17.1681274Z ok=73
2026-02-21T12:29:17.1681487Z min=0.0195
2026-02-21T12:29:17.1681710Z mid=0.0216
2026-02-21T12:29:17.1681919Z max=0.1718
2026-02-21T12:29:17.1682145Z best={'block_sizes': [1, 64, 64],
2026-02-21T12:29:17.1682655Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:29:17.1683093Z  'l2_groupings': [64],
2026-02-21T12:29:17.1683375Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:29:17.1683710Z  'loop_orders': [[0, 1]],
2026-02-21T12:29:17.1683991Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:29:17.1684268Z  'num_stages': 2,
2026-02-21T12:29:17.1684501Z  'num_warps': 4,
2026-02-21T12:29:17.1684749Z  'pid_type': 'xyz',
2026-02-21T12:29:17.1685008Z  'range_flattens': [None, None],
2026-02-21T12:29:17.1685319Z  'range_multi_buffers': [None, False],
2026-02-21T12:29:17.1685627Z  'range_num_stages': [0, 2],
2026-02-21T12:29:17.1685911Z  'range_unroll_factors': [0, 3],
2026-02-21T12:29:17.1686544Z  'range_warp_specializes': [],
2026-02-21T12:29:17.1686819Z  'waves_per_eu': 2}
2026-02-21T12:29:17.2358270Z [137s] Fitting surrogate: 484 points, 484 targets
2026-02-21T12:29:18.1797129Z [138s] Generation 6 starting: 58 neighbors, 5 active search path(s)
2026-02-21T12:29:27.8073797Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 3.3 configs/s
2026-02-21T12:29:31.5063203Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 16.9 configs/s
2026-02-21T12:29:33.5382371Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 560.4         
2026-02-21T12:29:33.5383010Z                                                                   configs/s     
2026-02-21T12:29:33.8927921Z [153s] Generation 6 complete: 
2026-02-21T12:29:33.8928291Z error=1
2026-02-21T12:29:33.8928450Z ok=62
2026-02-21T12:29:33.8928603Z min=0.0197
2026-02-21T12:29:33.8928764Z mid=0.0285
2026-02-21T12:29:33.8928913Z max=0.2056
2026-02-21T12:29:33.8929090Z best={'block_sizes': [1, 64, 64],
2026-02-21T12:29:33.8929470Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:29:33.8929942Z  'l2_groupings': [64],
2026-02-21T12:29:33.8930278Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:29:33.8930658Z  'loop_orders': [[0, 1]],
2026-02-21T12:29:33.8930992Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:29:33.8931303Z  'num_stages': 2,
2026-02-21T12:29:33.8931692Z  'num_warps': 4,
2026-02-21T12:29:33.8931870Z  'pid_type': 'xyz',
2026-02-21T12:29:33.8932066Z  'range_flattens': [None, None],
2026-02-21T12:29:33.8932304Z  'range_multi_buffers': [None, False],
2026-02-21T12:29:33.8932550Z  'range_num_stages': [0, 1],
2026-02-21T12:29:33.8932761Z  'range_unroll_factors': [0, 3],
2026-02-21T12:29:33.8932986Z  'range_warp_specializes': [],
2026-02-21T12:29:33.8933197Z  'waves_per_eu': 2}
2026-02-21T12:29:33.9307861Z [154s] Fitting surrogate: 547 points, 547 targets
2026-02-21T12:29:34.6464366Z [154s] Generation 7 starting: 59 neighbors, 5 active search path(s)
2026-02-21T12:30:02.1595330Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 0.5 configs/s
2026-02-21T12:30:05.8205468Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 17.4 configs/s
2026-02-21T12:30:08.5030901Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 414.6         
2026-02-21T12:30:08.5031507Z                                                                   configs/s     
2026-02-21T12:30:08.9135637Z [189s] Generation 7 complete: 
2026-02-21T12:30:08.9136113Z error=4
2026-02-21T12:30:08.9136319Z ok=60
2026-02-21T12:30:08.9136526Z min=0.0198
2026-02-21T12:30:08.9136733Z mid=0.0250
2026-02-21T12:30:08.9136937Z max=0.2548
2026-02-21T12:30:08.9137166Z best={'block_sizes': [1, 64, 64],
2026-02-21T12:30:08.9137603Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:30:08.9138017Z  'l2_groupings': [64],
2026-02-21T12:30:08.9138301Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:30:08.9138629Z  'loop_orders': [[0, 1]],
2026-02-21T12:30:08.9138926Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:30:08.9139200Z  'num_stages': 2,
2026-02-21T12:30:08.9139431Z  'num_warps': 4,
2026-02-21T12:30:08.9139668Z  'pid_type': 'xyz',
2026-02-21T12:30:08.9139920Z  'range_flattens': [None, None],
2026-02-21T12:30:08.9140236Z  'range_multi_buffers': [None, False],
2026-02-21T12:30:08.9140545Z  'range_num_stages': [0, 1],
2026-02-21T12:30:08.9140836Z  'range_unroll_factors': [0, 3],
2026-02-21T12:30:08.9141130Z  'range_warp_specializes': [],
2026-02-21T12:30:08.9141927Z  'waves_per_eu': 2}
2026-02-21T12:30:08.9746918Z [189s] Fitting surrogate: 611 points, 611 targets
2026-02-21T12:30:09.5601849Z [189s] Generation 8 starting: 50 neighbors, 4 active search path(s)
2026-02-21T12:30:17.1986769Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 4.6 configs/s
2026-02-21T12:30:20.5764031Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 16.1 configs/s
2026-02-21T12:30:22.5413866Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 501.7         
2026-02-21T12:30:22.5415136Z                                                                   configs/s     
2026-02-21T12:30:22.9149097Z [203s] Generation 8 complete: 
2026-02-21T12:30:22.9149437Z ok=55
2026-02-21T12:30:22.9149677Z min=0.0198
2026-02-21T12:30:22.9149899Z mid=0.0236
2026-02-21T12:30:22.9152913Z max=0.1164
2026-02-21T12:30:22.9153246Z best={'block_sizes': [1, 128, 128],
2026-02-21T12:30:22.9153691Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:30:22.9154158Z  'l2_groupings': [8],
2026-02-21T12:30:22.9154413Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:30:22.9154716Z  'loop_orders': [[0, 1]],
2026-02-21T12:30:22.9154973Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:30:22.9155436Z  'num_stages': 1,
2026-02-21T12:30:22.9155660Z  'num_warps': 8,
2026-02-21T12:30:22.9155878Z  'pid_type': 'xyz',
2026-02-21T12:30:22.9156116Z  'range_flattens': [None, None],
2026-02-21T12:30:22.9156400Z  'range_multi_buffers': [None, False],
2026-02-21T12:30:22.9156704Z  'range_num_stages': [0, 3],
2026-02-21T12:30:22.9156961Z  'range_unroll_factors': [0, 2],
2026-02-21T12:30:22.9157244Z  'range_warp_specializes': [],
2026-02-21T12:30:22.9157495Z  'waves_per_eu': 1}
2026-02-21T12:30:22.9617183Z [203s] Fitting surrogate: 666 points, 666 targets
2026-02-21T12:30:23.4325363Z [203s] Generation 9 starting: 39 neighbors, 3 active search path(s)
2026-02-21T12:30:31.2480828Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 1.9 configs/s
2026-02-21T12:30:33.7990220Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 40/40 16.6 configs/s
2026-02-21T12:30:35.2877211Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 657.8         
2026-02-21T12:30:35.6177878Z [215s] Generation 9 complete: 
2026-02-21T12:30:35.6178252Z                                                                   configs/s     
2026-02-21T12:30:35.6178598Z error=1
2026-02-21T12:30:35.6178769Z ok=41
2026-02-21T12:30:35.6178931Z min=0.0192
2026-02-21T12:30:35.6179119Z mid=0.0218
2026-02-21T12:30:35.6179279Z max=0.0605
2026-02-21T12:30:35.6179464Z best={'block_sizes': [1, 64, 128],
2026-02-21T12:30:35.6179783Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T12:30:35.6180105Z  'l2_groupings': [4],
2026-02-21T12:30:35.6180327Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:30:35.6180588Z  'loop_orders': [[0, 1]],
2026-02-21T12:30:35.6180830Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:30:35.6181043Z  'num_stages': 3,
2026-02-21T12:30:35.6181246Z  'num_warps': 4,
2026-02-21T12:30:35.6181432Z  'pid_type': 'xyz',
2026-02-21T12:30:35.6181641Z  'range_flattens': [None, True],
2026-02-21T12:30:35.6181883Z  'range_multi_buffers': [None, True],
2026-02-21T12:30:35.6182137Z  'range_num_stages': [0, 2],
2026-02-21T12:30:35.6182355Z  'range_unroll_factors': [0, 4],
2026-02-21T12:30:35.6182581Z  'range_warp_specializes': [],
2026-02-21T12:30:35.6182789Z  'waves_per_eu': 2}
2026-02-21T12:30:35.6519261Z [215s] Fitting surrogate: 708 points, 708 targets
2026-02-21T12:30:36.1007994Z [216s] Generation 10 starting: 39 neighbors, 3 active search path(s)
2026-02-21T12:30:42.4787759Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 3.4 configs/s
2026-02-21T12:30:45.0451680Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 39/39 16.1 configs/s
2026-02-21T12:30:47.2334239Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 527.1         
2026-02-21T12:30:47.2334882Z                                                                   configs/s     
2026-02-21T12:30:47.6236727Z [227s] Generation 10 complete: 
2026-02-21T12:30:47.6236952Z ok=42
2026-02-21T12:30:47.6237035Z min=0.0196
2026-02-21T12:30:47.6237120Z mid=0.0205
2026-02-21T12:30:47.6237234Z max=0.0956
2026-02-21T12:30:47.6237322Z best={'block_sizes': [1, 128, 128],
2026-02-21T12:30:47.6237485Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:30:47.6237640Z  'l2_groupings': [8],
2026-02-21T12:30:47.6237746Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:30:47.6237875Z  'loop_orders': [[0, 1]],
2026-02-21T12:30:47.6237982Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:30:47.6238082Z  'num_stages': 1,
2026-02-21T12:30:47.6238193Z  'num_warps': 8,
2026-02-21T12:30:47.6238280Z  'pid_type': 'xyz',
2026-02-21T12:30:47.6238377Z  'range_flattens': [None, None],
2026-02-21T12:30:47.6238507Z  'range_multi_buffers': [None, False],
2026-02-21T12:30:47.6238621Z  'range_num_stages': [0, 1],
2026-02-21T12:30:47.6238728Z  'range_unroll_factors': [0, 2],
2026-02-21T12:30:47.6238849Z  'range_warp_specializes': [],
2026-02-21T12:30:47.6238952Z  'waves_per_eu': 1}
2026-02-21T12:30:47.6718962Z [227s] Fitting surrogate: 750 points, 750 targets
2026-02-21T12:30:48.1492024Z [228s] Generation 11 starting: 38 neighbors, 3 active search path(s)
2026-02-21T12:31:05.4532306Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 0.5 configs/s
2026-02-21T12:31:07.8876222Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 16.6 configs/s
2026-02-21T12:31:09.5908743Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 577.1         
2026-02-21T12:31:09.5912386Z                                                                   configs/s     
2026-02-21T12:31:09.9365708Z [250s] Generation 11 complete: 
2026-02-21T12:31:09.9366154Z error=1
2026-02-21T12:31:09.9366354Z ok=40
2026-02-21T12:31:09.9366551Z min=0.0195
2026-02-21T12:31:09.9367346Z mid=0.0207
2026-02-21T12:31:09.9367536Z max=0.4359
2026-02-21T12:31:09.9367754Z best={'block_sizes': [1, 64, 128],
2026-02-21T12:31:09.9368182Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:31:09.9368565Z  'l2_groupings': [8],
2026-02-21T12:31:09.9368846Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:31:09.9369146Z  'loop_orders': [[0, 1]],
2026-02-21T12:31:09.9369417Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:31:09.9369677Z  'num_stages': 3,
2026-02-21T12:31:09.9369890Z  'num_warps': 4,
2026-02-21T12:31:09.9370111Z  'pid_type': 'xyz',
2026-02-21T12:31:09.9370365Z  'range_flattens': [None, True],
2026-02-21T12:31:09.9370655Z  'range_multi_buffers': [None, True],
2026-02-21T12:31:09.9370949Z  'range_num_stages': [0, 3],
2026-02-21T12:31:09.9371212Z  'range_unroll_factors': [0, 4],
2026-02-21T12:31:09.9371483Z  'range_warp_specializes': [],
2026-02-21T12:31:09.9371746Z  'waves_per_eu': 4}
2026-02-21T12:31:09.9766266Z [250s] Fitting surrogate: 791 points, 791 targets
2026-02-21T12:31:10.6799886Z [250s] Generation 12 starting: 28 neighbors, 2 active search path(s)
2026-02-21T12:31:17.2497991Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 1.3 configs/s
2026-02-21T12:31:19.0403482Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 28/28 17.0 configs/s
2026-02-21T12:31:20.1764232Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 857.9         
2026-02-21T12:31:20.1765023Z                                                                   configs/s     
2026-02-21T12:31:20.4997221Z [260s] Generation 12 complete: 
2026-02-21T12:31:20.4997711Z error=1
2026-02-21T12:31:20.4997860Z ok=29
2026-02-21T12:31:20.4998010Z min=0.0194
2026-02-21T12:31:20.4998159Z mid=0.0203
2026-02-21T12:31:20.4998311Z max=0.1006
2026-02-21T12:31:20.4998469Z best={'block_sizes': [1, 64, 128],
2026-02-21T12:31:20.4998759Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T12:31:20.4999041Z  'l2_groupings': [8],
2026-02-21T12:31:20.4999248Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:31:20.4999476Z  'loop_orders': [[0, 1]],
2026-02-21T12:31:20.4999816Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:31:20.5000013Z  'num_stages': 3,
2026-02-21T12:31:20.5000178Z  'num_warps': 4,
2026-02-21T12:31:20.5000372Z  'pid_type': 'flat',
2026-02-21T12:31:20.5000568Z  'range_flattens': [None, True],
2026-02-21T12:31:20.5000785Z  'range_multi_buffers': [None, True],
2026-02-21T12:31:20.5001004Z  'range_num_stages': [0, 2],
2026-02-21T12:31:20.5001208Z  'range_unroll_factors': [0, 4],
2026-02-21T12:31:20.5001424Z  'range_warp_specializes': [],
2026-02-21T12:31:20.5001625Z  'waves_per_eu': 2}
2026-02-21T12:31:20.5279359Z [260s] Fitting surrogate: 821 points, 821 targets
2026-02-21T12:31:20.8866411Z [260s] Generation 13 starting: 28 neighbors, 2 active search path(s)
2026-02-21T12:31:25.6337699Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 3.7 configs/s
2026-02-21T12:31:27.4029482Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 28/28 16.0 configs/s
2026-02-21T12:31:28.4697305Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 912.9         
2026-02-21T12:31:28.4697901Z                                                                   configs/s     
2026-02-21T12:31:28.7859575Z [268s] Generation 13 complete: 
2026-02-21T12:31:28.7860302Z error=1
2026-02-21T12:31:28.7860511Z ok=29
2026-02-21T12:31:28.7860711Z min=0.0193
2026-02-21T12:31:28.7860924Z mid=0.0203
2026-02-21T12:31:28.7861121Z max=0.1724
2026-02-21T12:31:28.7861351Z best={'block_sizes': [1, 64, 128],
2026-02-21T12:31:28.7861772Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T12:31:28.7862174Z  'l2_groupings': [8],
2026-02-21T12:31:28.7862459Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:31:28.7862777Z  'loop_orders': [[0, 1]],
2026-02-21T12:31:28.7863058Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:31:28.7863323Z  'num_stages': 3,
2026-02-21T12:31:28.7863556Z  'num_warps': 4,
2026-02-21T12:31:28.7863938Z  'pid_type': 'flat',
2026-02-21T12:31:28.7864230Z  'range_flattens': [None, True],
2026-02-21T12:31:28.7864438Z  'range_multi_buffers': [None, True],
2026-02-21T12:31:28.7864626Z  'range_num_stages': [0, 2],
2026-02-21T12:31:28.7864792Z  'range_unroll_factors': [0, 4],
2026-02-21T12:31:28.7864972Z  'range_warp_specializes': [],
2026-02-21T12:31:28.7865147Z  'waves_per_eu': 2}
2026-02-21T12:31:28.8104600Z [268s] Fitting surrogate: 851 points, 851 targets
2026-02-21T12:31:29.3821238Z [269s] Generation 14 starting: 14 neighbors, 1 active search path(s)
2026-02-21T12:31:32.1287920Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 4.8 configs/s
2026-02-21T12:31:33.0902106Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 17.1 configs/s
2026-02-21T12:31:33.6034987Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1900.2        
2026-02-21T12:31:33.6035546Z                                                                   configs/s     
2026-02-21T12:31:33.8772688Z [273s] Generation 14 complete: 
2026-02-21T12:31:33.8773049Z ok=16
2026-02-21T12:31:33.8773265Z min=0.0193
2026-02-21T12:31:33.8773881Z mid=0.0199
2026-02-21T12:31:33.8774091Z max=0.1733
2026-02-21T12:31:33.8774322Z best={'block_sizes': [1, 64, 128],
2026-02-21T12:31:33.8774747Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T12:31:33.8775145Z  'l2_groupings': [8],
2026-02-21T12:31:33.8775421Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:31:33.8775741Z  'loop_orders': [[0, 1]],
2026-02-21T12:31:33.8776017Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:31:33.8776411Z  'num_stages': 3,
2026-02-21T12:31:33.8776640Z  'num_warps': 4,
2026-02-21T12:31:33.8776875Z  'pid_type': 'flat',
2026-02-21T12:31:33.8777139Z  'range_flattens': [None, True],
2026-02-21T12:31:33.8777442Z  'range_multi_buffers': [None, True],
2026-02-21T12:31:33.8777751Z  'range_num_stages': [0, 2],
2026-02-21T12:31:33.8778033Z  'range_unroll_factors': [0, 4],
2026-02-21T12:31:33.8778336Z  'range_warp_specializes': [],
2026-02-21T12:31:33.8778618Z  'waves_per_eu': 2}
2026-02-21T12:31:33.8906749Z [273s] Fitting surrogate: 867 points, 867 targets
2026-02-21T12:31:33.9975666Z [274s] Autotuning complete in 274.1s after searching 828 configs.
2026-02-21T12:31:33.9975912Z One can hardcode the best config and skip autotuning with:
2026-02-21T12:31:33.9976823Z     @helion.kernel(config=helion.Config(block_sizes=[1, 64, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T12:31:33.9977463Z 
2026-02-21T12:31:33.9977632Z [274s] Code of selected kernel: /tmp/torchinductor_root/m5/cm5gfiotqfrfv4rcdgan2kcbhl65ognsotfsvupfbhmyivl3cca4.py
2026-02-21T12:31:34.0202740Z from __future__ import annotations
2026-02-21T12:31:34.0202987Z 
2026-02-21T12:31:34.0203053Z import torch
2026-02-21T12:31:34.0203198Z import triton
2026-02-21T12:31:34.0203333Z import triton.language as tl
2026-02-21T12:31:34.0203532Z from torch._inductor.runtime import triton_helpers
2026-02-21T12:31:34.0203807Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T12:31:34.0204087Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T12:31:34.0204286Z 
2026-02-21T12:31:34.0204350Z _BLOCK_SIZE_1 = tl.constexpr(64)
2026-02-21T12:31:34.0204522Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T12:31:34.0204699Z _BLOCK_SIZE_3 = tl.constexpr(128)
2026-02-21T12:31:34.0204869Z _SHAPE_DIM_6 = tl.constexpr(128)
2026-02-21T12:31:34.0204975Z 
2026-02-21T12:31:34.0205028Z @triton.jit
2026-02-21T12:31:34.0205247Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr):
2026-02-21T12:31:34.0205901Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T12:31:34.0206147Z     num_pid_m = 192
2026-02-21T12:31:34.0206291Z     num_pid_n = tl.cdiv(128, _BLOCK_SIZE_1)
2026-02-21T12:31:34.0206484Z     inner_2d_pid = tl.program_id(0)
2026-02-21T12:31:34.0206656Z     num_pid_in_group = 8 * num_pid_n
2026-02-21T12:31:34.0206846Z     group_id = inner_2d_pid // num_pid_in_group
2026-02-21T12:31:34.0207045Z     first_pid_m = group_id * 8
2026-02-21T12:31:34.0207285Z     group_size_m = min(num_pid_m - first_pid_m, 8)
2026-02-21T12:31:34.0207540Z     pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T12:31:34.0207816Z     pid_1 = inner_2d_pid % num_pid_in_group // group_size_m
2026-02-21T12:31:34.0208021Z     offset_0 = pid_0
2026-02-21T12:31:34.0208181Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T12:31:34.0208376Z     offset_1 = pid_1 * _BLOCK_SIZE_1
2026-02-21T12:31:34.0208594Z     indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32)
2026-02-21T12:31:34.0208847Z     indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32)
2026-02-21T12:31:34.0209147Z     # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T12:31:34.0209617Z     m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32)
2026-02-21T12:31:34.0209885Z     # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0)
2026-02-21T12:31:34.0210136Z     l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32)
2026-02-21T12:31:34.0210436Z     # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32)
2026-02-21T12:31:34.0210754Z     acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32)
2026-02-21T12:31:34.0211124Z     # src[attention.py:71]: q = q_view[tile_b, tile_m, :]
2026-02-21T12:31:34.0211482Z     q = tl.load(q_view + (indices_0[:, None, None] * 16384 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), None)
2026-02-21T12:31:34.0211864Z     # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)):
2026-02-21T12:31:34.0212119Z     # src[attention.py:73]:     k = k_view[tile_b, :, tile_n]
2026-02-21T12:31:34.0212345Z     # src[attention.py:74]:     qk = torch.bmm(q, k)
2026-02-21T12:31:34.0212540Z     # src[attention.py:72-85]: ...
2026-02-21T12:31:34.0212987Z     for offset_2 in tl.range(0, 128, _BLOCK_SIZE_3, loop_unroll_factor=4, num_stages=1, disallow_acc_multi_buffer=False, flatten=True):
2026-02-21T12:31:34.0213400Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32)
2026-02-21T12:31:34.0213610Z         q_copy = q
2026-02-21T12:31:34.0213744Z         m_i_copy = m_i
2026-02-21T12:31:34.0213880Z         l_i_copy = l_i
2026-02-21T12:31:34.0214019Z         acc_copy = acc
2026-02-21T12:31:34.0214163Z         q_copy_0 = q_copy
2026-02-21T12:31:34.0214310Z         m_i_copy_0 = m_i_copy
2026-02-21T12:31:34.0214460Z         l_i_copy_0 = l_i_copy
2026-02-21T12:31:34.0214608Z         acc_copy_0 = acc_copy
2026-02-21T12:31:34.0214814Z         # src[attention.py:73]: k = k_view[tile_b, :, tile_n]
2026-02-21T12:31:34.0215110Z         k = tl.load(k_view + (indices_0[:, None, None] * 16384 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None)
2026-02-21T12:31:34.0215386Z         # src[attention.py:74]: qk = torch.bmm(q, k)
2026-02-21T12:31:34.0215869Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T12:31:34.0216498Z         # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale)
2026-02-21T12:31:34.0216718Z         amax = tl.cast(tl.max(qk, 2), tl.bfloat16)
2026-02-21T12:31:34.0216855Z         v_0 = 0.12751743074602467
2026-02-21T12:31:34.0216995Z         v_1 = tl.cast(amax * v_0, tl.bfloat16)
2026-02-21T12:31:34.0217136Z         v_2 = tl.cast(v_1, tl.float32)
2026-02-21T12:31:34.0217287Z         v_3 = triton_helpers.maximum(m_i_copy_0, v_2)
2026-02-21T12:31:34.0217491Z         # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None]
2026-02-21T12:31:34.0217657Z         v_4 = 0.12751743074602467
2026-02-21T12:31:34.0217781Z         v_5 = tl.cast(qk * v_4, tl.bfloat16)
2026-02-21T12:31:34.0217920Z         subscript = v_3[:, :, None]
2026-02-21T12:31:34.0218055Z         v_6 = tl.cast(v_5, tl.float32)
2026-02-21T12:31:34.0218178Z         v_7 = v_6 - subscript
2026-02-21T12:31:34.0218313Z         # src[attention.py:77]: p = torch.exp2(qk)
2026-02-21T12:31:34.0218454Z         v_8 = libdevice.exp2(v_7)
2026-02-21T12:31:34.0218602Z         # src[attention.py:78]: l_ij = torch.sum(p, -1)
2026-02-21T12:31:34.0218771Z         l_ij = tl.cast(tl.sum(v_8, 2), tl.float32)
2026-02-21T12:31:34.0218938Z         # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij)
2026-02-21T12:31:34.0219099Z         v_9 = m_i_copy_0 - v_3
2026-02-21T12:31:34.0219221Z         v_10 = libdevice.exp2(v_9)
2026-02-21T12:31:34.0219369Z         # src[attention.py:80]: l_i = l_i * alpha + l_ij
2026-02-21T12:31:34.0219517Z         v_11 = l_i_copy_0 * v_10
2026-02-21T12:31:34.0219643Z         l_i = v_11 + l_ij
2026-02-21T12:31:34.0219805Z         # src[attention.py:81]: acc = acc * alpha[:, :, None]
2026-02-21T12:31:34.0219963Z         subscript_1 = v_10[:, :, None]
2026-02-21T12:31:34.0220102Z         v_13 = acc_copy_0 * subscript_1
2026-02-21T12:31:34.0220258Z         # src[attention.py:82]: v = v_view[tile_b, tile_n, :]
2026-02-21T12:31:34.0220649Z         v = tl.load(tl.make_block_ptr(v_view, [192, 128, 128], [16384, 128, 1], [offset_0, offset_2, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_3, _SHAPE_DIM_6], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T12:31:34.0221041Z         # src[attention.py:83]: p = p.to(v.dtype)
2026-02-21T12:31:34.0221192Z         v_14 = tl.cast(v_8, tl.bfloat16)
2026-02-21T12:31:34.0221356Z         # src[attention.py:84]: acc = torch.baddbmm(acc, p, v)
2026-02-21T12:31:34.0221882Z         acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128])
2026-02-21T12:31:34.0222389Z         # src[attention.py:85]: m_i = m_ij
2026-02-21T12:31:34.0222520Z         m_i = v_3
2026-02-21T12:31:34.0222663Z     # src[attention.py:87]: acc = acc / l_i[:, :, None]
2026-02-21T12:31:34.0222819Z     subscript_2 = l_i[:, :, None]
2026-02-21T12:31:34.0222942Z     v_15 = acc / subscript_2
2026-02-21T12:31:34.0223103Z     # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype)
2026-02-21T12:31:34.0223285Z     v_16 = tl.cast(v_15, tl.bfloat16)
2026-02-21T12:31:34.0223539Z     tl.store(out + (indices_0[:, None, None] * 16384 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), v_16, None)
2026-02-21T12:31:34.0223749Z 
2026-02-21T12:31:34.0223949Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T12:31:34.0224187Z     """
2026-02-21T12:31:34.0224292Z     Computes scaled dot-product attention.
2026-02-21T12:31:34.0224413Z 
2026-02-21T12:31:34.0224543Z     Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
2026-02-21T12:31:34.0224724Z 
2026-02-21T12:31:34.0224763Z     Args:
2026-02-21T12:31:34.0224874Z         q_in: Query tensor of shape [..., seq_len_q, head_dim]
2026-02-21T12:31:34.0225032Z         k_in: Key tensor of shape [..., seq_len_k, head_dim]
2026-02-21T12:31:34.0225200Z         v_in: Value tensor of shape [..., seq_len_k, head_dim]
2026-02-21T12:31:34.0225299Z 
2026-02-21T12:31:34.0225332Z     Returns:
2026-02-21T12:31:34.0225437Z         Output tensor of shape [..., seq_len_q, head_dim]
2026-02-21T12:31:34.0225558Z     """
2026-02-21T12:31:34.0225652Z     # src[attention.py:56]: m_dim = q_in.size(-2)
2026-02-21T12:31:34.0225774Z     m_dim = q_in.size(-2)
2026-02-21T12:31:34.0225902Z     # src[attention.py:57]: n_dim = k_in.size(-2)
2026-02-21T12:31:34.0226022Z     n_dim = k_in.size(-2)
2026-02-21T12:31:34.0226139Z     # src[attention.py:58]: assert n_dim == v_in.size(-2)
2026-02-21T12:31:34.0226274Z     assert n_dim == v_in.size(-2)
2026-02-21T12:31:34.0226414Z     # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1))
2026-02-21T12:31:34.0226559Z     head_dim = 128
2026-02-21T12:31:34.0226687Z     # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T12:31:34.0226862Z     assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T12:31:34.0227025Z     # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T12:31:34.0227188Z     q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T12:31:34.0227343Z     # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T12:31:34.0227499Z     v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T12:31:34.0227672Z     # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T12:31:34.0227894Z     k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T12:31:34.0228084Z     # src[attention.py:64]: out = torch.empty_like(q_view)
2026-02-21T12:31:34.0228225Z     out = torch.empty_like(q_view)
2026-02-21T12:31:34.0228391Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T12:31:34.0228561Z     _BLOCK_SIZE_1 = 64
2026-02-21T12:31:34.0228657Z     _RDIM_SIZE_2 = 128
2026-02-21T12:31:34.0228804Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T12:31:34.0229032Z     # src[attention.py:68]:     m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T12:31:34.0229313Z     # src[attention.py:69]:     l_i = torch.full_like(m_i, 1.0)
2026-02-21T12:31:34.0229458Z     # src[attention.py:67-88]: ...
2026-02-21T12:31:34.0229768Z     _launcher(_helion_attention, (192 * triton.cdiv(128, _BLOCK_SIZE_1),), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=3, waves_per_eu=2, matrix_instr_nonkdim=16)
2026-02-21T12:31:34.0230091Z     # src[attention.py:89]: return out.view(q_in.size())
2026-02-21T12:31:34.0230233Z     return out.view(q_in.size())
2026-02-21T12:31:34.5741484Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T12:31:34.5743403Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 64, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T12:31:34.5744970Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T12:31:34.5745336Z WARNING:tritonbench.utils.triton_op:Completed input ID 0:
2026-02-21T12:31:34.5745768Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T12:31:34.5746031Z ------------------------------------------
2026-02-21T12:31:34.5746265Z (4, 48, 128, 128, 128)
2026-02-21T12:31:34.5746401Z 
2026-02-21T12:31:34.5748672Z  17%|█▋        | 1/6 [04:40<23:24, 280.94s/it]WARNING:tritonbench.utils.triton_op:Running input ID 1:
2026-02-21T12:31:34.5749087Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T12:31:34.5749352Z ------------------------------------------
2026-02-21T12:31:34.5749588Z (4, 48, 256, 256, 128)
2026-02-21T12:31:34.5749906Z INFO:tritonbench.utils.triton_op:Took 0.06ms to get benchmark function for aten
2026-02-21T12:31:35.8469022Z INFO:tritonbench.utils.triton_op:Took 2.08ms to get benchmark function for flex_attention
2026-02-21T12:31:37.2290782Z WARNING:__main__:Input tensor metadata:
2026-02-21T12:31:37.2291191Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T12:31:37.2291529Z               'dtype': 'torch.bfloat16',
2026-02-21T12:31:37.2292248Z               'shape': (4, 48, 256, 128),
2026-02-21T12:31:37.2292586Z               'stride': (1572864, 32768, 128, 1)},
2026-02-21T12:31:37.2292939Z             { 'device': 'cuda:0',
2026-02-21T12:31:37.2293241Z               'dtype': 'torch.bfloat16',
2026-02-21T12:31:37.2293547Z               'shape': (4, 48, 256, 128),
2026-02-21T12:31:37.2293872Z               'stride': (1572864, 32768, 128, 1)},
2026-02-21T12:31:37.2294194Z             { 'device': 'cuda:0',
2026-02-21T12:31:37.2294480Z               'dtype': 'torch.bfloat16',
2026-02-21T12:31:37.2294790Z               'shape': (4, 48, 256, 128),
2026-02-21T12:31:37.2295115Z               'stride': (1572864, 32768, 128, 1)}),
2026-02-21T12:31:37.2295435Z   'kwargs': {}}
2026-02-21T12:31:37.2312542Z INFO:tritonbench.utils.triton_op:Took 2.59ms to get benchmark function for helion_attention
2026-02-21T12:31:37.4796140Z [0s] Autotune random seed: 2150287535
2026-02-21T12:31:37.5084998Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T12:32:00.6852044Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.1 configs/s
2026-02-21T12:32:02.9942443Z /tmp/torchinductor_root/4e/c4eorrzxwkgp4flh757wxxie4yjblurji4dyrobpe2unl45lzbz2.py:49:20: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T12:32:02.9944008Z         k = tl.load(tl.make_block_ptr(k_view, [192, 128, 256], [32768, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T12:32:02.9945242Z                    ^
2026-02-21T12:32:02.9946965Z /tmp/torchinductor_root/4e/c4eorrzxwkgp4flh757wxxie4yjblurji4dyrobpe2unl45lzbz2.py:51:141: note: - use: %140 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x16xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 4], order = [1, 0, 2]}>>) -> tensor<128x16xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 4], order = [0, 1]}>>
2026-02-21T12:32:02.9948624Z 
2026-02-21T12:32:02.9949675Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T12:32:02.9950956Z                                                                                                                                             ^
2026-02-21T12:32:02.9951473Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T12:32:02.9952231Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 4, 1], order = [2, 1, 0]}>
2026-02-21T12:32:02.9953030Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:02.9953767Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}>
2026-02-21T12:32:02.9954351Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T12:32:02.9954949Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [2, 1, 0]}>
2026-02-21T12:32:02.9955536Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:02.9956122Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:02.9956717Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 2, 16], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:02.9957298Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T12:32:02.9957864Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T12:32:02.9958411Z #blocked10 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [4], order = [0]}>
2026-02-21T12:32:02.9958957Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [0, 1]}>
2026-02-21T12:32:02.9959551Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [32, 2, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:02.9960139Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T12:32:02.9960716Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [0, 1, 2]}>
2026-02-21T12:32:02.9961309Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [0, 1, 2]}>
2026-02-21T12:32:02.9961900Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 1, 16], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:02.9962707Z #blocked17 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [4, 16, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:02.9963349Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T12:32:02.9964112Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T12:32:02.9964699Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T12:32:02.9964877Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T12:32:02.9965042Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T12:32:02.9965213Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T12:32:02.9965439Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x128x16xbf16, #blocked>
2026-02-21T12:32:02.9965723Z     %cst_0 = arith.constant dense<256> : tensor<1x1x16xi64, #blocked1>
2026-02-21T12:32:02.9965970Z     %cst_1 = arith.constant dense<0> : tensor<1x1x16xi64, #blocked1>
2026-02-21T12:32:02.9966249Z     %cst_2 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2>
2026-02-21T12:32:02.9966502Z     %cst_3 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2>
2026-02-21T12:32:02.9966744Z     %cst_4 = arith.constant dense<128> : tensor<1x1x16xi64, #blocked1>
2026-02-21T12:32:02.9967012Z     %cst_5 = arith.constant dense<0.000000e+00> : tensor<1x2x128xbf16, #blocked3>
2026-02-21T12:32:02.9967280Z     %cst_6 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked4>
2026-02-21T12:32:02.9967527Z     %cst_7 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked4>
2026-02-21T12:32:02.9967792Z     %cst_8 = arith.constant dense<256> : tensor<1x2x1xi64, #blocked5>
2026-02-21T12:32:02.9968039Z     %cst_9 = arith.constant dense<0> : tensor<1x2x1xi64, #blocked5>
2026-02-21T12:32:02.9968285Z     %cst_10 = arith.constant dense<128> : tensor<1x2x1xi64, #blocked5>
2026-02-21T12:32:02.9968491Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T12:32:02.9968658Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T12:32:02.9968859Z     %cst_11 = arith.constant dense<128> : tensor<1x2x1xi32, #blocked5>
2026-02-21T12:32:02.9969108Z     %cst_12 = arith.constant dense<128> : tensor<1x16x1xi32, #blocked6>
2026-02-21T12:32:02.9969367Z     %cst_13 = arith.constant dense<0.127517432> : tensor<1x2x16xf32, #blocked7>
2026-02-21T12:32:02.9969646Z     %cst_14 = arith.constant dense<0.127517432> : tensor<1x2xf32, #blocked8>
2026-02-21T12:32:02.9969925Z     %cst_15 = arith.constant dense<0.000000e+00> : tensor<2x16xf32, #blocked9>
2026-02-21T12:32:02.9970147Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T12:32:02.9970366Z     %cst_16 = arith.constant dense<0.000000e+00> : tensor<1x2x128xf32, #blocked3>
2026-02-21T12:32:02.9970639Z     %cst_17 = arith.constant dense<1.000000e+00> : tensor<1x2xf32, #blocked8>
2026-02-21T12:32:02.9970911Z     %cst_18 = arith.constant dense<0xFF800000> : tensor<1x2xf32, #blocked8>
2026-02-21T12:32:02.9971124Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T12:32:02.9971290Z     %0 = tt.get_program_id x : i32
2026-02-21T12:32:02.9971448Z     %1 = tt.get_program_id y : i32
2026-02-21T12:32:02.9971600Z     %2 = arith.muli %1, %c2_i32 : i32
2026-02-21T12:32:02.9971821Z     %3 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked10>
2026-02-21T12:32:02.9972071Z     %4 = tt.splat %2 : i32 -> tensor<2xi32, #blocked10>
2026-02-21T12:32:02.9972273Z     %5 = arith.addi %4, %3 : tensor<2xi32, #blocked10>
2026-02-21T12:32:02.9972510Z     %6 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked10>
2026-02-21T12:32:02.9972744Z     %7 = arith.extsi %0 : i32 to i64
2026-02-21T12:32:02.9972902Z     %8 = arith.extsi %2 : i32 to i64
2026-02-21T12:32:02.9973115Z     %9 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T12:32:02.9973373Z     %10 = arith.muli %7, %c32768_i64 : i64
2026-02-21T12:32:02.9973565Z     %11 = tt.splat %10 : i64 -> tensor<1x2x128xi64, #blocked3>
2026-02-21T12:32:02.9973784Z     %12 = tt.splat %8 : i64 -> tensor<2xi64, #blocked10>
2026-02-21T12:32:02.9974019Z     %13 = arith.extsi %3 : tensor<2xi32, #blocked10> to tensor<2xi64, #blocked10>
2026-02-21T12:32:02.9974269Z     %14 = arith.addi %12, %13 : tensor<2xi64, #blocked10>
2026-02-21T12:32:02.9974555Z     %15 = ttg.convert_layout %14 : tensor<2xi64, #blocked10> -> tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked11}>>
2026-02-21T12:32:02.9974910Z     %16 = tt.expand_dims %15 {axis = 0 : i32} : tensor<2xi64, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x2xi64, #blocked11>
2026-02-21T12:32:02.9975220Z     %17 = ttg.convert_layout %16 : tensor<1x2xi64, #blocked11> -> tensor<1x2xi64, #blocked8>
2026-02-21T12:32:02.9975522Z     %18 = ttg.convert_layout %17 : tensor<1x2xi64, #blocked8> -> tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked12}>>
2026-02-21T12:32:02.9975885Z     %19 = tt.expand_dims %18 {axis = 2 : i32} : tensor<1x2xi64, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x2x1xi64, #blocked12>
2026-02-21T12:32:02.9976226Z     %20 = ttg.convert_layout %19 : tensor<1x2x1xi64, #blocked12> -> tensor<1x2x1xi64, #blocked5>
2026-02-21T12:32:02.9976452Z     %21 = arith.muli %20, %cst_10 : tensor<1x2x1xi64, #blocked5>
2026-02-21T12:32:02.9976667Z     %22 = tt.broadcast %21 : tensor<1x2x1xi64, #blocked5> -> tensor<1x2x128xi64, #blocked5>
2026-02-21T12:32:02.9976929Z     %23 = ttg.convert_layout %22 : tensor<1x2x128xi64, #blocked5> -> tensor<1x2x128xi64, #blocked3>
2026-02-21T12:32:02.9977188Z     %24 = arith.extsi %6 : tensor<128xi32, #blocked10> to tensor<128xi64, #blocked10>
2026-02-21T12:32:02.9977503Z     %25 = ttg.convert_layout %24 : tensor<128xi64, #blocked10> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked11}>>
2026-02-21T12:32:02.9977859Z     %26 = tt.expand_dims %25 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x128xi64, #blocked11>
2026-02-21T12:32:02.9978180Z     %27 = ttg.convert_layout %26 : tensor<1x128xi64, #blocked11> -> tensor<1x128xi64, #blocked13>
2026-02-21T12:32:02.9978495Z     %28 = ttg.convert_layout %27 : tensor<1x128xi64, #blocked13> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked14}>>
2026-02-21T12:32:02.9978872Z     %29 = tt.expand_dims %28 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked14}>> -> tensor<1x1x128xi64, #blocked14>
2026-02-21T12:32:02.9979204Z     %30 = ttg.convert_layout %29 : tensor<1x1x128xi64, #blocked14> -> tensor<1x1x128xi64, #blocked4>
2026-02-21T12:32:02.9979469Z     %31 = tt.broadcast %30 : tensor<1x1x128xi64, #blocked4> -> tensor<1x2x128xi64, #blocked4>
2026-02-21T12:32:02.9979734Z     %32 = ttg.convert_layout %31 : tensor<1x2x128xi64, #blocked4> -> tensor<1x2x128xi64, #blocked3>
2026-02-21T12:32:02.9979960Z     %33 = arith.addi %23, %32 : tensor<1x2x128xi64, #blocked3>
2026-02-21T12:32:02.9980135Z     %34 = arith.addi %11, %33 : tensor<1x2x128xi64, #blocked3>
2026-02-21T12:32:02.9980364Z     %35 = tt.addptr %9, %34 : tensor<1x2x128x!tt.ptr<bf16>, #blocked3>, tensor<1x2x128xi64, #blocked3>
2026-02-21T12:32:02.9980574Z     %36 = arith.cmpi sge, %7, %c0_i64 : i64
2026-02-21T12:32:02.9980714Z     %37 = arith.cmpi slt, %7, %c192_i64 : i64
2026-02-21T12:32:02.9980846Z     %38 = arith.andi %36, %37 : i1
2026-02-21T12:32:02.9981003Z     %39 = arith.cmpi sge, %20, %cst_9 : tensor<1x2x1xi64, #blocked5>
2026-02-21T12:32:02.9981190Z     %40 = arith.cmpi slt, %20, %cst_8 : tensor<1x2x1xi64, #blocked5>
2026-02-21T12:32:02.9981370Z     %41 = arith.andi %39, %40 : tensor<1x2x1xi1, #blocked5>
2026-02-21T12:32:02.9981540Z     %42 = tt.splat %38 : i1 -> tensor<1x2x1xi1, #blocked5>
2026-02-21T12:32:02.9981702Z     %43 = arith.andi %42, %41 : tensor<1x2x1xi1, #blocked5>
2026-02-21T12:32:02.9981907Z     %44 = tt.broadcast %43 : tensor<1x2x1xi1, #blocked5> -> tensor<1x2x128xi1, #blocked5>
2026-02-21T12:32:02.9982205Z     %45 = ttg.convert_layout %44 : tensor<1x2x128xi1, #blocked5> -> tensor<1x2x128xi1, #blocked3>
2026-02-21T12:32:02.9982440Z     %46 = arith.cmpi sge, %30, %cst_7 : tensor<1x1x128xi64, #blocked4>
2026-02-21T12:32:02.9982633Z     %47 = arith.cmpi slt, %30, %cst_6 : tensor<1x1x128xi64, #blocked4>
2026-02-21T12:32:02.9982811Z     %48 = arith.andi %46, %47 : tensor<1x1x128xi1, #blocked4>
2026-02-21T12:32:02.9983023Z     %49 = tt.broadcast %48 : tensor<1x1x128xi1, #blocked4> -> tensor<1x2x128xi1, #blocked4>
2026-02-21T12:32:02.9983280Z     %50 = ttg.convert_layout %49 : tensor<1x2x128xi1, #blocked4> -> tensor<1x2x128xi1, #blocked3>
2026-02-21T12:32:02.9983487Z     %51 = arith.andi %45, %50 : tensor<1x2x128xi1, #blocked3>
2026-02-21T12:32:02.9983654Z     %52 = tt.load %35, %51, %cst_5 : tensor<1x2x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T12:32:02.9983855Z     %53 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked10>
2026-02-21T12:32:02.9984072Z     %54 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x16x!tt.ptr<bf16>, #blocked>
2026-02-21T12:32:02.9984260Z     %55 = tt.splat %10 : i64 -> tensor<1x128x16xi64, #blocked>
2026-02-21T12:32:02.9984527Z     %56 = ttg.convert_layout %27 : tensor<1x128xi64, #blocked13> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked15}>>
2026-02-21T12:32:02.9984870Z     %57 = tt.expand_dims %56 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x128x1xi64, #blocked15>
2026-02-21T12:32:02.9985179Z     %58 = ttg.convert_layout %57 : tensor<1x128x1xi64, #blocked15> -> tensor<1x128x1xi64, #blocked2>
2026-02-21T12:32:02.9985430Z     %59 = tt.broadcast %58 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x16xi64, #blocked2>
2026-02-21T12:32:02.9985674Z     %60 = ttg.convert_layout %59 : tensor<1x128x16xi64, #blocked2> -> tensor<1x128x16xi64, #blocked>
2026-02-21T12:32:02.9985925Z     %61 = arith.extsi %53 : tensor<16xi32, #blocked10> to tensor<16xi64, #blocked10>
2026-02-21T12:32:02.9986120Z     %62 = arith.cmpi sge, %58, %cst_3 : tensor<1x128x1xi64, #blocked2>
2026-02-21T12:32:02.9986300Z     %63 = arith.cmpi slt, %58, %cst_2 : tensor<1x128x1xi64, #blocked2>
2026-02-21T12:32:02.9986472Z     %64 = arith.andi %62, %63 : tensor<1x128x1xi1, #blocked2>
2026-02-21T12:32:02.9986626Z     %65 = tt.splat %38 : i1 -> tensor<1x128x1xi1, #blocked2>
2026-02-21T12:32:02.9986784Z     %66 = arith.andi %65, %64 : tensor<1x128x1xi1, #blocked2>
2026-02-21T12:32:02.9986973Z     %67 = tt.broadcast %66 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x16xi1, #blocked2>
2026-02-21T12:32:02.9987220Z     %68 = ttg.convert_layout %67 : tensor<1x128x16xi1, #blocked2> -> tensor<1x128x16xi1, #blocked>
2026-02-21T12:32:02.9987463Z     %69 = tt.reshape %52 : tensor<1x2x128xbf16, #blocked3> -> tensor<2x128xbf16, #blocked13>
2026-02-21T12:32:02.9987709Z     %70 = arith.muli %0, %c32768_i32 : i32
2026-02-21T12:32:02.9987859Z     %71 = tt.splat %70 : i32 -> tensor<1x16x1xi32, #blocked6>
2026-02-21T12:32:02.9988097Z     %72 = ttg.convert_layout %6 : tensor<128xi32, #blocked10> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked11}>>
2026-02-21T12:32:02.9988435Z     %73 = tt.expand_dims %72 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x128xi32, #blocked11>
2026-02-21T12:32:02.9988734Z     %74 = ttg.convert_layout %73 : tensor<1x128xi32, #blocked11> -> tensor<1x128xi32, #blocked13>
2026-02-21T12:32:02.9989024Z     %75 = ttg.convert_layout %74 : tensor<1x128xi32, #blocked13> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked14}>>
2026-02-21T12:32:02.9989371Z     %76 = tt.expand_dims %75 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked14}>> -> tensor<1x1x128xi32, #blocked14>
2026-02-21T12:32:02.9989674Z     %77 = ttg.convert_layout %76 : tensor<1x1x128xi32, #blocked14> -> tensor<1x1x128xi32, #blocked4>
2026-02-21T12:32:02.9989928Z     %78 = tt.broadcast %77 : tensor<1x1x128xi32, #blocked4> -> tensor<1x16x128xi32, #blocked4>
2026-02-21T12:32:02.9998448Z     %79 = ttg.convert_layout %78 : tensor<1x16x128xi32, #blocked4> -> tensor<1x16x128xi32, #blocked3>
2026-02-21T12:32:02.9998745Z     %80 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x16x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T12:32:02.9999148Z     %81:3 = scf.for %arg4 = %c0_i32 to %c256_i32 step %c16_i32 iter_args(%arg5 = %cst_18, %arg6 = %cst_17, %arg7 = %cst_16) -> (tensor<1x2xf32, #blocked8>, tensor<1x2xf32, #blocked8>, tensor<1x2x128xf32, #blocked3>)  : i32 {
2026-02-21T12:32:02.9999515Z       %112 = tt.splat %arg4 : i32 -> tensor<16xi32, #blocked10>
2026-02-21T12:32:02.9999705Z       %113 = arith.addi %112, %53 : tensor<16xi32, #blocked10>
2026-02-21T12:32:02.9999857Z       %114 = arith.extsi %arg4 : i32 to i64
2026-02-21T12:32:02.9999998Z       %115 = tt.splat %114 : i64 -> tensor<16xi64, #blocked10>
2026-02-21T12:32:03.0000159Z       %116 = arith.addi %115, %61 : tensor<16xi64, #blocked10>
2026-02-21T12:32:03.0000410Z       %117 = ttg.convert_layout %116 : tensor<16xi64, #blocked10> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked11}>>
2026-02-21T12:32:03.0000762Z       %118 = tt.expand_dims %117 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x16xi64, #blocked11>
2026-02-21T12:32:03.0001087Z       %119 = ttg.convert_layout %118 : tensor<1x16xi64, #blocked11> -> tensor<1x16xi64, #blocked9>
2026-02-21T12:32:03.0001382Z       %120 = ttg.convert_layout %119 : tensor<1x16xi64, #blocked9> -> tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked16}>>
2026-02-21T12:32:03.0001738Z       %121 = tt.expand_dims %120 {axis = 1 : i32} : tensor<1x16xi64, #ttg.slice<{dim = 1, parent = #blocked16}>> -> tensor<1x1x16xi64, #blocked16>
2026-02-21T12:32:03.0002049Z       %122 = ttg.convert_layout %121 : tensor<1x1x16xi64, #blocked16> -> tensor<1x1x16xi64, #blocked1>
2026-02-21T12:32:03.0002271Z       %123 = arith.muli %122, %cst_4 : tensor<1x1x16xi64, #blocked1>
2026-02-21T12:32:03.0002505Z       %124 = tt.broadcast %123 : tensor<1x1x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked1>
2026-02-21T12:32:03.0002813Z       %125 = ttg.convert_layout %124 : tensor<1x128x16xi64, #blocked1> -> tensor<1x128x16xi64, #blocked>
2026-02-21T12:32:03.0003036Z       %126 = arith.addi %60, %125 : tensor<1x128x16xi64, #blocked>
2026-02-21T12:32:03.0003202Z       %127 = arith.addi %55, %126 : tensor<1x128x16xi64, #blocked>
2026-02-21T12:32:03.0003423Z       %128 = tt.addptr %54, %127 : tensor<1x128x16x!tt.ptr<bf16>, #blocked>, tensor<1x128x16xi64, #blocked>
2026-02-21T12:32:03.0003681Z       %129 = arith.cmpi sge, %122, %cst_1 : tensor<1x1x16xi64, #blocked1>
2026-02-21T12:32:03.0003869Z       %130 = arith.cmpi slt, %122, %cst_0 : tensor<1x1x16xi64, #blocked1>
2026-02-21T12:32:03.0004045Z       %131 = arith.andi %129, %130 : tensor<1x1x16xi1, #blocked1>
2026-02-21T12:32:03.0004253Z       %132 = tt.broadcast %131 : tensor<1x1x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked1>
2026-02-21T12:32:03.0004508Z       %133 = ttg.convert_layout %132 : tensor<1x128x16xi1, #blocked1> -> tensor<1x128x16xi1, #blocked>
2026-02-21T12:32:03.0004722Z       %134 = arith.andi %68, %133 : tensor<1x128x16xi1, #blocked>
2026-02-21T12:32:03.0004903Z       %135 = tt.load %128, %134, %cst : tensor<1x128x16x!tt.ptr<bf16>, #blocked>
2026-02-21T12:32:03.0005125Z       %136 = tt.reshape %135 : tensor<1x128x16xbf16, #blocked> -> tensor<128x16xbf16, #blocked9>
2026-02-21T12:32:03.0005433Z       %137 = ttg.convert_layout %69 : tensor<2x128xbf16, #blocked13> -> tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>>
2026-02-21T12:32:03.0005794Z       %138 = ttg.convert_layout %136 : tensor<128x16xbf16, #blocked9> -> tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>>
2026-02-21T12:32:03.0006104Z       %139 = ttg.convert_layout %cst_15 : tensor<2x16xf32, #blocked9> -> tensor<2x16xf32, #blocked9>
2026-02-21T12:32:03.0006526Z       %140 = tt.dot %137, %138, %139, inputPrecision = tf32 : tensor<2x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked9}>> * tensor<128x16xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked9}>> -> tensor<2x16xf32, #blocked9>
2026-02-21T12:32:03.0006950Z       %141 = tt.reshape %140 : tensor<2x16xf32, #blocked9> -> tensor<1x2x16xf32, #blocked7>
2026-02-21T12:32:03.0007197Z       %142 = arith.truncf %141 : tensor<1x2x16xf32, #blocked7> to tensor<1x2x16xbf16, #blocked7>
2026-02-21T12:32:03.0007440Z       %143 = arith.extf %142 : tensor<1x2x16xbf16, #blocked7> to tensor<1x2x16xf32, #blocked7>
2026-02-21T12:32:03.0007631Z       %144 = "tt.reduce"(%143) <{axis = 2 : i32}> ({
2026-02-21T12:32:03.0007787Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T12:32:03.0007910Z         %199 = arith.maxnumf %arg8, %arg9 : f32
2026-02-21T12:32:03.0008039Z         tt.reduce.return %199 : f32
2026-02-21T12:32:03.0008228Z       }) : (tensor<1x2x16xf32, #blocked7>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:32:03.0008527Z       %145 = ttg.convert_layout %144 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0008807Z       %146 = arith.truncf %145 : tensor<1x2xf32, #blocked8> to tensor<1x2xbf16, #blocked8>
2026-02-21T12:32:03.0009034Z       %147 = arith.extf %146 : tensor<1x2xbf16, #blocked8> to tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0009251Z       %148 = arith.mulf %147, %cst_14 : tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0009445Z       %149 = arith.truncf %148 : tensor<1x2xf32, #blocked8> to tensor<1x2xbf16, #blocked8>
2026-02-21T12:32:03.0009671Z       %150 = arith.extf %149 : tensor<1x2xbf16, #blocked8> to tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0009877Z       %151 = arith.cmpf ogt, %arg5, %150 : tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0010051Z       %152 = arith.cmpf une, %arg5, %arg5 : tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0010222Z       %153 = arith.ori %151, %152 : tensor<1x2xi1, #blocked8>
2026-02-21T12:32:03.0010438Z       %154 = arith.select %153, %arg5, %150 : tensor<1x2xi1, #blocked8>, tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0010653Z       %155 = arith.mulf %143, %cst_13 : tensor<1x2x16xf32, #blocked7>
2026-02-21T12:32:03.0010863Z       %156 = arith.truncf %155 : tensor<1x2x16xf32, #blocked7> to tensor<1x2x16xbf16, #blocked7>
2026-02-21T12:32:03.0011157Z       %157 = ttg.convert_layout %154 : tensor<1x2xf32, #blocked8> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>>
2026-02-21T12:32:03.0011504Z       %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x2x1xf32, #blocked12>
2026-02-21T12:32:03.0011810Z       %159 = ttg.convert_layout %158 : tensor<1x2x1xf32, #blocked12> -> tensor<1x2x1xf32, #blocked5>
2026-02-21T12:32:03.0012060Z       %160 = arith.extf %156 : tensor<1x2x16xbf16, #blocked7> to tensor<1x2x16xf32, #blocked7>
2026-02-21T12:32:03.0012300Z       %161 = tt.broadcast %159 : tensor<1x2x1xf32, #blocked5> -> tensor<1x2x16xf32, #blocked5>
2026-02-21T12:32:03.0012546Z       %162 = ttg.convert_layout %161 : tensor<1x2x16xf32, #blocked5> -> tensor<1x2x16xf32, #blocked7>
2026-02-21T12:32:03.0012762Z       %163 = arith.subf %160, %162 : tensor<1x2x16xf32, #blocked7>
2026-02-21T12:32:03.0013069Z       %164 = tt.extern_elementwise %163 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2x16xf32, #blocked7>) -> tensor<1x2x16xf32, #blocked7>
2026-02-21T12:32:03.0013363Z       %165 = "tt.reduce"(%164) <{axis = 2 : i32}> ({
2026-02-21T12:32:03.0013497Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T12:32:03.0013617Z         %199 = arith.addf %arg8, %arg9 : f32
2026-02-21T12:32:03.0013743Z         tt.reduce.return %199 : f32
2026-02-21T12:32:03.0013929Z       }) : (tensor<1x2x16xf32, #blocked7>) -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:32:03.0014224Z       %166 = ttg.convert_layout %165 : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0014472Z       %167 = arith.subf %arg5, %154 : tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0014766Z       %168 = tt.extern_elementwise %167 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x2xf32, #blocked8>) -> tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0015077Z       %169 = arith.mulf %arg6, %168 : tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0015240Z       %170 = arith.addf %169, %166 : tensor<1x2xf32, #blocked8>
2026-02-21T12:32:03.0015492Z       %171 = ttg.convert_layout %168 : tensor<1x2xf32, #blocked8> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>>
2026-02-21T12:32:03.0015836Z       %172 = tt.expand_dims %171 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x2x1xf32, #blocked12>
2026-02-21T12:32:03.0016157Z       %173 = ttg.convert_layout %172 : tensor<1x2x1xf32, #blocked12> -> tensor<1x2x1xf32, #blocked5>
2026-02-21T12:32:03.0016410Z       %174 = tt.broadcast %173 : tensor<1x2x1xf32, #blocked5> -> tensor<1x2x128xf32, #blocked5>
2026-02-21T12:32:03.0016661Z       %175 = ttg.convert_layout %174 : tensor<1x2x128xf32, #blocked5> -> tensor<1x2x128xf32, #blocked3>
2026-02-21T12:32:03.0016883Z       %176 = arith.mulf %arg7, %175 : tensor<1x2x128xf32, #blocked3>
2026-02-21T12:32:03.0017145Z       %177 = ttg.convert_layout %113 : tensor<16xi32, #blocked10> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked11}>>
2026-02-21T12:32:03.0017483Z       %178 = tt.expand_dims %177 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x16xi32, #blocked11>
2026-02-21T12:32:03.0017782Z       %179 = ttg.convert_layout %178 : tensor<1x16xi32, #blocked11> -> tensor<1x16xi32, #blocked9>
2026-02-21T12:32:03.0018074Z       %180 = ttg.convert_layout %179 : tensor<1x16xi32, #blocked9> -> tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked17}>>
2026-02-21T12:32:03.0018424Z       %181 = tt.expand_dims %180 {axis = 2 : i32} : tensor<1x16xi32, #ttg.slice<{dim = 2, parent = #blocked17}>> -> tensor<1x16x1xi32, #blocked17>
2026-02-21T12:32:03.0018752Z       %182 = ttg.convert_layout %181 : tensor<1x16x1xi32, #blocked17> -> tensor<1x16x1xi32, #blocked6>
2026-02-21T12:32:03.0018971Z       %183 = arith.muli %182, %cst_12 : tensor<1x16x1xi32, #blocked6>
2026-02-21T12:32:03.0019142Z       %184 = arith.addi %71, %183 : tensor<1x16x1xi32, #blocked6>
2026-02-21T12:32:03.0019344Z       %185 = tt.broadcast %184 : tensor<1x16x1xi32, #blocked6> -> tensor<1x16x128xi32, #blocked6>
2026-02-21T12:32:03.0019603Z       %186 = ttg.convert_layout %185 : tensor<1x16x128xi32, #blocked6> -> tensor<1x16x128xi32, #blocked3>
2026-02-21T12:32:03.0019821Z       %187 = arith.addi %186, %79 : tensor<1x16x128xi32, #blocked3>
2026-02-21T12:32:03.0020048Z       %188 = tt.addptr %80, %187 : tensor<1x16x128x!tt.ptr<bf16>, #blocked3>, tensor<1x16x128xi32, #blocked3>
2026-02-21T12:32:03.0020274Z       %189 = tt.load %188 : tensor<1x16x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T12:32:03.0020481Z       %190 = arith.truncf %164 : tensor<1x2x16xf32, #blocked7> to tensor<1x2x16xbf16, #blocked7>
2026-02-21T12:32:03.0020729Z       %191 = tt.reshape %176 : tensor<1x2x128xf32, #blocked3> -> tensor<2x128xf32, #blocked13>
2026-02-21T12:32:03.0020965Z       %192 = tt.reshape %190 : tensor<1x2x16xbf16, #blocked7> -> tensor<2x16xbf16, #blocked9>
2026-02-21T12:32:03.0021217Z       %193 = tt.reshape %189 : tensor<1x16x128xbf16, #blocked3> -> tensor<16x128xbf16, #blocked13>
2026-02-21T12:32:03.0021527Z       %194 = ttg.convert_layout %192 : tensor<2x16xbf16, #blocked9> -> tensor<2x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked13}>>
2026-02-21T12:32:03.0021889Z       %195 = ttg.convert_layout %193 : tensor<16x128xbf16, #blocked13> -> tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked13}>>
2026-02-21T12:32:03.0022205Z       %196 = ttg.convert_layout %191 : tensor<2x128xf32, #blocked13> -> tensor<2x128xf32, #blocked13>
2026-02-21T12:32:03.0022623Z       %197 = tt.dot %194, %195, %196, inputPrecision = tf32 : tensor<2x16xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked13}>> * tensor<16x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked13}>> -> tensor<2x128xf32, #blocked13>
2026-02-21T12:32:03.0023028Z       %198 = tt.reshape %197 : tensor<2x128xf32, #blocked13> -> tensor<1x2x128xf32, #blocked3>
2026-02-21T12:32:03.0023322Z       scf.yield %154, %170, %198 : tensor<1x2xf32, #blocked8>, tensor<1x2xf32, #blocked8>, tensor<1x2x128xf32, #blocked3>
2026-02-21T12:32:03.0023539Z     } {tt.num_stages = 4 : i32}
2026-02-21T12:32:03.0023759Z     %82 = ttg.convert_layout %81#1 : tensor<1x2xf32, #blocked8> -> tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>>
2026-02-21T12:32:03.0024099Z     %83 = tt.expand_dims %82 {axis = 2 : i32} : tensor<1x2xf32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x2x1xf32, #blocked12>
2026-02-21T12:32:03.0029531Z     %84 = ttg.convert_layout %83 : tensor<1x2x1xf32, #blocked12> -> tensor<1x2x1xf32, #blocked5>
2026-02-21T12:32:03.0029777Z     %85 = tt.broadcast %84 : tensor<1x2x1xf32, #blocked5> -> tensor<1x2x128xf32, #blocked5>
2026-02-21T12:32:03.0030022Z     %86 = ttg.convert_layout %85 : tensor<1x2x128xf32, #blocked5> -> tensor<1x2x128xf32, #blocked3>
2026-02-21T12:32:03.0030241Z     %87 = arith.divf %81#2, %86 : tensor<1x2x128xf32, #blocked3>
2026-02-21T12:32:03.0030446Z     %88 = arith.truncf %87 : tensor<1x2x128xf32, #blocked3> to tensor<1x2x128xbf16, #blocked3>
2026-02-21T12:32:03.0030666Z     %89 = arith.muli %0, %c32768_i32 : i32
2026-02-21T12:32:03.0030888Z     %90 = ttg.convert_layout %5 : tensor<2xi32, #blocked10> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked11}>>
2026-02-21T12:32:03.0031208Z     %91 = tt.expand_dims %90 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x2xi32, #blocked11>
2026-02-21T12:32:03.0031498Z     %92 = ttg.convert_layout %91 : tensor<1x2xi32, #blocked11> -> tensor<1x2xi32, #blocked8>
2026-02-21T12:32:03.0031780Z     %93 = ttg.convert_layout %92 : tensor<1x2xi32, #blocked8> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked12}>>
2026-02-21T12:32:03.0032130Z     %94 = tt.expand_dims %93 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x2x1xi32, #blocked12>
2026-02-21T12:32:03.0032430Z     %95 = ttg.convert_layout %94 : tensor<1x2x1xi32, #blocked12> -> tensor<1x2x1xi32, #blocked5>
2026-02-21T12:32:03.0032639Z     %96 = arith.muli %95, %cst_11 : tensor<1x2x1xi32, #blocked5>
2026-02-21T12:32:03.0032805Z     %97 = tt.splat %89 : i32 -> tensor<1x2x1xi32, #blocked5>
2026-02-21T12:32:03.0032959Z     %98 = arith.addi %97, %96 : tensor<1x2x1xi32, #blocked5>
2026-02-21T12:32:03.0033202Z     %99 = ttg.convert_layout %6 : tensor<128xi32, #blocked10> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked11}>>
2026-02-21T12:32:03.0033557Z     %100 = tt.expand_dims %99 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked11}>> -> tensor<1x128xi32, #blocked11>
2026-02-21T12:32:03.0033862Z     %101 = ttg.convert_layout %100 : tensor<1x128xi32, #blocked11> -> tensor<1x128xi32, #blocked13>
2026-02-21T12:32:03.0034171Z     %102 = ttg.convert_layout %101 : tensor<1x128xi32, #blocked13> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked14}>>
2026-02-21T12:32:03.0034528Z     %103 = tt.expand_dims %102 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked14}>> -> tensor<1x1x128xi32, #blocked14>
2026-02-21T12:32:03.0034844Z     %104 = ttg.convert_layout %103 : tensor<1x1x128xi32, #blocked14> -> tensor<1x1x128xi32, #blocked4>
2026-02-21T12:32:03.0035097Z     %105 = tt.broadcast %98 : tensor<1x2x1xi32, #blocked5> -> tensor<1x2x128xi32, #blocked5>
2026-02-21T12:32:03.0035344Z     %106 = ttg.convert_layout %105 : tensor<1x2x128xi32, #blocked5> -> tensor<1x2x128xi32, #blocked3>
2026-02-21T12:32:03.0035601Z     %107 = tt.broadcast %104 : tensor<1x1x128xi32, #blocked4> -> tensor<1x2x128xi32, #blocked4>
2026-02-21T12:32:03.0035855Z     %108 = ttg.convert_layout %107 : tensor<1x2x128xi32, #blocked4> -> tensor<1x2x128xi32, #blocked3>
2026-02-21T12:32:03.0036072Z     %109 = arith.addi %106, %108 : tensor<1x2x128xi32, #blocked3>
2026-02-21T12:32:03.0036268Z     %110 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x2x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T12:32:03.0036523Z     %111 = tt.addptr %110, %109 : tensor<1x2x128x!tt.ptr<bf16>, #blocked3>, tensor<1x2x128xi32, #blocked3>
2026-02-21T12:32:03.0036745Z     tt.store %111, %88 : tensor<1x2x128x!tt.ptr<bf16>, #blocked3>
2026-02-21T12:32:03.0036883Z     tt.return
2026-02-21T12:32:03.0036969Z   }
2026-02-21T12:32:03.0037053Z }
2026-02-21T12:32:03.0037099Z 
2026-02-21T12:32:03.0037131Z {-#
2026-02-21T12:32:03.0037218Z   external_resources: {
2026-02-21T12:32:03.0037321Z     mlir_reproducer: {
2026-02-21T12:32:03.0039564Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T12:32:03.0041896Z       disable_threading: false,
2026-02-21T12:32:03.0042009Z       verify_each: true
2026-02-21T12:32:03.0042101Z     }
2026-02-21T12:32:03.0042184Z   }
2026-02-21T12:32:03.0042254Z #-}
2026-02-21T12:32:03.0042540Z /tmp/torchinductor_root/4e/c4eorrzxwkgp4flh757wxxie4yjblurji4dyrobpe2unl45lzbz2.py:17:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T12:32:03.0043297Z /tmp/torchinductor_root/4e/c4eorrzxwkgp4flh757wxxie4yjblurji4dyrobpe2unl45lzbz2.py:17:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T12:32:03.0043857Z [25s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T12:32:03.0044595Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 2, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='xyz', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T12:32:03.0045264Z Error: RuntimeError: PassManager::run failed
2026-02-21T12:32:03.0045438Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T12:32:06.7100839Z /tmp/torchinductor_root/l7/cl7esyxzbgkcnqcndexrw6mheqxzagayrpobtydvrxos4vagr6b7.py:66:133: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T12:32:06.7102139Z             k = tl.load(k_view + (indices_0[:, None, None] * 32768 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None)
2026-02-21T12:32:06.7102887Z                                                                                                                                     ^
2026-02-21T12:32:06.7104815Z /tmp/torchinductor_root/l7/cl7esyxzbgkcnqcndexrw6mheqxzagayrpobtydvrxos4vagr6b7.py:68:145: note: - use: %118 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x64xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 4], order = [1, 0, 2]}>>) -> tensor<128x64xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 4], order = [0, 1]}>>
2026-02-21T12:32:06.7107361Z 
2026-02-21T12:32:06.7108303Z             qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T12:32:06.7109728Z                                                                                                                                                 ^
2026-02-21T12:32:06.7110239Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T12:32:06.7110983Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 2, 2], order = [2, 1, 0]}>
2026-02-21T12:32:06.7111789Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.7112459Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [2, 1, 0]}>
2026-02-21T12:32:06.7113020Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.7113577Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T12:32:06.7114126Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 1], order = [1, 0]}>
2026-02-21T12:32:06.7114659Z #blocked6 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [4], order = [0]}>
2026-02-21T12:32:06.7115310Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 4], order = [0, 1]}>
2026-02-21T12:32:06.7115837Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [2, 2], order = [1, 0]}>
2026-02-21T12:32:06.7116400Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [2, 1, 2], order = [0, 1, 2]}>
2026-02-21T12:32:06.7116980Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [0, 1, 2]}>
2026-02-21T12:32:06.7117555Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [2, 2, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.7118129Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.7118731Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:06.7119308Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:06.7119874Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [4, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.7120451Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:06.7121071Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T12:32:06.7121907Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T12:32:06.7122445Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T12:32:06.7122695Z     %c256_i64 = arith.constant 256 : i64
2026-02-21T12:32:06.7122861Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T12:32:06.7123059Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T12:32:06.7123211Z     %c128_i64 = arith.constant 128 : i64
2026-02-21T12:32:06.7123373Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T12:32:06.7123598Z     %cst = arith.constant dense<0.000000e+00> : tensor<1x64x128xbf16, #blocked>
2026-02-21T12:32:06.7123868Z     %cst_0 = arith.constant dense<256> : tensor<1x64x1xi64, #blocked1>
2026-02-21T12:32:06.7124108Z     %cst_1 = arith.constant dense<0> : tensor<1x64x1xi64, #blocked1>
2026-02-21T12:32:06.7124376Z     %cst_2 = arith.constant dense<128> : tensor<1x64x1xi64, #blocked1>
2026-02-21T12:32:06.7124637Z     %cst_3 = arith.constant dense<0.000000e+00> : tensor<1x1x128xbf16, #blocked2>
2026-02-21T12:32:06.7124898Z     %cst_4 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7125143Z     %cst_5 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7125344Z     %c12288_i32 = arith.constant 12288 : i32
2026-02-21T12:32:06.7125509Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T12:32:06.7125718Z     %cst_6 = arith.constant dense<0.127517432> : tensor<1x1x64xf32, #blocked3>
2026-02-21T12:32:06.7126007Z     %cst_7 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7126269Z     %cst_8 = arith.constant dense<0.000000e+00> : tensor<1x64xf32, #blocked5>
2026-02-21T12:32:06.7126519Z     %cst_9 = arith.constant dense<128> : tensor<1x1x64xi32, #blocked3>
2026-02-21T12:32:06.7126721Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T12:32:06.7126929Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked2>
2026-02-21T12:32:06.7127200Z     %cst_11 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7127460Z     %cst_12 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7127700Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T12:32:06.7127858Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T12:32:06.7128016Z     %c49152_i32 = arith.constant 49152 : i32
2026-02-21T12:32:06.7128181Z     %c162_i32 = arith.constant 162 : i32
2026-02-21T12:32:06.7128339Z     %0 = tt.get_program_id x : i32
2026-02-21T12:32:06.7128495Z     %1 = arith.muli %0, %c162_i32 : i32
2026-02-21T12:32:06.7128646Z     %2 = arith.addi %1, %c162_i32 : i32
2026-02-21T12:32:06.7128803Z     %3 = arith.minsi %2, %c49152_i32 : i32
2026-02-21T12:32:06.7129023Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked6>
2026-02-21T12:32:06.7129307Z     %5 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T12:32:06.7129588Z     %6 = arith.extsi %4 : tensor<128xi32, #blocked6> to tensor<128xi64, #blocked6>
2026-02-21T12:32:06.7129944Z     %7 = ttg.convert_layout %6 : tensor<128xi64, #blocked6> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T12:32:06.7130389Z     %8 = tt.expand_dims %7 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi64, #blocked7>
2026-02-21T12:32:06.7130775Z     %9 = ttg.convert_layout %8 : tensor<1x128xi64, #blocked7> -> tensor<1x128xi64, #blocked8>
2026-02-21T12:32:06.7131160Z     %10 = ttg.convert_layout %9 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T12:32:06.7131619Z     %11 = tt.expand_dims %10 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi64, #blocked9>
2026-02-21T12:32:06.7132026Z     %12 = ttg.convert_layout %11 : tensor<1x1x128xi64, #blocked9> -> tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7132323Z     %13 = arith.cmpi sge, %12, %cst_5 : tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7132558Z     %14 = arith.cmpi slt, %12, %cst_4 : tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7132754Z     %15 = arith.andi %13, %14 : tensor<1x1x128xi1, #blocked2>
2026-02-21T12:32:06.7132955Z     %16 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked6>
2026-02-21T12:32:06.7133254Z     %17 = ttg.convert_layout %4 : tensor<128xi32, #blocked6> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T12:32:06.7133602Z     %18 = tt.expand_dims %17 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7>
2026-02-21T12:32:06.7133913Z     %19 = ttg.convert_layout %18 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked8>
2026-02-21T12:32:06.7134222Z     %20 = ttg.convert_layout %19 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked10}>>
2026-02-21T12:32:06.7134608Z     %21 = tt.expand_dims %20 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x128x1xi32, #blocked10>
2026-02-21T12:32:06.7134940Z     %22 = ttg.convert_layout %21 : tensor<1x128x1xi32, #blocked10> -> tensor<1x128x1xi32, #blocked11>
2026-02-21T12:32:06.7135200Z     %23 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x64x!tt.ptr<bf16>, #blocked12>
2026-02-21T12:32:06.7135429Z     %24 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x64x128x!tt.ptr<bf16>, #blocked>
2026-02-21T12:32:06.7135651Z     %25 = arith.extsi %16 : tensor<64xi32, #blocked6> to tensor<64xi64, #blocked6>
2026-02-21T12:32:06.7135912Z     %26 = tt.broadcast %12 : tensor<1x1x128xi64, #blocked2> -> tensor<1x64x128xi64, #blocked2>
2026-02-21T12:32:06.7136174Z     %27 = ttg.convert_layout %26 : tensor<1x64x128xi64, #blocked2> -> tensor<1x64x128xi64, #blocked>
2026-02-21T12:32:06.7136443Z     %28 = tt.broadcast %15 : tensor<1x1x128xi1, #blocked2> -> tensor<1x64x128xi1, #blocked2>
2026-02-21T12:32:06.7136702Z     %29 = ttg.convert_layout %28 : tensor<1x64x128xi1, #blocked2> -> tensor<1x64x128xi1, #blocked>
2026-02-21T12:32:06.7136945Z     %30 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T12:32:06.7137137Z     scf.for %arg4 = %1 to %3 step %c1_i32  : i32 {
2026-02-21T12:32:06.7137306Z       %31 = arith.divsi %arg4, %c12288_i32 : i32
2026-02-21T12:32:06.7137448Z       %32 = arith.muli %31, %c64_i32 : i32
2026-02-21T12:32:06.7137579Z       %33 = arith.subi %c256_i32, %32 : i32
2026-02-21T12:32:06.7137709Z       %34 = arith.minsi %33, %c64_i32 : i32
2026-02-21T12:32:06.7137842Z       %35 = arith.remsi %arg4, %c12288_i32 : i32
2026-02-21T12:32:06.7137976Z       %36 = arith.remsi %35, %34 : i32
2026-02-21T12:32:06.7138103Z       %37 = arith.addi %32, %36 : i32
2026-02-21T12:32:06.7138222Z       %38 = arith.divsi %35, %34 : i32
2026-02-21T12:32:06.7138347Z       %39 = arith.extsi %38 : i32 to i64
2026-02-21T12:32:06.7138472Z       %40 = arith.extsi %37 : i32 to i64
2026-02-21T12:32:06.7138604Z       %41 = arith.muli %39, %c32768_i64 : i64
2026-02-21T12:32:06.7138756Z       %42 = tt.splat %41 : i64 -> tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7138909Z       %43 = arith.muli %40, %c128_i64 : i64
2026-02-21T12:32:06.7139062Z       %44 = tt.splat %43 : i64 -> tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7139228Z       %45 = arith.addi %44, %12 : tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7139400Z       %46 = arith.addi %42, %45 : tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7139618Z       %47 = tt.addptr %5, %46 : tensor<1x1x128x!tt.ptr<bf16>, #blocked2>, tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7139829Z       %48 = arith.cmpi sge, %39, %c0_i64 : i64
2026-02-21T12:32:06.7139962Z       %49 = arith.cmpi slt, %39, %c192_i64 : i64
2026-02-21T12:32:06.7140094Z       %50 = arith.andi %48, %49 : i1
2026-02-21T12:32:06.7140224Z       %51 = arith.cmpi sge, %40, %c0_i64 : i64
2026-02-21T12:32:06.7140355Z       %52 = arith.cmpi slt, %40, %c256_i64 : i64
2026-02-21T12:32:06.7140486Z       %53 = arith.andi %51, %52 : i1
2026-02-21T12:32:06.7140602Z       %54 = arith.andi %50, %53 : i1
2026-02-21T12:32:06.7140748Z       %55 = tt.splat %54 : i1 -> tensor<1x1x128xi1, #blocked2>
2026-02-21T12:32:06.7140917Z       %56 = arith.andi %55, %15 : tensor<1x1x128xi1, #blocked2>
2026-02-21T12:32:06.7141101Z       %57 = tt.load %47, %56, %cst_3 : tensor<1x1x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T12:32:06.7141288Z       %58 = arith.muli %38, %c32768_i32 : i32
2026-02-21T12:32:06.7141441Z       %59 = tt.splat %58 : i32 -> tensor<1x128x1xi32, #blocked11>
2026-02-21T12:32:06.7141618Z       %60 = arith.addi %59, %22 : tensor<1x128x1xi32, #blocked11>
2026-02-21T12:32:06.7141842Z       %61 = tt.broadcast %60 : tensor<1x128x1xi32, #blocked11> -> tensor<1x128x64xi32, #blocked11>
2026-02-21T12:32:06.7142124Z       %62 = ttg.convert_layout %61 : tensor<1x128x64xi32, #blocked11> -> tensor<1x128x64xi32, #blocked12>
2026-02-21T12:32:06.7142393Z       %63 = tt.reshape %57 : tensor<1x1x128xbf16, #blocked2> -> tensor<1x128xbf16, #blocked8>
2026-02-21T12:32:06.7142597Z       %64 = tt.splat %41 : i64 -> tensor<1x64x128xi64, #blocked>
2026-02-21T12:32:06.7142760Z       %65 = tt.splat %50 : i1 -> tensor<1x64x1xi1, #blocked1>
2026-02-21T12:32:06.7143126Z       %66:3 = scf.for %arg5 = %c0_i32 to %c256_i32 step %c64_i32 iter_args(%arg6 = %cst_12, %arg7 = %cst_11, %arg8 = %cst_10) -> (tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked2>)  : i32 {
2026-02-21T12:32:06.7143492Z         %75 = tt.splat %arg5 : i32 -> tensor<64xi32, #blocked6>
2026-02-21T12:32:06.7143696Z         %76 = arith.addi %75, %16 : tensor<64xi32, #blocked6>
2026-02-21T12:32:06.7143938Z         %77 = ttg.convert_layout %76 : tensor<64xi32, #blocked6> -> tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T12:32:06.7144279Z         %78 = tt.expand_dims %77 {axis = 0 : i32} : tensor<64xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x64xi32, #blocked7>
2026-02-21T12:32:06.7144575Z         %79 = ttg.convert_layout %78 : tensor<1x64xi32, #blocked7> -> tensor<1x64xi32, #blocked5>
2026-02-21T12:32:06.7144872Z         %80 = ttg.convert_layout %79 : tensor<1x64xi32, #blocked5> -> tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked13}>>
2026-02-21T12:32:06.7145238Z         %81 = tt.expand_dims %80 {axis = 1 : i32} : tensor<1x64xi32, #ttg.slice<{dim = 1, parent = #blocked13}>> -> tensor<1x1x64xi32, #blocked13>
2026-02-21T12:32:06.7145551Z         %82 = ttg.convert_layout %81 : tensor<1x1x64xi32, #blocked13> -> tensor<1x1x64xi32, #blocked3>
2026-02-21T12:32:06.7145773Z         %83 = arith.muli %82, %cst_9 : tensor<1x1x64xi32, #blocked3>
2026-02-21T12:32:06.7145979Z         %84 = tt.broadcast %83 : tensor<1x1x64xi32, #blocked3> -> tensor<1x128x64xi32, #blocked3>
2026-02-21T12:32:06.7146239Z         %85 = ttg.convert_layout %84 : tensor<1x128x64xi32, #blocked3> -> tensor<1x128x64xi32, #blocked12>
2026-02-21T12:32:06.7146464Z         %86 = arith.addi %62, %85 : tensor<1x128x64xi32, #blocked12>
2026-02-21T12:32:06.7146690Z         %87 = tt.addptr %23, %86 : tensor<1x128x64x!tt.ptr<bf16>, #blocked12>, tensor<1x128x64xi32, #blocked12>
2026-02-21T12:32:06.7146919Z         %88 = tt.load %87 : tensor<1x128x64x!tt.ptr<bf16>, #blocked12>
2026-02-21T12:32:06.7147128Z         %89 = tt.reshape %88 : tensor<1x128x64xbf16, #blocked12> -> tensor<128x64xbf16, #blocked5>
2026-02-21T12:32:06.7147437Z         %90 = ttg.convert_layout %63 : tensor<1x128xbf16, #blocked8> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>>
2026-02-21T12:32:06.7147799Z         %91 = ttg.convert_layout %89 : tensor<128x64xbf16, #blocked5> -> tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>>
2026-02-21T12:32:06.7148109Z         %92 = ttg.convert_layout %cst_8 : tensor<1x64xf32, #blocked5> -> tensor<1x64xf32, #blocked5>
2026-02-21T12:32:06.7148525Z         %93 = tt.dot %90, %91, %92, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> * tensor<128x64xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> -> tensor<1x64xf32, #blocked5>
2026-02-21T12:32:06.7148920Z         %94 = tt.reshape %93 : tensor<1x64xf32, #blocked5> -> tensor<1x1x64xf32, #blocked3>
2026-02-21T12:32:06.7149163Z         %95 = arith.truncf %94 : tensor<1x1x64xf32, #blocked3> to tensor<1x1x64xbf16, #blocked3>
2026-02-21T12:32:06.7149404Z         %96 = arith.extf %95 : tensor<1x1x64xbf16, #blocked3> to tensor<1x1x64xf32, #blocked3>
2026-02-21T12:32:06.7149610Z         %97 = "tt.reduce"(%96) <{axis = 2 : i32}> ({
2026-02-21T12:32:06.7149745Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:32:06.7149871Z           %162 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:32:06.7150007Z           tt.reduce.return %162 : f32
2026-02-21T12:32:06.7150200Z         }) : (tensor<1x1x64xf32, #blocked3>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked3}>>
2026-02-21T12:32:06.7150499Z         %98 = ttg.convert_layout %97 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7150793Z         %99 = arith.truncf %98 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4>
2026-02-21T12:32:06.7151017Z         %100 = arith.extf %99 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7151218Z         %101 = arith.mulf %100, %cst_7 : tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7151414Z         %102 = arith.truncf %101 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4>
2026-02-21T12:32:06.7151643Z         %103 = arith.extf %102 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7151864Z         %104 = arith.cmpf ogt, %arg6, %103 : tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7152041Z         %105 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7152217Z         %106 = arith.ori %104, %105 : tensor<1x1xi1, #blocked4>
2026-02-21T12:32:06.7152419Z         %107 = arith.select %106, %arg6, %103 : tensor<1x1xi1, #blocked4>, tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7152636Z         %108 = arith.mulf %96, %cst_6 : tensor<1x1x64xf32, #blocked3>
2026-02-21T12:32:06.7152843Z         %109 = arith.truncf %108 : tensor<1x1x64xf32, #blocked3> to tensor<1x1x64xbf16, #blocked3>
2026-02-21T12:32:06.7153174Z         %110 = ttg.convert_layout %107 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T12:32:06.7153524Z         %111 = tt.expand_dims %110 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T12:32:06.7153833Z         %112 = ttg.convert_layout %111 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15>
2026-02-21T12:32:06.7154092Z         %113 = arith.extf %109 : tensor<1x1x64xbf16, #blocked3> to tensor<1x1x64xf32, #blocked3>
2026-02-21T12:32:06.7154336Z         %114 = tt.broadcast %112 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x64xf32, #blocked15>
2026-02-21T12:32:06.7154591Z         %115 = ttg.convert_layout %114 : tensor<1x1x64xf32, #blocked15> -> tensor<1x1x64xf32, #blocked3>
2026-02-21T12:32:06.7154811Z         %116 = arith.subf %113, %115 : tensor<1x1x64xf32, #blocked3>
2026-02-21T12:32:06.7155119Z         %117 = tt.extern_elementwise %116 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x64xf32, #blocked3>) -> tensor<1x1x64xf32, #blocked3>
2026-02-21T12:32:06.7155416Z         %118 = "tt.reduce"(%117) <{axis = 2 : i32}> ({
2026-02-21T12:32:06.7155553Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:32:06.7155676Z           %162 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:32:06.7155808Z           tt.reduce.return %162 : f32
2026-02-21T12:32:06.7155996Z         }) : (tensor<1x1x64xf32, #blocked3>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked3}>>
2026-02-21T12:32:06.7156296Z         %119 = ttg.convert_layout %118 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7156542Z         %120 = arith.subf %arg6, %107 : tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7156838Z         %121 = tt.extern_elementwise %120 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked4>) -> tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7157140Z         %122 = arith.mulf %arg7, %121 : tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7157303Z         %123 = arith.addf %122, %119 : tensor<1x1xf32, #blocked4>
2026-02-21T12:32:06.7157574Z         %124 = ttg.convert_layout %121 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T12:32:06.7157919Z         %125 = tt.expand_dims %124 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T12:32:06.7158229Z         %126 = ttg.convert_layout %125 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15>
2026-02-21T12:32:06.7158502Z         %127 = tt.broadcast %126 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15>
2026-02-21T12:32:06.7158788Z         %128 = ttg.convert_layout %127 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked2>
2026-02-21T12:32:06.7159007Z         %129 = arith.mulf %arg8, %128 : tensor<1x1x128xf32, #blocked2>
2026-02-21T12:32:06.7159163Z         %130 = arith.extsi %arg5 : i32 to i64
2026-02-21T12:32:06.7159309Z         %131 = tt.splat %130 : i64 -> tensor<64xi64, #blocked6>
2026-02-21T12:32:06.7159472Z         %132 = arith.addi %131, %25 : tensor<64xi64, #blocked6>
2026-02-21T12:32:06.7159714Z         %133 = ttg.convert_layout %132 : tensor<64xi64, #blocked6> -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T12:32:06.7160070Z         %134 = tt.expand_dims %133 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x64xi64, #blocked7>
2026-02-21T12:32:06.7160369Z         %135 = ttg.convert_layout %134 : tensor<1x64xi64, #blocked7> -> tensor<1x64xi64, #blocked5>
2026-02-21T12:32:06.7160666Z         %136 = ttg.convert_layout %135 : tensor<1x64xi64, #blocked5> -> tensor<1x64xi64, #ttg.slice<{dim = 2, parent = #blocked16}>>
2026-02-21T12:32:06.7161020Z         %137 = tt.expand_dims %136 {axis = 2 : i32} : tensor<1x64xi64, #ttg.slice<{dim = 2, parent = #blocked16}>> -> tensor<1x64x1xi64, #blocked16>
2026-02-21T12:32:06.7161354Z         %138 = ttg.convert_layout %137 : tensor<1x64x1xi64, #blocked16> -> tensor<1x64x1xi64, #blocked1>
2026-02-21T12:32:06.7161573Z         %139 = arith.muli %138, %cst_2 : tensor<1x64x1xi64, #blocked1>
2026-02-21T12:32:06.7161784Z         %140 = tt.broadcast %139 : tensor<1x64x1xi64, #blocked1> -> tensor<1x64x128xi64, #blocked1>
2026-02-21T12:32:06.7162045Z         %141 = ttg.convert_layout %140 : tensor<1x64x128xi64, #blocked1> -> tensor<1x64x128xi64, #blocked>
2026-02-21T12:32:06.7162269Z         %142 = arith.addi %141, %27 : tensor<1x64x128xi64, #blocked>
2026-02-21T12:32:06.7162437Z         %143 = arith.addi %64, %142 : tensor<1x64x128xi64, #blocked>
2026-02-21T12:32:06.7162713Z         %144 = tt.addptr %24, %143 : tensor<1x64x128x!tt.ptr<bf16>, #blocked>, tensor<1x64x128xi64, #blocked>
2026-02-21T12:32:06.7162946Z         %145 = arith.cmpi sge, %138, %cst_1 : tensor<1x64x1xi64, #blocked1>
2026-02-21T12:32:06.7163130Z         %146 = arith.cmpi slt, %138, %cst_0 : tensor<1x64x1xi64, #blocked1>
2026-02-21T12:32:06.7163316Z         %147 = arith.andi %145, %146 : tensor<1x64x1xi1, #blocked1>
2026-02-21T12:32:06.7163483Z         %148 = arith.andi %65, %147 : tensor<1x64x1xi1, #blocked1>
2026-02-21T12:32:06.7163690Z         %149 = tt.broadcast %148 : tensor<1x64x1xi1, #blocked1> -> tensor<1x64x128xi1, #blocked1>
2026-02-21T12:32:06.7163950Z         %150 = ttg.convert_layout %149 : tensor<1x64x128xi1, #blocked1> -> tensor<1x64x128xi1, #blocked>
2026-02-21T12:32:06.7164164Z         %151 = arith.andi %150, %29 : tensor<1x64x128xi1, #blocked>
2026-02-21T12:32:06.7164350Z         %152 = tt.load %144, %151, %cst : tensor<1x64x128x!tt.ptr<bf16>, #blocked>
2026-02-21T12:32:06.7164575Z         %153 = arith.truncf %117 : tensor<1x1x64xf32, #blocked3> to tensor<1x1x64xbf16, #blocked3>
2026-02-21T12:32:06.7164821Z         %154 = tt.reshape %129 : tensor<1x1x128xf32, #blocked2> -> tensor<1x128xf32, #blocked8>
2026-02-21T12:32:06.7165061Z         %155 = tt.reshape %153 : tensor<1x1x64xbf16, #blocked3> -> tensor<1x64xbf16, #blocked5>
2026-02-21T12:32:06.7165301Z         %156 = tt.reshape %152 : tensor<1x64x128xbf16, #blocked> -> tensor<64x128xbf16, #blocked8>
2026-02-21T12:32:06.7168895Z         %157 = ttg.convert_layout %155 : tensor<1x64xbf16, #blocked5> -> tensor<1x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T12:32:06.7169253Z         %158 = ttg.convert_layout %156 : tensor<64x128xbf16, #blocked8> -> tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T12:32:06.7169560Z         %159 = ttg.convert_layout %154 : tensor<1x128xf32, #blocked8> -> tensor<1x128xf32, #blocked8>
2026-02-21T12:32:06.7170004Z         %160 = tt.dot %157, %158, %159, inputPrecision = tf32 : tensor<1x64xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<64x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x128xf32, #blocked8>
2026-02-21T12:32:06.7170401Z         %161 = tt.reshape %160 : tensor<1x128xf32, #blocked8> -> tensor<1x1x128xf32, #blocked2>
2026-02-21T12:32:06.7170686Z         scf.yield %107, %123, %161 : tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked2>
2026-02-21T12:32:06.7170903Z       } {tt.flatten}
2026-02-21T12:32:06.7171114Z       %67 = ttg.convert_layout %66#1 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>>
2026-02-21T12:32:06.7171479Z       %68 = tt.expand_dims %67 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked14}>> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T12:32:06.7171777Z       %69 = ttg.convert_layout %68 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x1xf32, #blocked15>
2026-02-21T12:32:06.7172031Z       %70 = tt.broadcast %69 : tensor<1x1x1xf32, #blocked15> -> tensor<1x1x128xf32, #blocked15>
2026-02-21T12:32:06.7172279Z       %71 = ttg.convert_layout %70 : tensor<1x1x128xf32, #blocked15> -> tensor<1x1x128xf32, #blocked2>
2026-02-21T12:32:06.7172501Z       %72 = arith.divf %66#2, %71 : tensor<1x1x128xf32, #blocked2>
2026-02-21T12:32:06.7172733Z       %73 = arith.truncf %72 : tensor<1x1x128xf32, #blocked2> to tensor<1x1x128xbf16, #blocked2>
2026-02-21T12:32:06.7172984Z       %74 = tt.addptr %30, %46 : tensor<1x1x128x!tt.ptr<bf16>, #blocked2>, tensor<1x1x128xi64, #blocked2>
2026-02-21T12:32:06.7173208Z       tt.store %74, %73, %56 : tensor<1x1x128x!tt.ptr<bf16>, #blocked2>
2026-02-21T12:32:06.7173352Z     }
2026-02-21T12:32:06.7173439Z     tt.return
2026-02-21T12:32:06.7173523Z   }
2026-02-21T12:32:06.7173609Z }
2026-02-21T12:32:06.7173654Z 
2026-02-21T12:32:06.7173690Z {-#
2026-02-21T12:32:06.7173776Z   external_resources: {
2026-02-21T12:32:06.7173883Z     mlir_reproducer: {
2026-02-21T12:32:06.7176127Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T12:32:06.7178432Z       disable_threading: false,
2026-02-21T12:32:06.7178548Z       verify_each: true
2026-02-21T12:32:06.7178642Z     }
2026-02-21T12:32:06.7178739Z   }
2026-02-21T12:32:06.7178812Z #-}
2026-02-21T12:32:06.7179095Z /tmp/torchinductor_root/l7/cl7esyxzbgkcnqcndexrw6mheqxzagayrpobtydvrxos4vagr6b7.py:19:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T12:32:06.7179820Z /tmp/torchinductor_root/l7/cl7esyxzbgkcnqcndexrw6mheqxzagayrpobtydvrxos4vagr6b7.py:19:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T12:32:06.7180404Z [29s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T12:32:06.7181203Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T12:32:06.7181948Z Error: RuntimeError: PassManager::run failed
2026-02-21T12:32:06.7182119Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T12:32:06.8255250Z /tmp/torchinductor_root/fe/cfedhhxnlm7fu6mbsljolujgdmvn2hjqtzdhyrtxbjqnqca3oatr.py:56:129: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T12:32:06.8256438Z         k = tl.load(k_view + (indices_0[:, None, None] * 32768 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None)
2026-02-21T12:32:06.8257163Z                                                                                                                                 ^
2026-02-21T12:32:06.8259156Z /tmp/torchinductor_root/fe/cfedhhxnlm7fu6mbsljolujgdmvn2hjqtzdhyrtxbjqnqca3oatr.py:58:141: note: - use: %140 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x32xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 1], order = [1, 0, 2]}>>) -> tensor<128x32xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [0, 1]}>>
2026-02-21T12:32:06.8260762Z 
2026-02-21T12:32:06.8261703Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T12:32:06.8262947Z                                                                                                                                             ^
2026-02-21T12:32:06.8263388Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T12:32:06.8263700Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [8, 8, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.8264037Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.8264358Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 32, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.8264679Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 2, 32], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.8264998Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T12:32:06.8265300Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T12:32:06.8265610Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 1, 32], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.8265909Z #blocked7 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}>
2026-02-21T12:32:06.8266238Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}>
2026-02-21T12:32:06.8266544Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [8, 8, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:06.8266861Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T12:32:06.8267175Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:06.8267517Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:06.8267843Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:32:06.8268172Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 1, 32], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:06.8268489Z #blocked15 = #ttg.blocked<{sizePerThread = [2, 2], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T12:32:06.8268827Z #blocked16 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [2, 32, 1], warpsPerCTA = [1, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:32:06.8269137Z #blocked17 = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}>
2026-02-21T12:32:06.8269472Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T12:32:06.8270029Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T12:32:06.8270438Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T12:32:06.8270565Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T12:32:06.8270690Z     %c32768_i64 = arith.constant 32768 : i64
2026-02-21T12:32:06.8270814Z     %c32768_i32 = arith.constant 32768 : i32
2026-02-21T12:32:06.8270971Z     %cst = arith.constant dense<256> : tensor<1x8x1xi64, #blocked>
2026-02-21T12:32:06.8271150Z     %cst_0 = arith.constant dense<0> : tensor<1x8x1xi64, #blocked>
2026-02-21T12:32:06.8271338Z     %cst_1 = arith.constant dense<128> : tensor<1x8x1xi64, #blocked>
2026-02-21T12:32:06.8271535Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<1x32x128xbf16, #blocked1>
2026-02-21T12:32:06.8271740Z     %cst_3 = arith.constant dense<128> : tensor<1x1x128xi64, #blocked1>
2026-02-21T12:32:06.8271921Z     %cst_4 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked1>
2026-02-21T12:32:06.8272102Z     %cst_5 = arith.constant dense<256> : tensor<1x32x1xi64, #blocked2>
2026-02-21T12:32:06.8272281Z     %cst_6 = arith.constant dense<0> : tensor<1x32x1xi64, #blocked2>
2026-02-21T12:32:06.8272456Z     %cst_7 = arith.constant dense<128> : tensor<1x32x1xi64, #blocked2>
2026-02-21T12:32:06.8272607Z     %c32_i32 = arith.constant 32 : i32
2026-02-21T12:32:06.8272720Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T12:32:06.8272841Z     %c768_i32 = arith.constant 768 : i32
2026-02-21T12:32:06.8273001Z     %cst_8 = arith.constant dense<0.127517432> : tensor<1x8x32xf32, #blocked3>
2026-02-21T12:32:06.8273197Z     %cst_9 = arith.constant dense<0.127517432> : tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8273397Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<8x32xf32, #blocked5>
2026-02-21T12:32:06.8273588Z     %cst_11 = arith.constant dense<128> : tensor<1x1x32xi32, #blocked6>
2026-02-21T12:32:06.8273739Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T12:32:06.8273881Z     %cst_12 = arith.constant dense<128> : tensor<1x8x1xi32, #blocked>
2026-02-21T12:32:06.8274074Z     %cst_13 = arith.constant dense<0.000000e+00> : tensor<1x8x128xf32, #blocked1>
2026-02-21T12:32:06.8274294Z     %cst_14 = arith.constant dense<1.000000e+00> : tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8274493Z     %cst_15 = arith.constant dense<0xFF800000> : tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8274663Z     %c8_i32 = arith.constant 8 : i32
2026-02-21T12:32:06.8274782Z     %c4_i32 = arith.constant 4 : i32
2026-02-21T12:32:06.8274898Z     %0 = tt.get_program_id x : i32
2026-02-21T12:32:06.8275009Z     %1 = arith.divsi %0, %c768_i32 : i32
2026-02-21T12:32:06.8275125Z     %2 = arith.muli %1, %c4_i32 : i32
2026-02-21T12:32:06.8275251Z     %3 = arith.subi %c32_i32, %2 : i32
2026-02-21T12:32:06.8275364Z     %4 = arith.minsi %3, %c4_i32 : i32
2026-02-21T12:32:06.8275476Z     %5 = arith.remsi %0, %c768_i32 : i32
2026-02-21T12:32:06.8275587Z     %6 = arith.remsi %5, %4 : i32
2026-02-21T12:32:06.8275698Z     %7 = arith.addi %2, %6 : i32
2026-02-21T12:32:06.8275803Z     %8 = arith.divsi %5, %4 : i32
2026-02-21T12:32:06.8275913Z     %9 = arith.muli %7, %c8_i32 : i32
2026-02-21T12:32:06.8276067Z     %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked7>
2026-02-21T12:32:06.8276250Z     %11 = tt.splat %9 : i32 -> tensor<8xi32, #blocked7>
2026-02-21T12:32:06.8276416Z     %12 = arith.addi %11, %10 : tensor<8xi32, #blocked7>
2026-02-21T12:32:06.8276590Z     %13 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked7>
2026-02-21T12:32:06.8276763Z     %14 = arith.muli %8, %c32768_i32 : i32
2026-02-21T12:32:06.8276981Z     %15 = ttg.convert_layout %12 : tensor<8xi32, #blocked7> -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T12:32:06.8277305Z     %16 = tt.expand_dims %15 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x8xi32, #blocked8>
2026-02-21T12:32:06.8277585Z     %17 = ttg.convert_layout %16 : tensor<1x8xi32, #blocked8> -> tensor<1x8xi32, #blocked4>
2026-02-21T12:32:06.8277882Z     %18 = ttg.convert_layout %17 : tensor<1x8xi32, #blocked4> -> tensor<1x8xi32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T12:32:06.8278211Z     %19 = tt.expand_dims %18 {axis = 2 : i32} : tensor<1x8xi32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x8x1xi32, #blocked9>
2026-02-21T12:32:06.8278504Z     %20 = ttg.convert_layout %19 : tensor<1x8x1xi32, #blocked9> -> tensor<1x8x1xi32, #blocked>
2026-02-21T12:32:06.8278711Z     %21 = arith.muli %20, %cst_12 : tensor<1x8x1xi32, #blocked>
2026-02-21T12:32:06.8278869Z     %22 = tt.splat %14 : i32 -> tensor<1x8x1xi32, #blocked>
2026-02-21T12:32:06.8279023Z     %23 = arith.addi %22, %21 : tensor<1x8x1xi32, #blocked>
2026-02-21T12:32:06.8279255Z     %24 = ttg.convert_layout %13 : tensor<128xi32, #blocked7> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T12:32:06.8279577Z     %25 = tt.expand_dims %24 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi32, #blocked8>
2026-02-21T12:32:06.8279872Z     %26 = ttg.convert_layout %25 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #blocked10>
2026-02-21T12:32:06.8280165Z     %27 = ttg.convert_layout %26 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T12:32:06.8280511Z     %28 = tt.expand_dims %27 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi32, #blocked11>
2026-02-21T12:32:06.8280816Z     %29 = ttg.convert_layout %28 : tensor<1x1x128xi32, #blocked11> -> tensor<1x1x128xi32, #blocked1>
2026-02-21T12:32:06.8281059Z     %30 = tt.broadcast %23 : tensor<1x8x1xi32, #blocked> -> tensor<1x8x128xi32, #blocked>
2026-02-21T12:32:06.8281300Z     %31 = ttg.convert_layout %30 : tensor<1x8x128xi32, #blocked> -> tensor<1x8x128xi32, #blocked1>
2026-02-21T12:32:06.8281542Z     %32 = tt.broadcast %29 : tensor<1x1x128xi32, #blocked1> -> tensor<1x8x128xi32, #blocked1>
2026-02-21T12:32:06.8281739Z     %33 = arith.addi %31, %32 : tensor<1x8x128xi32, #blocked1>
2026-02-21T12:32:06.8281924Z     %34 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x8x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:32:06.8282176Z     %35 = tt.addptr %34, %33 : tensor<1x8x128x!tt.ptr<bf16>, #blocked1>, tensor<1x8x128xi32, #blocked1>
2026-02-21T12:32:06.8282386Z     %36 = tt.load %35 : tensor<1x8x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:32:06.8282624Z     %37 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked7>
2026-02-21T12:32:06.8282985Z     %38 = ttg.convert_layout %26 : tensor<1x128xi32, #blocked10> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked12}>>
2026-02-21T12:32:06.8283351Z     %39 = tt.expand_dims %38 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked12}>> -> tensor<1x128x1xi32, #blocked12>
2026-02-21T12:32:06.8283658Z     %40 = ttg.convert_layout %39 : tensor<1x128x1xi32, #blocked12> -> tensor<1x128x1xi32, #blocked13>
2026-02-21T12:32:06.8283867Z     %41 = tt.splat %14 : i32 -> tensor<1x128x1xi32, #blocked13>
2026-02-21T12:32:06.8284027Z     %42 = arith.addi %41, %40 : tensor<1x128x1xi32, #blocked13>
2026-02-21T12:32:06.8284229Z     %43 = tt.broadcast %42 : tensor<1x128x1xi32, #blocked13> -> tensor<1x128x32xi32, #blocked13>
2026-02-21T12:32:06.8284482Z     %44 = ttg.convert_layout %43 : tensor<1x128x32xi32, #blocked13> -> tensor<1x128x32xi32, #blocked3>
2026-02-21T12:32:06.8284739Z     %45 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x32x!tt.ptr<bf16>, #blocked3>
2026-02-21T12:32:06.8284959Z     %46 = tt.reshape %36 : tensor<1x8x128xbf16, #blocked1> -> tensor<8x128xbf16, #blocked10>
2026-02-21T12:32:06.8285139Z     %47 = arith.extsi %8 : i32 to i64
2026-02-21T12:32:06.8285299Z     %48 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x32x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:32:06.8285465Z     %49 = arith.muli %47, %c32768_i64 : i64
2026-02-21T12:32:06.8285604Z     %50 = tt.splat %49 : i64 -> tensor<1x32x128xi64, #blocked1>
2026-02-21T12:32:06.8285802Z     %51 = arith.extsi %37 : tensor<32xi32, #blocked7> to tensor<32xi64, #blocked7>
2026-02-21T12:32:06.8286011Z     %52 = arith.extsi %13 : tensor<128xi32, #blocked7> to tensor<128xi64, #blocked7>
2026-02-21T12:32:06.8286277Z     %53 = ttg.convert_layout %52 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T12:32:06.8286608Z     %54 = tt.expand_dims %53 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8>
2026-02-21T12:32:06.8286899Z     %55 = ttg.convert_layout %54 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked10>
2026-02-21T12:32:06.8287187Z     %56 = ttg.convert_layout %55 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T12:32:06.8287533Z     %57 = tt.expand_dims %56 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi64, #blocked11>
2026-02-21T12:32:06.8287838Z     %58 = ttg.convert_layout %57 : tensor<1x1x128xi64, #blocked11> -> tensor<1x1x128xi64, #blocked1>
2026-02-21T12:32:06.8288086Z     %59 = tt.broadcast %58 : tensor<1x1x128xi64, #blocked1> -> tensor<1x32x128xi64, #blocked1>
2026-02-21T12:32:06.8288273Z     %60 = arith.cmpi sge, %47, %c0_i64 : i64
2026-02-21T12:32:06.8288396Z     %61 = arith.cmpi slt, %47, %c192_i64 : i64
2026-02-21T12:32:06.8288518Z     %62 = arith.andi %60, %61 : i1
2026-02-21T12:32:06.8288647Z     %63 = tt.splat %62 : i1 -> tensor<1x32x1xi1, #blocked2>
2026-02-21T12:32:06.8288815Z     %64 = arith.cmpi sge, %58, %cst_4 : tensor<1x1x128xi64, #blocked1>
2026-02-21T12:32:06.8288990Z     %65 = arith.cmpi slt, %58, %cst_3 : tensor<1x1x128xi64, #blocked1>
2026-02-21T12:32:06.8289158Z     %66 = arith.andi %64, %65 : tensor<1x1x128xi1, #blocked1>
2026-02-21T12:32:06.8289356Z     %67 = tt.broadcast %66 : tensor<1x1x128xi1, #blocked1> -> tensor<1x32x128xi1, #blocked1>
2026-02-21T12:32:06.8289760Z     %68:3 = scf.for %arg4 = %c0_i32 to %c256_i32 step %c32_i32 iter_args(%arg5 = %cst_15, %arg6 = %cst_14, %arg7 = %cst_13) -> (tensor<1x8xf32, #blocked4>, tensor<1x8xf32, #blocked4>, tensor<1x8x128xf32, #blocked1>)  : i32 {
2026-02-21T12:32:06.8290117Z       %119 = tt.splat %arg4 : i32 -> tensor<32xi32, #blocked7>
2026-02-21T12:32:06.8290295Z       %120 = arith.addi %119, %37 : tensor<32xi32, #blocked7>
2026-02-21T12:32:06.8290532Z       %121 = ttg.convert_layout %120 : tensor<32xi32, #blocked7> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T12:32:06.8290863Z       %122 = tt.expand_dims %121 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x32xi32, #blocked8>
2026-02-21T12:32:06.8291175Z       %123 = ttg.convert_layout %122 : tensor<1x32xi32, #blocked8> -> tensor<1x32xi32, #blocked5>
2026-02-21T12:32:06.8291469Z       %124 = ttg.convert_layout %123 : tensor<1x32xi32, #blocked5> -> tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked14}>>
2026-02-21T12:32:06.8291817Z       %125 = tt.expand_dims %124 {axis = 1 : i32} : tensor<1x32xi32, #ttg.slice<{dim = 1, parent = #blocked14}>> -> tensor<1x1x32xi32, #blocked14>
2026-02-21T12:32:06.8292125Z       %126 = ttg.convert_layout %125 : tensor<1x1x32xi32, #blocked14> -> tensor<1x1x32xi32, #blocked6>
2026-02-21T12:32:06.8292343Z       %127 = arith.muli %126, %cst_11 : tensor<1x1x32xi32, #blocked6>
2026-02-21T12:32:06.8292562Z       %128 = tt.broadcast %127 : tensor<1x1x32xi32, #blocked6> -> tensor<1x128x32xi32, #blocked6>
2026-02-21T12:32:06.8292821Z       %129 = ttg.convert_layout %128 : tensor<1x128x32xi32, #blocked6> -> tensor<1x128x32xi32, #blocked3>
2026-02-21T12:32:06.8293038Z       %130 = arith.addi %44, %129 : tensor<1x128x32xi32, #blocked3>
2026-02-21T12:32:06.8293259Z       %131 = tt.addptr %45, %130 : tensor<1x128x32x!tt.ptr<bf16>, #blocked3>, tensor<1x128x32xi32, #blocked3>
2026-02-21T12:32:06.8293481Z       %132 = tt.load %131 : tensor<1x128x32x!tt.ptr<bf16>, #blocked3>
2026-02-21T12:32:06.8293686Z       %133 = tt.reshape %132 : tensor<1x128x32xbf16, #blocked3> -> tensor<128x32xbf16, #blocked5>
2026-02-21T12:32:06.8294004Z       %134 = ttg.convert_layout %46 : tensor<8x128xbf16, #blocked10> -> tensor<8x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked15}>>
2026-02-21T12:32:06.8294366Z       %135 = ttg.convert_layout %133 : tensor<128x32xbf16, #blocked5> -> tensor<128x32xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked15}>>
2026-02-21T12:32:06.8294673Z       %136 = ttg.convert_layout %cst_10 : tensor<8x32xf32, #blocked5> -> tensor<8x32xf32, #blocked15>
2026-02-21T12:32:06.8295092Z       %137 = tt.dot %134, %135, %136, inputPrecision = tf32 : tensor<8x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked15}>> * tensor<128x32xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked15}>> -> tensor<8x32xf32, #blocked15>
2026-02-21T12:32:06.8295499Z       %138 = ttg.convert_layout %137 : tensor<8x32xf32, #blocked15> -> tensor<8x32xf32, #blocked5>
2026-02-21T12:32:06.8295741Z       %139 = tt.reshape %138 : tensor<8x32xf32, #blocked5> -> tensor<1x8x32xf32, #blocked3>
2026-02-21T12:32:06.8295978Z       %140 = arith.truncf %139 : tensor<1x8x32xf32, #blocked3> to tensor<1x8x32xbf16, #blocked3>
2026-02-21T12:32:06.8296218Z       %141 = arith.extf %140 : tensor<1x8x32xbf16, #blocked3> to tensor<1x8x32xf32, #blocked3>
2026-02-21T12:32:06.8296411Z       %142 = "tt.reduce"(%141) <{axis = 2 : i32}> ({
2026-02-21T12:32:06.8296539Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T12:32:06.8296663Z         %208 = arith.maxnumf %arg8, %arg9 : f32
2026-02-21T12:32:06.8296790Z         tt.reduce.return %208 : f32
2026-02-21T12:32:06.8296976Z       }) : (tensor<1x8x32xf32, #blocked3>) -> tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked3}>>
2026-02-21T12:32:06.8297271Z       %143 = ttg.convert_layout %142 : tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8297543Z       %144 = arith.truncf %143 : tensor<1x8xf32, #blocked4> to tensor<1x8xbf16, #blocked4>
2026-02-21T12:32:06.8297766Z       %145 = arith.extf %144 : tensor<1x8xbf16, #blocked4> to tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8297960Z       %146 = arith.mulf %145, %cst_9 : tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8298150Z       %147 = arith.truncf %146 : tensor<1x8xf32, #blocked4> to tensor<1x8xbf16, #blocked4>
2026-02-21T12:32:06.8298388Z       %148 = arith.extf %147 : tensor<1x8xbf16, #blocked4> to tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8298583Z       %149 = arith.cmpf ogt, %arg5, %148 : tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8298757Z       %150 = arith.cmpf une, %arg5, %arg5 : tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8298920Z       %151 = arith.ori %149, %150 : tensor<1x8xi1, #blocked4>
2026-02-21T12:32:06.8299118Z       %152 = arith.select %151, %arg5, %148 : tensor<1x8xi1, #blocked4>, tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8299358Z       %153 = arith.mulf %141, %cst_8 : tensor<1x8x32xf32, #blocked3>
2026-02-21T12:32:06.8299563Z       %154 = arith.truncf %153 : tensor<1x8x32xf32, #blocked3> to tensor<1x8x32xbf16, #blocked3>
2026-02-21T12:32:06.8299857Z       %155 = ttg.convert_layout %152 : tensor<1x8xf32, #blocked4> -> tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T12:32:06.8300198Z       %156 = tt.expand_dims %155 {axis = 2 : i32} : tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x8x1xf32, #blocked9>
2026-02-21T12:32:06.8300531Z       %157 = ttg.convert_layout %156 : tensor<1x8x1xf32, #blocked9> -> tensor<1x8x1xf32, #blocked>
2026-02-21T12:32:06.8300780Z       %158 = arith.extf %154 : tensor<1x8x32xbf16, #blocked3> to tensor<1x8x32xf32, #blocked3>
2026-02-21T12:32:06.8301014Z       %159 = tt.broadcast %157 : tensor<1x8x1xf32, #blocked> -> tensor<1x8x32xf32, #blocked>
2026-02-21T12:32:06.8301261Z       %160 = ttg.convert_layout %159 : tensor<1x8x32xf32, #blocked> -> tensor<1x8x32xf32, #blocked3>
2026-02-21T12:32:06.8301474Z       %161 = arith.subf %158, %160 : tensor<1x8x32xf32, #blocked3>
2026-02-21T12:32:06.8301803Z       %162 = tt.extern_elementwise %161 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x8x32xf32, #blocked3>) -> tensor<1x8x32xf32, #blocked3>
2026-02-21T12:32:06.8302098Z       %163 = "tt.reduce"(%162) <{axis = 2 : i32}> ({
2026-02-21T12:32:06.8302228Z       ^bb0(%arg8: f32, %arg9: f32):
2026-02-21T12:32:06.8302358Z         %208 = arith.addf %arg8, %arg9 : f32
2026-02-21T12:32:06.8302480Z         tt.reduce.return %208 : f32
2026-02-21T12:32:06.8302674Z       }) : (tensor<1x8x32xf32, #blocked3>) -> tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked3}>>
2026-02-21T12:32:06.8302967Z       %164 = ttg.convert_layout %163 : tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked3}>> -> tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8303221Z       %165 = arith.subf %arg5, %152 : tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8303521Z       %166 = tt.extern_elementwise %165 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x8xf32, #blocked4>) -> tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8303813Z       %167 = arith.mulf %arg6, %166 : tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8303981Z       %168 = arith.addf %167, %164 : tensor<1x8xf32, #blocked4>
2026-02-21T12:32:06.8304227Z       %169 = ttg.convert_layout %166 : tensor<1x8xf32, #blocked4> -> tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T12:32:06.8304577Z       %170 = tt.expand_dims %169 {axis = 2 : i32} : tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x8x1xf32, #blocked9>
2026-02-21T12:32:06.8304878Z       %171 = ttg.convert_layout %170 : tensor<1x8x1xf32, #blocked9> -> tensor<1x8x1xf32, #blocked>
2026-02-21T12:32:06.8305125Z       %172 = tt.broadcast %171 : tensor<1x8x1xf32, #blocked> -> tensor<1x8x128xf32, #blocked>
2026-02-21T12:32:06.8305375Z       %173 = ttg.convert_layout %172 : tensor<1x8x128xf32, #blocked> -> tensor<1x8x128xf32, #blocked1>
2026-02-21T12:32:06.8305593Z       %174 = arith.mulf %arg7, %173 : tensor<1x8x128xf32, #blocked1>
2026-02-21T12:32:06.8305746Z       %175 = arith.extsi %arg4 : i32 to i64
2026-02-21T12:32:06.8305890Z       %176 = tt.splat %175 : i64 -> tensor<32xi64, #blocked7>
2026-02-21T12:32:06.8306047Z       %177 = arith.addi %176, %51 : tensor<32xi64, #blocked7>
2026-02-21T12:32:06.8306309Z       %178 = ttg.convert_layout %177 : tensor<32xi64, #blocked7> -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T12:32:06.8306638Z       %179 = tt.expand_dims %178 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x32xi64, #blocked8>
2026-02-21T12:32:06.8306935Z       %180 = ttg.convert_layout %179 : tensor<1x32xi64, #blocked8> -> tensor<1x32xi64, #blocked5>
2026-02-21T12:32:06.8307229Z       %181 = ttg.convert_layout %180 : tensor<1x32xi64, #blocked5> -> tensor<1x32xi64, #ttg.slice<{dim = 2, parent = #blocked16}>>
2026-02-21T12:32:06.8307592Z       %182 = tt.expand_dims %181 {axis = 2 : i32} : tensor<1x32xi64, #ttg.slice<{dim = 2, parent = #blocked16}>> -> tensor<1x32x1xi64, #blocked16>
2026-02-21T12:32:06.8307907Z       %183 = ttg.convert_layout %182 : tensor<1x32x1xi64, #blocked16> -> tensor<1x32x1xi64, #blocked2>
2026-02-21T12:32:06.8308124Z       %184 = arith.muli %183, %cst_7 : tensor<1x32x1xi64, #blocked2>
2026-02-21T12:32:06.8308333Z       %185 = tt.broadcast %184 : tensor<1x32x1xi64, #blocked2> -> tensor<1x32x128xi64, #blocked2>
2026-02-21T12:32:06.8308598Z       %186 = ttg.convert_layout %185 : tensor<1x32x128xi64, #blocked2> -> tensor<1x32x128xi64, #blocked1>
2026-02-21T12:32:06.8308839Z       %187 = arith.addi %186, %59 : tensor<1x32x128xi64, #blocked1>
2026-02-21T12:32:06.8309011Z       %188 = arith.addi %50, %187 : tensor<1x32x128xi64, #blocked1>
2026-02-21T12:32:06.8309232Z       %189 = tt.addptr %48, %188 : tensor<1x32x128x!tt.ptr<bf16>, #blocked1>, tensor<1x32x128xi64, #blocked1>
2026-02-21T12:32:06.8309465Z       %190 = arith.cmpi sge, %183, %cst_6 : tensor<1x32x1xi64, #blocked2>
2026-02-21T12:32:06.8309648Z       %191 = arith.cmpi slt, %183, %cst_5 : tensor<1x32x1xi64, #blocked2>
2026-02-21T12:32:06.8309820Z       %192 = arith.andi %190, %191 : tensor<1x32x1xi1, #blocked2>
2026-02-21T12:32:06.8310002Z       %193 = arith.andi %63, %192 : tensor<1x32x1xi1, #blocked2>
2026-02-21T12:32:06.8310203Z       %194 = tt.broadcast %193 : tensor<1x32x1xi1, #blocked2> -> tensor<1x32x128xi1, #blocked2>
2026-02-21T12:32:06.8310465Z       %195 = ttg.convert_layout %194 : tensor<1x32x128xi1, #blocked2> -> tensor<1x32x128xi1, #blocked1>
2026-02-21T12:32:06.8310685Z       %196 = arith.andi %195, %67 : tensor<1x32x128xi1, #blocked1>
2026-02-21T12:32:06.8310867Z       %197 = tt.load %189, %196, %cst_2 : tensor<1x32x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:32:06.8311094Z       %198 = arith.truncf %162 : tensor<1x8x32xf32, #blocked3> to tensor<1x8x32xbf16, #blocked3>
2026-02-21T12:32:06.8311337Z       %199 = tt.reshape %174 : tensor<1x8x128xf32, #blocked1> -> tensor<8x128xf32, #blocked10>
2026-02-21T12:32:06.8311576Z       %200 = tt.reshape %198 : tensor<1x8x32xbf16, #blocked3> -> tensor<8x32xbf16, #blocked5>
2026-02-21T12:32:06.8311814Z       %201 = tt.reshape %197 : tensor<1x32x128xbf16, #blocked1> -> tensor<32x128xbf16, #blocked10>
2026-02-21T12:32:06.8312125Z       %202 = ttg.convert_layout %200 : tensor<8x32xbf16, #blocked5> -> tensor<8x32xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked17}>>
2026-02-21T12:32:06.8312492Z       %203 = ttg.convert_layout %201 : tensor<32x128xbf16, #blocked10> -> tensor<32x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked17}>>
2026-02-21T12:32:06.8312805Z       %204 = ttg.convert_layout %199 : tensor<8x128xf32, #blocked10> -> tensor<8x128xf32, #blocked17>
2026-02-21T12:32:06.8313228Z       %205 = tt.dot %202, %203, %204, inputPrecision = tf32 : tensor<8x32xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked17}>> * tensor<32x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked17}>> -> tensor<8x128xf32, #blocked17>
2026-02-21T12:32:06.8313647Z       %206 = ttg.convert_layout %205 : tensor<8x128xf32, #blocked17> -> tensor<8x128xf32, #blocked10>
2026-02-21T12:32:06.8313896Z       %207 = tt.reshape %206 : tensor<8x128xf32, #blocked10> -> tensor<1x8x128xf32, #blocked1>
2026-02-21T12:32:06.8314177Z       scf.yield %152, %168, %207 : tensor<1x8xf32, #blocked4>, tensor<1x8xf32, #blocked4>, tensor<1x8x128xf32, #blocked1>
2026-02-21T12:32:06.8314401Z     }
2026-02-21T12:32:06.8314593Z     %69 = ttg.convert_layout %68#1 : tensor<1x8xf32, #blocked4> -> tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T12:32:06.8314929Z     %70 = tt.expand_dims %69 {axis = 2 : i32} : tensor<1x8xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x8x1xf32, #blocked9>
2026-02-21T12:32:06.8315224Z     %71 = ttg.convert_layout %70 : tensor<1x8x1xf32, #blocked9> -> tensor<1x8x1xf32, #blocked>
2026-02-21T12:32:06.8315477Z     %72 = tt.broadcast %71 : tensor<1x8x1xf32, #blocked> -> tensor<1x8x128xf32, #blocked>
2026-02-21T12:32:06.8315713Z     %73 = ttg.convert_layout %72 : tensor<1x8x128xf32, #blocked> -> tensor<1x8x128xf32, #blocked1>
2026-02-21T12:32:06.8315927Z     %74 = arith.divf %68#2, %73 : tensor<1x8x128xf32, #blocked1>
2026-02-21T12:32:06.8316137Z     %75 = arith.truncf %74 : tensor<1x8x128xf32, #blocked1> to tensor<1x8x128xbf16, #blocked1>
2026-02-21T12:32:06.8316320Z     %76 = arith.extsi %8 : i32 to i64
2026-02-21T12:32:06.8316445Z     %77 = arith.extsi %9 : i32 to i64
2026-02-21T12:32:06.8316608Z     %78 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x8x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:32:06.8316784Z     %79 = arith.muli %76, %c32768_i64 : i64
2026-02-21T12:32:06.8316940Z     %80 = tt.splat %79 : i64 -> tensor<1x8x128xi64, #blocked1>
2026-02-21T12:32:06.8317097Z     %81 = tt.splat %77 : i64 -> tensor<8xi64, #blocked7>
2026-02-21T12:32:06.8317274Z     %82 = arith.extsi %10 : tensor<8xi32, #blocked7> to tensor<8xi64, #blocked7>
2026-02-21T12:32:06.8317453Z     %83 = arith.addi %81, %82 : tensor<8xi64, #blocked7>
2026-02-21T12:32:06.8317683Z     %84 = ttg.convert_layout %83 : tensor<8xi64, #blocked7> -> tensor<8xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T12:32:06.8318020Z     %85 = tt.expand_dims %84 {axis = 0 : i32} : tensor<8xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x8xi64, #blocked8>
2026-02-21T12:32:06.8318306Z     %86 = ttg.convert_layout %85 : tensor<1x8xi64, #blocked8> -> tensor<1x8xi64, #blocked4>
2026-02-21T12:32:06.8318588Z     %87 = ttg.convert_layout %86 : tensor<1x8xi64, #blocked4> -> tensor<1x8xi64, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T12:32:06.8318917Z     %88 = tt.expand_dims %87 {axis = 2 : i32} : tensor<1x8xi64, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x8x1xi64, #blocked9>
2026-02-21T12:32:06.8319213Z     %89 = ttg.convert_layout %88 : tensor<1x8x1xi64, #blocked9> -> tensor<1x8x1xi64, #blocked>
2026-02-21T12:32:06.8319415Z     %90 = arith.muli %89, %cst_1 : tensor<1x8x1xi64, #blocked>
2026-02-21T12:32:06.8319612Z     %91 = tt.broadcast %90 : tensor<1x8x1xi64, #blocked> -> tensor<1x8x128xi64, #blocked>
2026-02-21T12:32:06.8319851Z     %92 = ttg.convert_layout %91 : tensor<1x8x128xi64, #blocked> -> tensor<1x8x128xi64, #blocked1>
2026-02-21T12:32:06.8320084Z     %93 = arith.extsi %13 : tensor<128xi32, #blocked7> to tensor<128xi64, #blocked7>
2026-02-21T12:32:06.8320354Z     %94 = ttg.convert_layout %93 : tensor<128xi64, #blocked7> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>>
2026-02-21T12:32:06.8320680Z     %95 = tt.expand_dims %94 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked8}>> -> tensor<1x128xi64, #blocked8>
2026-02-21T12:32:06.8320973Z     %96 = ttg.convert_layout %95 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #blocked10>
2026-02-21T12:32:06.8321266Z     %97 = ttg.convert_layout %96 : tensor<1x128xi64, #blocked10> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>>
2026-02-21T12:32:06.8321610Z     %98 = tt.expand_dims %97 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked11}>> -> tensor<1x1x128xi64, #blocked11>
2026-02-21T12:32:06.8321915Z     %99 = ttg.convert_layout %98 : tensor<1x1x128xi64, #blocked11> -> tensor<1x1x128xi64, #blocked1>
2026-02-21T12:32:06.8322162Z     %100 = tt.broadcast %99 : tensor<1x1x128xi64, #blocked1> -> tensor<1x8x128xi64, #blocked1>
2026-02-21T12:32:06.8322367Z     %101 = arith.addi %92, %100 : tensor<1x8x128xi64, #blocked1>
2026-02-21T12:32:06.8322555Z     %102 = arith.addi %80, %101 : tensor<1x8x128xi64, #blocked1>
2026-02-21T12:32:06.8322807Z     %103 = tt.addptr %78, %102 : tensor<1x8x128x!tt.ptr<bf16>, #blocked1>, tensor<1x8x128xi64, #blocked1>
2026-02-21T12:32:06.8323013Z     %104 = arith.cmpi sge, %76, %c0_i64 : i64
2026-02-21T12:32:06.8323146Z     %105 = arith.cmpi slt, %76, %c192_i64 : i64
2026-02-21T12:32:06.8323276Z     %106 = arith.andi %104, %105 : i1
2026-02-21T12:32:06.8323418Z     %107 = arith.cmpi sge, %89, %cst_0 : tensor<1x8x1xi64, #blocked>
2026-02-21T12:32:06.8323619Z     %108 = arith.cmpi slt, %89, %cst : tensor<1x8x1xi64, #blocked>
2026-02-21T12:32:06.8323790Z     %109 = arith.andi %107, %108 : tensor<1x8x1xi1, #blocked>
2026-02-21T12:32:06.8323949Z     %110 = tt.splat %106 : i1 -> tensor<1x8x1xi1, #blocked>
2026-02-21T12:32:06.8324110Z     %111 = arith.andi %110, %109 : tensor<1x8x1xi1, #blocked>
2026-02-21T12:32:06.8324307Z     %112 = tt.broadcast %111 : tensor<1x8x1xi1, #blocked> -> tensor<1x8x128xi1, #blocked>
2026-02-21T12:32:06.8324555Z     %113 = ttg.convert_layout %112 : tensor<1x8x128xi1, #blocked> -> tensor<1x8x128xi1, #blocked1>
2026-02-21T12:32:06.8324779Z     %114 = arith.cmpi sge, %99, %cst_4 : tensor<1x1x128xi64, #blocked1>
2026-02-21T12:32:06.8324981Z     %115 = arith.cmpi slt, %99, %cst_3 : tensor<1x1x128xi64, #blocked1>
2026-02-21T12:32:06.8325156Z     %116 = arith.andi %114, %115 : tensor<1x1x128xi1, #blocked1>
2026-02-21T12:32:06.8325354Z     %117 = tt.broadcast %116 : tensor<1x1x128xi1, #blocked1> -> tensor<1x8x128xi1, #blocked1>
2026-02-21T12:32:06.8325562Z     %118 = arith.andi %113, %117 : tensor<1x8x128xi1, #blocked1>
2026-02-21T12:32:06.8325734Z     tt.store %103, %75, %118 : tensor<1x8x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:32:06.8325883Z     tt.return
2026-02-21T12:32:06.8325966Z   }
2026-02-21T12:32:06.8326046Z }
2026-02-21T12:32:06.8326090Z 
2026-02-21T12:32:06.8326148Z {-#
2026-02-21T12:32:06.8326232Z   external_resources: {
2026-02-21T12:32:06.8326344Z     mlir_reproducer: {
2026-02-21T12:32:06.8328576Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T12:32:06.8330870Z       disable_threading: false,
2026-02-21T12:32:06.8330984Z       verify_each: true
2026-02-21T12:32:06.8331083Z     }
2026-02-21T12:32:06.8331158Z   }
2026-02-21T12:32:06.8331237Z #-}
2026-02-21T12:32:06.8331517Z /tmp/torchinductor_root/fe/cfedhhxnlm7fu6mbsljolujgdmvn2hjqtzdhyrtxbjqnqca3oatr.py:17:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T12:32:06.8332213Z /tmp/torchinductor_root/fe/cfedhhxnlm7fu6mbsljolujgdmvn2hjqtzdhyrtxbjqnqca3oatr.py:17:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T12:32:06.8332793Z [29s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T12:32:06.8333536Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 8, 32], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True)
2026-02-21T12:32:06.8334221Z Error: RuntimeError: PassManager::run failed
2026-02-21T12:32:06.8334393Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T12:32:07.2405671Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 15.4 configs/s
2026-02-21T12:32:07.2415371Z [29s] Adaptive compile timeout: 30s (90% percentile=10.2s, bounds=[30.0s, 30s])
2026-02-21T12:32:07.3569493Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 9401.3 configs/s
2026-02-21T12:32:07.5917491Z [30s] Initial random population of 100, 5 starting points: 
2026-02-21T12:32:07.5918108Z error=13
2026-02-21T12:32:07.5918270Z ok=87
2026-02-21T12:32:07.5918417Z min=0.0540
2026-02-21T12:32:07.5918575Z mid=0.3969
2026-02-21T12:32:07.5918726Z max=23.3332
2026-02-21T12:32:07.5918896Z best={'block_sizes': [1, 32, 16],
2026-02-21T12:32:07.5919201Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T12:32:07.5919490Z  'l2_groupings': [2],
2026-02-21T12:32:07.5919697Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:32:07.5919927Z  'loop_orders': [[1, 0]],
2026-02-21T12:32:07.5920130Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:32:07.5920325Z  'num_stages': 4,
2026-02-21T12:32:07.5920500Z  'num_warps': 2,
2026-02-21T12:32:07.5921550Z  'pid_type': 'flat',
2026-02-21T12:32:07.5921745Z  'range_flattens': [None, True],
2026-02-21T12:32:07.5921974Z  'range_multi_buffers': [None, None],
2026-02-21T12:32:07.5922196Z  'range_num_stages': [0, 1],
2026-02-21T12:32:07.5922403Z  'range_unroll_factors': [0, 1],
2026-02-21T12:32:07.5922678Z  'range_warp_specializes': [],
2026-02-21T12:32:07.5922888Z  'waves_per_eu': 4}
2026-02-21T12:32:07.5942905Z [30s] Fitting surrogate: 100 points, 100 targets
2026-02-21T12:32:08.3848108Z [30s] Generation 1 starting: 82 neighbors, 5 active search path(s)
2026-02-21T12:32:41.5971881Z [64s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:32:41.7833436Z [64s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:32:41.8929811Z [64s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:32:42.0007970Z [64s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:32:42.3106355Z [64s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:32:42.9128525Z [65s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:32:43.2531468Z [65s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[3, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:32:43.2556787Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 0.7 configs/s
2026-02-21T12:32:48.3860170Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 16.1 configs/s
2026-02-21T12:32:50.9430541Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 370.0         
2026-02-21T12:32:50.9433037Z                                                                   configs/s     
2026-02-21T12:32:51.5011359Z [73s] Generation 1 complete: 
2026-02-21T12:32:51.5011508Z error=1
2026-02-21T12:32:51.5011600Z timeout=7
2026-02-21T12:32:51.5011681Z ok=79
2026-02-21T12:32:51.5011763Z min=0.0505
2026-02-21T12:32:51.5011866Z mid=0.0945
2026-02-21T12:32:51.5011952Z max=0.8442
2026-02-21T12:32:51.5012044Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:32:51.5012209Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:32:51.5012369Z  'l2_groupings': [16],
2026-02-21T12:32:51.5012483Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:32:51.5012606Z  'loop_orders': [[0, 1]],
2026-02-21T12:32:51.5012725Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:32:51.5012833Z  'num_stages': 4,
2026-02-21T12:32:51.5012932Z  'num_warps': 4,
2026-02-21T12:32:51.5013024Z  'pid_type': 'xyz',
2026-02-21T12:32:51.5013124Z  'range_flattens': [None, None],
2026-02-21T12:32:51.5013248Z  'range_multi_buffers': [None, None],
2026-02-21T12:32:51.5013374Z  'range_num_stages': [0, 2],
2026-02-21T12:32:51.5013483Z  'range_unroll_factors': [0, 3],
2026-02-21T12:32:51.5013595Z  'range_warp_specializes': [],
2026-02-21T12:32:51.5013705Z  'waves_per_eu': 4}
2026-02-21T12:32:51.5450158Z [74s] Fitting surrogate: 187 points, 187 targets
2026-02-21T12:32:52.3953440Z [74s] Generation 2 starting: 69 neighbors, 5 active search path(s)
2026-02-21T12:33:25.7801845Z [108s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:33:26.1596613Z [108s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:33:26.1618121Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.3 configs/s
2026-02-21T12:33:30.1891356Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.3 configs/s
2026-02-21T12:33:34.7353264Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 242.6         
2026-02-21T12:33:34.7353759Z                                                                   configs/s     
2026-02-21T12:33:35.4147659Z [117s] Generation 2 complete: 
2026-02-21T12:33:35.4148059Z error=2
2026-02-21T12:33:35.4148280Z timeout=2
2026-02-21T12:33:35.4148490Z ok=70
2026-02-21T12:33:35.4148737Z min=0.0463
2026-02-21T12:33:35.4149518Z mid=0.0753
2026-02-21T12:33:35.4149721Z max=0.8389
2026-02-21T12:33:35.4149914Z best={'block_sizes': [1, 64, 16],
2026-02-21T12:33:35.4150075Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:33:35.4150240Z  'l2_groupings': [16],
2026-02-21T12:33:35.4150364Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:33:35.4150484Z  'loop_orders': [[0, 1]],
2026-02-21T12:33:35.4150597Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:33:35.4150700Z  'num_stages': 4,
2026-02-21T12:33:35.4150792Z  'num_warps': 4,
2026-02-21T12:33:35.4150883Z  'pid_type': 'xyz',
2026-02-21T12:33:35.4150986Z  'range_flattens': [None, None],
2026-02-21T12:33:35.4151209Z  'range_multi_buffers': [None, None],
2026-02-21T12:33:35.4151329Z  'range_num_stages': [0, 2],
2026-02-21T12:33:35.4151442Z  'range_unroll_factors': [0, 3],
2026-02-21T12:33:35.4151568Z  'range_warp_specializes': [],
2026-02-21T12:33:35.4151679Z  'waves_per_eu': 2}
2026-02-21T12:33:35.4200217Z [117s] Fitting surrogate: 261 points, 261 targets
2026-02-21T12:33:36.1281545Z [118s] Generation 3 starting: 66 neighbors, 4 active search path(s)
2026-02-21T12:34:10.7538596Z [153s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[2, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:34:11.1585657Z [153s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:34:11.1607737Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 67/67 0.6 configs/s
2026-02-21T12:34:15.7049285Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 67/67 14.9 configs/s
2026-02-21T12:34:18.0776291Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 403.1         
2026-02-21T12:34:18.0779543Z                                                                   configs/s     
2026-02-21T12:34:18.5874984Z [161s] Generation 3 complete: 
2026-02-21T12:34:18.5875379Z error=1
2026-02-21T12:34:18.5875594Z timeout=2
2026-02-21T12:34:18.5875839Z ok=67
2026-02-21T12:34:18.5876044Z min=0.0465
2026-02-21T12:34:18.5876263Z mid=0.0817
2026-02-21T12:34:18.5876685Z max=3.5996
2026-02-21T12:34:18.5876919Z best={'block_sizes': [1, 64, 16],
2026-02-21T12:34:18.5877348Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:34:18.5877763Z  'l2_groupings': [16],
2026-02-21T12:34:18.5878068Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:34:18.5878381Z  'loop_orders': [[0, 1]],
2026-02-21T12:34:18.5878672Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:34:18.5878938Z  'num_stages': 4,
2026-02-21T12:34:18.5879182Z  'num_warps': 4,
2026-02-21T12:34:18.5879414Z  'pid_type': 'xyz',
2026-02-21T12:34:18.5879672Z  'range_flattens': [None, None],
2026-02-21T12:34:18.5879981Z  'range_multi_buffers': [None, None],
2026-02-21T12:34:18.5880292Z  'range_num_stages': [0, 2],
2026-02-21T12:34:18.5880572Z  'range_unroll_factors': [0, 2],
2026-02-21T12:34:18.5880866Z  'range_warp_specializes': [],
2026-02-21T12:34:18.5881152Z  'waves_per_eu': 2}
2026-02-21T12:34:18.6375022Z [161s] Fitting surrogate: 331 points, 331 targets
2026-02-21T12:34:19.1801934Z [161s] Generation 4 starting: 41 neighbors, 3 active search path(s)
2026-02-21T12:34:52.1214507Z [194s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:34:52.3775259Z [194s] Timeout after 30s compiling Config(block_sizes=[1, 64, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[2, 3], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:34:52.3785142Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 0.4 configs/s
2026-02-21T12:34:54.8821172Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 42/42 17.1 configs/s
2026-02-21T12:34:56.6667477Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 529.1         
2026-02-21T12:34:56.6670370Z                                                                   configs/s     
2026-02-21T12:34:57.1444426Z [199s] Generation 4 complete: 
2026-02-21T12:34:57.1444793Z timeout=2
2026-02-21T12:34:57.1445016Z ok=42
2026-02-21T12:34:57.1445247Z min=0.0447
2026-02-21T12:34:57.1445451Z mid=0.0686
2026-02-21T12:34:57.1445658Z max=1.1353
2026-02-21T12:34:57.1445884Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:34:57.1446300Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:34:57.1446733Z  'l2_groupings': [16],
2026-02-21T12:34:57.1447014Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:34:57.1447348Z  'loop_orders': [[0, 1]],
2026-02-21T12:34:57.1447633Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:34:57.1447922Z  'num_stages': 3,
2026-02-21T12:34:57.1448155Z  'num_warps': 4,
2026-02-21T12:34:57.1448407Z  'pid_type': 'xyz',
2026-02-21T12:34:57.1448658Z  'range_flattens': [None, None],
2026-02-21T12:34:57.1448966Z  'range_multi_buffers': [None, None],
2026-02-21T12:34:57.1449280Z  'range_num_stages': [0, 2],
2026-02-21T12:34:57.1449565Z  'range_unroll_factors': [0, 2],
2026-02-21T12:34:57.1450200Z  'range_warp_specializes': [],
2026-02-21T12:34:57.1450410Z  'waves_per_eu': 2}
2026-02-21T12:34:57.1823106Z [199s] Fitting surrogate: 375 points, 375 targets
2026-02-21T12:34:58.2761352Z [200s] Generation 5 starting: 46 neighbors, 3 active search path(s)
2026-02-21T12:35:20.9022428Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47/47 0.3 configs/s
2026-02-21T12:35:23.9701457Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 47/47 16.1 configs/s
2026-02-21T12:35:26.3387567Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 403.1         
2026-02-21T12:35:26.3391086Z                                                                   configs/s     
2026-02-21T12:35:26.8392102Z [229s] Generation 5 complete: 
2026-02-21T12:35:26.8392449Z ok=49
2026-02-21T12:35:26.8392674Z min=0.0437
2026-02-21T12:35:26.8392901Z mid=0.0653
2026-02-21T12:35:26.8395589Z max=3.4227
2026-02-21T12:35:26.8395826Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:35:26.8396155Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:35:26.8396463Z  'l2_groupings': [16],
2026-02-21T12:35:26.8396667Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:35:26.8396901Z  'loop_orders': [[0, 1]],
2026-02-21T12:35:26.8397105Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:35:26.8397307Z  'num_stages': 3,
2026-02-21T12:35:26.8397474Z  'num_warps': 4,
2026-02-21T12:35:26.8397648Z  'pid_type': 'xyz',
2026-02-21T12:35:26.8397829Z  'range_flattens': [None, None],
2026-02-21T12:35:26.8398059Z  'range_multi_buffers': [None, None],
2026-02-21T12:35:26.8398285Z  'range_num_stages': [0, 2],
2026-02-21T12:35:26.8398485Z  'range_unroll_factors': [0, 2],
2026-02-21T12:35:26.8398994Z  'range_warp_specializes': [],
2026-02-21T12:35:26.8399193Z  'waves_per_eu': 3}
2026-02-21T12:35:26.8898436Z [229s] Fitting surrogate: 424 points, 424 targets
2026-02-21T12:35:27.3103188Z [229s] Generation 6 starting: 30 neighbors, 2 active search path(s)
2026-02-21T12:35:37.2704191Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 1.0 configs/s
2026-02-21T12:35:39.2762879Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 16.6 configs/s
2026-02-21T12:35:40.9987763Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 802.1         
2026-02-21T12:35:40.9988396Z                                                                   configs/s     
2026-02-21T12:35:41.4292487Z [243s] Generation 6 complete: 
2026-02-21T12:35:41.4292688Z ok=32
2026-02-21T12:35:41.4292827Z min=0.0432
2026-02-21T12:35:41.4292954Z mid=0.0710
2026-02-21T12:35:41.4293072Z max=0.6745
2026-02-21T12:35:41.4293211Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:35:41.4293471Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:35:41.4293697Z  'l2_groupings': [16],
2026-02-21T12:35:41.4293869Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:35:41.4294043Z  'loop_orders': [[0, 1]],
2026-02-21T12:35:41.4294210Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:35:41.4294390Z  'num_stages': 4,
2026-02-21T12:35:41.4294519Z  'num_warps': 4,
2026-02-21T12:35:41.4294657Z  'pid_type': 'xyz',
2026-02-21T12:35:41.4294801Z  'range_flattens': [None, None],
2026-02-21T12:35:41.4295010Z  'range_multi_buffers': [None, None],
2026-02-21T12:35:41.4295192Z  'range_num_stages': [0, 2],
2026-02-21T12:35:41.4295348Z  'range_unroll_factors': [0, 2],
2026-02-21T12:35:41.4295522Z  'range_warp_specializes': [],
2026-02-21T12:35:41.4295682Z  'waves_per_eu': 3}
2026-02-21T12:35:41.4553567Z [243s] Fitting surrogate: 456 points, 456 targets
2026-02-21T12:35:41.8622532Z [244s] Generation 7 starting: 31 neighbors, 2 active search path(s)
2026-02-21T12:36:06.8587069Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 32/32 0.2 configs/s
2026-02-21T12:36:08.9216747Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 32/32 16.7 configs/s
2026-02-21T12:36:10.2860096Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 679.9         
2026-02-21T12:36:10.2863059Z                                                                   configs/s     
2026-02-21T12:36:10.7333242Z [273s] Generation 7 complete: 
2026-02-21T12:36:10.7333638Z ok=33
2026-02-21T12:36:10.7333849Z min=0.0433
2026-02-21T12:36:10.7334057Z mid=0.0638
2026-02-21T12:36:10.7334268Z max=0.4157
2026-02-21T12:36:10.7334504Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:36:10.7334969Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:36:10.7335386Z  'l2_groupings': [16],
2026-02-21T12:36:10.7336111Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:36:10.7336430Z  'loop_orders': [[0, 1]],
2026-02-21T12:36:10.7336711Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:36:10.7336977Z  'num_stages': 3,
2026-02-21T12:36:10.7337352Z  'num_warps': 4,
2026-02-21T12:36:10.7337588Z  'pid_type': 'xyz',
2026-02-21T12:36:10.7337836Z  'range_flattens': [None, None],
2026-02-21T12:36:10.7338133Z  'range_multi_buffers': [None, None],
2026-02-21T12:36:10.7338443Z  'range_num_stages': [0, 1],
2026-02-21T12:36:10.7338724Z  'range_unroll_factors': [0, 2],
2026-02-21T12:36:10.7339039Z  'range_warp_specializes': [],
2026-02-21T12:36:10.7339312Z  'waves_per_eu': 3}
2026-02-21T12:36:10.7531808Z [273s] Fitting surrogate: 489 points, 489 targets
2026-02-21T12:36:11.1736711Z [273s] Generation 8 starting: 35 neighbors, 2 active search path(s)
2026-02-21T12:36:22.8443188Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 1.2 configs/s
2026-02-21T12:36:25.1157744Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 16.9 configs/s
2026-02-21T12:36:26.9059853Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 656.2         
2026-02-21T12:36:26.9060446Z                                                                   configs/s     
2026-02-21T12:36:27.3418522Z [289s] Generation 8 complete: 
2026-02-21T12:36:27.3418795Z ok=37
2026-02-21T12:36:27.3418965Z min=0.0438
2026-02-21T12:36:27.3419128Z mid=0.0653
2026-02-21T12:36:27.3419283Z max=0.8369
2026-02-21T12:36:27.3419487Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:36:27.3419816Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:36:27.3420137Z  'l2_groupings': [16],
2026-02-21T12:36:27.3420358Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:36:27.3420610Z  'loop_orders': [[0, 1]],
2026-02-21T12:36:27.3420823Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:36:27.3421438Z  'num_stages': 3,
2026-02-21T12:36:27.3421625Z  'num_warps': 4,
2026-02-21T12:36:27.3421808Z  'pid_type': 'xyz',
2026-02-21T12:36:27.3422017Z  'range_flattens': [None, True],
2026-02-21T12:36:27.3422257Z  'range_multi_buffers': [None, None],
2026-02-21T12:36:27.3422506Z  'range_num_stages': [0, 1],
2026-02-21T12:36:27.3422720Z  'range_unroll_factors': [0, 2],
2026-02-21T12:36:27.3422959Z  'range_warp_specializes': [],
2026-02-21T12:36:27.3423179Z  'waves_per_eu': 3}
2026-02-21T12:36:27.3653435Z [289s] Fitting surrogate: 526 points, 526 targets
2026-02-21T12:36:27.7508810Z [290s] Generation 9 starting: 31 neighbors, 2 active search path(s)
2026-02-21T12:36:32.6301145Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 5.3 configs/s
2026-02-21T12:36:34.5580847Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 31/31 16.4 configs/s
2026-02-21T12:36:36.4893503Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 487.3         
2026-02-21T12:36:36.4894117Z                                                                   configs/s     
2026-02-21T12:36:36.9425715Z [299s] Generation 9 complete: 
2026-02-21T12:36:36.9426258Z ok=33
2026-02-21T12:36:36.9426417Z min=0.0436
2026-02-21T12:36:36.9426571Z mid=0.0476
2026-02-21T12:36:36.9426723Z max=0.3350
2026-02-21T12:36:36.9426895Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:36:36.9427223Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:36:36.9427534Z  'l2_groupings': [16],
2026-02-21T12:36:36.9427742Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:36:36.9427985Z  'loop_orders': [[0, 1]],
2026-02-21T12:36:36.9428308Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:36:36.9428510Z  'num_stages': 3,
2026-02-21T12:36:36.9428681Z  'num_warps': 4,
2026-02-21T12:36:36.9428855Z  'pid_type': 'xyz',
2026-02-21T12:36:36.9429041Z  'range_flattens': [None, True],
2026-02-21T12:36:36.9429272Z  'range_multi_buffers': [None, None],
2026-02-21T12:36:36.9429507Z  'range_num_stages': [0, 1],
2026-02-21T12:36:36.9429721Z  'range_unroll_factors': [0, 2],
2026-02-21T12:36:36.9429941Z  'range_warp_specializes': [],
2026-02-21T12:36:36.9430151Z  'waves_per_eu': 3}
2026-02-21T12:36:36.9782932Z [299s] Fitting surrogate: 559 points, 559 targets
2026-02-21T12:36:37.3037073Z [299s] Generation 10 starting: 25 neighbors, 2 active search path(s)
2026-02-21T12:36:41.6288313Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 25/25 6.3 configs/s
2026-02-21T12:36:43.2781231Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 25/25 16.6 configs/s
2026-02-21T12:36:45.2908815Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 579.3         
2026-02-21T12:36:45.2909455Z                                                                   configs/s     
2026-02-21T12:36:45.7188748Z [308s] Generation 10 complete: 
2026-02-21T12:36:45.7189132Z ok=27
2026-02-21T12:36:45.7189337Z min=0.0434
2026-02-21T12:36:45.7189552Z mid=0.0466
2026-02-21T12:36:45.7189754Z max=0.2063
2026-02-21T12:36:45.7190441Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:36:45.7190864Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:36:45.7191307Z  'l2_groupings': [16],
2026-02-21T12:36:45.7191586Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:36:45.7191906Z  'loop_orders': [[0, 1]],
2026-02-21T12:36:45.7192189Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:36:45.7192469Z  'num_stages': 4,
2026-02-21T12:36:45.7192707Z  'num_warps': 4,
2026-02-21T12:36:45.7192975Z  'pid_type': 'xyz',
2026-02-21T12:36:45.7193231Z  'range_flattens': [None, True],
2026-02-21T12:36:45.7193537Z  'range_multi_buffers': [None, None],
2026-02-21T12:36:45.7193850Z  'range_num_stages': [0, 1],
2026-02-21T12:36:45.7194134Z  'range_unroll_factors': [0, 2],
2026-02-21T12:36:45.7194424Z  'range_warp_specializes': [],
2026-02-21T12:36:45.7194707Z  'waves_per_eu': 3}
2026-02-21T12:36:45.7435655Z [308s] Fitting surrogate: 586 points, 586 targets
2026-02-21T12:36:46.1290766Z [308s] Generation 11 starting: 28 neighbors, 2 active search path(s)
2026-02-21T12:36:50.2697082Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 9.3 configs/s
2026-02-21T12:36:52.0521750Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 28/28 16.5 configs/s
2026-02-21T12:36:53.4932562Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 641.9         
2026-02-21T12:36:53.4933161Z                                                                   configs/s     
2026-02-21T12:36:53.9340422Z [316s] Generation 11 complete: 
2026-02-21T12:36:53.9340718Z ok=30
2026-02-21T12:36:53.9340907Z min=0.0443
2026-02-21T12:36:53.9341096Z mid=0.0542
2026-02-21T12:36:53.9341635Z max=0.4257
2026-02-21T12:36:53.9341841Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:36:53.9342219Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:36:53.9342591Z  'l2_groupings': [16],
2026-02-21T12:36:53.9342837Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:36:53.9343118Z  'loop_orders': [[0, 1]],
2026-02-21T12:36:53.9343377Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:36:53.9343621Z  'num_stages': 3,
2026-02-21T12:36:53.9343823Z  'num_warps': 4,
2026-02-21T12:36:53.9344198Z  'pid_type': 'xyz',
2026-02-21T12:36:53.9344446Z  'range_flattens': [None, True],
2026-02-21T12:36:53.9344705Z  'range_multi_buffers': [None, True],
2026-02-21T12:36:53.9344967Z  'range_num_stages': [0, 3],
2026-02-21T12:36:53.9345210Z  'range_unroll_factors': [0, 2],
2026-02-21T12:36:53.9345471Z  'range_warp_specializes': [],
2026-02-21T12:36:53.9345710Z  'waves_per_eu': 2}
2026-02-21T12:36:53.9566153Z [316s] Fitting surrogate: 616 points, 616 targets
2026-02-21T12:36:54.2131825Z [316s] Generation 12 starting: 15 neighbors, 1 active search path(s)
2026-02-21T12:36:56.5480057Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 14.1 configs/s
2026-02-21T12:36:57.5496716Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 17.5 configs/s
2026-02-21T12:36:58.1792729Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1398.1        
2026-02-21T12:36:58.1793313Z                                                                   configs/s     
2026-02-21T12:36:58.5560320Z [321s] Generation 12 complete: 
2026-02-21T12:36:58.5560698Z ok=16
2026-02-21T12:36:58.5560909Z min=0.0438
2026-02-21T12:36:58.5561132Z mid=0.0487
2026-02-21T12:36:58.5561338Z max=0.1585
2026-02-21T12:36:58.5570008Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:36:58.5570435Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:36:58.5570827Z  'l2_groupings': [16],
2026-02-21T12:36:58.5571099Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:36:58.5571418Z  'loop_orders': [[0, 1]],
2026-02-21T12:36:58.5571685Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:36:58.5571940Z  'num_stages': 3,
2026-02-21T12:36:58.5572164Z  'num_warps': 4,
2026-02-21T12:36:58.5572382Z  'pid_type': 'xyz',
2026-02-21T12:36:58.5572628Z  'range_flattens': [None, True],
2026-02-21T12:36:58.5572914Z  'range_multi_buffers': [None, True],
2026-02-21T12:36:58.5573384Z  'range_num_stages': [0, 2],
2026-02-21T12:36:58.5573650Z  'range_unroll_factors': [0, 2],
2026-02-21T12:36:58.5573854Z  'range_warp_specializes': [],
2026-02-21T12:36:58.5574036Z  'waves_per_eu': 2}
2026-02-21T12:36:58.5686275Z [321s] Fitting surrogate: 632 points, 632 targets
2026-02-21T12:36:58.7908394Z [321s] Generation 13 starting: 11 neighbors, 1 active search path(s)
2026-02-21T12:37:01.2159779Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11/11 3.3 configs/s
2026-02-21T12:37:01.9770714Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 11/11 18.0 configs/s
2026-02-21T12:37:02.7652700Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1109.4        
2026-02-21T12:37:02.7653264Z                                                                   configs/s     
2026-02-21T12:37:03.1459268Z [325s] Generation 13 complete: 
2026-02-21T12:37:03.1459628Z ok=12
2026-02-21T12:37:03.1459839Z min=0.0480
2026-02-21T12:37:03.1460065Z mid=0.0486
2026-02-21T12:37:03.1460270Z max=0.0668
2026-02-21T12:37:03.1460533Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:37:03.1460959Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:37:03.1461403Z  'l2_groupings': [16],
2026-02-21T12:37:03.1461686Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:37:03.1462013Z  'loop_orders': [[0, 1]],
2026-02-21T12:37:03.1462313Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:37:03.1462607Z  'num_stages': 2,
2026-02-21T12:37:03.1462844Z  'num_warps': 4,
2026-02-21T12:37:03.1463077Z  'pid_type': 'xyz',
2026-02-21T12:37:03.1463342Z  'range_flattens': [None, True],
2026-02-21T12:37:03.1463966Z  'range_multi_buffers': [None, True],
2026-02-21T12:37:03.1464287Z  'range_num_stages': [0, 2],
2026-02-21T12:37:03.1464568Z  'range_unroll_factors': [0, 2],
2026-02-21T12:37:03.1464870Z  'range_warp_specializes': [],
2026-02-21T12:37:03.1465153Z  'waves_per_eu': 3}
2026-02-21T12:37:03.1572424Z [325s] Fitting surrogate: 644 points, 644 targets
2026-02-21T12:37:03.4086969Z [325s] Generation 14 starting: 15 neighbors, 1 active search path(s)
2026-02-21T12:37:06.3661983Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 8.5 configs/s
2026-02-21T12:37:07.4297915Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 16.3 configs/s
2026-02-21T12:37:08.5782901Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1659.6        
2026-02-21T12:37:08.5783409Z                                                                   configs/s     
2026-02-21T12:37:08.9876031Z [331s] Generation 14 complete: 
2026-02-21T12:37:08.9876377Z ok=17
2026-02-21T12:37:08.9876596Z min=0.0438
2026-02-21T12:37:08.9876835Z mid=0.0558
2026-02-21T12:37:08.9877035Z max=0.2195
2026-02-21T12:37:08.9877270Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:37:08.9877692Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:37:08.9878109Z  'l2_groupings': [16],
2026-02-21T12:37:08.9878390Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:37:08.9878732Z  'loop_orders': [[0, 1]],
2026-02-21T12:37:08.9879014Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:37:08.9879291Z  'num_stages': 2,
2026-02-21T12:37:08.9879537Z  'num_warps': 4,
2026-02-21T12:37:08.9879772Z  'pid_type': 'xyz',
2026-02-21T12:37:08.9880033Z  'range_flattens': [None, True],
2026-02-21T12:37:08.9880335Z  'range_multi_buffers': [None, True],
2026-02-21T12:37:08.9880882Z  'range_num_stages': [0, 2],
2026-02-21T12:37:08.9881165Z  'range_unroll_factors': [0, 2],
2026-02-21T12:37:08.9881467Z  'range_warp_specializes': [],
2026-02-21T12:37:08.9881747Z  'waves_per_eu': 3}
2026-02-21T12:37:09.0028488Z [331s] Fitting surrogate: 661 points, 661 targets
2026-02-21T12:37:09.3025989Z [331s] Generation 15 starting: 16 neighbors, 1 active search path(s)
2026-02-21T12:37:12.7548887Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 4.1 configs/s
2026-02-21T12:37:13.8366006Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.1 configs/s
2026-02-21T12:37:14.5471161Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1237.9        
2026-02-21T12:37:14.5471723Z                                                                   configs/s     
2026-02-21T12:37:14.9477852Z [337s] Generation 15 complete: 
2026-02-21T12:37:14.9478166Z ok=18
2026-02-21T12:37:14.9478383Z min=0.0473
2026-02-21T12:37:14.9478589Z mid=0.0673
2026-02-21T12:37:14.9478803Z max=0.2220
2026-02-21T12:37:14.9479027Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:37:14.9479430Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:37:14.9479823Z  'l2_groupings': [16],
2026-02-21T12:37:14.9480111Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:37:14.9480416Z  'loop_orders': [[0, 1]],
2026-02-21T12:37:14.9480691Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:37:14.9480949Z  'num_stages': 2,
2026-02-21T12:37:14.9481176Z  'num_warps': 4,
2026-02-21T12:37:14.9481398Z  'pid_type': 'xyz',
2026-02-21T12:37:14.9481647Z  'range_flattens': [None, True],
2026-02-21T12:37:14.9481948Z  'range_multi_buffers': [None, None],
2026-02-21T12:37:14.9482247Z  'range_num_stages': [0, 2],
2026-02-21T12:37:14.9482514Z  'range_unroll_factors': [0, 2],
2026-02-21T12:37:14.9482916Z  'range_warp_specializes': [],
2026-02-21T12:37:14.9483187Z  'waves_per_eu': 3}
2026-02-21T12:37:14.9646095Z [337s] Fitting surrogate: 679 points, 679 targets
2026-02-21T12:37:15.2523969Z [337s] Generation 16 starting: 16 neighbors, 1 active search path(s)
2026-02-21T12:37:18.5171288Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 6.5 configs/s
2026-02-21T12:37:19.6633611Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.0 configs/s
2026-02-21T12:37:20.3714995Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1247.0        
2026-02-21T12:37:20.3715431Z                                                                   configs/s     
2026-02-21T12:37:20.7323470Z [343s] Generation 16 complete: 
2026-02-21T12:37:20.7323819Z ok=18
2026-02-21T12:37:20.7324027Z min=0.0441
2026-02-21T12:37:20.7324261Z mid=0.0584
2026-02-21T12:37:20.7324466Z max=0.1888
2026-02-21T12:37:20.7324701Z best={'block_sizes': [1, 64, 32],
2026-02-21T12:37:20.7325379Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:37:20.7325786Z  'l2_groupings': [16],
2026-02-21T12:37:20.7326068Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:37:20.7326509Z  'loop_orders': [[0, 1]],
2026-02-21T12:37:20.7326794Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:37:20.7327060Z  'num_stages': 1,
2026-02-21T12:37:20.7327293Z  'num_warps': 4,
2026-02-21T12:37:20.7327522Z  'pid_type': 'xyz',
2026-02-21T12:37:20.7327788Z  'range_flattens': [None, True],
2026-02-21T12:37:20.7328090Z  'range_multi_buffers': [None, True],
2026-02-21T12:37:20.7328398Z  'range_num_stages': [0, 2],
2026-02-21T12:37:20.7328678Z  'range_unroll_factors': [0, 2],
2026-02-21T12:37:20.7328970Z  'range_warp_specializes': [],
2026-02-21T12:37:20.7329251Z  'waves_per_eu': 3}
2026-02-21T12:37:20.7488538Z [343s] Fitting surrogate: 697 points, 697 targets
2026-02-21T12:37:21.0151439Z [343s] Generation 17 starting: 15 neighbors, 1 active search path(s)
2026-02-21T12:37:24.1297676Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 7.8 configs/s
2026-02-21T12:37:25.1042880Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 16.1 configs/s
2026-02-21T12:37:25.9279155Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1080.0        
2026-02-21T12:37:25.9279741Z                                                                   configs/s     
2026-02-21T12:37:26.3143396Z [348s] Generation 17 complete: 
2026-02-21T12:37:26.3143767Z ok=16
2026-02-21T12:37:26.3143980Z min=0.0467
2026-02-21T12:37:26.3144181Z mid=0.0542
2026-02-21T12:37:26.3144378Z max=0.2231
2026-02-21T12:37:26.3144605Z best={'block_sizes': [1, 64, 64],
2026-02-21T12:37:26.3145021Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:37:26.3145427Z  'l2_groupings': [16],
2026-02-21T12:37:26.3145702Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:37:26.3146378Z  'loop_orders': [[0, 1]],
2026-02-21T12:37:26.3146658Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:37:26.3146949Z  'num_stages': 1,
2026-02-21T12:37:26.3147175Z  'num_warps': 4,
2026-02-21T12:37:26.3147403Z  'pid_type': 'xyz',
2026-02-21T12:37:26.3147649Z  'range_flattens': [None, True],
2026-02-21T12:37:26.3147958Z  'range_multi_buffers': [None, True],
2026-02-21T12:37:26.3148259Z  'range_num_stages': [0, 2],
2026-02-21T12:37:26.3148533Z  'range_unroll_factors': [0, 3],
2026-02-21T12:37:26.3148821Z  'range_warp_specializes': [],
2026-02-21T12:37:26.3149103Z  'waves_per_eu': 3}
2026-02-21T12:37:26.3329529Z [348s] Fitting surrogate: 713 points, 713 targets
2026-02-21T12:37:26.6240678Z [349s] Generation 18 starting: 17 neighbors, 1 active search path(s)
2026-02-21T12:37:31.5150334Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 2.8 configs/s
2026-02-21T12:37:32.7275487Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 16.8 configs/s
2026-02-21T12:37:33.6381078Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 3288.8        
2026-02-21T12:37:33.6382006Z                                                                   configs/s     
2026-02-21T12:37:33.9911328Z [356s] Generation 18 complete: 
2026-02-21T12:37:33.9911544Z ok=19
2026-02-21T12:37:33.9911674Z min=0.0460
2026-02-21T12:37:33.9911815Z mid=0.0804
2026-02-21T12:37:33.9911944Z max=0.3543
2026-02-21T12:37:33.9912081Z best={'block_sizes': [1, 64, 64],
2026-02-21T12:37:33.9912329Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:37:33.9912580Z  'l2_groupings': [16],
2026-02-21T12:37:33.9912948Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:37:33.9913143Z  'loop_orders': [[0, 1]],
2026-02-21T12:37:33.9913312Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:37:33.9913475Z  'num_stages': 1,
2026-02-21T12:37:33.9913608Z  'num_warps': 4,
2026-02-21T12:37:33.9913750Z  'pid_type': 'xyz',
2026-02-21T12:37:33.9913901Z  'range_flattens': [None, True],
2026-02-21T12:37:33.9914089Z  'range_multi_buffers': [None, True],
2026-02-21T12:37:33.9914270Z  'range_num_stages': [0, 2],
2026-02-21T12:37:33.9914444Z  'range_unroll_factors': [0, 3],
2026-02-21T12:37:33.9914622Z  'range_warp_specializes': [],
2026-02-21T12:37:33.9914781Z  'waves_per_eu': 3}
2026-02-21T12:37:34.0026208Z [356s] Fitting surrogate: 732 points, 732 targets
2026-02-21T12:37:34.3109759Z [356s] Generation 19 starting: 18 neighbors, 1 active search path(s)
2026-02-21T12:37:38.8361936Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 2.8 configs/s
2026-02-21T12:37:40.0832550Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.2 configs/s
2026-02-21T12:37:40.2086129Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 7963.7        
2026-02-21T12:37:40.2086533Z                                                                   configs/s     
2026-02-21T12:37:40.4189982Z [362s] Generation 19 complete: 
2026-02-21T12:37:40.4190345Z ok=20
2026-02-21T12:37:40.4190944Z min=0.0462
2026-02-21T12:37:40.4191160Z mid=0.1308
2026-02-21T12:37:40.4191364Z max=0.3769
2026-02-21T12:37:40.4191618Z best={'block_sizes': [1, 64, 64],
2026-02-21T12:37:40.4192046Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:37:40.4192451Z  'l2_groupings': [16],
2026-02-21T12:37:40.4192754Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:37:40.4193075Z  'loop_orders': [[0, 1]],
2026-02-21T12:37:40.4193352Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:37:40.4193625Z  'num_stages': 1,
2026-02-21T12:37:40.4193860Z  'num_warps': 4,
2026-02-21T12:37:40.4194104Z  'pid_type': 'xyz',
2026-02-21T12:37:40.4194353Z  'range_flattens': [None, True],
2026-02-21T12:37:40.4194658Z  'range_multi_buffers': [None, True],
2026-02-21T12:37:40.4194964Z  'range_num_stages': [0, 2],
2026-02-21T12:37:40.4195243Z  'range_unroll_factors': [0, 3],
2026-02-21T12:37:40.4195534Z  'range_warp_specializes': [],
2026-02-21T12:37:40.4195812Z  'waves_per_eu': 3}
2026-02-21T12:37:40.4253834Z [362s] Fitting surrogate: 752 points, 752 targets
2026-02-21T12:37:40.6766867Z [363s] Generation 20 starting: 15 neighbors, 1 active search path(s)
2026-02-21T12:37:43.8756475Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 10.0 configs/s
2026-02-21T12:37:44.9477149Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 16/16 17.3 configs/s
2026-02-21T12:37:45.1065983Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 6564.9        
2026-02-21T12:37:45.1066534Z                                                                   configs/s     
2026-02-21T12:37:45.3630271Z [367s] Generation 20 complete: 
2026-02-21T12:37:45.3630619Z ok=17
2026-02-21T12:37:45.3630821Z min=0.0417
2026-02-21T12:37:45.3631034Z mid=0.1098
2026-02-21T12:37:45.3631235Z max=0.2755
2026-02-21T12:37:45.3631459Z best={'block_sizes': [1, 64, 64],
2026-02-21T12:37:45.3631868Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T12:37:45.3632274Z  'l2_groupings': [16],
2026-02-21T12:37:45.3632568Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:37:45.3632884Z  'loop_orders': [[0, 1]],
2026-02-21T12:37:45.3633414Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:37:45.3633687Z  'num_stages': 1,
2026-02-21T12:37:45.3633915Z  'num_warps': 4,
2026-02-21T12:37:45.3634138Z  'pid_type': 'xyz',
2026-02-21T12:37:45.3634402Z  'range_flattens': [None, True],
2026-02-21T12:37:45.3634696Z  'range_multi_buffers': [None, True],
2026-02-21T12:37:45.3635003Z  'range_num_stages': [0, 2],
2026-02-21T12:37:45.3635277Z  'range_unroll_factors': [0, 3],
2026-02-21T12:37:45.3635565Z  'range_warp_specializes': [],
2026-02-21T12:37:45.3635847Z  'waves_per_eu': 3}
2026-02-21T12:37:45.3721051Z [367s] Fitting surrogate: 769 points, 769 targets
2026-02-21T12:37:45.5029307Z [367s] Autotuning complete in 368.0s after searching 720 configs.
2026-02-21T12:37:45.5029833Z One can hardcode the best config and skip autotuning with:
2026-02-21T12:37:45.5031896Z     @helion.kernel(config=helion.Config(block_sizes=[1, 64, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T12:37:45.5033680Z 
2026-02-21T12:37:45.5034137Z [367s] Code of selected kernel: /tmp/torchinductor_root/sc/csc5gevho6toucdyyzbxtu6mbk3qvwwzcnzxldinj7cxnesrc5vi.py
2026-02-21T12:37:45.5274849Z from __future__ import annotations
2026-02-21T12:37:45.5275022Z 
2026-02-21T12:37:45.5275082Z import torch
2026-02-21T12:37:45.5275209Z import triton
2026-02-21T12:37:45.5275354Z import triton.language as tl
2026-02-21T12:37:45.5275553Z from torch._inductor.runtime import triton_helpers
2026-02-21T12:37:45.5275816Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T12:37:45.5276285Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T12:37:45.5276465Z 
2026-02-21T12:37:45.5276532Z _BLOCK_SIZE_1 = tl.constexpr(64)
2026-02-21T12:37:45.5276704Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T12:37:45.5276867Z _BLOCK_SIZE_3 = tl.constexpr(64)
2026-02-21T12:37:45.5277032Z _SHAPE_DIM_3 = tl.constexpr(128)
2026-02-21T12:37:45.5277193Z _SHAPE_DIM_7 = tl.constexpr(128)
2026-02-21T12:37:45.5277299Z 
2026-02-21T12:37:45.5277349Z @triton.jit
2026-02-21T12:37:45.5277560Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr):
2026-02-21T12:37:45.5277910Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T12:37:45.5278172Z     num_blocks_0 = 192
2026-02-21T12:37:45.5278314Z     num_pid_m = 192
2026-02-21T12:37:45.5278464Z     num_pid_n = tl.cdiv(256, _BLOCK_SIZE_1)
2026-02-21T12:37:45.5278706Z     inner_2d_pid = tl.program_id(0) + tl.program_id(1) * num_blocks_0
2026-02-21T12:37:45.5278946Z     num_pid_in_group = 16 * num_pid_n
2026-02-21T12:37:45.5279140Z     group_id = inner_2d_pid // num_pid_in_group
2026-02-21T12:37:45.5279328Z     first_pid_m = group_id * 16
2026-02-21T12:37:45.5279513Z     group_size_m = min(num_pid_m - first_pid_m, 16)
2026-02-21T12:37:45.5279770Z     pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T12:37:45.5280044Z     pid_1 = inner_2d_pid % num_pid_in_group // group_size_m
2026-02-21T12:37:45.5280246Z     offset_0 = pid_0
2026-02-21T12:37:45.5280410Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T12:37:45.5280604Z     offset_1 = pid_1 * _BLOCK_SIZE_1
2026-02-21T12:37:45.5280868Z     indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32)
2026-02-21T12:37:45.5281119Z     indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32)
2026-02-21T12:37:45.5281420Z     # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T12:37:45.5281759Z     m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32)
2026-02-21T12:37:45.5282035Z     # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0)
2026-02-21T12:37:45.5282288Z     l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32)
2026-02-21T12:37:45.5282691Z     # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32)
2026-02-21T12:37:45.5283011Z     acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32)
2026-02-21T12:37:45.5283266Z     # src[attention.py:71]: q = q_view[tile_b, tile_m, :]
2026-02-21T12:37:45.5283630Z     q = tl.load(q_view + (indices_0[:, None, None] * 32768 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), None)
2026-02-21T12:37:45.5284017Z     # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)):
2026-02-21T12:37:45.5284274Z     # src[attention.py:73]:     k = k_view[tile_b, :, tile_n]
2026-02-21T12:37:45.5284505Z     # src[attention.py:74]:     qk = torch.bmm(q, k)
2026-02-21T12:37:45.5284702Z     # src[attention.py:72-85]: ...
2026-02-21T12:37:45.5285066Z     for offset_2 in tl.range(0, 256, _BLOCK_SIZE_3, loop_unroll_factor=3, num_stages=1, disallow_acc_multi_buffer=False, flatten=True):
2026-02-21T12:37:45.5285486Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32)
2026-02-21T12:37:45.5285705Z         q_copy = q
2026-02-21T12:37:45.5285837Z         m_i_copy = m_i
2026-02-21T12:37:45.5286004Z         l_i_copy = l_i
2026-02-21T12:37:45.5286141Z         acc_copy = acc
2026-02-21T12:37:45.5286277Z         q_copy_0 = q_copy
2026-02-21T12:37:45.5286431Z         m_i_copy_0 = m_i_copy
2026-02-21T12:37:45.5286585Z         l_i_copy_0 = l_i_copy
2026-02-21T12:37:45.5286739Z         acc_copy_0 = acc_copy
2026-02-21T12:37:45.5286925Z         # src[attention.py:73]: k = k_view[tile_b, :, tile_n]
2026-02-21T12:37:45.5287451Z         k = tl.load(tl.make_block_ptr(k_view, [192, 128, 256], [32768, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_3, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T12:37:45.5287980Z         # src[attention.py:74]: qk = torch.bmm(q, k)
2026-02-21T12:37:45.5288624Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T12:37:45.5289292Z         # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale)
2026-02-21T12:37:45.5289507Z         amax = tl.cast(tl.max(qk, 2), tl.bfloat16)
2026-02-21T12:37:45.5289648Z         v_0 = 0.12751743074602467
2026-02-21T12:37:45.5289783Z         v_1 = tl.cast(amax * v_0, tl.bfloat16)
2026-02-21T12:37:45.5289922Z         v_2 = tl.cast(v_1, tl.float32)
2026-02-21T12:37:45.5290073Z         v_3 = triton_helpers.maximum(m_i_copy_0, v_2)
2026-02-21T12:37:45.5290256Z         # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None]
2026-02-21T12:37:45.5290420Z         v_4 = 0.12751743074602467
2026-02-21T12:37:45.5290551Z         v_5 = tl.cast(qk * v_4, tl.bfloat16)
2026-02-21T12:37:45.5290688Z         subscript = v_3[:, :, None]
2026-02-21T12:37:45.5290821Z         v_6 = tl.cast(v_5, tl.float32)
2026-02-21T12:37:45.5290977Z         v_7 = v_6 - subscript
2026-02-21T12:37:45.5291111Z         # src[attention.py:77]: p = torch.exp2(qk)
2026-02-21T12:37:45.5291260Z         v_8 = libdevice.exp2(v_7)
2026-02-21T12:37:45.5291406Z         # src[attention.py:78]: l_ij = torch.sum(p, -1)
2026-02-21T12:37:45.5291567Z         l_ij = tl.cast(tl.sum(v_8, 2), tl.float32)
2026-02-21T12:37:45.5291736Z         # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij)
2026-02-21T12:37:45.5291924Z         v_9 = m_i_copy_0 - v_3
2026-02-21T12:37:45.5292046Z         v_10 = libdevice.exp2(v_9)
2026-02-21T12:37:45.5292193Z         # src[attention.py:80]: l_i = l_i * alpha + l_ij
2026-02-21T12:37:45.5292343Z         v_11 = l_i_copy_0 * v_10
2026-02-21T12:37:45.5292462Z         l_i = v_11 + l_ij
2026-02-21T12:37:45.5300110Z         # src[attention.py:81]: acc = acc * alpha[:, :, None]
2026-02-21T12:37:45.5300270Z         subscript_1 = v_10[:, :, None]
2026-02-21T12:37:45.5300390Z         v_13 = acc_copy_0 * subscript_1
2026-02-21T12:37:45.5300569Z         # src[attention.py:82]: v = v_view[tile_b, tile_n, :]
2026-02-21T12:37:45.5300818Z         v = tl.load(v_view + (indices_0[:, None, None] * 32768 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None)
2026-02-21T12:37:45.5301055Z         # src[attention.py:83]: p = p.to(v.dtype)
2026-02-21T12:37:45.5301181Z         v_14 = tl.cast(v_8, tl.bfloat16)
2026-02-21T12:37:45.5301319Z         # src[attention.py:84]: acc = torch.baddbmm(acc, p, v)
2026-02-21T12:37:45.5301771Z         acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128])
2026-02-21T12:37:45.5302199Z         # src[attention.py:85]: m_i = m_ij
2026-02-21T12:37:45.5302310Z         m_i = v_3
2026-02-21T12:37:45.5302415Z     # src[attention.py:87]: acc = acc / l_i[:, :, None]
2026-02-21T12:37:45.5302549Z     subscript_2 = l_i[:, :, None]
2026-02-21T12:37:45.5302655Z     v_15 = acc / subscript_2
2026-02-21T12:37:45.5302824Z     # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype)
2026-02-21T12:37:45.5302973Z     v_16 = tl.cast(v_15, tl.bfloat16)
2026-02-21T12:37:45.5303252Z     tl.store(tl.make_block_ptr(out, [192, 256, 128], [32768, 128, 1], [offset_0, offset_1, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _SHAPE_DIM_7], [2, 1, 0]), v_16, boundary_check=[0, 1, 2])
2026-02-21T12:37:45.5303499Z 
2026-02-21T12:37:45.5303632Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T12:37:45.5303829Z     """
2026-02-21T12:37:45.5303919Z     Computes scaled dot-product attention.
2026-02-21T12:37:45.5304001Z 
2026-02-21T12:37:45.5304129Z     Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
2026-02-21T12:37:45.5304280Z 
2026-02-21T12:37:45.5304314Z     Args:
2026-02-21T12:37:45.5304424Z         q_in: Query tensor of shape [..., seq_len_q, head_dim]
2026-02-21T12:37:45.5304577Z         k_in: Key tensor of shape [..., seq_len_k, head_dim]
2026-02-21T12:37:45.5304730Z         v_in: Value tensor of shape [..., seq_len_k, head_dim]
2026-02-21T12:37:45.5304825Z 
2026-02-21T12:37:45.5304865Z     Returns:
2026-02-21T12:37:45.5304969Z         Output tensor of shape [..., seq_len_q, head_dim]
2026-02-21T12:37:45.5305094Z     """
2026-02-21T12:37:45.5305184Z     # src[attention.py:56]: m_dim = q_in.size(-2)
2026-02-21T12:37:45.5305308Z     m_dim = q_in.size(-2)
2026-02-21T12:37:45.5305414Z     # src[attention.py:57]: n_dim = k_in.size(-2)
2026-02-21T12:37:45.5305536Z     n_dim = k_in.size(-2)
2026-02-21T12:37:45.5305650Z     # src[attention.py:58]: assert n_dim == v_in.size(-2)
2026-02-21T12:37:45.5305785Z     assert n_dim == v_in.size(-2)
2026-02-21T12:37:45.5305930Z     # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1))
2026-02-21T12:37:45.5306071Z     head_dim = 128
2026-02-21T12:37:45.5306200Z     # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T12:37:45.5306371Z     assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T12:37:45.5306539Z     # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T12:37:45.5306696Z     q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T12:37:45.5306851Z     # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T12:37:45.5307021Z     v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T12:37:45.5307195Z     # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T12:37:45.5307392Z     k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T12:37:45.5307553Z     # src[attention.py:64]: out = torch.empty_like(q_view)
2026-02-21T12:37:45.5307691Z     out = torch.empty_like(q_view)
2026-02-21T12:37:45.5307847Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T12:37:45.5308040Z     _BLOCK_SIZE_1 = 64
2026-02-21T12:37:45.5308130Z     _RDIM_SIZE_2 = 128
2026-02-21T12:37:45.5308273Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T12:37:45.5308500Z     # src[attention.py:68]:     m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T12:37:45.5308704Z     # src[attention.py:69]:     l_i = torch.full_like(m_i, 1.0)
2026-02-21T12:37:45.5308848Z     # src[attention.py:67-88]: ...
2026-02-21T12:37:45.5309144Z     _launcher(_helion_attention, (192, triton.cdiv(256, _BLOCK_SIZE_1)), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=1, waves_per_eu=3, matrix_instr_nonkdim=16)
2026-02-21T12:37:45.5309461Z     # src[attention.py:89]: return out.view(q_in.size())
2026-02-21T12:37:45.5309594Z     return out.view(q_in.size())
2026-02-21T12:37:46.1794524Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T12:37:46.1796863Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 64, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='xyz', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3), static_shapes=True)
2026-02-21T12:37:46.1798934Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T12:37:46.1799354Z WARNING:tritonbench.utils.triton_op:Completed input ID 1:
2026-02-21T12:37:46.1799719Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T12:37:46.1800014Z ------------------------------------------
2026-02-21T12:37:46.1800280Z (4, 48, 256, 256, 128)
2026-02-21T12:37:46.1800483Z 
2026-02-21T12:37:46.1802894Z  33%|███▎      | 2/6 [10:52<22:17, 334.27s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2:
2026-02-21T12:37:46.1803383Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T12:37:46.1803684Z ------------------------------------------
2026-02-21T12:37:46.1803953Z (4, 48, 512, 512, 128)
2026-02-21T12:37:46.1804967Z INFO:tritonbench.utils.triton_op:Took 0.08ms to get benchmark function for aten
2026-02-21T12:37:47.2958230Z INFO:tritonbench.utils.triton_op:Took 2.51ms to get benchmark function for flex_attention
2026-02-21T12:37:48.6761114Z WARNING:__main__:Input tensor metadata:
2026-02-21T12:37:48.6761559Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T12:37:48.6761886Z               'dtype': 'torch.bfloat16',
2026-02-21T12:37:48.6762210Z               'shape': (4, 48, 512, 128),
2026-02-21T12:37:48.6762535Z               'stride': (3145728, 65536, 128, 1)},
2026-02-21T12:37:48.6762969Z             { 'device': 'cuda:0',
2026-02-21T12:37:48.6763276Z               'dtype': 'torch.bfloat16',
2026-02-21T12:37:48.6763579Z               'shape': (4, 48, 512, 128),
2026-02-21T12:37:48.6763905Z               'stride': (3145728, 65536, 128, 1)},
2026-02-21T12:37:48.6764224Z             { 'device': 'cuda:0',
2026-02-21T12:37:48.6764512Z               'dtype': 'torch.bfloat16',
2026-02-21T12:37:48.6764811Z               'shape': (4, 48, 512, 128),
2026-02-21T12:37:48.6765127Z               'stride': (3145728, 65536, 128, 1)}),
2026-02-21T12:37:48.6765437Z   'kwargs': {}}
2026-02-21T12:37:48.6795468Z INFO:tritonbench.utils.triton_op:Took 3.84ms to get benchmark function for helion_attention
2026-02-21T12:37:48.9337947Z [0s] Autotune random seed: 2150287535
2026-02-21T12:37:48.9641749Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T12:38:23.2682115Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 64, 512], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[1, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:38:28.8678999Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[2, 1], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:38:31.2610902Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 4, 512], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, None], range_num_stages=[0, 3], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:38:31.3795442Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:38:32.4741520Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 8, 512], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[0, 2], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:38:32.6952167Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:38:32.6978204Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s
2026-02-21T12:38:38.1819582Z /tmp/torchinductor_root/ne/cnewmtkcvegrxtiokl5mr4xyshhjpytfcvozpnfvt7sf7g5ytajx.py:93:133: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T12:38:38.1820795Z             v = tl.load(v_view + (indices_0[:, None, None] * 65536 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None)
2026-02-21T12:38:38.1821532Z                                                                                                                                     ^
2026-02-21T12:38:38.1823489Z /tmp/torchinductor_root/ne/cnewmtkcvegrxtiokl5mr4xyshhjpytfcvozpnfvt7sf7g5ytajx.py:97:144: note: - use: %121 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x256x128xbf16, #ttg.blocked<{sizePerThread = [1, 1, 8], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 2, 1], order = [2, 0, 1]}>>) -> tensor<256x128xbf16, #ttg.linear<{register = [[0, 1], [0, 2], [0, 4], [8, 0], [16, 0], [32, 0], [64, 0], [128, 0]], lane = [[0, 8], [0, 16], [0, 32], [0, 64], [1, 0], [2, 0]], warp = [[4, 0]], block = []}>>
2026-02-21T12:38:38.1825445Z 
2026-02-21T12:38:38.1826345Z             acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128])
2026-02-21T12:38:38.1827661Z                                                                                                                                                ^
2026-02-21T12:38:38.1828126Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T12:38:38.1854412Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}>
2026-02-21T12:38:38.1854989Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T12:38:38.1855514Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T12:38:38.1856005Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T12:38:38.1856481Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}>
2026-02-21T12:38:38.1856947Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}>
2026-02-21T12:38:38.1857557Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [0, 1, 2]}>
2026-02-21T12:38:38.1858070Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 2, 1], order = [0, 1, 2]}>
2026-02-21T12:38:38.1858583Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:38:38.1859099Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:38:38.1859749Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T12:38:38.1860607Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T12:38:38.1861268Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T12:38:38.1861474Z     %c65536_i32 = arith.constant 65536 : i32
2026-02-21T12:38:38.1861675Z     %c256_i32 = arith.constant 256 : i32
2026-02-21T12:38:38.1861859Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T12:38:38.1862049Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T12:38:38.1862241Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T12:38:38.1862429Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T12:38:38.1862672Z     %cst = arith.constant dense<128> : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1862993Z     %cst_0 = arith.constant dense<0.127517432> : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1863325Z     %cst_1 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1863645Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1863959Z     %cst_3 = arith.constant dense<128> : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1864275Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1864598Z     %cst_5 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1864911Z     %cst_6 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1865200Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T12:38:38.1865388Z     %c192_i32 = arith.constant 192 : i32
2026-02-21T12:38:38.1865591Z     %c98304_i32 = arith.constant 98304 : i32
2026-02-21T12:38:38.1865757Z     %c21_i32 = arith.constant 21 : i32
2026-02-21T12:38:38.1865920Z     %0 = tt.get_program_id x : i32
2026-02-21T12:38:38.1866078Z     %1 = arith.muli %0, %c21_i32 : i32
2026-02-21T12:38:38.1866234Z     %2 = arith.addi %1, %c21_i32 : i32
2026-02-21T12:38:38.1866391Z     %3 = arith.minsi %2, %c98304_i32 : i32
2026-02-21T12:38:38.1866652Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked4>
2026-02-21T12:38:38.1867039Z     %5 = ttg.convert_layout %4 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.1867503Z     %6 = tt.expand_dims %5 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x128xi32, #blocked5>
2026-02-21T12:38:38.1867913Z     %7 = ttg.convert_layout %6 : tensor<1x128xi32, #blocked5> -> tensor<1x128xi32, #blocked3>
2026-02-21T12:38:38.1868313Z     %8 = ttg.convert_layout %7 : tensor<1x128xi32, #blocked3> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.1868786Z     %9 = tt.expand_dims %8 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x128xi32, #blocked6>
2026-02-21T12:38:38.1869210Z     %10 = ttg.convert_layout %9 : tensor<1x1x128xi32, #blocked6> -> tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.1869541Z     %11 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1869864Z     %12 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1870247Z     %13 = ttg.convert_layout %7 : tensor<1x128xi32, #blocked3> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.1870722Z     %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x128x1xi32, #blocked7>
2026-02-21T12:38:38.1871147Z     %15 = ttg.convert_layout %14 : tensor<1x128x1xi32, #blocked7> -> tensor<1x128x1xi32, #blocked>
2026-02-21T12:38:38.1871474Z     %16 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1871824Z     %17 = tt.broadcast %10 : tensor<1x1x128xi32, #blocked1> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1872163Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1872456Z     %19 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1872683Z     %20 = arith.subi %3, %1 : i32
2026-02-21T12:38:38.1872847Z     %c1_i32_7 = arith.constant 1 : i32
2026-02-21T12:38:38.1873021Z     %21 = arith.subi %c1_i32, %c1_i32_7 : i32
2026-02-21T12:38:38.1873187Z     %22 = arith.addi %20, %21 : i32
2026-02-21T12:38:38.1873348Z     %23 = arith.divui %22, %c1_i32 : i32
2026-02-21T12:38:38.1873510Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T12:38:38.1873655Z     %24 = arith.remsi %23, %c2_i32 : i32
2026-02-21T12:38:38.1873777Z     %25 = arith.subi %23, %24 : i32
2026-02-21T12:38:38.1873898Z     %26 = arith.muli %25, %c1_i32 : i32
2026-02-21T12:38:38.1874020Z     %27 = arith.addi %1, %26 : i32
2026-02-21T12:38:38.1874146Z     %28 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T12:38:38.1874287Z     scf.for %arg4 = %1 to %27 step %28  : i32 {
2026-02-21T12:38:38.1874429Z       %29 = arith.divsi %arg4, %c8192_i32 : i32
2026-02-21T12:38:38.1874565Z       %30 = arith.muli %29, %c16_i32 : i32
2026-02-21T12:38:38.1874692Z       %31 = arith.subi %c192_i32, %30 : i32
2026-02-21T12:38:38.1874821Z       %32 = arith.minsi %31, %c16_i32 : i32
2026-02-21T12:38:38.1874950Z       %33 = arith.remsi %arg4, %c8192_i32 : i32
2026-02-21T12:38:38.1875080Z       %34 = arith.remsi %33, %32 : i32
2026-02-21T12:38:38.1875204Z       %35 = arith.addi %30, %34 : i32
2026-02-21T12:38:38.1875343Z       %36 = arith.divsi %33, %32 : i32
2026-02-21T12:38:38.1875469Z       %37 = arith.muli %35, %c65536_i32 : i32
2026-02-21T12:38:38.1875596Z       %38 = arith.muli %36, %c128_i32 : i32
2026-02-21T12:38:38.1875723Z       %39 = arith.addi %37, %38 : i32
2026-02-21T12:38:38.1875871Z       %40 = tt.splat %39 : i32 -> tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.1876050Z       %41 = arith.addi %40, %10 : tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.1876280Z       %42 = tt.addptr %11, %41 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.1876537Z       %43 = tt.load %42 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1876716Z       %44 = tt.splat %37 : i32 -> tensor<1x128x1xi32, #blocked>
2026-02-21T12:38:38.1876886Z       %45 = arith.addi %44, %15 : tensor<1x128x1xi32, #blocked>
2026-02-21T12:38:38.1877104Z       %46 = tt.broadcast %45 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x256xi32, #blocked>
2026-02-21T12:38:38.1877382Z       %47 = ttg.convert_layout %46 : tensor<1x128x256xi32, #blocked> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1877662Z       %48 = tt.reshape %43 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3>
2026-02-21T12:38:38.1877876Z       %49 = tt.splat %37 : i32 -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1878026Z       %c0_i32_8 = arith.constant 0 : i32
2026-02-21T12:38:38.1878162Z       %c1024_i32 = arith.constant 1024 : i32
2026-02-21T12:38:38.1878540Z       %50:3 = scf.for %arg5 = %c0_i32 to %c0_i32_8 step %c1024_i32 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T12:38:38.1878954Z         %93 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1879127Z         %94 = arith.addi %93, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1879386Z         %95 = ttg.convert_layout %94 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.1879757Z         %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.1880078Z         %97 = ttg.convert_layout %96 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.1880419Z         %98 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.1880795Z         %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.1881134Z         %100 = ttg.convert_layout %99 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1881375Z         %101 = arith.muli %100, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1881606Z         %102 = tt.broadcast %101 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1881841Z         %103 = arith.addi %47, %102 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1882087Z         %104 = tt.addptr %16, %103 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1882333Z         %105 = tt.load %104 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1882638Z         %106 = tt.reshape %105 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.1882970Z         %107 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.1883333Z         %108 = ttg.convert_layout %106 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.1883641Z         %109 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1884051Z         %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1884463Z         %111 = tt.reshape %110 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1884705Z         %112 = arith.truncf %111 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1884946Z         %113 = arith.extf %112 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1885156Z         %114 = "tt.reduce"(%113) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.1885284Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.1885405Z           %391 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.1885532Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.1885722Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.1886016Z         %115 = ttg.convert_layout %114 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1886284Z         %116 = arith.truncf %115 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.1886504Z         %117 = arith.extf %116 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1886698Z         %118 = arith.mulf %117, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1886905Z         %119 = arith.truncf %118 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.1887127Z         %120 = arith.extf %119 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1887355Z         %121 = arith.cmpf ogt, %arg6, %120 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1887523Z         %122 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1887690Z         %123 = arith.ori %121, %122 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.1887883Z         %124 = arith.select %123, %arg6, %120 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1888092Z         %125 = arith.mulf %113, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1888298Z         %126 = arith.truncf %125 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1888601Z         %127 = ttg.convert_layout %124 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1888939Z         %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1889235Z         %129 = ttg.convert_layout %128 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1889478Z         %130 = arith.extf %126 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1889719Z         %131 = tt.broadcast %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.1889967Z         %132 = ttg.convert_layout %131 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1890183Z         %133 = arith.subf %130, %132 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1890489Z         %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1890782Z         %135 = "tt.reduce"(%134) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.1890908Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.1891026Z           %391 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.1891148Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.1891333Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.1891622Z         %136 = ttg.convert_layout %135 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1891881Z         %137 = arith.subf %arg6, %124 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1892165Z         %138 = tt.extern_elementwise %137 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1892453Z         %139 = arith.mulf %arg7, %138 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1892611Z         %140 = arith.addf %139, %136 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1892869Z         %141 = ttg.convert_layout %138 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1893203Z         %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1893495Z         %143 = ttg.convert_layout %142 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1893738Z         %144 = tt.broadcast %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.1893983Z         %145 = ttg.convert_layout %144 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1894199Z         %146 = arith.mulf %arg8, %145 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1894450Z         %147 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.1894793Z         %148 = tt.expand_dims %147 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.1895115Z         %149 = ttg.convert_layout %148 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1895326Z         %150 = arith.muli %149, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1895489Z         %151 = arith.addi %49, %150 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1895689Z         %152 = tt.broadcast %151 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.1895943Z         %153 = ttg.convert_layout %152 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1896162Z         %154 = arith.addi %153, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1896394Z         %155 = tt.addptr %18, %154 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1896620Z         %156 = tt.load %155 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1896828Z         %157 = arith.truncf %134 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1897067Z         %158 = tt.reshape %146 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1897301Z         %159 = tt.reshape %157 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.1897540Z         %160 = tt.reshape %156 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.1897842Z         %161 = ttg.convert_layout %159 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.1898199Z         %162 = ttg.convert_layout %160 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.1898500Z         %163 = ttg.convert_layout %158 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1898909Z         %164 = tt.dot %161, %162, %163, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1899308Z         %165 = tt.reshape %164 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1899490Z         %c1_i32_12 = arith.constant 1 : i32
2026-02-21T12:38:38.1899637Z         %166 = arith.muli %c256_i32, %c1_i32_12 : i32
2026-02-21T12:38:38.1899763Z         %167 = arith.addi %arg5, %166 : i32
2026-02-21T12:38:38.1899900Z         %168 = tt.splat %167 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1900056Z         %169 = arith.addi %168, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1900297Z         %170 = ttg.convert_layout %169 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.1900651Z         %171 = tt.expand_dims %170 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.1900941Z         %172 = ttg.convert_layout %171 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.1901230Z         %173 = ttg.convert_layout %172 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.1901570Z         %174 = tt.expand_dims %173 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.1901878Z         %175 = ttg.convert_layout %174 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1902093Z         %176 = arith.muli %175, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1902297Z         %177 = tt.broadcast %176 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1902510Z         %178 = arith.addi %47, %177 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1902724Z         %179 = tt.addptr %16, %178 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1902960Z         %180 = tt.load %179 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1903167Z         %181 = tt.reshape %180 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.1903463Z         %182 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.1903819Z         %183 = ttg.convert_layout %181 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.1911708Z         %184 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1912126Z         %185 = tt.dot %182, %183, %184, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1912525Z         %186 = tt.reshape %185 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1912764Z         %187 = arith.truncf %186 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1913006Z         %188 = arith.extf %187 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1913200Z         %189 = "tt.reduce"(%188) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.1913325Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.1913447Z           %391 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.1913571Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.1913764Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.1914054Z         %190 = ttg.convert_layout %189 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1914328Z         %191 = arith.truncf %190 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.1914551Z         %192 = arith.extf %191 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1914742Z         %193 = arith.mulf %192, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1914931Z         %194 = arith.truncf %193 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.1915169Z         %195 = arith.extf %194 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1915366Z         %196 = arith.cmpf ogt, %124, %195 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1915533Z         %197 = arith.cmpf une, %124, %124 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1915692Z         %198 = arith.ori %196, %197 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.1915883Z         %199 = arith.select %198, %124, %195 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1916105Z         %200 = arith.mulf %188, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1916312Z         %201 = arith.truncf %200 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1916595Z         %202 = ttg.convert_layout %199 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1916928Z         %203 = tt.expand_dims %202 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1917225Z         %204 = ttg.convert_layout %203 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1917469Z         %205 = arith.extf %201 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1917707Z         %206 = tt.broadcast %204 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.1917957Z         %207 = ttg.convert_layout %206 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1918187Z         %208 = arith.subf %205, %207 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1918490Z         %209 = tt.extern_elementwise %208 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1918777Z         %210 = "tt.reduce"(%209) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.1918906Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.1919023Z           %391 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.1919145Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.1919334Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.1919638Z         %211 = ttg.convert_layout %210 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1919881Z         %212 = arith.subf %124, %199 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1920165Z         %213 = tt.extern_elementwise %212 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1920452Z         %214 = arith.mulf %140, %213 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1920609Z         %215 = arith.addf %214, %211 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1920848Z         %216 = ttg.convert_layout %213 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1921182Z         %217 = tt.expand_dims %216 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1921475Z         %218 = ttg.convert_layout %217 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1921715Z         %219 = tt.broadcast %218 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.1921961Z         %220 = ttg.convert_layout %219 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1922172Z         %221 = arith.mulf %165, %220 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1922420Z         %222 = ttg.convert_layout %172 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.1922815Z         %223 = tt.expand_dims %222 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.1923120Z         %224 = ttg.convert_layout %223 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1923331Z         %225 = arith.muli %224, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1923493Z         %226 = arith.addi %49, %225 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1923692Z         %227 = tt.broadcast %226 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.1923973Z         %228 = ttg.convert_layout %227 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1924194Z         %229 = arith.addi %228, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1924416Z         %230 = tt.addptr %18, %229 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1924637Z         %231 = tt.load %230 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1924845Z         %232 = arith.truncf %209 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1925082Z         %233 = tt.reshape %221 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1925320Z         %234 = tt.reshape %232 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.1925562Z         %235 = tt.reshape %231 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.1925861Z         %236 = ttg.convert_layout %234 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.1926240Z         %237 = ttg.convert_layout %235 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.1926543Z         %238 = ttg.convert_layout %233 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1926948Z         %239 = tt.dot %236, %237, %238, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1927367Z         %240 = tt.reshape %239 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1927545Z         %c2_i32_13 = arith.constant 2 : i32
2026-02-21T12:38:38.1927677Z         %241 = arith.muli %c256_i32, %c2_i32_13 : i32
2026-02-21T12:38:38.1927805Z         %242 = arith.addi %arg5, %241 : i32
2026-02-21T12:38:38.1927943Z         %243 = tt.splat %242 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1928101Z         %244 = arith.addi %243, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1928340Z         %245 = ttg.convert_layout %244 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.1928673Z         %246 = tt.expand_dims %245 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.1928964Z         %247 = ttg.convert_layout %246 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.1929254Z         %248 = ttg.convert_layout %247 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.1929598Z         %249 = tt.expand_dims %248 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.1929905Z         %250 = ttg.convert_layout %249 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1930122Z         %251 = arith.muli %250, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1930326Z         %252 = tt.broadcast %251 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1930535Z         %253 = arith.addi %47, %252 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1930797Z         %254 = tt.addptr %16, %253 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1931018Z         %255 = tt.load %254 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1931228Z         %256 = tt.reshape %255 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.1931525Z         %257 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.1931898Z         %258 = ttg.convert_layout %256 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.1932203Z         %259 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1932605Z         %260 = tt.dot %257, %258, %259, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1932999Z         %261 = tt.reshape %260 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1933239Z         %262 = arith.truncf %261 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1933481Z         %263 = arith.extf %262 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1933673Z         %264 = "tt.reduce"(%263) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.1933797Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.1933945Z           %391 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.1934070Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.1934256Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.1934549Z         %265 = ttg.convert_layout %264 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1934823Z         %266 = arith.truncf %265 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.1935049Z         %267 = arith.extf %266 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1935267Z         %268 = arith.mulf %267, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1935458Z         %269 = arith.truncf %268 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.1935684Z         %270 = arith.extf %269 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1935881Z         %271 = arith.cmpf ogt, %199, %270 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1936054Z         %272 = arith.cmpf une, %199, %199 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1936217Z         %273 = arith.ori %271, %272 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.1936415Z         %274 = arith.select %273, %199, %270 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1936628Z         %275 = arith.mulf %263, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1936837Z         %276 = arith.truncf %275 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1937132Z         %277 = ttg.convert_layout %274 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1937470Z         %278 = tt.expand_dims %277 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1937775Z         %279 = ttg.convert_layout %278 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1938025Z         %280 = arith.extf %276 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1938268Z         %281 = tt.broadcast %279 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.1938540Z         %282 = ttg.convert_layout %281 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1938755Z         %283 = arith.subf %280, %282 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1939066Z         %284 = tt.extern_elementwise %283 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1939361Z         %285 = "tt.reduce"(%284) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.1939506Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.1939632Z           %391 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.1939756Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.1939949Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.1940243Z         %286 = ttg.convert_layout %285 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1940485Z         %287 = arith.subf %199, %274 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1940776Z         %288 = tt.extern_elementwise %287 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1941066Z         %289 = arith.mulf %215, %288 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1941228Z         %290 = arith.addf %289, %286 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1941474Z         %291 = ttg.convert_layout %288 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1941825Z         %292 = tt.expand_dims %291 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1942127Z         %293 = ttg.convert_layout %292 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1942370Z         %294 = tt.broadcast %293 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.1942623Z         %295 = ttg.convert_layout %294 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1942844Z         %296 = arith.mulf %240, %295 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1943112Z         %297 = ttg.convert_layout %247 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.1943461Z         %298 = tt.expand_dims %297 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.1943771Z         %299 = ttg.convert_layout %298 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1943990Z         %300 = arith.muli %299, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1944160Z         %301 = arith.addi %49, %300 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1944360Z         %302 = tt.broadcast %301 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.1944623Z         %303 = ttg.convert_layout %302 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1944844Z         %304 = arith.addi %303, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1945068Z         %305 = tt.addptr %18, %304 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1945297Z         %306 = tt.load %305 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1945509Z         %307 = arith.truncf %284 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1945758Z         %308 = tt.reshape %296 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1945991Z         %309 = tt.reshape %307 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.1946235Z         %310 = tt.reshape %306 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.1946559Z         %311 = ttg.convert_layout %309 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.1946915Z         %312 = ttg.convert_layout %310 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.1947223Z         %313 = ttg.convert_layout %308 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1947650Z         %314 = tt.dot %311, %312, %313, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1948051Z         %315 = tt.reshape %314 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1948239Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T12:38:38.1948366Z         %316 = arith.muli %c256_i32, %c3_i32 : i32
2026-02-21T12:38:38.1948497Z         %317 = arith.addi %arg5, %316 : i32
2026-02-21T12:38:38.1948637Z         %318 = tt.splat %317 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1948799Z         %319 = arith.addi %318, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1949045Z         %320 = ttg.convert_layout %319 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.1949379Z         %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.1949680Z         %322 = ttg.convert_layout %321 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.1949984Z         %323 = ttg.convert_layout %322 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.1950332Z         %324 = tt.expand_dims %323 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.1950645Z         %325 = ttg.convert_layout %324 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1950862Z         %326 = arith.muli %325, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1951093Z         %327 = tt.broadcast %326 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1951304Z         %328 = arith.addi %47, %327 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1951526Z         %329 = tt.addptr %16, %328 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1951755Z         %330 = tt.load %329 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1951965Z         %331 = tt.reshape %330 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.1952270Z         %332 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.1952630Z         %333 = ttg.convert_layout %331 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.1952946Z         %334 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1953360Z         %335 = tt.dot %332, %333, %334, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1953758Z         %336 = tt.reshape %335 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1954004Z         %337 = arith.truncf %336 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1954252Z         %338 = arith.extf %337 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1954481Z         %339 = "tt.reduce"(%338) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.1954615Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.1954738Z           %391 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.1954868Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.1955058Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.1955357Z         %340 = ttg.convert_layout %339 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1955650Z         %341 = arith.truncf %340 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.1955874Z         %342 = arith.extf %341 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1956075Z         %343 = arith.mulf %342, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1956269Z         %344 = arith.truncf %343 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.1956495Z         %345 = arith.extf %344 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1956694Z         %346 = arith.cmpf ogt, %274, %345 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1956862Z         %347 = arith.cmpf une, %274, %274 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1957028Z         %348 = arith.ori %346, %347 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.1957221Z         %349 = arith.select %348, %274, %345 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1957432Z         %350 = arith.mulf %338, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1957656Z         %351 = arith.truncf %350 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1957948Z         %352 = ttg.convert_layout %349 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1958283Z         %353 = tt.expand_dims %352 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1958580Z         %354 = ttg.convert_layout %353 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1958828Z         %355 = arith.extf %351 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1959089Z         %356 = tt.broadcast %354 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.1959338Z         %357 = ttg.convert_layout %356 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1959559Z         %358 = arith.subf %355, %357 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1959865Z         %359 = tt.extern_elementwise %358 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1960161Z         %360 = "tt.reduce"(%359) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.1960296Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.1960417Z           %391 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.1960543Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.1960730Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.1961025Z         %361 = ttg.convert_layout %360 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1961267Z         %362 = arith.subf %274, %349 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1961557Z         %363 = tt.extern_elementwise %362 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1961849Z         %364 = arith.mulf %290, %363 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1962007Z         %365 = arith.addf %364, %361 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1962251Z         %366 = ttg.convert_layout %363 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1962642Z         %367 = tt.expand_dims %366 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1962945Z         %368 = ttg.convert_layout %367 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1963193Z         %369 = tt.broadcast %368 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.1963463Z         %370 = ttg.convert_layout %369 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1963681Z         %371 = arith.mulf %315, %370 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1963930Z         %372 = ttg.convert_layout %322 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.1964278Z         %373 = tt.expand_dims %372 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.1964588Z         %374 = ttg.convert_layout %373 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1964800Z         %375 = arith.muli %374, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1964973Z         %376 = arith.addi %49, %375 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1965171Z         %377 = tt.broadcast %376 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.1965438Z         %378 = ttg.convert_layout %377 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1965681Z         %379 = arith.addi %378, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1965901Z         %380 = tt.addptr %18, %379 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1966130Z         %381 = tt.load %380 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1966338Z         %382 = arith.truncf %359 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1966584Z         %383 = tt.reshape %371 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1966823Z         %384 = tt.reshape %382 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.1967083Z         %385 = tt.reshape %381 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.1967393Z         %386 = ttg.convert_layout %384 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.1967753Z         %387 = ttg.convert_layout %385 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.1968063Z         %388 = ttg.convert_layout %383 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1968476Z         %389 = tt.dot %386, %387, %388, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1968871Z         %390 = tt.reshape %389 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1969150Z         scf.yield %349, %365, %390 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1969378Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T12:38:38.1969714Z       %51:3 = scf.for %arg5 = %c0_i32_8 to %c512_i32 step %c256_i32 iter_args(%arg6 = %50#0, %arg7 = %50#1, %arg8 = %50#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T12:38:38.1970062Z         %93 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1970221Z         %94 = arith.addi %93, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1970481Z         %95 = ttg.convert_layout %94 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.1970817Z         %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.1971107Z         %97 = ttg.convert_layout %96 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.1971397Z         %98 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.1971753Z         %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.1972065Z         %100 = ttg.convert_layout %99 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1972286Z         %101 = arith.muli %100, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1972496Z         %102 = tt.broadcast %101 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1972704Z         %103 = arith.addi %47, %102 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1972920Z         %104 = tt.addptr %16, %103 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1973146Z         %105 = tt.load %104 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1973354Z         %106 = tt.reshape %105 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.1973681Z         %107 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.1974035Z         %108 = ttg.convert_layout %106 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.1974338Z         %109 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1974746Z         %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.1975158Z         %111 = tt.reshape %110 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1975396Z         %112 = arith.truncf %111 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1975641Z         %113 = arith.extf %112 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1975831Z         %114 = "tt.reduce"(%113) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.1975955Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.1976076Z           %166 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.1976200Z           tt.reduce.return %166 : f32
2026-02-21T12:38:38.1976390Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.1976679Z         %115 = ttg.convert_layout %114 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1976951Z         %116 = arith.truncf %115 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.1977174Z         %117 = arith.extf %116 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1977368Z         %118 = arith.mulf %117, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1977563Z         %119 = arith.truncf %118 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.1977782Z         %120 = arith.extf %119 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1977978Z         %121 = arith.cmpf ogt, %arg6, %120 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1978151Z         %122 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1978333Z         %123 = arith.ori %121, %122 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.1978530Z         %124 = arith.select %123, %arg6, %120 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1978735Z         %125 = arith.mulf %113, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1978942Z         %126 = arith.truncf %125 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1979229Z         %127 = ttg.convert_layout %124 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1979587Z         %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1979885Z         %129 = ttg.convert_layout %128 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1980129Z         %130 = arith.extf %126 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1980368Z         %131 = tt.broadcast %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.1980615Z         %132 = ttg.convert_layout %131 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1980827Z         %133 = arith.subf %130, %132 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1981130Z         %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.1981420Z         %135 = "tt.reduce"(%134) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.1981563Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.1981682Z           %166 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.1981802Z           tt.reduce.return %166 : f32
2026-02-21T12:38:38.1981989Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.1982278Z         %136 = ttg.convert_layout %135 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1982520Z         %137 = arith.subf %arg6, %124 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1982830Z         %138 = tt.extern_elementwise %137 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1983123Z         %139 = arith.mulf %arg7, %138 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1983282Z         %140 = arith.addf %139, %136 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.1983519Z         %141 = ttg.convert_layout %138 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1983851Z         %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1984147Z         %143 = ttg.convert_layout %142 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1984387Z         %144 = tt.broadcast %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.1984633Z         %145 = ttg.convert_layout %144 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1984847Z         %146 = arith.mulf %arg8, %145 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1985096Z         %147 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.1985443Z         %148 = tt.expand_dims %147 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.1985745Z         %149 = ttg.convert_layout %148 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1985958Z         %150 = arith.muli %149, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1986134Z         %151 = arith.addi %49, %150 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1986332Z         %152 = tt.broadcast %151 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.1986586Z         %153 = ttg.convert_layout %152 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1986807Z         %154 = arith.addi %153, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1987041Z         %155 = tt.addptr %18, %154 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.1987261Z         %156 = tt.load %155 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1987468Z         %157 = arith.truncf %134 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.1987707Z         %158 = tt.reshape %146 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1987940Z         %159 = tt.reshape %157 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.1988180Z         %160 = tt.reshape %156 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.1988478Z         %161 = ttg.convert_layout %159 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.1988836Z         %162 = ttg.convert_layout %160 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.1989139Z         %163 = ttg.convert_layout %158 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1989559Z         %164 = tt.dot %161, %162, %163, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.1989954Z         %165 = tt.reshape %164 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1990225Z         scf.yield %124, %140, %165 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1990447Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T12:38:38.1990688Z       %52 = ttg.convert_layout %51#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.1991017Z       %53 = tt.expand_dims %52 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.1991309Z       %54 = ttg.convert_layout %53 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.1991544Z       %55 = tt.broadcast %54 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.1991783Z       %56 = ttg.convert_layout %55 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1991994Z       %57 = arith.divf %51#2, %56 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.1992192Z       %58 = arith.truncf %57 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1>
2026-02-21T12:38:38.1992438Z       %59 = tt.addptr %19, %41 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.1992648Z       tt.store %59, %58 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1992792Z       %c1_i32_9 = arith.constant 1 : i32
2026-02-21T12:38:38.1992917Z       %60 = arith.muli %c1_i32, %c1_i32_9 : i32
2026-02-21T12:38:38.1993037Z       %61 = arith.addi %arg4, %60 : i32
2026-02-21T12:38:38.1993157Z       %62 = arith.divsi %61, %c8192_i32 : i32
2026-02-21T12:38:38.1993276Z       %63 = arith.muli %62, %c16_i32 : i32
2026-02-21T12:38:38.1993394Z       %64 = arith.subi %c192_i32, %63 : i32
2026-02-21T12:38:38.1993507Z       %65 = arith.minsi %64, %c16_i32 : i32
2026-02-21T12:38:38.1993625Z       %66 = arith.remsi %61, %c8192_i32 : i32
2026-02-21T12:38:38.1993757Z       %67 = arith.remsi %66, %65 : i32
2026-02-21T12:38:38.1993869Z       %68 = arith.addi %63, %67 : i32
2026-02-21T12:38:38.1993980Z       %69 = arith.divsi %66, %65 : i32
2026-02-21T12:38:38.1994092Z       %70 = arith.muli %68, %c65536_i32 : i32
2026-02-21T12:38:38.1994212Z       %71 = arith.muli %69, %c128_i32 : i32
2026-02-21T12:38:38.1994324Z       %72 = arith.addi %70, %71 : i32
2026-02-21T12:38:38.1994457Z       %73 = tt.splat %72 : i32 -> tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.1994629Z       %74 = arith.addi %73, %10 : tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.1994834Z       %75 = tt.addptr %11, %74 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.1995045Z       %76 = tt.load %75 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.1995201Z       %77 = tt.splat %70 : i32 -> tensor<1x128x1xi32, #blocked>
2026-02-21T12:38:38.1995356Z       %78 = arith.addi %77, %15 : tensor<1x128x1xi32, #blocked>
2026-02-21T12:38:38.1995550Z       %79 = tt.broadcast %78 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x256xi32, #blocked>
2026-02-21T12:38:38.1995797Z       %80 = ttg.convert_layout %79 : tensor<1x128x256xi32, #blocked> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1996043Z       %81 = tt.reshape %76 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3>
2026-02-21T12:38:38.1996237Z       %82 = tt.splat %70 : i32 -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.1996374Z       %c0_i32_10 = arith.constant 0 : i32
2026-02-21T12:38:38.1996500Z       %c1024_i32_11 = arith.constant 1024 : i32
2026-02-21T12:38:38.1996854Z       %83:3 = scf.for %arg5 = %c0_i32 to %c0_i32_10 step %c1024_i32_11 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T12:38:38.1997204Z         %93 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1997362Z         %94 = arith.addi %93, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.1997599Z         %95 = ttg.convert_layout %94 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.1997924Z         %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.1998236Z         %97 = ttg.convert_layout %96 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.1998522Z         %98 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.1998861Z         %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.1999163Z         %100 = ttg.convert_layout %99 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1999376Z         %101 = arith.muli %100, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.1999587Z         %102 = tt.broadcast %101 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.1999794Z         %103 = arith.addi %80, %102 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2000012Z         %104 = tt.addptr %16, %103 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2000232Z         %105 = tt.load %104 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2000441Z         %106 = tt.reshape %105 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.2000739Z         %107 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2001093Z         %108 = ttg.convert_layout %106 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2001412Z         %109 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2001825Z         %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2002222Z         %111 = tt.reshape %110 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2002462Z         %112 = arith.truncf %111 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2002763Z         %113 = arith.extf %112 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2002954Z         %114 = "tt.reduce"(%113) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2003080Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2003199Z           %391 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.2003361Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.2003547Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2003838Z         %115 = ttg.convert_layout %114 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2004111Z         %116 = arith.truncf %115 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2004330Z         %117 = arith.extf %116 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2004526Z         %118 = arith.mulf %117, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2004881Z         %119 = arith.truncf %118 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2005102Z         %120 = arith.extf %119 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2005298Z         %121 = arith.cmpf ogt, %arg6, %120 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2005468Z         %122 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2005632Z         %123 = arith.ori %121, %122 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.2005823Z         %124 = arith.select %123, %arg6, %120 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2006030Z         %125 = arith.mulf %113, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2006260Z         %126 = arith.truncf %125 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2006545Z         %127 = ttg.convert_layout %124 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2006879Z         %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2007172Z         %129 = ttg.convert_layout %128 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2007416Z         %130 = arith.extf %126 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2007654Z         %131 = tt.broadcast %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.2007898Z         %132 = ttg.convert_layout %131 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2008114Z         %133 = arith.subf %130, %132 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2008417Z         %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2008709Z         %135 = "tt.reduce"(%134) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2008836Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2008953Z           %391 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.2009074Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.2009258Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2009563Z         %136 = ttg.convert_layout %135 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2009803Z         %137 = arith.subf %arg6, %124 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2010087Z         %138 = tt.extern_elementwise %137 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2010392Z         %139 = arith.mulf %arg7, %138 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2010550Z         %140 = arith.addf %139, %136 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2010789Z         %141 = ttg.convert_layout %138 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2011121Z         %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2011414Z         %143 = ttg.convert_layout %142 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2011657Z         %144 = tt.broadcast %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2011900Z         %145 = ttg.convert_layout %144 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2012113Z         %146 = arith.mulf %arg8, %145 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2012364Z         %147 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.2012727Z         %148 = tt.expand_dims %147 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.2013032Z         %149 = ttg.convert_layout %148 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2013244Z         %150 = arith.muli %149, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2013406Z         %151 = arith.addi %82, %150 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2013606Z         %152 = tt.broadcast %151 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.2013877Z         %153 = ttg.convert_layout %152 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2014098Z         %154 = arith.addi %153, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2014314Z         %155 = tt.addptr %18, %154 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2014538Z         %156 = tt.load %155 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2014744Z         %157 = arith.truncf %134 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2014984Z         %158 = tt.reshape %146 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2015218Z         %159 = tt.reshape %157 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.2015455Z         %160 = tt.reshape %156 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.2015755Z         %161 = ttg.convert_layout %159 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2016115Z         %162 = ttg.convert_layout %160 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2016415Z         %163 = ttg.convert_layout %158 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2016820Z         %164 = tt.dot %161, %162, %163, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2017223Z         %165 = tt.reshape %164 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2017404Z         %c1_i32_12 = arith.constant 1 : i32
2026-02-21T12:38:38.2017536Z         %166 = arith.muli %c256_i32, %c1_i32_12 : i32
2026-02-21T12:38:38.2017661Z         %167 = arith.addi %arg5, %166 : i32
2026-02-21T12:38:38.2017800Z         %168 = tt.splat %167 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2017955Z         %169 = arith.addi %168, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2018210Z         %170 = ttg.convert_layout %169 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.2018543Z         %171 = tt.expand_dims %170 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.2018833Z         %172 = ttg.convert_layout %171 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.2019123Z         %173 = ttg.convert_layout %172 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.2019465Z         %174 = tt.expand_dims %173 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.2019773Z         %175 = ttg.convert_layout %174 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2019989Z         %176 = arith.muli %175, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2028460Z         %177 = tt.broadcast %176 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2028732Z         %178 = arith.addi %80, %177 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2028964Z         %179 = tt.addptr %16, %178 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2029192Z         %180 = tt.load %179 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2029406Z         %181 = tt.reshape %180 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.2029708Z         %182 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2030085Z         %183 = ttg.convert_layout %181 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2030393Z         %184 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2030808Z         %185 = tt.dot %182, %183, %184, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2031211Z         %186 = tt.reshape %185 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2031453Z         %187 = arith.truncf %186 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2031699Z         %188 = arith.extf %187 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2031892Z         %189 = "tt.reduce"(%188) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2032023Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2032150Z           %391 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.2032277Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.2032470Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2032768Z         %190 = ttg.convert_layout %189 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2033042Z         %191 = arith.truncf %190 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2033262Z         %192 = arith.extf %191 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2033471Z         %193 = arith.mulf %192, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2033658Z         %194 = arith.truncf %193 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2033878Z         %195 = arith.extf %194 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2034073Z         %196 = arith.cmpf ogt, %124, %195 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2034238Z         %197 = arith.cmpf une, %124, %124 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2034420Z         %198 = arith.ori %196, %197 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.2034611Z         %199 = arith.select %198, %124, %195 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2034816Z         %200 = arith.mulf %188, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2035020Z         %201 = arith.truncf %200 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2035311Z         %202 = ttg.convert_layout %199 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2035650Z         %203 = tt.expand_dims %202 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2035945Z         %204 = ttg.convert_layout %203 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2036192Z         %205 = arith.extf %201 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2036429Z         %206 = tt.broadcast %204 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.2036696Z         %207 = ttg.convert_layout %206 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2036912Z         %208 = arith.subf %205, %207 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2037215Z         %209 = tt.extern_elementwise %208 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2037510Z         %210 = "tt.reduce"(%209) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2037636Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2037757Z           %391 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.2037895Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.2038082Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2038378Z         %211 = ttg.convert_layout %210 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2038618Z         %212 = arith.subf %124, %199 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2038904Z         %213 = tt.extern_elementwise %212 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2039191Z         %214 = arith.mulf %140, %213 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2039347Z         %215 = arith.addf %214, %211 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2039588Z         %216 = ttg.convert_layout %213 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2039920Z         %217 = tt.expand_dims %216 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2040217Z         %218 = ttg.convert_layout %217 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2040459Z         %219 = tt.broadcast %218 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2040703Z         %220 = ttg.convert_layout %219 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2040916Z         %221 = arith.mulf %165, %220 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2041178Z         %222 = ttg.convert_layout %172 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.2041525Z         %223 = tt.expand_dims %222 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.2041833Z         %224 = ttg.convert_layout %223 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2042043Z         %225 = arith.muli %224, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2042225Z         %226 = arith.addi %82, %225 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2042425Z         %227 = tt.broadcast %226 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.2042734Z         %228 = ttg.convert_layout %227 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2042955Z         %229 = arith.addi %228, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2043172Z         %230 = tt.addptr %18, %229 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2043397Z         %231 = tt.load %230 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2043602Z         %232 = arith.truncf %209 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2043846Z         %233 = tt.reshape %221 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2044082Z         %234 = tt.reshape %232 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.2044320Z         %235 = tt.reshape %231 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.2044642Z         %236 = ttg.convert_layout %234 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2044997Z         %237 = ttg.convert_layout %235 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2045299Z         %238 = ttg.convert_layout %233 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2045724Z         %239 = tt.dot %236, %237, %238, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2046119Z         %240 = tt.reshape %239 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2046302Z         %c2_i32_13 = arith.constant 2 : i32
2026-02-21T12:38:38.2046430Z         %241 = arith.muli %c256_i32, %c2_i32_13 : i32
2026-02-21T12:38:38.2046561Z         %242 = arith.addi %arg5, %241 : i32
2026-02-21T12:38:38.2046700Z         %243 = tt.splat %242 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2046855Z         %244 = arith.addi %243, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2047099Z         %245 = ttg.convert_layout %244 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.2047430Z         %246 = tt.expand_dims %245 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.2047727Z         %247 = ttg.convert_layout %246 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.2048016Z         %248 = ttg.convert_layout %247 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.2048362Z         %249 = tt.expand_dims %248 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.2048671Z         %250 = ttg.convert_layout %249 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2048888Z         %251 = arith.muli %250, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2049113Z         %252 = tt.broadcast %251 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2049321Z         %253 = arith.addi %80, %252 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2049534Z         %254 = tt.addptr %16, %253 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2049760Z         %255 = tt.load %254 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2049965Z         %256 = tt.reshape %255 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.2050285Z         %257 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2050644Z         %258 = ttg.convert_layout %256 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2050949Z         %259 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2051359Z         %260 = tt.dot %257, %258, %259, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2051753Z         %261 = tt.reshape %260 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2051993Z         %262 = arith.truncf %261 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2052238Z         %263 = arith.extf %262 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2052449Z         %264 = "tt.reduce"(%263) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2052578Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2052697Z           %391 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.2052823Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.2053011Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2053300Z         %265 = ttg.convert_layout %264 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2053572Z         %266 = arith.truncf %265 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2053807Z         %267 = arith.extf %266 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2054003Z         %268 = arith.mulf %267, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2054193Z         %269 = arith.truncf %268 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2054412Z         %270 = arith.extf %269 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2054604Z         %271 = arith.cmpf ogt, %199, %270 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2054763Z         %272 = arith.cmpf une, %199, %199 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2054924Z         %273 = arith.ori %271, %272 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.2055114Z         %274 = arith.select %273, %199, %270 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2055314Z         %275 = arith.mulf %263, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2055520Z         %276 = arith.truncf %275 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2055802Z         %277 = ttg.convert_layout %274 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2056136Z         %278 = tt.expand_dims %277 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2056432Z         %279 = ttg.convert_layout %278 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2056672Z         %280 = arith.extf %276 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2056938Z         %281 = tt.broadcast %279 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.2057180Z         %282 = ttg.convert_layout %281 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2057392Z         %283 = arith.subf %280, %282 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2057695Z         %284 = tt.extern_elementwise %283 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2058001Z         %285 = "tt.reduce"(%284) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2058128Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2058244Z           %391 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.2058363Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.2058545Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2058834Z         %286 = ttg.convert_layout %285 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2059073Z         %287 = arith.subf %199, %274 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2059355Z         %288 = tt.extern_elementwise %287 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2059641Z         %289 = arith.mulf %215, %288 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2059795Z         %290 = arith.addf %289, %286 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2060048Z         %291 = ttg.convert_layout %288 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2060379Z         %292 = tt.expand_dims %291 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2060671Z         %293 = ttg.convert_layout %292 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2060912Z         %294 = tt.broadcast %293 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2061154Z         %295 = ttg.convert_layout %294 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2061384Z         %296 = arith.mulf %240, %295 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2061632Z         %297 = ttg.convert_layout %247 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.2061974Z         %298 = tt.expand_dims %297 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.2062280Z         %299 = ttg.convert_layout %298 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2062491Z         %300 = arith.muli %299, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2062654Z         %301 = arith.addi %82, %300 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2062853Z         %302 = tt.broadcast %301 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.2063106Z         %303 = ttg.convert_layout %302 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2063327Z         %304 = arith.addi %303, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2063542Z         %305 = tt.addptr %18, %304 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2063768Z         %306 = tt.load %305 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2063975Z         %307 = arith.truncf %284 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2064214Z         %308 = tt.reshape %296 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2064448Z         %309 = tt.reshape %307 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.2064703Z         %310 = tt.reshape %306 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.2065002Z         %311 = ttg.convert_layout %309 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2065359Z         %312 = ttg.convert_layout %310 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2065678Z         %313 = ttg.convert_layout %308 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2066084Z         %314 = tt.dot %311, %312, %313, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2066476Z         %315 = tt.reshape %314 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2066656Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T12:38:38.2066779Z         %316 = arith.muli %c256_i32, %c3_i32 : i32
2026-02-21T12:38:38.2066900Z         %317 = arith.addi %arg5, %316 : i32
2026-02-21T12:38:38.2067033Z         %318 = tt.splat %317 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2067186Z         %319 = arith.addi %318, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2067424Z         %320 = ttg.convert_layout %319 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.2067776Z         %321 = tt.expand_dims %320 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.2068067Z         %322 = ttg.convert_layout %321 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.2068354Z         %323 = ttg.convert_layout %322 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.2068696Z         %324 = tt.expand_dims %323 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.2069003Z         %325 = ttg.convert_layout %324 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2069234Z         %326 = arith.muli %325, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2069442Z         %327 = tt.broadcast %326 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2069647Z         %328 = arith.addi %80, %327 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2069862Z         %329 = tt.addptr %16, %328 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2070082Z         %330 = tt.load %329 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2070286Z         %331 = tt.reshape %330 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.2070582Z         %332 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2070935Z         %333 = ttg.convert_layout %331 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2071238Z         %334 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2071642Z         %335 = tt.dot %332, %333, %334, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2072032Z         %336 = tt.reshape %335 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2072267Z         %337 = arith.truncf %336 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2072526Z         %338 = arith.extf %337 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2072712Z         %339 = "tt.reduce"(%338) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2072834Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2072953Z           %391 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.2073073Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.2073256Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2073559Z         %340 = ttg.convert_layout %339 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2073829Z         %341 = arith.truncf %340 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2074052Z         %342 = arith.extf %341 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2074245Z         %343 = arith.mulf %342, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2074433Z         %344 = arith.truncf %343 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2074648Z         %345 = arith.extf %344 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2074839Z         %346 = arith.cmpf ogt, %274, %345 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2075003Z         %347 = arith.cmpf une, %274, %274 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2075162Z         %348 = arith.ori %346, %347 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.2075352Z         %349 = arith.select %348, %274, %345 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2075569Z         %350 = arith.mulf %338, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2075772Z         %351 = arith.truncf %350 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2076054Z         %352 = ttg.convert_layout %349 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2076386Z         %353 = tt.expand_dims %352 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2076681Z         %354 = ttg.convert_layout %353 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2076938Z         %355 = arith.extf %351 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2077172Z         %356 = tt.broadcast %354 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.2077416Z         %357 = ttg.convert_layout %356 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2077625Z         %358 = arith.subf %355, %357 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2077922Z         %359 = tt.extern_elementwise %358 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2078209Z         %360 = "tt.reduce"(%359) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2078332Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2078447Z           %391 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.2078566Z           tt.reduce.return %391 : f32
2026-02-21T12:38:38.2078753Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2079040Z         %361 = ttg.convert_layout %360 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2079275Z         %362 = arith.subf %274, %349 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2079557Z         %363 = tt.extern_elementwise %362 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2079839Z         %364 = arith.mulf %290, %363 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2080007Z         %365 = arith.addf %364, %361 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2080243Z         %366 = ttg.convert_layout %363 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2080575Z         %367 = tt.expand_dims %366 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2080864Z         %368 = ttg.convert_layout %367 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2081120Z         %369 = tt.broadcast %368 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2081363Z         %370 = ttg.convert_layout %369 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2081573Z         %371 = arith.mulf %315, %370 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2081819Z         %372 = ttg.convert_layout %322 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.2082159Z         %373 = tt.expand_dims %372 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.2082459Z         %374 = ttg.convert_layout %373 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2082715Z         %375 = arith.muli %374, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2082879Z         %376 = arith.addi %82, %375 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2083076Z         %377 = tt.broadcast %376 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.2083347Z         %378 = ttg.convert_layout %377 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2083562Z         %379 = arith.addi %378, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2083776Z         %380 = tt.addptr %18, %379 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2083995Z         %381 = tt.load %380 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2084198Z         %382 = arith.truncf %359 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2084462Z         %383 = tt.reshape %371 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2084942Z         %384 = tt.reshape %382 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.2085225Z         %385 = tt.reshape %381 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.2085529Z         %386 = ttg.convert_layout %384 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2085887Z         %387 = ttg.convert_layout %385 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2086188Z         %388 = ttg.convert_layout %383 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2086593Z         %389 = tt.dot %386, %387, %388, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2086990Z         %390 = tt.reshape %389 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2087258Z         scf.yield %349, %365, %390 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2087477Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T12:38:38.2087813Z       %84:3 = scf.for %arg5 = %c0_i32_10 to %c512_i32 step %c256_i32 iter_args(%arg6 = %83#0, %arg7 = %83#1, %arg8 = %83#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T12:38:38.2088238Z         %93 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2088394Z         %94 = arith.addi %93, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2088627Z         %95 = ttg.convert_layout %94 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.2088956Z         %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.2089270Z         %97 = ttg.convert_layout %96 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.2089555Z         %98 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.2089890Z         %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.2090190Z         %100 = ttg.convert_layout %99 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2090406Z         %101 = arith.muli %100, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2090614Z         %102 = tt.broadcast %101 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2090820Z         %103 = arith.addi %80, %102 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2091039Z         %104 = tt.addptr %16, %103 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2091260Z         %105 = tt.load %104 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2091485Z         %106 = tt.reshape %105 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.2091784Z         %107 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2092135Z         %108 = ttg.convert_layout %106 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2092441Z         %109 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2092870Z         %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2093262Z         %111 = tt.reshape %110 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2093501Z         %112 = arith.truncf %111 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2093741Z         %113 = arith.extf %112 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2093930Z         %114 = "tt.reduce"(%113) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2094056Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2094175Z           %166 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.2094299Z           tt.reduce.return %166 : f32
2026-02-21T12:38:38.2094486Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2094776Z         %115 = ttg.convert_layout %114 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2095043Z         %116 = arith.truncf %115 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2095265Z         %117 = arith.extf %116 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2095458Z         %118 = arith.mulf %117, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2095647Z         %119 = arith.truncf %118 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2095866Z         %120 = arith.extf %119 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2096075Z         %121 = arith.cmpf ogt, %arg6, %120 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2096245Z         %122 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2096406Z         %123 = arith.ori %121, %122 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.2096598Z         %124 = arith.select %123, %arg6, %120 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2096801Z         %125 = arith.mulf %113, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2097021Z         %126 = arith.truncf %125 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2097309Z         %127 = ttg.convert_layout %124 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2097640Z         %128 = tt.expand_dims %127 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2097932Z         %129 = ttg.convert_layout %128 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2098173Z         %130 = arith.extf %126 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2098407Z         %131 = tt.broadcast %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.2098651Z         %132 = ttg.convert_layout %131 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2098863Z         %133 = arith.subf %130, %132 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2099181Z         %134 = tt.extern_elementwise %133 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2099468Z         %135 = "tt.reduce"(%134) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2099591Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2099708Z           %166 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.2099826Z           tt.reduce.return %166 : f32
2026-02-21T12:38:38.2100010Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2100299Z         %136 = ttg.convert_layout %135 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2100554Z         %137 = arith.subf %arg6, %124 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2100839Z         %138 = tt.extern_elementwise %137 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2101123Z         %139 = arith.mulf %arg7, %138 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2101281Z         %140 = arith.addf %139, %136 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2101517Z         %141 = ttg.convert_layout %138 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2101845Z         %142 = tt.expand_dims %141 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2102138Z         %143 = ttg.convert_layout %142 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2102377Z         %144 = tt.broadcast %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2102620Z         %145 = ttg.convert_layout %144 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2102834Z         %146 = arith.mulf %arg8, %145 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2103080Z         %147 = ttg.convert_layout %97 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.2103420Z         %148 = tt.expand_dims %147 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.2103736Z         %149 = ttg.convert_layout %148 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2103944Z         %150 = arith.muli %149, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2104104Z         %151 = arith.addi %82, %150 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2104299Z         %152 = tt.broadcast %151 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.2104553Z         %153 = ttg.convert_layout %152 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2104782Z         %154 = arith.addi %153, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2104998Z         %155 = tt.addptr %18, %154 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2105219Z         %156 = tt.load %155 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2105422Z         %157 = arith.truncf %134 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2105660Z         %158 = tt.reshape %146 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2105888Z         %159 = tt.reshape %157 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.2106127Z         %160 = tt.reshape %156 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.2106425Z         %161 = ttg.convert_layout %159 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2106779Z         %162 = ttg.convert_layout %160 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2107092Z         %163 = ttg.convert_layout %158 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2107494Z         %164 = tt.dot %161, %162, %163, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2107884Z         %165 = tt.reshape %164 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2108152Z         scf.yield %124, %140, %165 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2108386Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T12:38:38.2108606Z       %85 = ttg.convert_layout %84#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2108932Z       %86 = tt.expand_dims %85 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2109217Z       %87 = ttg.convert_layout %86 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2109451Z       %88 = tt.broadcast %87 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2109688Z       %89 = ttg.convert_layout %88 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2109893Z       %90 = arith.divf %84#2, %89 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2110089Z       %91 = arith.truncf %90 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1>
2026-02-21T12:38:38.2110333Z       %92 = tt.addptr %19, %74 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.2110543Z       tt.store %92, %91 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2110721Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T12:38:38.2110888Z     scf.for %arg4 = %27 to %3 step %c1_i32  : i32 {
2026-02-21T12:38:38.2111020Z       %29 = arith.divsi %arg4, %c8192_i32 : i32
2026-02-21T12:38:38.2111141Z       %30 = arith.muli %29, %c16_i32 : i32
2026-02-21T12:38:38.2111255Z       %31 = arith.subi %c192_i32, %30 : i32
2026-02-21T12:38:38.2111384Z       %32 = arith.minsi %31, %c16_i32 : i32
2026-02-21T12:38:38.2111502Z       %33 = arith.remsi %arg4, %c8192_i32 : i32
2026-02-21T12:38:38.2111618Z       %34 = arith.remsi %33, %32 : i32
2026-02-21T12:38:38.2111727Z       %35 = arith.addi %30, %34 : i32
2026-02-21T12:38:38.2111833Z       %36 = arith.divsi %33, %32 : i32
2026-02-21T12:38:38.2111947Z       %37 = arith.muli %35, %c65536_i32 : i32
2026-02-21T12:38:38.2112062Z       %38 = arith.muli %36, %c128_i32 : i32
2026-02-21T12:38:38.2112170Z       %39 = arith.addi %37, %38 : i32
2026-02-21T12:38:38.2112317Z       %40 = tt.splat %39 : i32 -> tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.2112470Z       %41 = arith.addi %40, %10 : tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.2112671Z       %42 = tt.addptr %11, %41 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.2112877Z       %43 = tt.load %42 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2113032Z       %44 = tt.splat %37 : i32 -> tensor<1x128x1xi32, #blocked>
2026-02-21T12:38:38.2113185Z       %45 = arith.addi %44, %15 : tensor<1x128x1xi32, #blocked>
2026-02-21T12:38:38.2113375Z       %46 = tt.broadcast %45 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x256xi32, #blocked>
2026-02-21T12:38:38.2113622Z       %47 = ttg.convert_layout %46 : tensor<1x128x256xi32, #blocked> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2113865Z       %48 = tt.reshape %43 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3>
2026-02-21T12:38:38.2114058Z       %49 = tt.splat %37 : i32 -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2114191Z       %c0_i32_8 = arith.constant 0 : i32
2026-02-21T12:38:38.2114309Z       %c1024_i32 = arith.constant 1024 : i32
2026-02-21T12:38:38.2114669Z       %50:3 = scf.for %arg5 = %c0_i32 to %c0_i32_8 step %c1024_i32 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T12:38:38.2115014Z         %60 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2115165Z         %61 = arith.addi %60, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2115394Z         %62 = ttg.convert_layout %61 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.2115738Z         %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.2116033Z         %64 = ttg.convert_layout %63 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.2116318Z         %65 = ttg.convert_layout %64 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.2116655Z         %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.2116953Z         %67 = ttg.convert_layout %66 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2117163Z         %68 = arith.muli %67, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2117363Z         %69 = tt.broadcast %68 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2117565Z         %70 = arith.addi %47, %69 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2117780Z         %71 = tt.addptr %16, %70 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2117996Z         %72 = tt.load %71 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2118200Z         %73 = tt.reshape %72 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.2118494Z         %74 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2118843Z         %75 = ttg.convert_layout %73 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2119164Z         %76 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2119574Z         %77 = tt.dot %74, %75, %76, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2119955Z         %78 = tt.reshape %77 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2120204Z         %79 = arith.truncf %78 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2120441Z         %80 = arith.extf %79 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2120627Z         %81 = "tt.reduce"(%80) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2120669Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2120722Z           %358 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.2120763Z           tt.reduce.return %358 : f32
2026-02-21T12:38:38.2120875Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2121011Z         %82 = ttg.convert_layout %81 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2121101Z         %83 = arith.truncf %82 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2121188Z         %84 = arith.extf %83 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2121250Z         %85 = arith.mulf %84, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2121352Z         %86 = arith.truncf %85 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2121434Z         %87 = arith.extf %86 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2121498Z         %88 = arith.cmpf ogt, %arg6, %87 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2121567Z         %89 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2121624Z         %90 = arith.ori %88, %89 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.2121718Z         %91 = arith.select %90, %arg6, %87 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2121797Z         %92 = arith.mulf %80, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2121898Z         %93 = arith.truncf %92 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2122038Z         %94 = ttg.convert_layout %91 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2122186Z         %95 = tt.expand_dims %94 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2122286Z         %96 = ttg.convert_layout %95 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2122382Z         %97 = arith.extf %93 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2122478Z         %98 = tt.broadcast %96 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.2122633Z         %99 = ttg.convert_layout %98 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2122700Z         %100 = arith.subf %97, %99 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2122913Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2122963Z         %102 = "tt.reduce"(%101) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2123004Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2123049Z           %358 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.2123090Z           tt.reduce.return %358 : f32
2026-02-21T12:38:38.2123226Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2123366Z         %103 = ttg.convert_layout %102 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2123428Z         %104 = arith.subf %arg6, %91 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2123616Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2123698Z         %106 = arith.mulf %arg7, %105 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2123760Z         %107 = arith.addf %106, %103 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2123901Z         %108 = ttg.convert_layout %105 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2124051Z         %109 = tt.expand_dims %108 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2124157Z         %110 = ttg.convert_layout %109 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2124253Z         %111 = tt.broadcast %110 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2124362Z         %112 = ttg.convert_layout %111 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2124428Z         %113 = arith.mulf %arg8, %112 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2124578Z         %114 = ttg.convert_layout %64 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.2124749Z         %115 = tt.expand_dims %114 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.2124858Z         %116 = ttg.convert_layout %115 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2124923Z         %117 = arith.muli %116, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2124985Z         %118 = arith.addi %49, %117 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2125085Z         %119 = tt.broadcast %118 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.2125220Z         %120 = ttg.convert_layout %119 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2125286Z         %121 = arith.addi %120, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2125404Z         %122 = tt.addptr %18, %121 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2125471Z         %123 = tt.load %122 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2125573Z         %124 = arith.truncf %101 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2125671Z         %125 = tt.reshape %113 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2125772Z         %126 = tt.reshape %124 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.2125874Z         %127 = tt.reshape %123 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.2126037Z         %128 = ttg.convert_layout %126 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2126199Z         %129 = ttg.convert_layout %127 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2126303Z         %130 = ttg.convert_layout %125 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2126600Z         %131 = tt.dot %128, %129, %130, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2126711Z         %132 = tt.reshape %131 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2126755Z         %c1_i32_9 = arith.constant 1 : i32
2026-02-21T12:38:38.2126808Z         %133 = arith.muli %c256_i32, %c1_i32_9 : i32
2026-02-21T12:38:38.2126851Z         %134 = arith.addi %arg5, %133 : i32
2026-02-21T12:38:38.2126912Z         %135 = tt.splat %134 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2126974Z         %136 = arith.addi %135, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2127132Z         %137 = ttg.convert_layout %136 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.2127282Z         %138 = tt.expand_dims %137 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.2127388Z         %139 = ttg.convert_layout %138 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.2127534Z         %140 = ttg.convert_layout %139 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.2127690Z         %141 = tt.expand_dims %140 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.2127804Z         %142 = ttg.convert_layout %141 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2127872Z         %143 = arith.muli %142, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2127976Z         %144 = tt.broadcast %143 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2128059Z         %145 = arith.addi %47, %144 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2128174Z         %146 = tt.addptr %16, %145 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2128239Z         %147 = tt.load %146 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2128345Z         %148 = tt.reshape %147 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.2128498Z         %149 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2128673Z         %150 = ttg.convert_layout %148 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2128782Z         %151 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2129042Z         %152 = tt.dot %149, %150, %151, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2129138Z         %153 = tt.reshape %152 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2129244Z         %154 = arith.truncf %153 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2129343Z         %155 = arith.extf %154 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2129393Z         %156 = "tt.reduce"(%155) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2129437Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2129486Z           %358 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.2129530Z           tt.reduce.return %358 : f32
2026-02-21T12:38:38.2129642Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2129787Z         %157 = ttg.convert_layout %156 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2129881Z         %158 = arith.truncf %157 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2129971Z         %159 = arith.extf %158 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2130051Z         %160 = arith.mulf %159, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2130141Z         %161 = arith.truncf %160 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2130228Z         %162 = arith.extf %161 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2130293Z         %163 = arith.cmpf ogt, %91, %162 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2130381Z         %164 = arith.cmpf une, %91, %91 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2130442Z         %165 = arith.ori %163, %164 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.2130541Z         %166 = arith.select %165, %91, %162 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2130607Z         %167 = arith.mulf %155, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2130709Z         %168 = arith.truncf %167 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2130856Z         %169 = ttg.convert_layout %166 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2131006Z         %170 = tt.expand_dims %169 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2131112Z         %171 = ttg.convert_layout %170 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2131214Z         %172 = arith.extf %168 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2131326Z         %173 = tt.broadcast %171 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.2131434Z         %174 = ttg.convert_layout %173 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2131502Z         %175 = arith.subf %172, %174 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2131702Z         %176 = tt.extern_elementwise %175 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2131751Z         %177 = "tt.reduce"(%176) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2131791Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2131859Z           %358 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.2131900Z           tt.reduce.return %358 : f32
2026-02-21T12:38:38.2132011Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2132154Z         %178 = ttg.convert_layout %177 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2132214Z         %179 = arith.subf %91, %166 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2132402Z         %180 = tt.extern_elementwise %179 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2132468Z         %181 = arith.mulf %107, %180 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2132527Z         %182 = arith.addf %181, %178 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2132666Z         %183 = ttg.convert_layout %180 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2132822Z         %184 = tt.expand_dims %183 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2132928Z         %185 = ttg.convert_layout %184 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2133025Z         %186 = tt.broadcast %185 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2133137Z         %187 = ttg.convert_layout %186 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2133201Z         %188 = arith.mulf %132, %187 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2133361Z         %189 = ttg.convert_layout %139 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.2133522Z         %190 = tt.expand_dims %189 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.2133630Z         %191 = ttg.convert_layout %190 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2133711Z         %192 = arith.muli %191, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2133776Z         %193 = arith.addi %49, %192 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2133877Z         %194 = tt.broadcast %193 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.2133991Z         %195 = ttg.convert_layout %194 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2134058Z         %196 = arith.addi %195, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2134174Z         %197 = tt.addptr %18, %196 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2134240Z         %198 = tt.load %197 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2134344Z         %199 = arith.truncf %176 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2134438Z         %200 = tt.reshape %188 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2134537Z         %201 = tt.reshape %199 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.2134653Z         %202 = tt.reshape %198 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.2134808Z         %203 = ttg.convert_layout %201 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2134970Z         %204 = ttg.convert_layout %202 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2135075Z         %205 = ttg.convert_layout %200 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2135344Z         %206 = tt.dot %203, %204, %205, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2135442Z         %207 = tt.reshape %206 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2135489Z         %c2_i32_10 = arith.constant 2 : i32
2026-02-21T12:38:38.2135540Z         %208 = arith.muli %c256_i32, %c2_i32_10 : i32
2026-02-21T12:38:38.2135581Z         %209 = arith.addi %arg5, %208 : i32
2026-02-21T12:38:38.2135641Z         %210 = tt.splat %209 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2135702Z         %211 = arith.addi %210, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2135845Z         %212 = ttg.convert_layout %211 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.2135997Z         %213 = tt.expand_dims %212 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.2136104Z         %214 = ttg.convert_layout %213 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.2136253Z         %215 = ttg.convert_layout %214 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.2136412Z         %216 = tt.expand_dims %215 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.2136523Z         %217 = ttg.convert_layout %216 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2136590Z         %218 = arith.muli %217, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2136710Z         %219 = tt.broadcast %218 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2136777Z         %220 = arith.addi %47, %219 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2136893Z         %221 = tt.addptr %16, %220 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2136958Z         %222 = tt.load %221 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2137080Z         %223 = tt.reshape %222 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.2137235Z         %224 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2137394Z         %225 = ttg.convert_layout %223 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2137505Z         %226 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2137761Z         %227 = tt.dot %224, %225, %226, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2137859Z         %228 = tt.reshape %227 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2137963Z         %229 = arith.truncf %228 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2138062Z         %230 = arith.extf %229 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2138125Z         %231 = "tt.reduce"(%230) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2138170Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2138218Z           %358 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.2138258Z           tt.reduce.return %358 : f32
2026-02-21T12:38:38.2138373Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2138511Z         %232 = ttg.convert_layout %231 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2138627Z         %233 = arith.truncf %232 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2138719Z         %234 = arith.extf %233 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2138782Z         %235 = arith.mulf %234, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2138872Z         %236 = arith.truncf %235 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2138960Z         %237 = arith.extf %236 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2139026Z         %238 = arith.cmpf ogt, %166, %237 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2139091Z         %239 = arith.cmpf une, %166, %166 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2139149Z         %240 = arith.ori %238, %239 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.2139248Z         %241 = arith.select %240, %166, %237 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2139315Z         %242 = arith.mulf %230, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2139417Z         %243 = arith.truncf %242 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2139563Z         %244 = ttg.convert_layout %241 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2139715Z         %245 = tt.expand_dims %244 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2139820Z         %246 = ttg.convert_layout %245 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2139935Z         %247 = arith.extf %243 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2140033Z         %248 = tt.broadcast %246 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.2140140Z         %249 = ttg.convert_layout %248 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2140210Z         %250 = arith.subf %247, %249 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2140412Z         %251 = tt.extern_elementwise %250 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2140476Z         %252 = "tt.reduce"(%251) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2140519Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2140565Z           %358 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.2140607Z           tt.reduce.return %358 : f32
2026-02-21T12:38:38.2140722Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2140857Z         %253 = ttg.convert_layout %252 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2140919Z         %254 = arith.subf %166, %241 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2141110Z         %255 = tt.extern_elementwise %254 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2141172Z         %256 = arith.mulf %182, %255 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2141231Z         %257 = arith.addf %256, %253 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2141385Z         %258 = ttg.convert_layout %255 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2141536Z         %259 = tt.expand_dims %258 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2141641Z         %260 = ttg.convert_layout %259 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2141741Z         %261 = tt.broadcast %260 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2141862Z         %262 = ttg.convert_layout %261 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2141927Z         %263 = arith.mulf %207, %262 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2142075Z         %264 = ttg.convert_layout %214 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.2142232Z         %265 = tt.expand_dims %264 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.2142338Z         %266 = ttg.convert_layout %265 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2142405Z         %267 = arith.muli %266, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2142469Z         %268 = arith.addi %49, %267 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2142569Z         %269 = tt.broadcast %268 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.2142684Z         %270 = ttg.convert_layout %269 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2142752Z         %271 = arith.addi %270, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2142869Z         %272 = tt.addptr %18, %271 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2142936Z         %273 = tt.load %272 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2143041Z         %274 = arith.truncf %251 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2143136Z         %275 = tt.reshape %263 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2143254Z         %276 = tt.reshape %274 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.2143358Z         %277 = tt.reshape %273 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.2143514Z         %278 = ttg.convert_layout %276 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2143673Z         %279 = ttg.convert_layout %277 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2143792Z         %280 = ttg.convert_layout %275 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2144055Z         %281 = tt.dot %278, %279, %280, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2144151Z         %282 = tt.reshape %281 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2144200Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T12:38:38.2144248Z         %283 = arith.muli %c256_i32, %c3_i32 : i32
2026-02-21T12:38:38.2144290Z         %284 = arith.addi %arg5, %283 : i32
2026-02-21T12:38:38.2144353Z         %285 = tt.splat %284 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2144410Z         %286 = arith.addi %285, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2144553Z         %287 = ttg.convert_layout %286 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.2144719Z         %288 = tt.expand_dims %287 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.2144820Z         %289 = ttg.convert_layout %288 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.2144964Z         %290 = ttg.convert_layout %289 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.2145123Z         %291 = tt.expand_dims %290 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.2145249Z         %292 = ttg.convert_layout %291 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2145316Z         %293 = arith.muli %292, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2145422Z         %294 = tt.broadcast %293 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2145486Z         %295 = arith.addi %47, %294 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2145602Z         %296 = tt.addptr %16, %295 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2145669Z         %297 = tt.load %296 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2145771Z         %298 = tt.reshape %297 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.2145925Z         %299 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2146087Z         %300 = ttg.convert_layout %298 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2146192Z         %301 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2146452Z         %302 = tt.dot %299, %300, %301, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2146551Z         %303 = tt.reshape %302 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2146652Z         %304 = arith.truncf %303 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2146765Z         %305 = arith.extf %304 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2146817Z         %306 = "tt.reduce"(%305) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2146858Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2146907Z           %358 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.2146948Z           tt.reduce.return %358 : f32
2026-02-21T12:38:38.2147080Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2147220Z         %307 = ttg.convert_layout %306 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2147311Z         %308 = arith.truncf %307 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2147402Z         %309 = arith.extf %308 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2147466Z         %310 = arith.mulf %309, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2147554Z         %311 = arith.truncf %310 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2147643Z         %312 = arith.extf %311 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2147709Z         %313 = arith.cmpf ogt, %241, %312 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2147773Z         %314 = arith.cmpf une, %241, %241 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2147836Z         %315 = arith.ori %313, %314 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.2147944Z         %316 = arith.select %315, %241, %312 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2148010Z         %317 = arith.mulf %305, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2148114Z         %318 = arith.truncf %317 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2148257Z         %319 = ttg.convert_layout %316 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2148407Z         %320 = tt.expand_dims %319 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2148530Z         %321 = ttg.convert_layout %320 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2148629Z         %322 = arith.extf %318 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2148728Z         %323 = tt.broadcast %321 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.2148840Z         %324 = ttg.convert_layout %323 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2148904Z         %325 = arith.subf %322, %324 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2149107Z         %326 = tt.extern_elementwise %325 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2149160Z         %327 = "tt.reduce"(%326) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2149201Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2149250Z           %358 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.2149291Z           tt.reduce.return %358 : f32
2026-02-21T12:38:38.2149407Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2149544Z         %328 = ttg.convert_layout %327 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2149606Z         %329 = arith.subf %241, %316 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2149795Z         %330 = tt.extern_elementwise %329 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2149871Z         %331 = arith.mulf %257, %330 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2149930Z         %332 = arith.addf %331, %328 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2150071Z         %333 = ttg.convert_layout %330 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2150222Z         %334 = tt.expand_dims %333 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2150345Z         %335 = ttg.convert_layout %334 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2150447Z         %336 = tt.broadcast %335 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2150554Z         %337 = ttg.convert_layout %336 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2150618Z         %338 = arith.mulf %282, %337 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2150768Z         %339 = ttg.convert_layout %289 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.2150924Z         %340 = tt.expand_dims %339 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.2151035Z         %341 = ttg.convert_layout %340 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2151100Z         %342 = arith.muli %341, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2151165Z         %343 = arith.addi %49, %342 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2151277Z         %344 = tt.broadcast %343 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.2151394Z         %345 = ttg.convert_layout %344 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2151457Z         %346 = arith.addi %345, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2151574Z         %347 = tt.addptr %18, %346 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2151643Z         %348 = tt.load %347 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2151745Z         %349 = arith.truncf %326 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2151854Z         %350 = tt.reshape %338 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2151954Z         %351 = tt.reshape %349 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.2152056Z         %352 = tt.reshape %348 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.2152212Z         %353 = ttg.convert_layout %351 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2152376Z         %354 = ttg.convert_layout %352 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2152479Z         %355 = ttg.convert_layout %350 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2152739Z         %356 = tt.dot %353, %354, %355, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2152838Z         %357 = tt.reshape %356 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2152972Z         scf.yield %316, %332, %357 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2153021Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T12:38:38.2153272Z       %51:3 = scf.for %arg5 = %c0_i32_8 to %c512_i32 step %c256_i32 iter_args(%arg6 = %50#0, %arg7 = %50#1, %arg8 = %50#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T12:38:38.2153349Z         %60 = tt.splat %arg5 : i32 -> tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2153409Z         %61 = arith.addi %60, %12 : tensor<256xi32, #blocked4>
2026-02-21T12:38:38.2153551Z         %62 = ttg.convert_layout %61 : tensor<256xi32, #blocked4> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T12:38:38.2153701Z         %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5>
2026-02-21T12:38:38.2153816Z         %64 = ttg.convert_layout %63 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3>
2026-02-21T12:38:38.2153964Z         %65 = ttg.convert_layout %64 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T12:38:38.2154118Z         %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x256xi32, #blocked6>
2026-02-21T12:38:38.2154225Z         %67 = ttg.convert_layout %66 : tensor<1x1x256xi32, #blocked6> -> tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2154291Z         %68 = arith.muli %67, %cst_3 : tensor<1x1x256xi32, #blocked1>
2026-02-21T12:38:38.2154391Z         %69 = tt.broadcast %68 : tensor<1x1x256xi32, #blocked1> -> tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2154454Z         %70 = arith.addi %47, %69 : tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2154568Z         %71 = tt.addptr %16, %70 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>, tensor<1x128x256xi32, #blocked1>
2026-02-21T12:38:38.2154634Z         %72 = tt.load %71 : tensor<1x128x256x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2154748Z         %73 = tt.reshape %72 : tensor<1x128x256xbf16, #blocked1> -> tensor<128x256xbf16, #blocked3>
2026-02-21T12:38:38.2154901Z         %74 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2155057Z         %75 = ttg.convert_layout %73 : tensor<128x256xbf16, #blocked3> -> tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2155161Z         %76 = ttg.convert_layout %cst_2 : tensor<1x256xf32, #blocked3> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2155439Z         %77 = tt.dot %74, %75, %76, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<128x256xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x256xf32, #blocked3>
2026-02-21T12:38:38.2155533Z         %78 = tt.reshape %77 : tensor<1x256xf32, #blocked3> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2155635Z         %79 = arith.truncf %78 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2155732Z         %80 = arith.extf %79 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2155782Z         %81 = "tt.reduce"(%80) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2155824Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2155870Z           %133 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:38:38.2155915Z           tt.reduce.return %133 : f32
2026-02-21T12:38:38.2156026Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2156163Z         %82 = ttg.convert_layout %81 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2156253Z         %83 = arith.truncf %82 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2156338Z         %84 = arith.extf %83 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2156400Z         %85 = arith.mulf %84, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2156488Z         %86 = arith.truncf %85 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T12:38:38.2156570Z         %87 = arith.extf %86 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2156650Z         %88 = arith.cmpf ogt, %arg6, %87 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2156719Z         %89 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2156776Z         %90 = arith.ori %88, %89 : tensor<1x1xi1, #blocked2>
2026-02-21T12:38:38.2156872Z         %91 = arith.select %90, %arg6, %87 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2156937Z         %92 = arith.mulf %80, %cst_0 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2157054Z         %93 = arith.truncf %92 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2157193Z         %94 = ttg.convert_layout %91 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2157342Z         %95 = tt.expand_dims %94 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2157444Z         %96 = ttg.convert_layout %95 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2157540Z         %97 = arith.extf %93 : tensor<1x1x256xbf16, #blocked1> to tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2157635Z         %98 = tt.broadcast %96 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x256xf32, #blocked9>
2026-02-21T12:38:38.2157740Z         %99 = ttg.convert_layout %98 : tensor<1x1x256xf32, #blocked9> -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2157803Z         %100 = arith.subf %97, %99 : tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2158024Z         %101 = tt.extern_elementwise %100 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1x256xf32, #blocked1>
2026-02-21T12:38:38.2158074Z         %102 = "tt.reduce"(%101) <{axis = 2 : i32}> ({
2026-02-21T12:38:38.2158114Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:38:38.2158159Z           %133 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:38:38.2158202Z           tt.reduce.return %133 : f32
2026-02-21T12:38:38.2158313Z         }) : (tensor<1x1x256xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:38:38.2158451Z         %103 = ttg.convert_layout %102 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2158530Z         %104 = arith.subf %arg6, %91 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2158716Z         %105 = tt.extern_elementwise %104 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2158779Z         %106 = arith.mulf %arg7, %105 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2158842Z         %107 = arith.addf %106, %103 : tensor<1x1xf32, #blocked2>
2026-02-21T12:38:38.2158983Z         %108 = ttg.convert_layout %105 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2159134Z         %109 = tt.expand_dims %108 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2159242Z         %110 = ttg.convert_layout %109 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2159340Z         %111 = tt.broadcast %110 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2159447Z         %112 = ttg.convert_layout %111 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2159515Z         %113 = arith.mulf %arg8, %112 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2159662Z         %114 = ttg.convert_layout %64 : tensor<1x256xi32, #blocked3> -> tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T12:38:38.2159819Z         %115 = tt.expand_dims %114 {axis = 2 : i32} : tensor<1x256xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x256x1xi32, #blocked7>
2026-02-21T12:38:38.2159927Z         %116 = ttg.convert_layout %115 : tensor<1x256x1xi32, #blocked7> -> tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2160006Z         %117 = arith.muli %116, %cst : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2160067Z         %118 = arith.addi %49, %117 : tensor<1x256x1xi32, #blocked>
2026-02-21T12:38:38.2160168Z         %119 = tt.broadcast %118 : tensor<1x256x1xi32, #blocked> -> tensor<1x256x128xi32, #blocked>
2026-02-21T12:38:38.2160283Z         %120 = ttg.convert_layout %119 : tensor<1x256x128xi32, #blocked> -> tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2160362Z         %121 = arith.addi %120, %17 : tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2160479Z         %122 = tt.addptr %18, %121 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>, tensor<1x256x128xi32, #blocked1>
2026-02-21T12:38:38.2160545Z         %123 = tt.load %122 : tensor<1x256x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2160647Z         %124 = arith.truncf %101 : tensor<1x1x256xf32, #blocked1> to tensor<1x1x256xbf16, #blocked1>
2026-02-21T12:38:38.2160743Z         %125 = tt.reshape %113 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2160842Z         %126 = tt.reshape %124 : tensor<1x1x256xbf16, #blocked1> -> tensor<1x256xbf16, #blocked3>
2026-02-21T12:38:38.2160942Z         %127 = tt.reshape %123 : tensor<1x256x128xbf16, #blocked1> -> tensor<256x128xbf16, #blocked3>
2026-02-21T12:38:38.2161100Z         %128 = ttg.convert_layout %126 : tensor<1x256xbf16, #blocked3> -> tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T12:38:38.2161261Z         %129 = ttg.convert_layout %127 : tensor<256x128xbf16, #blocked3> -> tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T12:38:38.2161376Z         %130 = ttg.convert_layout %125 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2161641Z         %131 = tt.dot %128, %129, %130, inputPrecision = tf32 : tensor<1x256xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<256x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T12:38:38.2161736Z         %132 = tt.reshape %131 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2161866Z         scf.yield %91, %107, %132 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2161930Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T12:38:38.2162069Z       %52 = ttg.convert_layout %51#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>>
2026-02-21T12:38:38.2162216Z       %53 = tt.expand_dims %52 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked8}>> -> tensor<1x1x1xf32, #blocked8>
2026-02-21T12:38:38.2162319Z       %54 = ttg.convert_layout %53 : tensor<1x1x1xf32, #blocked8> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T12:38:38.2162413Z       %55 = tt.broadcast %54 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x128xf32, #blocked9>
2026-02-21T12:38:38.2162517Z       %56 = ttg.convert_layout %55 : tensor<1x1x128xf32, #blocked9> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2162627Z       %57 = arith.divf %51#2, %56 : tensor<1x1x128xf32, #blocked1>
2026-02-21T12:38:38.2162730Z       %58 = arith.truncf %57 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1>
2026-02-21T12:38:38.2162838Z       %59 = tt.addptr %19, %41 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T12:38:38.2162903Z       tt.store %59, %58 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:38:38.2162988Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T12:38:38.2163024Z     tt.return
2026-02-21T12:38:38.2163056Z   }
2026-02-21T12:38:38.2163094Z }
2026-02-21T12:38:38.2163098Z 
2026-02-21T12:38:38.2163128Z {-#
2026-02-21T12:38:38.2163171Z   external_resources: {
2026-02-21T12:38:38.2163209Z     mlir_reproducer: {
2026-02-21T12:38:38.2165378Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T12:38:38.2165463Z       disable_threading: false,
2026-02-21T12:38:38.2165501Z       verify_each: true
2026-02-21T12:38:38.2165532Z     }
2026-02-21T12:38:38.2165564Z   }
2026-02-21T12:38:38.2165594Z #-}
2026-02-21T12:38:38.2165838Z /tmp/torchinductor_root/ne/cnewmtkcvegrxtiokl5mr4xyshhjpytfcvozpnfvt7sf7g5ytajx.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T12:38:38.2166304Z /tmp/torchinductor_root/ne/cnewmtkcvegrxtiokl5mr4xyshhjpytfcvozpnfvt7sf7g5ytajx.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T12:38:38.2166418Z [49s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T12:38:38.2167071Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T12:38:38.2167130Z Error: RuntimeError: PassManager::run failed
2026-02-21T12:38:38.2167212Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T12:38:42.6690209Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 10.0 configs/s
2026-02-21T12:38:42.6701947Z [53s] Adaptive compile timeout: 30s (90% percentile=13.8s, bounds=[30.0s, 30s])
2026-02-21T12:38:42.9868856Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2375.4 configs/s
2026-02-21T12:38:43.9050030Z [54s] Initial random population of 100, 5 starting points: 
2026-02-21T12:38:43.9050516Z error=19
2026-02-21T12:38:43.9050730Z timeout=6
2026-02-21T12:38:43.9050936Z ok=75
2026-02-21T12:38:43.9051133Z min=0.1508
2026-02-21T12:38:43.9051341Z mid=1.3094
2026-02-21T12:38:43.9051562Z max=127.9501
2026-02-21T12:38:43.9051812Z best={'block_sizes': [1, 128, 64],
2026-02-21T12:38:43.9052239Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T12:38:43.9052652Z  'l2_groupings': [32],
2026-02-21T12:38:43.9052933Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:38:43.9053266Z  'loop_orders': [[1, 0]],
2026-02-21T12:38:43.9053549Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:38:43.9053826Z  'num_sm_multiplier': 64,
2026-02-21T12:38:43.9054090Z  'num_stages': 1,
2026-02-21T12:38:43.9054321Z  'num_warps': 4,
2026-02-21T12:38:43.9054589Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:38:43.9055291Z  'range_flattens': [False, False],
2026-02-21T12:38:43.9055597Z  'range_multi_buffers': [False, None],
2026-02-21T12:38:43.9055902Z  'range_num_stages': [2, 3],
2026-02-21T12:38:43.9056183Z  'range_unroll_factors': [1, 2],
2026-02-21T12:38:43.9056480Z  'range_warp_specializes': [],
2026-02-21T12:38:43.9056766Z  'waves_per_eu': 1}
2026-02-21T12:38:43.9140104Z [54s] Fitting surrogate: 100 points, 100 targets
2026-02-21T12:38:44.8828352Z [55s] Generation 1 starting: 91 neighbors, 5 active search path(s)
2026-02-21T12:39:22.3606870Z [93s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:39:24.2049156Z [95s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:39:24.6005320Z [95s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:39:28.9420534Z [99s] Timeout after 30s compiling Config(block_sizes=[1, 128, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[2, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:39:29.6657575Z [100s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:39:30.1054722Z [101s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[2, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:39:30.1074250Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 0.8 configs/s
2026-02-21T12:39:35.5005488Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 93/93 17.4 configs/s
2026-02-21T12:39:44.1356773Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 110.4         
2026-02-21T12:39:44.1357377Z                                                                   configs/s     
2026-02-21T12:39:45.2722918Z [116s] Generation 1 complete: 
2026-02-21T12:39:45.2723548Z error=8
2026-02-21T12:39:45.2723761Z timeout=6
2026-02-21T12:39:45.2723958Z ok=82
2026-02-21T12:39:45.2724162Z min=0.1517
2026-02-21T12:39:45.2724363Z mid=0.2301
2026-02-21T12:39:45.2724561Z max=1.3124
2026-02-21T12:39:45.2724786Z best={'block_sizes': [1, 128, 64],
2026-02-21T12:39:45.2725199Z  'indexing': ['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'],
2026-02-21T12:39:45.2725627Z  'l2_groupings': [32],
2026-02-21T12:39:45.2725910Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:39:45.2726227Z  'loop_orders': [[1, 0]],
2026-02-21T12:39:45.2726681Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:39:45.2726966Z  'num_sm_multiplier': 64,
2026-02-21T12:39:45.2727228Z  'num_stages': 1,
2026-02-21T12:39:45.2727460Z  'num_warps': 4,
2026-02-21T12:39:45.2727738Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:39:45.2728065Z  'range_flattens': [False, True],
2026-02-21T12:39:45.2728368Z  'range_multi_buffers': [False, None],
2026-02-21T12:39:45.2728675Z  'range_num_stages': [2, 3],
2026-02-21T12:39:45.2728961Z  'range_unroll_factors': [1, 2],
2026-02-21T12:39:45.2729263Z  'range_warp_specializes': [],
2026-02-21T12:39:45.2729543Z  'waves_per_eu': 1}
2026-02-21T12:39:45.2792145Z [116s] Fitting surrogate: 196 points, 196 targets
2026-02-21T12:39:46.3264766Z [117s] Generation 2 starting: 95 neighbors, 5 active search path(s)
2026-02-21T12:40:24.6721056Z [155s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:40:24.6743800Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 1.1 configs/s
2026-02-21T12:40:30.3219828Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 17.3 configs/s
2026-02-21T12:40:39.0259002Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 110.1         
2026-02-21T12:40:39.0259603Z                                                                   configs/s     
2026-02-21T12:40:40.0867599Z [171s] Generation 2 complete: 
2026-02-21T12:40:40.0867939Z error=14
2026-02-21T12:40:40.0868145Z timeout=1
2026-02-21T12:40:40.0868742Z ok=85
2026-02-21T12:40:40.0868947Z min=0.1490
2026-02-21T12:40:40.0869148Z mid=0.2267
2026-02-21T12:40:40.0869369Z max=2.2459
2026-02-21T12:40:40.0869594Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:40:40.0870129Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T12:40:40.0870543Z  'l2_groupings': [64],
2026-02-21T12:40:40.0870838Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:40:40.0871152Z  'loop_orders': [[1, 0]],
2026-02-21T12:40:40.0871427Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:40:40.0871705Z  'num_sm_multiplier': 2,
2026-02-21T12:40:40.0871977Z  'num_stages': 1,
2026-02-21T12:40:40.0872203Z  'num_warps': 8,
2026-02-21T12:40:40.0872464Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:40:40.0872782Z  'range_flattens': [None, None],
2026-02-21T12:40:40.0883787Z  'range_multi_buffers': [False, None],
2026-02-21T12:40:40.0884024Z  'range_num_stages': [2, 1],
2026-02-21T12:40:40.0884249Z  'range_unroll_factors': [1, 2],
2026-02-21T12:40:40.0884463Z  'range_warp_specializes': [],
2026-02-21T12:40:40.0884652Z  'waves_per_eu': 1}
2026-02-21T12:40:40.1733027Z [171s] Fitting surrogate: 296 points, 296 targets
2026-02-21T12:40:41.1042296Z [172s] Generation 3 starting: 86 neighbors, 5 active search path(s)
2026-02-21T12:41:22.7241823Z [213s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:41:23.2620304Z [214s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:41:23.6825907Z [214s] Timeout after 30s compiling Config(block_sizes=[1, 256, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:41:24.5426196Z [215s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:41:24.5444460Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.5 configs/s
2026-02-21T12:41:29.8797364Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 16.4 configs/s
2026-02-21T12:41:40.6658364Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 89.7 configs/s
2026-02-21T12:41:41.7912071Z [232s] Generation 3 complete: 
2026-02-21T12:41:41.7912422Z error=3
2026-02-21T12:41:41.7912616Z timeout=4
2026-02-21T12:41:41.7912830Z ok=84
2026-02-21T12:41:41.7913005Z min=0.1467
2026-02-21T12:41:41.7913191Z mid=0.1644
2026-02-21T12:41:41.7913364Z max=2.3764
2026-02-21T12:41:41.7913570Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:41:41.7913953Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T12:41:41.7914317Z  'l2_groupings': [64],
2026-02-21T12:41:41.7914914Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:41:41.7915194Z  'loop_orders': [[0, 1]],
2026-02-21T12:41:41.7915445Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:41:41.7915710Z  'num_sm_multiplier': 2,
2026-02-21T12:41:41.7915946Z  'num_stages': 1,
2026-02-21T12:41:41.7916149Z  'num_warps': 8,
2026-02-21T12:41:41.7916402Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:41:41.7916684Z  'range_flattens': [None, None],
2026-02-21T12:41:41.7916956Z  'range_multi_buffers': [False, None],
2026-02-21T12:41:41.7917237Z  'range_num_stages': [2, 1],
2026-02-21T12:41:41.7917482Z  'range_unroll_factors': [1, 2],
2026-02-21T12:41:41.7917759Z  'range_warp_specializes': [],
2026-02-21T12:41:41.7918002Z  'waves_per_eu': 1}
2026-02-21T12:41:41.7957907Z [232s] Fitting surrogate: 387 points, 387 targets
2026-02-21T12:41:42.7834164Z [233s] Generation 4 starting: 98 neighbors, 5 active search path(s)
2026-02-21T12:42:26.6159516Z [277s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:42:28.4166451Z [279s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:42:28.4184512Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 100/100 0.8 configs/s
2026-02-21T12:42:33.6803332Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━ 100/100 19.2 configs/s
2026-02-21T12:42:43.8177052Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 95.0 configs/s
2026-02-21T12:42:44.8879401Z [295s] Generation 4 complete: 
2026-02-21T12:42:44.8883681Z error=11
2026-02-21T12:42:44.8884014Z timeout=2
2026-02-21T12:42:44.8884662Z ok=90
2026-02-21T12:42:44.8884869Z min=0.1470
2026-02-21T12:42:44.8885073Z mid=0.2032
2026-02-21T12:42:44.8885281Z max=1.3009
2026-02-21T12:42:44.8885515Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:42:44.8885990Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:42:44.8886431Z  'l2_groupings': [64],
2026-02-21T12:42:44.8886709Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:42:44.8887025Z  'loop_orders': [[0, 1]],
2026-02-21T12:42:44.8887305Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:42:44.8887586Z  'num_sm_multiplier': 2,
2026-02-21T12:42:44.8887848Z  'num_stages': 1,
2026-02-21T12:42:44.8888086Z  'num_warps': 8,
2026-02-21T12:42:44.8888518Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:42:44.8888847Z  'range_flattens': [None, None],
2026-02-21T12:42:44.8889166Z  'range_multi_buffers': [False, None],
2026-02-21T12:42:44.8889472Z  'range_num_stages': [2, 1],
2026-02-21T12:42:44.8889744Z  'range_unroll_factors': [1, 3],
2026-02-21T12:42:44.8890048Z  'range_warp_specializes': [],
2026-02-21T12:42:44.8890327Z  'waves_per_eu': 1}
2026-02-21T12:42:44.9781102Z [296s] Fitting surrogate: 490 points, 490 targets
2026-02-21T12:42:45.8363495Z [296s] Generation 5 starting: 86 neighbors, 5 active search path(s)
2026-02-21T12:43:29.2250827Z [340s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:43:29.2270296Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.5 configs/s
2026-02-21T12:43:33.9641635Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 18.6 configs/s
2026-02-21T12:43:43.2643534Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 103.4         
2026-02-21T12:43:43.2644150Z                                                                   configs/s     
2026-02-21T12:43:44.2806748Z [355s] Generation 5 complete: 
2026-02-21T12:43:44.2807081Z error=9
2026-02-21T12:43:44.2807311Z timeout=1
2026-02-21T12:43:44.2807513Z ok=81
2026-02-21T12:43:44.2807707Z min=0.1429
2026-02-21T12:43:44.2807913Z mid=0.2024
2026-02-21T12:43:44.2808106Z max=4.2551
2026-02-21T12:43:44.2808331Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:43:44.2808738Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:43:44.2809141Z  'l2_groupings': [64],
2026-02-21T12:43:44.2809432Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:43:44.2809745Z  'loop_orders': [[0, 1]],
2026-02-21T12:43:44.2810344Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:43:44.2810621Z  'num_sm_multiplier': 2,
2026-02-21T12:43:44.2810883Z  'num_stages': 1,
2026-02-21T12:43:44.2811106Z  'num_warps': 8,
2026-02-21T12:43:44.2811374Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:43:44.2811696Z  'range_flattens': [None, None],
2026-02-21T12:43:44.2812001Z  'range_multi_buffers': [False, None],
2026-02-21T12:43:44.2812301Z  'range_num_stages': [2, 1],
2026-02-21T12:43:44.2812573Z  'range_unroll_factors': [1, 4],
2026-02-21T12:43:44.2812995Z  'range_warp_specializes': [],
2026-02-21T12:43:44.2813263Z  'waves_per_eu': 1}
2026-02-21T12:43:44.3703993Z [355s] Fitting surrogate: 581 points, 581 targets
2026-02-21T12:43:45.1527518Z [356s] Generation 6 starting: 76 neighbors, 4 active search path(s)
2026-02-21T12:44:24.9290715Z [395s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:44:26.2162136Z [397s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:44:27.2520029Z [398s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:44:27.6504535Z [398s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:44:27.6522247Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77/77 0.7 configs/s
2026-02-21T12:44:31.7167070Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 77/77 19.2 configs/s
2026-02-21T12:44:40.0736343Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 114.3         
2026-02-21T12:44:40.0736927Z                                                                   configs/s     
2026-02-21T12:44:40.9457990Z [411s] Generation 6 complete: 
2026-02-21T12:44:40.9458327Z error=6
2026-02-21T12:44:40.9458535Z timeout=4
2026-02-21T12:44:40.9458736Z ok=70
2026-02-21T12:44:40.9458936Z min=0.1436
2026-02-21T12:44:40.9459139Z mid=0.1978
2026-02-21T12:44:40.9459331Z max=0.9076
2026-02-21T12:44:40.9459831Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:44:40.9460243Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:44:40.9460654Z  'l2_groupings': [64],
2026-02-21T12:44:40.9460929Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:44:40.9461237Z  'loop_orders': [[0, 1]],
2026-02-21T12:44:40.9461517Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:44:40.9461808Z  'num_sm_multiplier': 2,
2026-02-21T12:44:40.9462068Z  'num_stages': 1,
2026-02-21T12:44:40.9462296Z  'num_warps': 8,
2026-02-21T12:44:40.9462694Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:44:40.9463013Z  'range_flattens': [None, None],
2026-02-21T12:44:40.9463318Z  'range_multi_buffers': [False, False],
2026-02-21T12:44:40.9463632Z  'range_num_stages': [2, 1],
2026-02-21T12:44:40.9463907Z  'range_unroll_factors': [1, 4],
2026-02-21T12:44:40.9464199Z  'range_warp_specializes': [],
2026-02-21T12:44:40.9464471Z  'waves_per_eu': 1}
2026-02-21T12:44:41.0217393Z [412s] Fitting surrogate: 661 points, 661 targets
2026-02-21T12:44:41.7752660Z [412s] Generation 7 starting: 74 neighbors, 4 active search path(s)
2026-02-21T12:45:03.6642983Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 2.0 configs/s
2026-02-21T12:45:08.1020976Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 17.5 configs/s
2026-02-21T12:45:15.5915705Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 127.3         
2026-02-21T12:45:15.5916321Z                                                                   configs/s     
2026-02-21T12:45:16.6211399Z [447s] Generation 7 complete: 
2026-02-21T12:45:16.6211926Z error=4
2026-02-21T12:45:16.6212196Z ok=74
2026-02-21T12:45:16.6212402Z min=0.1442
2026-02-21T12:45:16.6212609Z mid=0.2146
2026-02-21T12:45:16.6220517Z max=1.6789
2026-02-21T12:45:16.6220770Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:45:16.6221195Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:45:16.6221608Z  'l2_groupings': [64],
2026-02-21T12:45:16.6221899Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:45:16.6222223Z  'loop_orders': [[0, 1]],
2026-02-21T12:45:16.6222510Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:45:16.6222797Z  'num_sm_multiplier': 2,
2026-02-21T12:45:16.6223059Z  'num_stages': 1,
2026-02-21T12:45:16.6223333Z  'num_warps': 8,
2026-02-21T12:45:16.6223734Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:45:16.6224381Z  'range_flattens': [None, None],
2026-02-21T12:45:16.6224690Z  'range_multi_buffers': [False, False],
2026-02-21T12:45:16.6225013Z  'range_num_stages': [2, 2],
2026-02-21T12:45:16.6225243Z  'range_unroll_factors': [2, 4],
2026-02-21T12:45:16.6225455Z  'range_warp_specializes': [],
2026-02-21T12:45:16.6225634Z  'waves_per_eu': 1}
2026-02-21T12:45:16.6996316Z [447s] Fitting surrogate: 739 points, 739 targets
2026-02-21T12:45:17.4442030Z [448s] Generation 8 starting: 64 neighbors, 4 active search path(s)
2026-02-21T12:45:49.6503140Z [480s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[2, 2], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:45:49.6522989Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 0.6 configs/s
2026-02-21T12:45:53.2928170Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 17.8 configs/s
2026-02-21T12:46:00.8844125Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 125.3         
2026-02-21T12:46:00.8844402Z                                                                   configs/s     
2026-02-21T12:46:01.8411702Z [492s] Generation 8 complete: 
2026-02-21T12:46:01.8412134Z error=5
2026-02-21T12:46:01.8412345Z timeout=1
2026-02-21T12:46:01.8412552Z ok=62
2026-02-21T12:46:01.8413170Z min=0.1406
2026-02-21T12:46:01.8413374Z mid=0.2013
2026-02-21T12:46:01.8413565Z max=0.8192
2026-02-21T12:46:01.8413797Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:46:01.8414220Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:46:01.8414629Z  'l2_groupings': [32],
2026-02-21T12:46:01.8414905Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:46:01.8415230Z  'loop_orders': [[0, 1]],
2026-02-21T12:46:01.8415512Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:46:01.8415792Z  'num_sm_multiplier': 2,
2026-02-21T12:46:01.8416195Z  'num_stages': 1,
2026-02-21T12:46:01.8416423Z  'num_warps': 8,
2026-02-21T12:46:01.8416682Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:46:01.8417013Z  'range_flattens': [None, False],
2026-02-21T12:46:01.8417329Z  'range_multi_buffers': [False, False],
2026-02-21T12:46:01.8417641Z  'range_num_stages': [2, 2],
2026-02-21T12:46:01.8417918Z  'range_unroll_factors': [2, 4],
2026-02-21T12:46:01.8418216Z  'range_warp_specializes': [],
2026-02-21T12:46:01.8418487Z  'waves_per_eu': 1}
2026-02-21T12:46:01.9192679Z [492s] Fitting surrogate: 807 points, 807 targets
2026-02-21T12:46:02.7283201Z [493s] Generation 9 starting: 71 neighbors, 4 active search path(s)
2026-02-21T12:46:37.3900567Z [528s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:46:45.5357596Z [536s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[None, True], range_num_stages=[3, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:46:45.5374692Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.6 configs/s
2026-02-21T12:46:49.3533156Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 19.1 configs/s
2026-02-21T12:46:58.6533828Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 112.1         
2026-02-21T12:46:58.6536331Z                                                                   configs/s     
2026-02-21T12:46:59.5990971Z [550s] Generation 9 complete: 
2026-02-21T12:46:59.5991258Z error=8
2026-02-21T12:46:59.5991464Z timeout=2
2026-02-21T12:46:59.5991632Z ok=65
2026-02-21T12:46:59.5991802Z min=0.1416
2026-02-21T12:46:59.5991971Z mid=0.1618
2026-02-21T12:46:59.5992142Z max=0.7893
2026-02-21T12:46:59.5992334Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:46:59.5992708Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:46:59.5993020Z  'l2_groupings': [32],
2026-02-21T12:46:59.5993234Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:46:59.5993474Z  'loop_orders': [[0, 1]],
2026-02-21T12:46:59.5993687Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:46:59.5993903Z  'num_sm_multiplier': 2,
2026-02-21T12:46:59.5994101Z  'num_stages': 1,
2026-02-21T12:46:59.5994286Z  'num_warps': 8,
2026-02-21T12:46:59.5994487Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:46:59.5994746Z  'range_flattens': [None, None],
2026-02-21T12:46:59.5994984Z  'range_multi_buffers': [False, False],
2026-02-21T12:46:59.5995225Z  'range_num_stages': [2, 2],
2026-02-21T12:46:59.5995438Z  'range_unroll_factors': [2, 4],
2026-02-21T12:46:59.5995667Z  'range_warp_specializes': [],
2026-02-21T12:46:59.5995882Z  'waves_per_eu': 1}
2026-02-21T12:46:59.6964899Z [550s] Fitting surrogate: 882 points, 882 targets
2026-02-21T12:47:00.4726296Z [551s] Generation 10 starting: 70 neighbors, 4 active search path(s)
2026-02-21T12:47:37.4376095Z [588s] Timeout after 30s compiling Config(block_sizes=[1, 256, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[3, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:47:37.4399627Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.6 configs/s
2026-02-21T12:47:41.3466522Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 72/72 18.7 configs/s
2026-02-21T12:47:47.2743758Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 158.5         
2026-02-21T12:47:47.2744368Z                                                                   configs/s     
2026-02-21T12:47:48.1848693Z [599s] Generation 10 complete: 
2026-02-21T12:47:48.1849101Z error=6
2026-02-21T12:47:48.1849310Z timeout=1
2026-02-21T12:47:48.1849523Z ok=67
2026-02-21T12:47:48.1849719Z min=0.1417
2026-02-21T12:47:48.1849925Z mid=0.2268
2026-02-21T12:47:48.1850123Z max=0.7611
2026-02-21T12:47:48.1850357Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:47:48.1850770Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:47:48.1851191Z  'l2_groupings': [32],
2026-02-21T12:47:48.1851467Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:47:48.1851808Z  'loop_orders': [[0, 1]],
2026-02-21T12:47:48.1852098Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:47:48.1852375Z  'num_sm_multiplier': 2,
2026-02-21T12:47:48.1852637Z  'num_stages': 1,
2026-02-21T12:47:48.1853195Z  'num_warps': 8,
2026-02-21T12:47:48.1853457Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:47:48.1853786Z  'range_flattens': [None, None],
2026-02-21T12:47:48.1854096Z  'range_multi_buffers': [False, False],
2026-02-21T12:47:48.1854414Z  'range_num_stages': [2, 2],
2026-02-21T12:47:48.1854702Z  'range_unroll_factors': [2, 4],
2026-02-21T12:47:48.1855002Z  'range_warp_specializes': [],
2026-02-21T12:47:48.1855282Z  'waves_per_eu': 1}
2026-02-21T12:47:48.2493831Z [599s] Fitting surrogate: 956 points, 956 targets
2026-02-21T12:47:49.0435453Z [600s] Generation 11 starting: 69 neighbors, 4 active search path(s)
2026-02-21T12:48:31.4965245Z [642s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[2, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:48:32.5723525Z [643s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:48:32.5745752Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71/71 0.5 configs/s
2026-02-21T12:48:36.5031805Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 71/71 18.3 configs/s
2026-02-21T12:48:40.4541863Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 231.5         
2026-02-21T12:48:40.4542468Z                                                                   configs/s     
2026-02-21T12:48:41.3096504Z [652s] Generation 11 complete: 
2026-02-21T12:48:41.3096860Z error=4
2026-02-21T12:48:41.3097072Z timeout=2
2026-02-21T12:48:41.3097285Z ok=67
2026-02-21T12:48:41.3101629Z min=0.1220
2026-02-21T12:48:41.3101891Z mid=0.2179
2026-02-21T12:48:41.3102416Z max=1.0024
2026-02-21T12:48:41.3102636Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:48:41.3103044Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T12:48:41.3103424Z  'l2_groupings': [1],
2026-02-21T12:48:41.3103681Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:48:41.3103982Z  'loop_orders': [[0, 1]],
2026-02-21T12:48:41.3104252Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:48:41.3104509Z  'num_stages': 2,
2026-02-21T12:48:41.3104723Z  'num_warps': 4,
2026-02-21T12:48:41.3105084Z  'pid_type': 'flat',
2026-02-21T12:48:41.3105333Z  'range_flattens': [None, True],
2026-02-21T12:48:41.3105615Z  'range_multi_buffers': [None, True],
2026-02-21T12:48:41.3105907Z  'range_num_stages': [0, 2],
2026-02-21T12:48:41.3106265Z  'range_unroll_factors': [0, 2],
2026-02-21T12:48:41.3106548Z  'range_warp_specializes': [],
2026-02-21T12:48:41.3106806Z  'waves_per_eu': 2}
2026-02-21T12:48:41.3458323Z [652s] Fitting surrogate: 1029 points, 1029 targets
2026-02-21T12:48:42.1287853Z [653s] Generation 12 starting: 72 neighbors, 4 active search path(s)
2026-02-21T12:49:17.7777528Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74/74 0.5 configs/s
2026-02-21T12:49:22.8615812Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 74/74 15.0 configs/s
2026-02-21T12:49:29.2054866Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 149.5         
2026-02-21T12:49:29.2058025Z                                                                   configs/s     
2026-02-21T12:49:30.1396402Z [701s] Generation 12 complete: 
2026-02-21T12:49:30.1396767Z error=4
2026-02-21T12:49:30.1396986Z ok=72
2026-02-21T12:49:30.1397193Z min=0.1208
2026-02-21T12:49:30.1397418Z mid=0.1917
2026-02-21T12:49:30.1397638Z max=1.3322
2026-02-21T12:49:30.1397879Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:49:30.1398297Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T12:49:30.1398709Z  'l2_groupings': [1],
2026-02-21T12:49:30.1398986Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:49:30.1399324Z  'loop_orders': [[0, 1]],
2026-02-21T12:49:30.1399616Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:49:30.1399883Z  'num_stages': 1,
2026-02-21T12:49:30.1400122Z  'num_warps': 4,
2026-02-21T12:49:30.1400351Z  'pid_type': 'flat',
2026-02-21T12:49:30.1400619Z  'range_flattens': [None, True],
2026-02-21T12:49:30.1400922Z  'range_multi_buffers': [None, True],
2026-02-21T12:49:30.1401512Z  'range_num_stages': [0, 2],
2026-02-21T12:49:30.1401786Z  'range_unroll_factors': [0, 2],
2026-02-21T12:49:30.1402104Z  'range_warp_specializes': [],
2026-02-21T12:49:30.1402391Z  'waves_per_eu': 2}
2026-02-21T12:49:30.2115531Z [701s] Fitting surrogate: 1105 points, 1105 targets
2026-02-21T12:49:30.8939213Z [701s] Generation 13 starting: 55 neighbors, 3 active search path(s)
2026-02-21T12:50:08.3530171Z [739s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:50:08.5218313Z [739s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[3, 4], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:50:08.5230771Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 0.7 configs/s
2026-02-21T12:50:11.7251947Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 57/57 18.1 configs/s
2026-02-21T12:50:13.7228268Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 421.9         
2026-02-21T12:50:13.7228861Z                                                                   configs/s     
2026-02-21T12:50:14.4189620Z [745s] Generation 13 complete: 
2026-02-21T12:50:14.4190015Z error=2
2026-02-21T12:50:14.4190222Z timeout=2
2026-02-21T12:50:14.4190430Z ok=54
2026-02-21T12:50:14.4190663Z min=0.1215
2026-02-21T12:50:14.4190867Z mid=0.2396
2026-02-21T12:50:14.4191067Z max=1.9598
2026-02-21T12:50:14.4191307Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:50:14.4191754Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T12:50:14.4192156Z  'l2_groupings': [1],
2026-02-21T12:50:14.4192904Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:50:14.4193222Z  'loop_orders': [[0, 1]],
2026-02-21T12:50:14.4193511Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:50:14.4193777Z  'num_stages': 1,
2026-02-21T12:50:14.4194015Z  'num_warps': 4,
2026-02-21T12:50:14.4194255Z  'pid_type': 'flat',
2026-02-21T12:50:14.4194530Z  'range_flattens': [None, True],
2026-02-21T12:50:14.4194846Z  'range_multi_buffers': [None, True],
2026-02-21T12:50:14.4195153Z  'range_num_stages': [0, 2],
2026-02-21T12:50:14.4195437Z  'range_unroll_factors': [0, 2],
2026-02-21T12:50:14.4195762Z  'range_warp_specializes': [],
2026-02-21T12:50:14.4196036Z  'waves_per_eu': 2}
2026-02-21T12:50:14.4406838Z [745s] Fitting surrogate: 1163 points, 1163 targets
2026-02-21T12:50:14.9072852Z [745s] Generation 14 starting: 40 neighbors, 2 active search path(s)
2026-02-21T12:50:46.7037234Z [777s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:50:47.2413727Z [778s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:50:47.9290768Z [778s] Timeout after 30s compiling Config(block_sizes=[1, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[None, None], range_num_stages=[2, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:50:47.9311716Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 1.0 configs/s
2026-02-21T12:50:50.3222830Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 42/42 17.9 configs/s
2026-02-21T12:50:51.9759350Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 721.7         
2026-02-21T12:50:51.9759992Z                                                                   configs/s     
2026-02-21T12:50:52.6562386Z [783s] Generation 14 complete: 
2026-02-21T12:50:52.6562774Z timeout=3
2026-02-21T12:50:52.6563273Z ok=40
2026-02-21T12:50:52.6563433Z min=0.1216
2026-02-21T12:50:52.6563594Z mid=0.2526
2026-02-21T12:50:52.6563750Z max=0.8945
2026-02-21T12:50:52.6563926Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:50:52.6564275Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T12:50:52.6564595Z  'l2_groupings': [1],
2026-02-21T12:50:52.6564814Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:50:52.6565067Z  'loop_orders': [[0, 1]],
2026-02-21T12:50:52.6565383Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:50:52.6565596Z  'num_stages': 1,
2026-02-21T12:50:52.6565777Z  'num_warps': 4,
2026-02-21T12:50:52.6565963Z  'pid_type': 'flat',
2026-02-21T12:50:52.6566167Z  'range_flattens': [None, True],
2026-02-21T12:50:52.6566409Z  'range_multi_buffers': [None, True],
2026-02-21T12:50:52.6566656Z  'range_num_stages': [0, 2],
2026-02-21T12:50:52.6566878Z  'range_unroll_factors': [0, 2],
2026-02-21T12:50:52.6567115Z  'range_warp_specializes': [],
2026-02-21T12:50:52.6567330Z  'waves_per_eu': 2}
2026-02-21T12:50:52.6688883Z [783s] Fitting surrogate: 1206 points, 1206 targets
2026-02-21T12:50:53.1456907Z [784s] Generation 15 starting: 40 neighbors, 2 active search path(s)
2026-02-21T12:51:14.3064360Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 1.3 configs/s
2026-02-21T12:51:16.8585867Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 42/42 17.5 configs/s
2026-02-21T12:51:17.3089521Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1620.0        
2026-02-21T12:51:17.3089984Z                                                                   configs/s     
2026-02-21T12:51:18.0854997Z [809s] Generation 15 complete: 
2026-02-21T12:51:18.0855323Z error=1
2026-02-21T12:51:18.0855519Z ok=42
2026-02-21T12:51:18.0855714Z min=0.1229
2026-02-21T12:51:18.0858811Z mid=0.2730
2026-02-21T12:51:18.0859040Z max=0.7985
2026-02-21T12:51:18.0859567Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:51:18.0859973Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T12:51:18.0860377Z  'l2_groupings': [1],
2026-02-21T12:51:18.0860634Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:51:18.0860928Z  'loop_orders': [[0, 1]],
2026-02-21T12:51:18.0861172Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:51:18.0861403Z  'num_stages': 1,
2026-02-21T12:51:18.0861599Z  'num_warps': 4,
2026-02-21T12:51:18.0861804Z  'pid_type': 'flat',
2026-02-21T12:51:18.0862022Z  'range_flattens': [None, True],
2026-02-21T12:51:18.0862281Z  'range_multi_buffers': [None, True],
2026-02-21T12:51:18.0862550Z  'range_num_stages': [0, 2],
2026-02-21T12:51:18.0862784Z  'range_unroll_factors': [0, 2],
2026-02-21T12:51:18.0863038Z  'range_warp_specializes': [],
2026-02-21T12:51:18.0863275Z  'waves_per_eu': 2}
2026-02-21T12:51:18.0947075Z [809s] Fitting surrogate: 1249 points, 1249 targets
2026-02-21T12:51:18.5385906Z [809s] Generation 16 starting: 38 neighbors, 2 active search path(s)
2026-02-21T12:51:49.8083945Z [840s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[2, 2], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:51:49.8099557Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 0.8 configs/s
2026-02-21T12:51:52.1218403Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 40/40 17.7 configs/s
2026-02-21T12:51:53.4597128Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 599.8         
2026-02-21T12:51:53.4597707Z                                                                   configs/s     
2026-02-21T12:51:54.2497847Z [845s] Generation 16 complete: 
2026-02-21T12:51:54.2498196Z error=1
2026-02-21T12:51:54.2498391Z timeout=1
2026-02-21T12:51:54.2498583Z ok=39
2026-02-21T12:51:54.2499057Z min=0.1248
2026-02-21T12:51:54.2499254Z mid=0.2399
2026-02-21T12:51:54.2499441Z max=1.2635
2026-02-21T12:51:54.2499659Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:51:54.2500070Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T12:51:54.2500458Z  'l2_groupings': [1],
2026-02-21T12:51:54.2500723Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:51:54.2501023Z  'loop_orders': [[0, 1]],
2026-02-21T12:51:54.2501291Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:51:54.2501556Z  'num_stages': 1,
2026-02-21T12:51:54.2501776Z  'num_warps': 4,
2026-02-21T12:51:54.2501999Z  'pid_type': 'flat',
2026-02-21T12:51:54.2502250Z  'range_flattens': [None, True],
2026-02-21T12:51:54.2502538Z  'range_multi_buffers': [None, True],
2026-02-21T12:51:54.2502827Z  'range_num_stages': [0, 2],
2026-02-21T12:51:54.2503089Z  'range_unroll_factors': [0, 2],
2026-02-21T12:51:54.2503375Z  'range_warp_specializes': [],
2026-02-21T12:51:54.2503645Z  'waves_per_eu': 2}
2026-02-21T12:51:54.2660867Z [845s] Fitting surrogate: 1290 points, 1290 targets
2026-02-21T12:51:54.7517903Z [845s] Generation 17 starting: 41 neighbors, 2 active search path(s)
2026-02-21T12:52:31.7872054Z [882s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, True], range_num_stages=[3, 4], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:52:31.7892224Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42/42 0.4 configs/s
2026-02-21T12:52:34.2000394Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 42/42 17.8 configs/s
2026-02-21T12:52:36.4599116Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 376.0         
2026-02-21T12:52:36.4599676Z                                                                   configs/s     
2026-02-21T12:52:37.1623938Z [888s] Generation 17 complete: 
2026-02-21T12:52:37.1624327Z error=1
2026-02-21T12:52:37.1624533Z timeout=1
2026-02-21T12:52:37.1624761Z ok=42
2026-02-21T12:52:37.1624960Z min=0.1222
2026-02-21T12:52:37.1625163Z mid=0.2331
2026-02-21T12:52:37.1625362Z max=0.8132
2026-02-21T12:52:37.1625593Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:52:37.1626020Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T12:52:37.1626437Z  'l2_groupings': [1],
2026-02-21T12:52:37.1626720Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:52:37.1627036Z  'loop_orders': [[0, 1]],
2026-02-21T12:52:37.1627318Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:52:37.1627587Z  'num_stages': 1,
2026-02-21T12:52:37.1627819Z  'num_warps': 4,
2026-02-21T12:52:37.1628049Z  'pid_type': 'flat',
2026-02-21T12:52:37.1628324Z  'range_flattens': [None, True],
2026-02-21T12:52:37.1628624Z  'range_multi_buffers': [None, True],
2026-02-21T12:52:37.1628939Z  'range_num_stages': [0, 2],
2026-02-21T12:52:37.1629212Z  'range_unroll_factors': [0, 2],
2026-02-21T12:52:37.1629507Z  'range_warp_specializes': [],
2026-02-21T12:52:37.1629792Z  'waves_per_eu': 2}
2026-02-21T12:52:37.1829173Z [888s] Fitting surrogate: 1334 points, 1334 targets
2026-02-21T12:52:37.4716303Z [888s] Generation 18 starting: 19 neighbors, 1 active search path(s)
2026-02-21T12:53:08.9500802Z [919s] Timeout after 30s compiling Config(block_sizes=[1, 64, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[3, 4], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:53:08.9521175Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 0.6 configs/s
2026-02-21T12:53:10.0999497Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 18.2 configs/s
2026-02-21T12:53:10.4202332Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 2173.6        
2026-02-21T12:53:10.4205731Z                                                                   configs/s     
2026-02-21T12:53:11.1090635Z [922s] Generation 18 complete: 
2026-02-21T12:53:11.1090897Z timeout=1
2026-02-21T12:53:11.1091008Z ok=20
2026-02-21T12:53:11.1091146Z min=0.1229
2026-02-21T12:53:11.1100325Z mid=0.2640
2026-02-21T12:53:11.1100459Z max=0.8246
2026-02-21T12:53:11.1100576Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:53:11.1100785Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T12:53:11.1100984Z  'l2_groupings': [1],
2026-02-21T12:53:11.1101124Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:53:11.1101287Z  'loop_orders': [[0, 1]],
2026-02-21T12:53:11.1101426Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:53:11.1101563Z  'num_stages': 1,
2026-02-21T12:53:11.1101697Z  'num_warps': 4,
2026-02-21T12:53:11.1101815Z  'pid_type': 'flat',
2026-02-21T12:53:11.1101943Z  'range_flattens': [None, True],
2026-02-21T12:53:11.1103469Z  'range_multi_buffers': [None, True],
2026-02-21T12:53:11.1104220Z  'range_num_stages': [0, 2],
2026-02-21T12:53:11.1104532Z  'range_unroll_factors': [0, 2],
2026-02-21T12:53:11.1104843Z  'range_warp_specializes': [],
2026-02-21T12:53:11.1105131Z  'waves_per_eu': 2}
2026-02-21T12:53:11.1187601Z [922s] Fitting surrogate: 1355 points, 1355 targets
2026-02-21T12:53:11.3960542Z [922s] Generation 19 starting: 17 neighbors, 1 active search path(s)
2026-02-21T12:53:16.5759503Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 5.2 configs/s
2026-02-21T12:53:17.7064907Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 17.3 configs/s
2026-02-21T12:53:18.5123080Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 984.9         
2026-02-21T12:53:18.5123684Z                                                                   configs/s     
2026-02-21T12:53:19.2856281Z [930s] Generation 19 complete: 
2026-02-21T12:53:19.2856527Z ok=19
2026-02-21T12:53:19.2856685Z min=0.1231
2026-02-21T12:53:19.2856843Z mid=0.1896
2026-02-21T12:53:19.2857008Z max=0.4380
2026-02-21T12:53:19.2857176Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:53:19.2857502Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T12:53:19.2857803Z  'l2_groupings': [1],
2026-02-21T12:53:19.2858023Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:53:19.2858264Z  'loop_orders': [[0, 1]],
2026-02-21T12:53:19.2858472Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:53:19.2858674Z  'num_stages': 1,
2026-02-21T12:53:19.2858844Z  'num_warps': 4,
2026-02-21T12:53:19.2859016Z  'pid_type': 'flat',
2026-02-21T12:53:19.2859208Z  'range_flattens': [None, True],
2026-02-21T12:53:19.2859442Z  'range_multi_buffers': [None, True],
2026-02-21T12:53:19.2859671Z  'range_num_stages': [0, 2],
2026-02-21T12:53:19.2859876Z  'range_unroll_factors': [0, 2],
2026-02-21T12:53:19.2860103Z  'range_warp_specializes': [],
2026-02-21T12:53:19.2860311Z  'waves_per_eu': 2}
2026-02-21T12:53:19.3012727Z [930s] Fitting surrogate: 1374 points, 1374 targets
2026-02-21T12:53:19.6408002Z [930s] Generation 20 starting: 19 neighbors, 1 active search path(s)
2026-02-21T12:53:29.2106305Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 2.0 configs/s
2026-02-21T12:53:30.4759224Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 17.0 configs/s
2026-02-21T12:53:32.5983923Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 408.1         
2026-02-21T12:53:32.5984433Z                                                                   configs/s     
2026-02-21T12:53:33.2845962Z [944s] Generation 20 complete: 
2026-02-21T12:53:33.2846207Z ok=21
2026-02-21T12:53:33.2846362Z min=0.1248
2026-02-21T12:53:33.2846537Z mid=0.1549
2026-02-21T12:53:33.2846685Z max=0.4356
2026-02-21T12:53:33.2846857Z best={'block_sizes': [1, 128, 32],
2026-02-21T12:53:33.2847353Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T12:53:33.2847648Z  'l2_groupings': [1],
2026-02-21T12:53:33.2847850Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:53:33.2848204Z  'loop_orders': [[0, 1]],
2026-02-21T12:53:33.2848412Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:53:33.2848607Z  'num_stages': 1,
2026-02-21T12:53:33.2848776Z  'num_warps': 4,
2026-02-21T12:53:33.2848948Z  'pid_type': 'flat',
2026-02-21T12:53:33.2849148Z  'range_flattens': [None, True],
2026-02-21T12:53:33.2849368Z  'range_multi_buffers': [None, True],
2026-02-21T12:53:33.2849595Z  'range_num_stages': [0, 2],
2026-02-21T12:53:33.2849800Z  'range_unroll_factors': [0, 2],
2026-02-21T12:53:33.2850028Z  'range_warp_specializes': [],
2026-02-21T12:53:33.2850236Z  'waves_per_eu': 2}
2026-02-21T12:53:33.3105083Z [944s] Fitting surrogate: 1395 points, 1395 targets
2026-02-21T12:53:33.4652619Z [944s] Autotuning complete in 944.5s after searching 1311 configs.
2026-02-21T12:53:33.4653271Z One can hardcode the best config and skip autotuning with:
2026-02-21T12:53:33.4655211Z     @helion.kernel(config=helion.Config(block_sizes=[1, 128, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T12:53:33.4656946Z 
2026-02-21T12:53:33.4657406Z [944s] Code of selected kernel: /tmp/torchinductor_root/rz/crztikscnyuaoookxi6imyujxfonfnzor3eprliz4nchtvjwxhku.py
2026-02-21T12:53:33.4896669Z from __future__ import annotations
2026-02-21T12:53:33.4896968Z 
2026-02-21T12:53:33.4897073Z import torch
2026-02-21T12:53:33.4897497Z import triton
2026-02-21T12:53:33.4897753Z import triton.language as tl
2026-02-21T12:53:33.4898108Z from torch._inductor.runtime import triton_helpers
2026-02-21T12:53:33.4898582Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T12:53:33.4899084Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T12:53:33.4899417Z 
2026-02-21T12:53:33.4899536Z _BLOCK_SIZE_1 = tl.constexpr(128)
2026-02-21T12:53:33.4899842Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T12:53:33.4900144Z _SHAPE_DIM = tl.constexpr(128)
2026-02-21T12:53:33.4900439Z _BLOCK_SIZE_3 = tl.constexpr(32)
2026-02-21T12:53:33.4900725Z _SHAPE_DIM_1 = tl.constexpr(128)
2026-02-21T12:53:33.4900914Z 
2026-02-21T12:53:33.4901007Z @triton.jit
2026-02-21T12:53:33.4901383Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr):
2026-02-21T12:53:33.4902001Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T12:53:33.4902451Z     num_blocks_0 = 192
2026-02-21T12:53:33.4902739Z     pid_0 = tl.program_id(0) % num_blocks_0
2026-02-21T12:53:33.4903080Z     pid_1 = tl.program_id(0) // num_blocks_0
2026-02-21T12:53:33.4903392Z     offset_0 = pid_0
2026-02-21T12:53:33.4903689Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T12:53:33.4904020Z     offset_1 = pid_1 * _BLOCK_SIZE_1
2026-02-21T12:53:33.4904342Z     indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32)
2026-02-21T12:53:33.4904628Z     indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32)
2026-02-21T12:53:33.4904971Z     # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T12:53:33.4905419Z     m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32)
2026-02-21T12:53:33.4905730Z     # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0)
2026-02-21T12:53:33.4906017Z     l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32)
2026-02-21T12:53:33.4906361Z     # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32)
2026-02-21T12:53:33.4906722Z     acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32)
2026-02-21T12:53:33.4914198Z     # src[attention.py:71]: q = q_view[tile_b, tile_m, :]
2026-02-21T12:53:33.4914669Z     q = tl.load(tl.make_block_ptr(q_view, [192, 512, 128], [65536, 128, 1], [offset_0, offset_1, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _SHAPE_DIM], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T12:53:33.4915106Z     # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)):
2026-02-21T12:53:33.4915323Z     # src[attention.py:73]:     k = k_view[tile_b, :, tile_n]
2026-02-21T12:53:33.4915520Z     # src[attention.py:74]:     qk = torch.bmm(q, k)
2026-02-21T12:53:33.4915687Z     # src[attention.py:72-85]: ...
2026-02-21T12:53:33.4915995Z     for offset_2 in tl.range(0, 512, _BLOCK_SIZE_3, loop_unroll_factor=2, num_stages=1, disallow_acc_multi_buffer=False, flatten=True):
2026-02-21T12:53:33.4916355Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32)
2026-02-21T12:53:33.4916538Z         q_copy = q
2026-02-21T12:53:33.4916652Z         m_i_copy = m_i
2026-02-21T12:53:33.4916770Z         l_i_copy = l_i
2026-02-21T12:53:33.4916886Z         acc_copy = acc
2026-02-21T12:53:33.4917002Z         q_copy_0 = q_copy
2026-02-21T12:53:33.4917132Z         m_i_copy_0 = m_i_copy
2026-02-21T12:53:33.4917265Z         l_i_copy_0 = l_i_copy
2026-02-21T12:53:33.4917393Z         acc_copy_0 = acc_copy
2026-02-21T12:53:33.4917547Z         # src[attention.py:73]: k = k_view[tile_b, :, tile_n]
2026-02-21T12:53:33.4917974Z         k = tl.load(tl.make_block_ptr(k_view, [192, 128, 512], [65536, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_1, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T12:53:33.4918387Z         # src[attention.py:74]: qk = torch.bmm(q, k)
2026-02-21T12:53:33.4918959Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T12:53:33.4919554Z         # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale)
2026-02-21T12:53:33.4919790Z         amax = tl.cast(tl.max(qk, 2), tl.bfloat16)
2026-02-21T12:53:33.4919945Z         v_0 = 0.12751743074602467
2026-02-21T12:53:33.4920090Z         v_1 = tl.cast(amax * v_0, tl.bfloat16)
2026-02-21T12:53:33.4920243Z         v_2 = tl.cast(v_1, tl.float32)
2026-02-21T12:53:33.4920408Z         v_3 = triton_helpers.maximum(m_i_copy_0, v_2)
2026-02-21T12:53:33.4920609Z         # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None]
2026-02-21T12:53:33.4920789Z         v_4 = 0.12751743074602467
2026-02-21T12:53:33.4920929Z         v_5 = tl.cast(qk * v_4, tl.bfloat16)
2026-02-21T12:53:33.4921080Z         subscript = v_3[:, :, None]
2026-02-21T12:53:33.4921224Z         v_6 = tl.cast(v_5, tl.float32)
2026-02-21T12:53:33.4921367Z         v_7 = v_6 - subscript
2026-02-21T12:53:33.4921529Z         # src[attention.py:77]: p = torch.exp2(qk)
2026-02-21T12:53:33.4921691Z         v_8 = libdevice.exp2(v_7)
2026-02-21T12:53:33.4921848Z         # src[attention.py:78]: l_ij = torch.sum(p, -1)
2026-02-21T12:53:33.4922028Z         l_ij = tl.cast(tl.sum(v_8, 2), tl.float32)
2026-02-21T12:53:33.4922214Z         # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij)
2026-02-21T12:53:33.4922392Z         v_9 = m_i_copy_0 - v_3
2026-02-21T12:53:33.4922527Z         v_10 = libdevice.exp2(v_9)
2026-02-21T12:53:33.4922764Z         # src[attention.py:80]: l_i = l_i * alpha + l_ij
2026-02-21T12:53:33.4922952Z         v_11 = l_i_copy_0 * v_10
2026-02-21T12:53:33.4923086Z         l_i = v_11 + l_ij
2026-02-21T12:53:33.4923238Z         # src[attention.py:81]: acc = acc * alpha[:, :, None]
2026-02-21T12:53:33.4923411Z         subscript_1 = v_10[:, :, None]
2026-02-21T12:53:33.4923557Z         v_13 = acc_copy_0 * subscript_1
2026-02-21T12:53:33.4923729Z         # src[attention.py:82]: v = v_view[tile_b, tile_n, :]
2026-02-21T12:53:33.4924043Z         v = tl.load(v_view + (indices_0[:, None, None] * 65536 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None)
2026-02-21T12:53:33.4924365Z         # src[attention.py:83]: p = p.to(v.dtype)
2026-02-21T12:53:33.4924554Z         v_14 = tl.cast(v_8, tl.bfloat16)
2026-02-21T12:53:33.4924701Z         # src[attention.py:84]: acc = torch.baddbmm(acc, p, v)
2026-02-21T12:53:33.4925164Z         acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128])
2026-02-21T12:53:33.4925605Z         # src[attention.py:85]: m_i = m_ij
2026-02-21T12:53:33.4925716Z         m_i = v_3
2026-02-21T12:53:33.4925829Z     # src[attention.py:87]: acc = acc / l_i[:, :, None]
2026-02-21T12:53:33.4925965Z     subscript_2 = l_i[:, :, None]
2026-02-21T12:53:33.4926076Z     v_15 = acc / subscript_2
2026-02-21T12:53:33.4926221Z     # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype)
2026-02-21T12:53:33.4926375Z     v_16 = tl.cast(v_15, tl.bfloat16)
2026-02-21T12:53:33.4926599Z     tl.store(out + (indices_0[:, None, None] * 65536 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), v_16, None)
2026-02-21T12:53:33.4926785Z 
2026-02-21T12:53:33.4926917Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T12:53:33.4927122Z     """
2026-02-21T12:53:33.4927214Z     Computes scaled dot-product attention.
2026-02-21T12:53:33.4927298Z 
2026-02-21T12:53:33.4927413Z     Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
2026-02-21T12:53:33.4927570Z 
2026-02-21T12:53:33.4927604Z     Args:
2026-02-21T12:53:33.4927708Z         q_in: Query tensor of shape [..., seq_len_q, head_dim]
2026-02-21T12:53:33.4927918Z         k_in: Key tensor of shape [..., seq_len_k, head_dim]
2026-02-21T12:53:33.4928075Z         v_in: Value tensor of shape [..., seq_len_k, head_dim]
2026-02-21T12:53:33.4928178Z 
2026-02-21T12:53:33.4928214Z     Returns:
2026-02-21T12:53:33.4928322Z         Output tensor of shape [..., seq_len_q, head_dim]
2026-02-21T12:53:33.4928450Z     """
2026-02-21T12:53:33.4928546Z     # src[attention.py:56]: m_dim = q_in.size(-2)
2026-02-21T12:53:33.4928673Z     m_dim = q_in.size(-2)
2026-02-21T12:53:33.4928789Z     # src[attention.py:57]: n_dim = k_in.size(-2)
2026-02-21T12:53:33.4928915Z     n_dim = k_in.size(-2)
2026-02-21T12:53:33.4929034Z     # src[attention.py:58]: assert n_dim == v_in.size(-2)
2026-02-21T12:53:33.4929172Z     assert n_dim == v_in.size(-2)
2026-02-21T12:53:33.4929318Z     # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1))
2026-02-21T12:53:33.4929468Z     head_dim = 128
2026-02-21T12:53:33.4929603Z     # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T12:53:33.4929784Z     assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T12:53:33.4929955Z     # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T12:53:33.4930123Z     q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T12:53:33.4930285Z     # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T12:53:33.4930450Z     v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T12:53:33.4930632Z     # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T12:53:33.4930837Z     k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T12:53:33.4931023Z     # src[attention.py:64]: out = torch.empty_like(q_view)
2026-02-21T12:53:33.4931163Z     out = torch.empty_like(q_view)
2026-02-21T12:53:33.4931327Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T12:53:33.4931494Z     _BLOCK_SIZE_1 = 128
2026-02-21T12:53:33.4931590Z     _RDIM_SIZE_2 = 128
2026-02-21T12:53:33.4931737Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T12:53:33.4931989Z     # src[attention.py:68]:     m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T12:53:33.4932203Z     # src[attention.py:69]:     l_i = torch.full_like(m_i, 1.0)
2026-02-21T12:53:33.4932380Z     # src[attention.py:67-88]: ...
2026-02-21T12:53:33.4932695Z     _launcher(_helion_attention, (192 * triton.cdiv(512, _BLOCK_SIZE_1),), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=1, waves_per_eu=2, matrix_instr_nonkdim=16)
2026-02-21T12:53:33.4933028Z     # src[attention.py:89]: return out.view(q_in.size())
2026-02-21T12:53:33.4933166Z     return out.view(q_in.size())
2026-02-21T12:53:34.3074750Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T12:53:34.3076967Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T12:53:34.3079124Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T12:53:34.3079603Z WARNING:tritonbench.utils.triton_op:Completed input ID 2:
2026-02-21T12:53:34.3080033Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T12:53:34.3080368Z ------------------------------------------
2026-02-21T12:53:34.3080683Z (4, 48, 512, 512, 128)
2026-02-21T12:53:34.3080845Z 
2026-02-21T12:53:34.3085208Z  50%|█████     | 3/6 [26:40<30:43, 614.57s/it]WARNING:tritonbench.utils.triton_op:Running input ID 4:
2026-02-21T12:53:34.3085598Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T12:53:34.3086029Z ------------------------------------------
2026-02-21T12:53:34.3086253Z (4, 48, 2048, 2048, 128)
2026-02-21T12:53:34.3087930Z INFO:tritonbench.utils.triton_op:Took 0.09ms to get benchmark function for aten
2026-02-21T12:53:35.3065035Z INFO:tritonbench.utils.triton_op:Took 1.66ms to get benchmark function for flex_attention
2026-02-21T12:53:36.4652620Z WARNING:__main__:Input tensor metadata:
2026-02-21T12:53:36.4652912Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T12:53:36.4653161Z               'dtype': 'torch.bfloat16',
2026-02-21T12:53:36.4653399Z               'shape': (4, 48, 2048, 128),
2026-02-21T12:53:36.4653655Z               'stride': (12582912, 262144, 128, 1)},
2026-02-21T12:53:36.4653899Z             { 'device': 'cuda:0',
2026-02-21T12:53:36.4654123Z               'dtype': 'torch.bfloat16',
2026-02-21T12:53:36.4654351Z               'shape': (4, 48, 2048, 128),
2026-02-21T12:53:36.4654585Z               'stride': (12582912, 262144, 128, 1)},
2026-02-21T12:53:36.4654825Z             { 'device': 'cuda:0',
2026-02-21T12:53:36.4655038Z               'dtype': 'torch.bfloat16',
2026-02-21T12:53:36.4655275Z               'shape': (4, 48, 2048, 128),
2026-02-21T12:53:36.4655509Z               'stride': (12582912, 262144, 128, 1)}),
2026-02-21T12:53:36.4655742Z   'kwargs': {}}
2026-02-21T12:53:36.4700452Z INFO:tritonbench.utils.triton_op:Took 5.19ms to get benchmark function for helion_attention
2026-02-21T12:53:36.7208569Z [0s] Autotune random seed: 2150287535
2026-02-21T12:53:36.7576062Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T12:54:11.3810466Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 64, 2048], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:54:12.3835759Z [35s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=4, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, None], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:54:12.7902591Z [36s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:54:13.5050724Z [36s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[None, None], range_num_stages=[4, 2], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:54:13.8357473Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 1, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, None], range_num_stages=[1, 2], range_unroll_factors=[1, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:54:14.1845155Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[3, 3], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:54:15.3142081Z [38s] Timeout after 30s compiling Config(block_sizes=[1, 16, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, False], range_num_stages=[2, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:54:16.3358463Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 128], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=1, num_warps=16, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[1, 1], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:54:17.1897909Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 1, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[3, 0], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:54:17.3794886Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[0, 0], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:54:17.9722667Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 1, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, True], range_num_stages=[3, 1], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:54:18.2071950Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:54:20.0148678Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[1, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:54:20.4376469Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, True], range_num_stages=[4, 0], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:54:20.6909244Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 32, 2048], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=4, num_warps=4, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, False], range_num_stages=[3, 2], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:54:21.3984528Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:54:22.8463258Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 64, 512], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[4, 2], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:54:23.1789074Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 512, 1024], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:54:25.7985886Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 256], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:54:25.9733757Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 64, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=8, num_stages=4, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, True], range_num_stages=[1, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T12:54:26.2027796Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 4], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:54:26.8904640Z [50s] Timeout after 30s compiling Config(block_sizes=[1, 8, 1024], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=4, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:54:27.2914717Z [50s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T12:54:27.8084474Z [51s] Timeout after 30s compiling Config(block_sizes=[1, 4, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[0, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:54:28.1682734Z [51s] Timeout after 30s compiling Config(block_sizes=[1, 1, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:54:28.1708097Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.2 configs/s
2026-02-21T12:55:19.5172168Z /tmp/torchinductor_root/3j/c3jqldgdfc2n2y6s5hmlk2le75i7srd7briu3ykyopeiqmr7asfz.py:63:24: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T12:55:19.5173220Z             k = tl.load(tl.make_block_ptr(k_view, [192, 128, 2048], [262144, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_3, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T12:55:19.5174147Z                        ^
2026-02-21T12:55:19.5175479Z /tmp/torchinductor_root/3j/c3jqldgdfc2n2y6s5hmlk2le75i7srd7briu3ykyopeiqmr7asfz.py:65:145: note: - use: %137 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x512xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [1, 0, 2]}>>) -> tensor<128x512xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>>
2026-02-21T12:55:19.5176619Z 
2026-02-21T12:55:19.5177281Z             qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T12:55:19.5178172Z                                                                                                                                                 ^
2026-02-21T12:55:19.5178533Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T12:55:19.5179027Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}>
2026-02-21T12:55:19.5179642Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}>
2026-02-21T12:55:19.5180155Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}>
2026-02-21T12:55:19.5180659Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [2, 1, 0]}>
2026-02-21T12:55:19.5181279Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T12:55:19.5181762Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}>
2026-02-21T12:55:19.5182231Z #blocked6 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}>
2026-02-21T12:55:19.5182687Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}>
2026-02-21T12:55:19.5183159Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}>
2026-02-21T12:55:19.5183653Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}>
2026-02-21T12:55:19.5184164Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}>
2026-02-21T12:55:19.5184690Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}>
2026-02-21T12:55:19.5185233Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [0, 1, 2]}>
2026-02-21T12:55:19.5185748Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}>
2026-02-21T12:55:19.5186267Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T12:55:19.5186784Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [0, 1, 2]}>
2026-02-21T12:55:19.5187401Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T12:55:19.5188242Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T12:55:19.5188889Z     %c2048_i64 = arith.constant 2048 : i64
2026-02-21T12:55:19.5189088Z     %c128_i64 = arith.constant 128 : i64
2026-02-21T12:55:19.5189308Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T12:55:19.5189476Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T12:55:19.5189628Z     %c262144_i64 = arith.constant 262144 : i64
2026-02-21T12:55:19.5189775Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T12:55:19.5189924Z     %c262144_i32 = arith.constant 262144 : i32
2026-02-21T12:55:19.5190104Z     %cst = arith.constant dense<128> : tensor<1x1x128xi64, #blocked>
2026-02-21T12:55:19.5190329Z     %cst_0 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked>
2026-02-21T12:55:19.5190564Z     %cst_1 = arith.constant dense<0.000000e+00> : tensor<1x128x512xbf16, #blocked1>
2026-02-21T12:55:19.5190842Z     %cst_2 = arith.constant dense<2048> : tensor<1x1x512xi64, #blocked1>
2026-02-21T12:55:19.5191058Z     %cst_3 = arith.constant dense<0> : tensor<1x1x512xi64, #blocked1>
2026-02-21T12:55:19.5191277Z     %cst_4 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2>
2026-02-21T12:55:19.5191498Z     %cst_5 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2>
2026-02-21T12:55:19.5191718Z     %cst_6 = arith.constant dense<128> : tensor<1x1x512xi64, #blocked1>
2026-02-21T12:55:19.5191896Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T12:55:19.5192043Z     %c2048_i32 = arith.constant 2048 : i32
2026-02-21T12:55:19.5192193Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T12:55:19.5192342Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T12:55:19.5192493Z     %c393216_i32 = arith.constant 393216 : i32
2026-02-21T12:55:19.5192670Z     %cst_7 = arith.constant dense<128> : tensor<1x512x1xi32, #blocked3>
2026-02-21T12:55:19.5192904Z     %cst_8 = arith.constant dense<0.127517432> : tensor<1x1x512xf32, #blocked1>
2026-02-21T12:55:19.5193168Z     %cst_9 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5193414Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<1x512xf32, #blocked5>
2026-02-21T12:55:19.5193615Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T12:55:19.5193799Z     %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked>
2026-02-21T12:55:19.5194044Z     %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5194283Z     %cst_13 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5194479Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T12:55:19.5194612Z     %c192_i32 = arith.constant 192 : i32
2026-02-21T12:55:19.5194757Z     %0 = tt.get_program_id x : i32
2026-02-21T12:55:19.5194954Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked6>
2026-02-21T12:55:19.5195282Z     %2 = ttg.convert_layout %1 : tensor<128xi32, #blocked6> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T12:55:19.5195693Z     %3 = tt.expand_dims %2 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7>
2026-02-21T12:55:19.5196073Z     %4 = ttg.convert_layout %3 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked8>
2026-02-21T12:55:19.5196424Z     %5 = ttg.convert_layout %4 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T12:55:19.5196836Z     %6 = tt.expand_dims %5 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi32, #blocked9>
2026-02-21T12:55:19.5197212Z     %7 = ttg.convert_layout %6 : tensor<1x1x128xi32, #blocked9> -> tensor<1x1x128xi32, #blocked>
2026-02-21T12:55:19.5197499Z     %8 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked>
2026-02-21T12:55:19.5197753Z     %9 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked6>
2026-02-21T12:55:19.5198018Z     %10 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x512x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:55:19.5198282Z     %11 = arith.extsi %1 : tensor<128xi32, #blocked6> to tensor<128xi64, #blocked6>
2026-02-21T12:55:19.5198634Z     %12 = ttg.convert_layout %11 : tensor<128xi64, #blocked6> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T12:55:19.5199056Z     %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi64, #blocked7>
2026-02-21T12:55:19.5199392Z     %14 = ttg.convert_layout %13 : tensor<1x128xi64, #blocked7> -> tensor<1x128xi64, #blocked8>
2026-02-21T12:55:19.5199676Z     %15 = ttg.convert_layout %14 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>>
2026-02-21T12:55:19.5200013Z     %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x128x1xi64, #blocked10>
2026-02-21T12:55:19.5200316Z     %17 = ttg.convert_layout %16 : tensor<1x128x1xi64, #blocked10> -> tensor<1x128x1xi64, #blocked2>
2026-02-21T12:55:19.5200565Z     %18 = tt.broadcast %17 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x512xi64, #blocked2>
2026-02-21T12:55:19.5200818Z     %19 = ttg.convert_layout %18 : tensor<1x128x512xi64, #blocked2> -> tensor<1x128x512xi64, #blocked1>
2026-02-21T12:55:19.5201056Z     %20 = arith.extsi %9 : tensor<512xi32, #blocked6> to tensor<512xi64, #blocked6>
2026-02-21T12:55:19.5201247Z     %21 = arith.cmpi sge, %17, %cst_5 : tensor<1x128x1xi64, #blocked2>
2026-02-21T12:55:19.5201422Z     %22 = arith.cmpi slt, %17, %cst_4 : tensor<1x128x1xi64, #blocked2>
2026-02-21T12:55:19.5201591Z     %23 = arith.andi %21, %22 : tensor<1x128x1xi1, #blocked2>
2026-02-21T12:55:19.5201790Z     %24 = tt.broadcast %7 : tensor<1x1x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked>
2026-02-21T12:55:19.5202041Z     %25 = ttg.convert_layout %24 : tensor<1x512x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked11>
2026-02-21T12:55:19.5202299Z     %26 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x512x128x!tt.ptr<bf16>, #blocked11>
2026-02-21T12:55:19.5202514Z     %27 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked>
2026-02-21T12:55:19.5202785Z     %28 = arith.extsi %1 : tensor<128xi32, #blocked6> to tensor<128xi64, #blocked6>
2026-02-21T12:55:19.5203052Z     %29 = ttg.convert_layout %28 : tensor<128xi64, #blocked6> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T12:55:19.5203387Z     %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi64, #blocked7>
2026-02-21T12:55:19.5203674Z     %31 = ttg.convert_layout %30 : tensor<1x128xi64, #blocked7> -> tensor<1x128xi64, #blocked8>
2026-02-21T12:55:19.5203961Z     %32 = ttg.convert_layout %31 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T12:55:19.5204298Z     %33 = tt.expand_dims %32 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi64, #blocked9>
2026-02-21T12:55:19.5204626Z     %34 = ttg.convert_layout %33 : tensor<1x1x128xi64, #blocked9> -> tensor<1x1x128xi64, #blocked>
2026-02-21T12:55:19.5204844Z     %35 = arith.cmpi sge, %34, %cst_0 : tensor<1x1x128xi64, #blocked>
2026-02-21T12:55:19.5205019Z     %36 = arith.cmpi slt, %34, %cst : tensor<1x1x128xi64, #blocked>
2026-02-21T12:55:19.5205187Z     %37 = arith.andi %35, %36 : tensor<1x1x128xi1, #blocked>
2026-02-21T12:55:19.5205347Z     scf.for %arg4 = %0 to %c393216_i32 step %c2432_i32  : i32 {
2026-02-21T12:55:19.5205520Z       %38 = arith.divsi %arg4, %c131072_i32 : i32
2026-02-21T12:55:19.5205649Z       %39 = arith.muli %38, %c64_i32 : i32
2026-02-21T12:55:19.5205775Z       %40 = arith.subi %c192_i32, %39 : i32
2026-02-21T12:55:19.5205898Z       %41 = arith.minsi %40, %c64_i32 : i32
2026-02-21T12:55:19.5206021Z       %42 = arith.remsi %arg4, %c131072_i32 : i32
2026-02-21T12:55:19.5206151Z       %43 = arith.remsi %42, %41 : i32
2026-02-21T12:55:19.5206266Z       %44 = arith.addi %39, %43 : i32
2026-02-21T12:55:19.5206383Z       %45 = arith.divsi %42, %41 : i32
2026-02-21T12:55:19.5206504Z       %46 = arith.muli %44, %c262144_i32 : i32
2026-02-21T12:55:19.5206628Z       %47 = arith.muli %45, %c128_i32 : i32
2026-02-21T12:55:19.5206763Z       %48 = arith.addi %46, %47 : i32
2026-02-21T12:55:19.5206902Z       %49 = tt.splat %48 : i32 -> tensor<1x1x128xi32, #blocked>
2026-02-21T12:55:19.5207064Z       %50 = arith.addi %49, %7 : tensor<1x1x128xi32, #blocked>
2026-02-21T12:55:19.5207264Z       %51 = tt.addptr %8, %50 : tensor<1x1x128x!tt.ptr<bf16>, #blocked>, tensor<1x1x128xi32, #blocked>
2026-02-21T12:55:19.5207478Z       %52 = tt.load %51 : tensor<1x1x128x!tt.ptr<bf16>, #blocked>
2026-02-21T12:55:19.5207620Z       %53 = arith.extsi %44 : i32 to i64
2026-02-21T12:55:19.5207746Z       %54 = arith.muli %53, %c262144_i64 : i64
2026-02-21T12:55:19.5207888Z       %55 = tt.splat %54 : i64 -> tensor<1x128x512xi64, #blocked1>
2026-02-21T12:55:19.5208038Z       %56 = arith.cmpi sge, %53, %c0_i64 : i64
2026-02-21T12:55:19.5208169Z       %57 = arith.cmpi slt, %53, %c192_i64 : i64
2026-02-21T12:55:19.5208294Z       %58 = arith.andi %56, %57 : i1
2026-02-21T12:55:19.5208433Z       %59 = tt.splat %58 : i1 -> tensor<1x128x1xi1, #blocked2>
2026-02-21T12:55:19.5208591Z       %60 = arith.andi %59, %23 : tensor<1x128x1xi1, #blocked2>
2026-02-21T12:55:19.5208789Z       %61 = tt.broadcast %60 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x512xi1, #blocked2>
2026-02-21T12:55:19.5209039Z       %62 = ttg.convert_layout %61 : tensor<1x128x512xi1, #blocked2> -> tensor<1x128x512xi1, #blocked1>
2026-02-21T12:55:19.5209287Z       %63 = tt.reshape %52 : tensor<1x1x128xbf16, #blocked> -> tensor<1x128xbf16, #blocked8>
2026-02-21T12:55:19.5209488Z       %64 = tt.splat %46 : i32 -> tensor<1x512x1xi32, #blocked3>
2026-02-21T12:55:19.5209886Z       %65:3 = scf.for %arg5 = %c0_i32 to %c2048_i32 step %c512_i32 iter_args(%arg6 = %cst_13, %arg7 = %cst_12, %arg8 = %cst_11) -> (tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked>)  : i32 {
2026-02-21T12:55:19.5210248Z         %91 = tt.splat %arg5 : i32 -> tensor<512xi32, #blocked6>
2026-02-21T12:55:19.5210405Z         %92 = arith.addi %91, %9 : tensor<512xi32, #blocked6>
2026-02-21T12:55:19.5210549Z         %93 = arith.extsi %arg5 : i32 to i64
2026-02-21T12:55:19.5210689Z         %94 = tt.splat %93 : i64 -> tensor<512xi64, #blocked6>
2026-02-21T12:55:19.5210839Z         %95 = arith.addi %94, %20 : tensor<512xi64, #blocked6>
2026-02-21T12:55:19.5211079Z         %96 = ttg.convert_layout %95 : tensor<512xi64, #blocked6> -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T12:55:19.5211413Z         %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x512xi64, #blocked7>
2026-02-21T12:55:19.5211709Z         %98 = ttg.convert_layout %97 : tensor<1x512xi64, #blocked7> -> tensor<1x512xi64, #blocked5>
2026-02-21T12:55:19.5212003Z         %99 = ttg.convert_layout %98 : tensor<1x512xi64, #blocked5> -> tensor<1x512xi64, #ttg.slice<{dim = 1, parent = #blocked12}>>
2026-02-21T12:55:19.5212366Z         %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x512xi64, #ttg.slice<{dim = 1, parent = #blocked12}>> -> tensor<1x1x512xi64, #blocked12>
2026-02-21T12:55:19.5212690Z         %101 = ttg.convert_layout %100 : tensor<1x1x512xi64, #blocked12> -> tensor<1x1x512xi64, #blocked1>
2026-02-21T12:55:19.5212915Z         %102 = arith.muli %101, %cst_6 : tensor<1x1x512xi64, #blocked1>
2026-02-21T12:55:19.5213132Z         %103 = tt.broadcast %102 : tensor<1x1x512xi64, #blocked1> -> tensor<1x128x512xi64, #blocked1>
2026-02-21T12:55:19.5213362Z         %104 = arith.addi %19, %103 : tensor<1x128x512xi64, #blocked1>
2026-02-21T12:55:19.5213532Z         %105 = arith.addi %55, %104 : tensor<1x128x512xi64, #blocked1>
2026-02-21T12:55:19.5213759Z         %106 = tt.addptr %10, %105 : tensor<1x128x512x!tt.ptr<bf16>, #blocked1>, tensor<1x128x512xi64, #blocked1>
2026-02-21T12:55:19.5213993Z         %107 = arith.cmpi sge, %101, %cst_3 : tensor<1x1x512xi64, #blocked1>
2026-02-21T12:55:19.5214186Z         %108 = arith.cmpi slt, %101, %cst_2 : tensor<1x1x512xi64, #blocked1>
2026-02-21T12:55:19.5214368Z         %109 = arith.andi %107, %108 : tensor<1x1x512xi1, #blocked1>
2026-02-21T12:55:19.5214588Z         %110 = tt.broadcast %109 : tensor<1x1x512xi1, #blocked1> -> tensor<1x128x512xi1, #blocked1>
2026-02-21T12:55:19.5214800Z         %111 = arith.andi %62, %110 : tensor<1x128x512xi1, #blocked1>
2026-02-21T12:55:19.5214979Z         %112 = tt.load %106, %111, %cst_1 : tensor<1x128x512x!tt.ptr<bf16>, #blocked1>
2026-02-21T12:55:19.5215209Z         %113 = tt.reshape %112 : tensor<1x128x512xbf16, #blocked1> -> tensor<128x512xbf16, #blocked5>
2026-02-21T12:55:19.5215516Z         %114 = ttg.convert_layout %63 : tensor<1x128xbf16, #blocked8> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>>
2026-02-21T12:55:19.5215876Z         %115 = ttg.convert_layout %113 : tensor<128x512xbf16, #blocked5> -> tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>>
2026-02-21T12:55:19.5216195Z         %116 = ttg.convert_layout %cst_10 : tensor<1x512xf32, #blocked5> -> tensor<1x512xf32, #blocked5>
2026-02-21T12:55:19.5216619Z         %117 = tt.dot %114, %115, %116, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> * tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> -> tensor<1x512xf32, #blocked5>
2026-02-21T12:55:19.5217019Z         %118 = tt.reshape %117 : tensor<1x512xf32, #blocked5> -> tensor<1x1x512xf32, #blocked1>
2026-02-21T12:55:19.5217269Z         %119 = arith.truncf %118 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1>
2026-02-21T12:55:19.5217516Z         %120 = arith.extf %119 : tensor<1x1x512xbf16, #blocked1> to tensor<1x1x512xf32, #blocked1>
2026-02-21T12:55:19.5217713Z         %121 = "tt.reduce"(%120) <{axis = 2 : i32}> ({
2026-02-21T12:55:19.5217860Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:55:19.5217988Z           %176 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T12:55:19.5218126Z           tt.reduce.return %176 : f32
2026-02-21T12:55:19.5218318Z         }) : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:55:19.5218617Z         %122 = ttg.convert_layout %121 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5218891Z         %123 = arith.truncf %122 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4>
2026-02-21T12:55:19.5219121Z         %124 = arith.extf %123 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5219319Z         %125 = arith.mulf %124, %cst_9 : tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5219513Z         %126 = arith.truncf %125 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4>
2026-02-21T12:55:19.5219738Z         %127 = arith.extf %126 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5219935Z         %128 = arith.cmpf ogt, %arg6, %127 : tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5230136Z         %129 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5230308Z         %130 = arith.ori %128, %129 : tensor<1x1xi1, #blocked4>
2026-02-21T12:55:19.5230514Z         %131 = arith.select %130, %arg6, %127 : tensor<1x1xi1, #blocked4>, tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5230730Z         %132 = arith.mulf %120, %cst_8 : tensor<1x1x512xf32, #blocked1>
2026-02-21T12:55:19.5230939Z         %133 = arith.truncf %132 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1>
2026-02-21T12:55:19.5231260Z         %134 = ttg.convert_layout %131 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>>
2026-02-21T12:55:19.5231603Z         %135 = tt.expand_dims %134 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13>
2026-02-21T12:55:19.5231912Z         %136 = ttg.convert_layout %135 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T12:55:19.5232167Z         %137 = arith.extf %133 : tensor<1x1x512xbf16, #blocked1> to tensor<1x1x512xf32, #blocked1>
2026-02-21T12:55:19.5232425Z         %138 = tt.broadcast %136 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x512xf32, #blocked14>
2026-02-21T12:55:19.5232684Z         %139 = ttg.convert_layout %138 : tensor<1x1x512xf32, #blocked14> -> tensor<1x1x512xf32, #blocked1>
2026-02-21T12:55:19.5232902Z         %140 = arith.subf %137, %139 : tensor<1x1x512xf32, #blocked1>
2026-02-21T12:55:19.5233215Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1x512xf32, #blocked1>
2026-02-21T12:55:19.5233512Z         %142 = "tt.reduce"(%141) <{axis = 2 : i32}> ({
2026-02-21T12:55:19.5233642Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T12:55:19.5233768Z           %176 = arith.addf %arg9, %arg10 : f32
2026-02-21T12:55:19.5233890Z           tt.reduce.return %176 : f32
2026-02-21T12:55:19.5234086Z         }) : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T12:55:19.5234384Z         %143 = ttg.convert_layout %142 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5234630Z         %144 = arith.subf %arg6, %131 : tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5234923Z         %145 = tt.extern_elementwise %144 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked4>) -> tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5235216Z         %146 = arith.mulf %arg7, %145 : tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5235380Z         %147 = arith.addf %146, %143 : tensor<1x1xf32, #blocked4>
2026-02-21T12:55:19.5235645Z         %148 = ttg.convert_layout %145 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>>
2026-02-21T12:55:19.5235984Z         %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13>
2026-02-21T12:55:19.5236294Z         %150 = ttg.convert_layout %149 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T12:55:19.5236547Z         %151 = tt.broadcast %150 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x128xf32, #blocked14>
2026-02-21T12:55:19.5236804Z         %152 = ttg.convert_layout %151 : tensor<1x1x128xf32, #blocked14> -> tensor<1x1x128xf32, #blocked>
2026-02-21T12:55:19.5237025Z         %153 = arith.mulf %arg8, %152 : tensor<1x1x128xf32, #blocked>
2026-02-21T12:55:19.5237271Z         %154 = ttg.convert_layout %92 : tensor<512xi32, #blocked6> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T12:55:19.5237609Z         %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x512xi32, #blocked7>
2026-02-21T12:55:19.5237910Z         %156 = ttg.convert_layout %155 : tensor<1x512xi32, #blocked7> -> tensor<1x512xi32, #blocked5>
2026-02-21T12:55:19.5238220Z         %157 = ttg.convert_layout %156 : tensor<1x512xi32, #blocked5> -> tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked15}>>
2026-02-21T12:55:19.5238573Z         %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x512x1xi32, #blocked15>
2026-02-21T12:55:19.5238889Z         %159 = ttg.convert_layout %158 : tensor<1x512x1xi32, #blocked15> -> tensor<1x512x1xi32, #blocked3>
2026-02-21T12:55:19.5239129Z         %160 = arith.muli %159, %cst_7 : tensor<1x512x1xi32, #blocked3>
2026-02-21T12:55:19.5239298Z         %161 = arith.addi %64, %160 : tensor<1x512x1xi32, #blocked3>
2026-02-21T12:55:19.5239508Z         %162 = tt.broadcast %161 : tensor<1x512x1xi32, #blocked3> -> tensor<1x512x128xi32, #blocked3>
2026-02-21T12:55:19.5239777Z         %163 = ttg.convert_layout %162 : tensor<1x512x128xi32, #blocked3> -> tensor<1x512x128xi32, #blocked11>
2026-02-21T12:55:19.5240002Z         %164 = arith.addi %163, %25 : tensor<1x512x128xi32, #blocked11>
2026-02-21T12:55:19.5240236Z         %165 = tt.addptr %26, %164 : tensor<1x512x128x!tt.ptr<bf16>, #blocked11>, tensor<1x512x128xi32, #blocked11>
2026-02-21T12:55:19.5240482Z         %166 = tt.load %165 : tensor<1x512x128x!tt.ptr<bf16>, #blocked11>
2026-02-21T12:55:19.5240698Z         %167 = arith.truncf %141 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1>
2026-02-21T12:55:19.5240943Z         %168 = tt.reshape %153 : tensor<1x1x128xf32, #blocked> -> tensor<1x128xf32, #blocked8>
2026-02-21T12:55:19.5241179Z         %169 = tt.reshape %167 : tensor<1x1x512xbf16, #blocked1> -> tensor<1x512xbf16, #blocked5>
2026-02-21T12:55:19.5241428Z         %170 = tt.reshape %166 : tensor<1x512x128xbf16, #blocked11> -> tensor<512x128xbf16, #blocked8>
2026-02-21T12:55:19.5241732Z         %171 = ttg.convert_layout %169 : tensor<1x512xbf16, #blocked5> -> tensor<1x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T12:55:19.5242094Z         %172 = ttg.convert_layout %170 : tensor<512x128xbf16, #blocked8> -> tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T12:55:19.5242406Z         %173 = ttg.convert_layout %168 : tensor<1x128xf32, #blocked8> -> tensor<1x128xf32, #blocked8>
2026-02-21T12:55:19.5242850Z         %174 = tt.dot %171, %172, %173, inputPrecision = tf32 : tensor<1x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x128xf32, #blocked8>
2026-02-21T12:55:19.5243250Z         %175 = tt.reshape %174 : tensor<1x128xf32, #blocked8> -> tensor<1x1x128xf32, #blocked>
2026-02-21T12:55:19.5243523Z         scf.yield %131, %147, %175 : tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked>
2026-02-21T12:55:19.5243745Z       } {tt.flatten, tt.num_stages = 4 : i32}
2026-02-21T12:55:19.5243998Z       %66 = ttg.convert_layout %65#1 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>>
2026-02-21T12:55:19.5244333Z       %67 = tt.expand_dims %66 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13>
2026-02-21T12:55:19.5244632Z       %68 = ttg.convert_layout %67 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T12:55:19.5244881Z       %69 = tt.broadcast %68 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x128xf32, #blocked14>
2026-02-21T12:55:19.5245127Z       %70 = ttg.convert_layout %69 : tensor<1x1x128xf32, #blocked14> -> tensor<1x1x128xf32, #blocked>
2026-02-21T12:55:19.5245342Z       %71 = arith.divf %65#2, %70 : tensor<1x1x128xf32, #blocked>
2026-02-21T12:55:19.5245539Z       %72 = arith.truncf %71 : tensor<1x1x128xf32, #blocked> to tensor<1x1x128xbf16, #blocked>
2026-02-21T12:55:19.5245722Z       %73 = arith.extsi %44 : i32 to i64
2026-02-21T12:55:19.5245844Z       %74 = arith.extsi %45 : i32 to i64
2026-02-21T12:55:19.5245964Z       %75 = arith.muli %73, %c262144_i64 : i64
2026-02-21T12:55:19.5246108Z       %76 = tt.splat %75 : i64 -> tensor<1x1x128xi64, #blocked>
2026-02-21T12:55:19.5246267Z       %77 = arith.muli %74, %c128_i64 : i64
2026-02-21T12:55:19.5246405Z       %78 = tt.splat %77 : i64 -> tensor<1x1x128xi64, #blocked>
2026-02-21T12:55:19.5246559Z       %79 = arith.addi %78, %34 : tensor<1x1x128xi64, #blocked>
2026-02-21T12:55:19.5246714Z       %80 = arith.addi %76, %79 : tensor<1x1x128xi64, #blocked>
2026-02-21T12:55:19.5246914Z       %81 = tt.addptr %27, %80 : tensor<1x1x128x!tt.ptr<bf16>, #blocked>, tensor<1x1x128xi64, #blocked>
2026-02-21T12:55:19.5247122Z       %82 = arith.cmpi sge, %73, %c0_i64 : i64
2026-02-21T12:55:19.5247249Z       %83 = arith.cmpi slt, %73, %c192_i64 : i64
2026-02-21T12:55:19.5247367Z       %84 = arith.andi %82, %83 : i1
2026-02-21T12:55:19.5247483Z       %85 = arith.cmpi sge, %74, %c0_i64 : i64
2026-02-21T12:55:19.5247606Z       %86 = arith.cmpi slt, %74, %c2048_i64 : i64
2026-02-21T12:55:19.5247726Z       %87 = arith.andi %85, %86 : i1
2026-02-21T12:55:19.5247832Z       %88 = arith.andi %84, %87 : i1
2026-02-21T12:55:19.5247965Z       %89 = tt.splat %88 : i1 -> tensor<1x1x128xi1, #blocked>
2026-02-21T12:55:19.5248119Z       %90 = arith.andi %89, %37 : tensor<1x1x128xi1, #blocked>
2026-02-21T12:55:19.5248303Z       tt.store %81, %72, %90 : tensor<1x1x128x!tt.ptr<bf16>, #blocked>
2026-02-21T12:55:19.5248475Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32}
2026-02-21T12:55:19.5248608Z     tt.return
2026-02-21T12:55:19.5248689Z   }
2026-02-21T12:55:19.5248766Z }
2026-02-21T12:55:19.5248810Z 
2026-02-21T12:55:19.5248840Z {-#
2026-02-21T12:55:19.5248919Z   external_resources: {
2026-02-21T12:55:19.5249020Z     mlir_reproducer: {
2026-02-21T12:55:19.5251187Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=1 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T12:55:19.5253400Z       disable_threading: false,
2026-02-21T12:55:19.5253510Z       verify_each: true
2026-02-21T12:55:19.5253598Z     }
2026-02-21T12:55:19.5253672Z   }
2026-02-21T12:55:19.5253741Z #-}
2026-02-21T12:55:19.5254015Z /tmp/torchinductor_root/3j/c3jqldgdfc2n2y6s5hmlk2le75i7srd7briu3ykyopeiqmr7asfz.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T12:55:19.5254719Z /tmp/torchinductor_root/3j/c3jqldgdfc2n2y6s5hmlk2le75i7srd7briu3ykyopeiqmr7asfz.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T12:55:19.5255268Z [102s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T12:55:19.5256058Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 512], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T12:55:19.5256801Z Error: RuntimeError: PassManager::run failed
2026-02-21T12:55:19.5256966Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T12:55:22.2039421Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 3.2 configs/s
2026-02-21T12:55:22.2050302Z [105s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s])
2026-02-21T12:55:22.2321826Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 - configs/s
2026-02-21T12:55:23.2028753Z [106s] Initial random population of 100, 5 starting points: 
2026-02-21T12:55:23.2029252Z error=12
2026-02-21T12:55:23.2029485Z timeout=25
2026-02-21T12:55:23.2029696Z ok=63
2026-02-21T12:55:23.2029919Z min=2.1912
2026-02-21T12:55:23.2030117Z mid=12.2203
2026-02-21T12:55:23.2030327Z max=1458.4796
2026-02-21T12:55:23.2030564Z best={'block_sizes': [1, 256, 16],
2026-02-21T12:55:23.2031248Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:55:23.2031639Z  'l2_groupings': [2],
2026-02-21T12:55:23.2031921Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:55:23.2032237Z  'loop_orders': [[1, 0]],
2026-02-21T12:55:23.2032511Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:55:23.2032801Z  'num_sm_multiplier': 32,
2026-02-21T12:55:23.2033062Z  'num_stages': 2,
2026-02-21T12:55:23.2033286Z  'num_warps': 16,
2026-02-21T12:55:23.2033543Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:55:23.2033875Z  'range_flattens': [False, None],
2026-02-21T12:55:23.2034178Z  'range_multi_buffers': [False, True],
2026-02-21T12:55:23.2034487Z  'range_num_stages': [2, 4],
2026-02-21T12:55:23.2034905Z  'range_unroll_factors': [2, 4],
2026-02-21T12:55:23.2035208Z  'range_warp_specializes': [],
2026-02-21T12:55:23.2035482Z  'waves_per_eu': 4}
2026-02-21T12:55:23.2045507Z [106s] Fitting surrogate: 100 points, 100 targets
2026-02-21T12:55:24.1407214Z [107s] Generation 1 starting: 91 neighbors, 5 active search path(s)
2026-02-21T12:55:47.3197638Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 3.3 configs/s
2026-02-21T12:55:57.1182517Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 92/92 9.5 configs/s
2026-02-21T12:55:57.4135776Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━ 103/103 223.8 configs/s
2026-02-21T12:56:05.1620702Z [148s] Generation 1 complete: 
2026-02-21T12:56:05.1621096Z error=3
2026-02-21T12:56:05.1621301Z ok=93
2026-02-21T12:56:05.1621508Z min=1.8662
2026-02-21T12:56:05.1621718Z mid=3.5739
2026-02-21T12:56:05.1621918Z max=59.7261
2026-02-21T12:56:05.1622154Z best={'block_sizes': [1, 128, 16],
2026-02-21T12:56:05.1622604Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T12:56:05.1623016Z  'l2_groupings': [2],
2026-02-21T12:56:05.1623318Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:56:05.1623633Z  'loop_orders': [[1, 0]],
2026-02-21T12:56:05.1623888Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:56:05.1624140Z  'num_stages': 2,
2026-02-21T12:56:05.1624369Z  'num_warps': 8,
2026-02-21T12:56:05.1624585Z  'pid_type': 'flat',
2026-02-21T12:56:05.1624831Z  'range_flattens': [None, None],
2026-02-21T12:56:05.1625108Z  'range_multi_buffers': [None, True],
2026-02-21T12:56:05.1625394Z  'range_num_stages': [0, 4],
2026-02-21T12:56:05.1625656Z  'range_unroll_factors': [0, 4],
2026-02-21T12:56:05.1625928Z  'range_warp_specializes': [],
2026-02-21T12:56:05.1626179Z  'waves_per_eu': 4}
2026-02-21T12:56:05.1686989Z [148s] Fitting surrogate: 196 points, 196 targets
2026-02-21T12:56:06.9566851Z [150s] Generation 2 starting: 87 neighbors, 5 active search path(s)
2026-02-21T12:56:46.3356644Z [189s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, None], range_num_stages=[0, 1], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:56:49.2656251Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 0.7 configs/s
2026-02-21T12:56:57.2161914Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 11.4 configs/s
2026-02-21T12:57:01.5053412Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━ 116/116 16.4 configs/s
2026-02-21T12:57:10.1456588Z [213s] Generation 2 complete: 
2026-02-21T12:57:10.1456971Z error=6
2026-02-21T12:57:10.1457286Z timeout=1
2026-02-21T12:57:10.1457489Z ok=86
2026-02-21T12:57:10.1457685Z min=1.6856
2026-02-21T12:57:10.1457978Z mid=2.3992
2026-02-21T12:57:10.1458200Z max=31.4057
2026-02-21T12:57:10.1458463Z best={'block_sizes': [1, 256, 32],
2026-02-21T12:57:10.1458938Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T12:57:10.1459375Z  'l2_groupings': [8],
2026-02-21T12:57:10.1459650Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:57:10.1460317Z  'loop_orders': [[0, 1]],
2026-02-21T12:57:10.1460601Z  'matrix_instr_nonkdim': 16,
2026-02-21T12:57:10.1460873Z  'num_sm_multiplier': 64,
2026-02-21T12:57:10.1461131Z  'num_stages': 2,
2026-02-21T12:57:10.1461357Z  'num_warps': 8,
2026-02-21T12:57:10.1461626Z  'pid_type': 'persistent_blocked',
2026-02-21T12:57:10.1461928Z  'range_flattens': [False, None],
2026-02-21T12:57:10.1462231Z  'range_multi_buffers': [True, False],
2026-02-21T12:57:10.1462534Z  'range_num_stages': [1, 1],
2026-02-21T12:57:10.1462806Z  'range_unroll_factors': [2, 0],
2026-02-21T12:57:10.1463098Z  'range_warp_specializes': [],
2026-02-21T12:57:10.1463366Z  'waves_per_eu': 1}
2026-02-21T12:57:10.1578964Z [213s] Fitting surrogate: 289 points, 289 targets
2026-02-21T12:57:11.1272356Z [214s] Generation 3 starting: 89 neighbors, 5 active search path(s)
2026-02-21T12:57:49.4746137Z [252s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[1, 1], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:57:50.1770128Z [253s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, False], range_num_stages=[1, 1], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:57:50.1796579Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 0.8 configs/s
2026-02-21T12:57:57.7009487Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 11.9 configs/s
2026-02-21T12:58:01.6785791Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 136/136 19.8 configs/s
2026-02-21T12:58:08.9558591Z [272s] Generation 3 complete: 
2026-02-21T12:58:08.9558941Z error=2
2026-02-21T12:58:08.9559141Z timeout=2
2026-02-21T12:58:08.9561710Z ok=90
2026-02-21T12:58:08.9561943Z min=1.4115
2026-02-21T12:58:08.9562447Z mid=2.4051
2026-02-21T12:58:08.9562796Z max=33.5141
2026-02-21T12:58:08.9563050Z best={'block_sizes': [1, 128, 64],
2026-02-21T12:58:08.9563488Z  'indexing': ['pointer', 'block_ptr', 'block_ptr', 'pointer'],
2026-02-21T12:58:08.9575325Z  'l2_groupings': [32],
2026-02-21T12:58:08.9575588Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:58:08.9575838Z  'loop_orders': [[0, 1]],
2026-02-21T12:58:08.9576070Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:58:08.9576564Z  'num_sm_multiplier': 16,
2026-02-21T12:58:08.9576766Z  'num_stages': 1,
2026-02-21T12:58:08.9576943Z  'num_warps': 4,
2026-02-21T12:58:08.9577153Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:58:08.9577421Z  'range_flattens': [None, True],
2026-02-21T12:58:08.9577652Z  'range_multi_buffers': [False, True],
2026-02-21T12:58:08.9577892Z  'range_num_stages': [0, 0],
2026-02-21T12:58:08.9578104Z  'range_unroll_factors': [2, 2],
2026-02-21T12:58:08.9578341Z  'range_warp_specializes': [],
2026-02-21T12:58:08.9578555Z  'waves_per_eu': 2}
2026-02-21T12:58:08.9625382Z [272s] Fitting surrogate: 383 points, 383 targets
2026-02-21T12:58:09.9470180Z [273s] Generation 4 starting: 98 neighbors, 5 active search path(s)
2026-02-21T12:58:42.4272084Z [305s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T12:58:42.4296147Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99/99 1.6 configs/s
2026-02-21T12:58:50.7290760Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 99/99 11.9 configs/s
2026-02-21T12:58:54.7380260Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 146/146 20.6 configs/s
2026-02-21T12:59:01.0662361Z [324s] Generation 4 complete: 
2026-02-21T12:59:01.0662828Z error=9
2026-02-21T12:59:01.0663053Z timeout=1
2026-02-21T12:59:01.0663266Z ok=93
2026-02-21T12:59:01.0663470Z min=1.2779
2026-02-21T12:59:01.0663782Z mid=2.3124
2026-02-21T12:59:01.0664107Z max=35.2354
2026-02-21T12:59:01.0665123Z best={'block_sizes': [1, 128, 128],
2026-02-21T12:59:01.0665593Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T12:59:01.0666034Z  'l2_groupings': [32],
2026-02-21T12:59:01.0666310Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:59:01.0666634Z  'loop_orders': [[0, 1]],
2026-02-21T12:59:01.0666908Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:59:01.0667202Z  'num_sm_multiplier': 16,
2026-02-21T12:59:01.0667460Z  'num_stages': 1,
2026-02-21T12:59:01.0667691Z  'num_warps': 4,
2026-02-21T12:59:01.0667951Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:59:01.0668285Z  'range_flattens': [None, True],
2026-02-21T12:59:01.0668601Z  'range_multi_buffers': [False, True],
2026-02-21T12:59:01.0668908Z  'range_num_stages': [0, 0],
2026-02-21T12:59:01.0669190Z  'range_unroll_factors': [2, 2],
2026-02-21T12:59:01.0669482Z  'range_warp_specializes': [],
2026-02-21T12:59:01.0669767Z  'waves_per_eu': 2}
2026-02-21T12:59:01.0752709Z [324s] Fitting surrogate: 486 points, 486 targets
2026-02-21T12:59:02.7439669Z [325s] Generation 5 starting: 73 neighbors, 4 active search path(s)
2026-02-21T12:59:39.4107155Z [362s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, None], range_num_stages=[1, 1], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T12:59:42.5823430Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.6 configs/s
2026-02-21T12:59:48.1121775Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 13.8 configs/s
2026-02-21T12:59:53.2999316Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━ 156/156 19.4 configs/s
2026-02-21T12:59:59.6076155Z [382s] Generation 5 complete: 
2026-02-21T12:59:59.6079576Z error=3
2026-02-21T12:59:59.6079887Z timeout=1
2026-02-21T12:59:59.6080067Z ok=73
2026-02-21T12:59:59.6080239Z min=1.3197
2026-02-21T12:59:59.6080420Z mid=1.8694
2026-02-21T12:59:59.6080907Z max=14.8615
2026-02-21T12:59:59.6081115Z best={'block_sizes': [1, 128, 128],
2026-02-21T12:59:59.6081494Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T12:59:59.6081864Z  'l2_groupings': [32],
2026-02-21T12:59:59.6082109Z  'load_eviction_policies': ['', '', ''],
2026-02-21T12:59:59.6082386Z  'loop_orders': [[0, 1]],
2026-02-21T12:59:59.6082711Z  'matrix_instr_nonkdim': 0,
2026-02-21T12:59:59.6082954Z  'num_sm_multiplier': 16,
2026-02-21T12:59:59.6083191Z  'num_stages': 1,
2026-02-21T12:59:59.6083389Z  'num_warps': 4,
2026-02-21T12:59:59.6083618Z  'pid_type': 'persistent_interleaved',
2026-02-21T12:59:59.6083898Z  'range_flattens': [None, False],
2026-02-21T12:59:59.6084163Z  'range_multi_buffers': [False, True],
2026-02-21T12:59:59.6084431Z  'range_num_stages': [0, 0],
2026-02-21T12:59:59.6084678Z  'range_unroll_factors': [2, 2],
2026-02-21T12:59:59.6084946Z  'range_warp_specializes': [],
2026-02-21T12:59:59.6085173Z  'waves_per_eu': 2}
2026-02-21T12:59:59.6195400Z [382s] Fitting surrogate: 563 points, 563 targets
2026-02-21T13:00:00.4612495Z [383s] Generation 6 starting: 69 neighbors, 4 active search path(s)
2026-02-21T13:00:38.6275748Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 0.4 configs/s
2026-02-21T13:00:44.7417866Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 11.5 configs/s
2026-02-21T13:00:49.1498456Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━ 156/156 21.5 configs/s
2026-02-21T13:00:54.7334375Z [437s] Generation 6 complete: 
2026-02-21T13:00:54.7334790Z error=1
2026-02-21T13:00:54.7335008Z ok=72
2026-02-21T13:00:54.7335209Z min=1.3089
2026-02-21T13:00:54.7335415Z mid=1.8582
2026-02-21T13:00:54.7335610Z max=17.2142
2026-02-21T13:00:54.7335844Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:00:54.7336262Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:00:54.7337251Z  'l2_groupings': [16],
2026-02-21T13:00:54.7337543Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:00:54.7337881Z  'loop_orders': [[0, 1]],
2026-02-21T13:00:54.7338135Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:00:54.7338378Z  'num_sm_multiplier': 16,
2026-02-21T13:00:54.7338611Z  'num_stages': 1,
2026-02-21T13:00:54.7338830Z  'num_warps': 4,
2026-02-21T13:00:54.7339063Z  'pid_type': 'persistent_interleaved',
2026-02-21T13:00:54.7339350Z  'range_flattens': [None, False],
2026-02-21T13:00:54.7339623Z  'range_multi_buffers': [False, True],
2026-02-21T13:00:54.7339895Z  'range_num_stages': [0, 0],
2026-02-21T13:00:54.7340155Z  'range_unroll_factors': [2, 2],
2026-02-21T13:00:54.7340419Z  'range_warp_specializes': [],
2026-02-21T13:00:54.7340659Z  'waves_per_eu': 2}
2026-02-21T13:00:54.7428033Z [437s] Fitting surrogate: 636 points, 636 targets
2026-02-21T13:00:55.5858425Z [438s] Generation 7 starting: 72 neighbors, 4 active search path(s)
2026-02-21T13:01:33.3355248Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.7 configs/s
2026-02-21T13:01:39.3772117Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 12.3 configs/s
2026-02-21T13:01:41.1886259Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━ 156/156 40.3 configs/s
2026-02-21T13:01:46.4675176Z [489s] Generation 7 complete: 
2026-02-21T13:01:46.4675606Z error=4
2026-02-21T13:01:46.4675813Z ok=72
2026-02-21T13:01:46.4676018Z min=1.3046
2026-02-21T13:01:46.4676231Z mid=2.9547
2026-02-21T13:01:46.4676429Z max=15.7700
2026-02-21T13:01:46.4676663Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:01:46.4677557Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:01:46.4677950Z  'l2_groupings': [32],
2026-02-21T13:01:46.4678232Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:01:46.4678543Z  'loop_orders': [[0, 1]],
2026-02-21T13:01:46.4678814Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:01:46.4679090Z  'num_sm_multiplier': 16,
2026-02-21T13:01:46.4679362Z  'num_stages': 1,
2026-02-21T13:01:46.4679588Z  'num_warps': 4,
2026-02-21T13:01:46.4679849Z  'pid_type': 'persistent_interleaved',
2026-02-21T13:01:46.4680335Z  'range_flattens': [None, False],
2026-02-21T13:01:46.4680638Z  'range_multi_buffers': [False, False],
2026-02-21T13:01:46.4680879Z  'range_num_stages': [0, 0],
2026-02-21T13:01:46.4681198Z  'range_unroll_factors': [2, 2],
2026-02-21T13:01:46.4681453Z  'range_warp_specializes': [],
2026-02-21T13:01:46.4681685Z  'waves_per_eu': 2}
2026-02-21T13:01:46.4741248Z [489s] Fitting surrogate: 712 points, 712 targets
2026-02-21T13:01:47.2916362Z [490s] Generation 8 starting: 71 neighbors, 4 active search path(s)
2026-02-21T13:02:26.9471355Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 0.5 configs/s
2026-02-21T13:02:33.0444255Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 12.1 configs/s
2026-02-21T13:02:36.5531818Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━ 156/156 26.3 configs/s
2026-02-21T13:02:42.4930744Z [545s] Generation 8 complete: 
2026-02-21T13:02:42.4931120Z error=3
2026-02-21T13:02:42.4931322Z ok=72
2026-02-21T13:02:42.4931550Z min=1.3343
2026-02-21T13:02:42.4931750Z mid=1.9467
2026-02-21T13:02:42.4931948Z max=27.7946
2026-02-21T13:02:42.4932178Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:02:42.4932619Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:02:42.4933019Z  'l2_groupings': [32],
2026-02-21T13:02:42.4933297Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:02:42.4933606Z  'loop_orders': [[0, 1]],
2026-02-21T13:02:42.4933877Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:02:42.4934170Z  'num_sm_multiplier': 16,
2026-02-21T13:02:42.4934425Z  'num_stages': 1,
2026-02-21T13:02:42.4934659Z  'num_warps': 4,
2026-02-21T13:02:42.4934918Z  'pid_type': 'persistent_interleaved',
2026-02-21T13:02:42.4935223Z  'range_flattens': [None, True],
2026-02-21T13:02:42.4935427Z  'range_multi_buffers': [False, False],
2026-02-21T13:02:42.4935614Z  'range_num_stages': [0, 0],
2026-02-21T13:02:42.4935732Z  'range_unroll_factors': [2, 2],
2026-02-21T13:02:42.4935851Z  'range_warp_specializes': [],
2026-02-21T13:02:42.4935967Z  'waves_per_eu': 2}
2026-02-21T13:02:42.5014326Z [545s] Fitting surrogate: 787 points, 787 targets
2026-02-21T13:02:43.9428163Z [547s] Generation 9 starting: 50 neighbors, 3 active search path(s)
2026-02-21T13:03:15.5639695Z [578s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, None], range_num_stages=[0, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:03:23.3874750Z [586s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:03:24.5797079Z [587s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:03:24.5818032Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 0.4 configs/s
2026-02-21T13:03:28.0618631Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 14.8 configs/s
2026-02-21T13:03:30.5355412Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━ 156/156 30.1 configs/s
2026-02-21T13:03:35.6671333Z [598s] Generation 9 complete: 
2026-02-21T13:03:35.6671749Z error=2
2026-02-21T13:03:35.6671955Z timeout=3
2026-02-21T13:03:35.6672153Z ok=48
2026-02-21T13:03:35.6672356Z min=1.3454
2026-02-21T13:03:35.6673078Z mid=1.6914
2026-02-21T13:03:35.6673274Z max=8.7120
2026-02-21T13:03:35.6673498Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:03:35.6673913Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:03:35.6674314Z  'l2_groupings': [32],
2026-02-21T13:03:35.6674605Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:03:35.6674922Z  'loop_orders': [[0, 1]],
2026-02-21T13:03:35.6675203Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:03:35.6675478Z  'num_sm_multiplier': 16,
2026-02-21T13:03:35.6675738Z  'num_stages': 1,
2026-02-21T13:03:35.6675969Z  'num_warps': 4,
2026-02-21T13:03:35.6676227Z  'pid_type': 'persistent_blocked',
2026-02-21T13:03:35.6676690Z  'range_flattens': [None, True],
2026-02-21T13:03:35.6676998Z  'range_multi_buffers': [False, False],
2026-02-21T13:03:35.6677321Z  'range_num_stages': [1, 0],
2026-02-21T13:03:35.6677597Z  'range_unroll_factors': [2, 2],
2026-02-21T13:03:35.6677824Z  'range_warp_specializes': [],
2026-02-21T13:03:35.6678037Z  'waves_per_eu': 2}
2026-02-21T13:03:35.6763072Z [598s] Fitting surrogate: 840 points, 840 targets
2026-02-21T13:03:36.3111629Z [599s] Generation 10 starting: 54 neighbors, 3 active search path(s)
2026-02-21T13:04:17.2560734Z [640s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[1, 0], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:04:17.2577582Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55/55 0.4 configs/s
2026-02-21T13:04:21.9837770Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 55/55 11.7 configs/s
2026-02-21T13:04:22.1546602Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━ 156/156 1240.0         
2026-02-21T13:04:22.1548596Z                                                                  configs/s      
2026-02-21T13:04:27.5974250Z [650s] Generation 10 complete: 
2026-02-21T13:04:27.5974675Z error=3
2026-02-21T13:04:27.5974896Z timeout=1
2026-02-21T13:04:27.5975476Z ok=53
2026-02-21T13:04:27.5975671Z min=1.2608
2026-02-21T13:04:27.5975868Z mid=2.3358
2026-02-21T13:04:27.5976064Z max=32.6283
2026-02-21T13:04:27.5976288Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:04:27.5976699Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:04:27.5977089Z  'l2_groupings': [32],
2026-02-21T13:04:27.5977363Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:04:27.5977682Z  'loop_orders': [[0, 1]],
2026-02-21T13:04:27.5977951Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:04:27.5978512Z  'num_sm_multiplier': 16,
2026-02-21T13:04:27.5978767Z  'num_stages': 1,
2026-02-21T13:04:27.5978993Z  'num_warps': 4,
2026-02-21T13:04:27.5979252Z  'pid_type': 'persistent_blocked',
2026-02-21T13:04:27.5979571Z  'range_flattens': [None, True],
2026-02-21T13:04:27.5979873Z  'range_multi_buffers': [False, False],
2026-02-21T13:04:27.5980179Z  'range_num_stages': [1, 0],
2026-02-21T13:04:27.5980442Z  'range_unroll_factors': [2, 2],
2026-02-21T13:04:27.5980658Z  'range_warp_specializes': [],
2026-02-21T13:04:27.5980881Z  'waves_per_eu': 2}
2026-02-21T13:04:27.6041218Z [650s] Fitting surrogate: 897 points, 897 targets
2026-02-21T13:04:28.0740638Z [651s] Generation 11 starting: 39 neighbors, 2 active search path(s)
2026-02-21T13:04:42.6914235Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39/39 1.8 configs/s
2026-02-21T13:04:45.8331031Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 39/39 13.0 configs/s
2026-02-21T13:04:45.9818940Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━ 158/158 1448.4         
2026-02-21T13:04:45.9819488Z                                                                  configs/s      
2026-02-21T13:04:50.6673922Z [673s] Generation 11 complete: 
2026-02-21T13:04:50.6674533Z ok=41
2026-02-21T13:04:50.6674618Z min=1.3124
2026-02-21T13:04:50.6674699Z mid=2.5480
2026-02-21T13:04:50.6674772Z max=8.9074
2026-02-21T13:04:50.6674863Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:04:50.6675015Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:04:50.6675175Z  'l2_groupings': [32],
2026-02-21T13:04:50.6675276Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:04:50.6675394Z  'loop_orders': [[0, 1]],
2026-02-21T13:04:50.6675495Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:04:50.6675603Z  'num_sm_multiplier': 16,
2026-02-21T13:04:50.6675701Z  'num_stages': 1,
2026-02-21T13:04:50.6675788Z  'num_warps': 4,
2026-02-21T13:04:50.6675996Z  'pid_type': 'persistent_blocked',
2026-02-21T13:04:50.6746413Z  'range_flattens': [False, True],
2026-02-21T13:04:50.6746563Z  'range_multi_buffers': [False, False],
2026-02-21T13:04:50.6746683Z  'range_num_stages': [1, 0],
2026-02-21T13:04:50.6746791Z  'range_unroll_factors': [2, 2],
2026-02-21T13:04:50.6746904Z  'range_warp_specializes': [],
2026-02-21T13:04:50.6747008Z  'waves_per_eu': 2}
2026-02-21T13:04:50.6747117Z [673s] Fitting surrogate: 938 points, 938 targets
2026-02-21T13:04:51.0824411Z [674s] Generation 12 starting: 34 neighbors, 2 active search path(s)
2026-02-21T13:05:22.0534741Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 0.6 configs/s
2026-02-21T13:05:25.0217479Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 11.9 configs/s
2026-02-21T13:05:25.1744209Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━ 158/158 1404.4         
2026-02-21T13:05:25.1748116Z                                                                  configs/s      
2026-02-21T13:05:29.9513773Z [713s] Generation 12 complete: 
2026-02-21T13:05:29.9513991Z error=1
2026-02-21T13:05:29.9514117Z ok=35
2026-02-21T13:05:29.9514202Z min=1.2955
2026-02-21T13:05:29.9514293Z mid=1.7441
2026-02-21T13:05:29.9514378Z max=18.3117
2026-02-21T13:05:29.9514471Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:05:29.9514640Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:05:29.9514800Z  'l2_groupings': [64],
2026-02-21T13:05:29.9514912Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:05:29.9515029Z  'loop_orders': [[0, 1]],
2026-02-21T13:05:29.9515528Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:05:29.9515640Z  'num_sm_multiplier': 16,
2026-02-21T13:05:29.9515746Z  'num_stages': 1,
2026-02-21T13:05:29.9515836Z  'num_warps': 4,
2026-02-21T13:05:29.9515940Z  'pid_type': 'persistent_blocked',
2026-02-21T13:05:29.9516064Z  'range_flattens': [False, True],
2026-02-21T13:05:29.9516180Z  'range_multi_buffers': [False, None],
2026-02-21T13:05:29.9516307Z  'range_num_stages': [1, 0],
2026-02-21T13:05:29.9516416Z  'range_unroll_factors': [2, 2],
2026-02-21T13:05:29.9516645Z  'range_warp_specializes': [],
2026-02-21T13:05:29.9516748Z  'waves_per_eu': 2}
2026-02-21T13:05:29.9608364Z [713s] Fitting surrogate: 974 points, 974 targets
2026-02-21T13:05:30.3820961Z [713s] Generation 13 starting: 37 neighbors, 2 active search path(s)
2026-02-21T13:06:06.0610359Z [749s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[0, 1], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:06:06.0626587Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 0.4 configs/s
2026-02-21T13:06:08.6491747Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 14.5 configs/s
2026-02-21T13:06:10.3469480Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━ 158/158 39.3 configs/s
2026-02-21T13:06:15.2284361Z [758s] Generation 13 complete: 
2026-02-21T13:06:15.2284794Z timeout=1
2026-02-21T13:06:15.2285007Z ok=38
2026-02-21T13:06:15.2285748Z min=1.3319
2026-02-21T13:06:15.2285959Z mid=1.4907
2026-02-21T13:06:15.2286169Z max=6.7336
2026-02-21T13:06:15.2286402Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:06:15.2286824Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:06:15.2287240Z  'l2_groupings': [64],
2026-02-21T13:06:15.2287549Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:06:15.2287870Z  'loop_orders': [[0, 1]],
2026-02-21T13:06:15.2288155Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:06:15.2288420Z  'num_stages': 1,
2026-02-21T13:06:15.2288654Z  'num_warps': 4,
2026-02-21T13:06:15.2288890Z  'pid_type': 'flat',
2026-02-21T13:06:15.2289322Z  'range_flattens': [None, True],
2026-02-21T13:06:15.2289628Z  'range_multi_buffers': [None, None],
2026-02-21T13:06:15.2289935Z  'range_num_stages': [0, 0],
2026-02-21T13:06:15.2290239Z  'range_unroll_factors': [0, 2],
2026-02-21T13:06:15.2290532Z  'range_warp_specializes': [],
2026-02-21T13:06:15.2290817Z  'waves_per_eu': 2}
2026-02-21T13:06:15.2352258Z [758s] Fitting surrogate: 1013 points, 1013 targets
2026-02-21T13:06:15.6901462Z [758s] Generation 14 starting: 33 neighbors, 2 active search path(s)
2026-02-21T13:06:28.2485699Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 1.4 configs/s
2026-02-21T13:06:31.1968084Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 33/33 11.7 configs/s
2026-02-21T13:06:31.3073274Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 158/158 - configs/s
2026-02-21T13:06:34.5726699Z [777s] Generation 14 complete: 
2026-02-21T13:06:34.5726909Z ok=35
2026-02-21T13:06:34.5727001Z min=1.3077
2026-02-21T13:06:34.5727085Z mid=3.4467
2026-02-21T13:06:34.5727207Z max=15.4661
2026-02-21T13:06:34.5727301Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:06:34.5727461Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:06:34.5727630Z  'l2_groupings': [32],
2026-02-21T13:06:34.5727742Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:06:34.5727866Z  'loop_orders': [[0, 1]],
2026-02-21T13:06:34.5727981Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:06:34.5728088Z  'num_stages': 1,
2026-02-21T13:06:34.5728178Z  'num_warps': 4,
2026-02-21T13:06:34.5728272Z  'pid_type': 'flat',
2026-02-21T13:06:34.5728375Z  'range_flattens': [None, True],
2026-02-21T13:06:34.5728873Z  'range_multi_buffers': [None, None],
2026-02-21T13:06:34.5728992Z  'range_num_stages': [0, 0],
2026-02-21T13:06:34.5729103Z  'range_unroll_factors': [0, 2],
2026-02-21T13:06:34.5729213Z  'range_warp_specializes': [],
2026-02-21T13:06:34.5729321Z  'waves_per_eu': 2}
2026-02-21T13:06:34.5804107Z [777s] Fitting surrogate: 1048 points, 1048 targets
2026-02-21T13:06:34.9855113Z [778s] Generation 15 starting: 30 neighbors, 2 active search path(s)
2026-02-21T13:07:06.0046037Z [809s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=4, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:07:06.9480921Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0.4 configs/s
2026-02-21T13:07:09.6355275Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 11.7 configs/s
2026-02-21T13:07:09.7011625Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 158/158 - configs/s
2026-02-21T13:07:11.5028984Z [814s] Generation 15 complete: 
2026-02-21T13:07:11.5029418Z error=1
2026-02-21T13:07:11.5029634Z timeout=1
2026-02-21T13:07:11.5029839Z ok=30
2026-02-21T13:07:11.5030079Z min=1.2747
2026-02-21T13:07:11.5030292Z mid=4.0630
2026-02-21T13:07:11.5030518Z max=17.3615
2026-02-21T13:07:11.5030782Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:07:11.5031339Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:07:11.5031772Z  'l2_groupings': [32],
2026-02-21T13:07:11.5032450Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:07:11.5035308Z  'loop_orders': [[0, 1]],
2026-02-21T13:07:11.5035705Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:07:11.5035994Z  'num_stages': 1,
2026-02-21T13:07:11.5036240Z  'num_warps': 4,
2026-02-21T13:07:11.5036519Z  'pid_type': 'flat',
2026-02-21T13:07:11.5036787Z  'range_flattens': [None, True],
2026-02-21T13:07:11.5037104Z  'range_multi_buffers': [None, None],
2026-02-21T13:07:11.5037438Z  'range_num_stages': [0, 0],
2026-02-21T13:07:11.5037722Z  'range_unroll_factors': [0, 1],
2026-02-21T13:07:11.5038039Z  'range_warp_specializes': [],
2026-02-21T13:07:11.5038331Z  'waves_per_eu': 2}
2026-02-21T13:07:11.5079093Z [814s] Fitting surrogate: 1080 points, 1080 targets
2026-02-21T13:07:12.1108981Z [815s] Generation 16 starting: 30 neighbors, 2 active search path(s)
2026-02-21T13:07:22.3720672Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 31/31 0.9 configs/s
2026-02-21T13:07:24.9764846Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 31/31 12.5 configs/s
2026-02-21T13:07:25.0700673Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 158/158 - configs/s
2026-02-21T13:07:27.7744068Z [831s] Generation 16 complete: 
2026-02-21T13:07:27.7744480Z error=1
2026-02-21T13:07:27.7744731Z ok=31
2026-02-21T13:07:27.7744930Z min=1.2896
2026-02-21T13:07:27.7745143Z mid=2.8624
2026-02-21T13:07:27.7745345Z max=19.3822
2026-02-21T13:07:27.7745572Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:07:27.7745984Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:07:27.7746377Z  'l2_groupings': [32],
2026-02-21T13:07:27.7746672Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:07:27.7746988Z  'loop_orders': [[0, 1]],
2026-02-21T13:07:27.7747259Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:07:27.7747960Z  'num_stages': 1,
2026-02-21T13:07:27.7748116Z  'num_warps': 4,
2026-02-21T13:07:27.7748271Z  'pid_type': 'flat',
2026-02-21T13:07:27.7748451Z  'range_flattens': [None, True],
2026-02-21T13:07:27.7748667Z  'range_multi_buffers': [None, None],
2026-02-21T13:07:27.7748873Z  'range_num_stages': [0, 0],
2026-02-21T13:07:27.7749058Z  'range_unroll_factors': [0, 1],
2026-02-21T13:07:27.7749255Z  'range_warp_specializes': [],
2026-02-21T13:07:27.7749444Z  'waves_per_eu': 2}
2026-02-21T13:07:27.7805696Z [831s] Fitting surrogate: 1112 points, 1112 targets
2026-02-21T13:07:28.8536313Z [832s] Generation 17 starting: 16 neighbors, 1 active search path(s)
2026-02-21T13:07:40.6941565Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.7 configs/s
2026-02-21T13:07:42.8255716Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 8.3 configs/s
2026-02-21T13:07:42.8667499Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 158/158 - configs/s
2026-02-21T13:07:44.0774465Z [847s] Generation 17 complete: 
2026-02-21T13:07:44.0774841Z ok=18
2026-02-21T13:07:44.0775045Z min=1.3154
2026-02-21T13:07:44.0775260Z mid=4.0403
2026-02-21T13:07:44.0775460Z max=33.1777
2026-02-21T13:07:44.0776024Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:07:44.0776451Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:07:44.0776849Z  'l2_groupings': [8],
2026-02-21T13:07:44.0777119Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:07:44.0777455Z  'loop_orders': [[1, 0]],
2026-02-21T13:07:44.0777726Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:07:44.0777993Z  'num_stages': 4,
2026-02-21T13:07:44.0778220Z  'num_warps': 4,
2026-02-21T13:07:44.0778448Z  'pid_type': 'flat',
2026-02-21T13:07:44.0778709Z  'range_flattens': [None, False],
2026-02-21T13:07:44.0779012Z  'range_multi_buffers': [None, None],
2026-02-21T13:07:44.0779474Z  'range_num_stages': [0, 1],
2026-02-21T13:07:44.0779754Z  'range_unroll_factors': [0, 1],
2026-02-21T13:07:44.0780050Z  'range_warp_specializes': [],
2026-02-21T13:07:44.0780326Z  'waves_per_eu': 2}
2026-02-21T13:07:44.0815638Z [847s] Fitting surrogate: 1130 points, 1130 targets
2026-02-21T13:07:44.3419716Z [847s] Generation 18 starting: 17 neighbors, 1 active search path(s)
2026-02-21T13:07:53.0910465Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.8 configs/s
2026-02-21T13:07:54.9500898Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 9.7 configs/s
2026-02-21T13:07:54.9916257Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━━━━ 158/158 - configs/s
2026-02-21T13:07:56.1101829Z [859s] Generation 18 complete: 
2026-02-21T13:07:56.1102326Z error=1
2026-02-21T13:07:56.1102653Z ok=17
2026-02-21T13:07:56.1102974Z min=1.3981
2026-02-21T13:07:56.1103283Z mid=2.9325
2026-02-21T13:07:56.1103538Z max=28.4636
2026-02-21T13:07:56.1103775Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:07:56.1104218Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'pointer'],
2026-02-21T13:07:56.1104627Z  'l2_groupings': [8],
2026-02-21T13:07:56.1104921Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:07:56.1105239Z  'loop_orders': [[1, 0]],
2026-02-21T13:07:56.1105516Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:07:56.1105792Z  'num_stages': 4,
2026-02-21T13:07:56.1106024Z  'num_warps': 4,
2026-02-21T13:07:56.1106254Z  'pid_type': 'flat',
2026-02-21T13:07:56.1106516Z  'range_flattens': [None, False],
2026-02-21T13:07:56.1106820Z  'range_multi_buffers': [None, None],
2026-02-21T13:07:56.1107408Z  'range_num_stages': [0, 1],
2026-02-21T13:07:56.1107689Z  'range_unroll_factors': [0, 1],
2026-02-21T13:07:56.1107987Z  'range_warp_specializes': [],
2026-02-21T13:07:56.1108262Z  'waves_per_eu': 2}
2026-02-21T13:07:56.1142210Z [859s] Fitting surrogate: 1148 points, 1148 targets
2026-02-21T13:07:56.2419643Z [859s] Autotuning complete in 859.5s after searching 1065 configs.
2026-02-21T13:07:56.2420051Z One can hardcode the best config and skip autotuning with:
2026-02-21T13:07:56.2421058Z     @helion.kernel(config=helion.Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T13:07:56.2422189Z 
2026-02-21T13:07:56.2422435Z [859s] Code of selected kernel: /tmp/torchinductor_root/et/cettuquifdqfk5ruaus5w25bzjtsigxxq5ms2ayqaiw4dzpt7j5j.py
2026-02-21T13:07:56.2646994Z from __future__ import annotations
2026-02-21T13:07:56.2647221Z 
2026-02-21T13:07:56.2647295Z import torch
2026-02-21T13:07:56.2647477Z import triton
2026-02-21T13:07:56.2647663Z import triton.language as tl
2026-02-21T13:07:56.2647854Z from torch._inductor.runtime import triton_helpers
2026-02-21T13:07:56.2648099Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T13:07:56.2648367Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T13:07:56.2648529Z 
2026-02-21T13:07:56.2648601Z _BLOCK_SIZE_1 = tl.constexpr(128)
2026-02-21T13:07:56.2648760Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T13:07:56.2662149Z _SHAPE_DIM = tl.constexpr(128)
2026-02-21T13:07:56.2662302Z _BLOCK_SIZE_3 = tl.constexpr(128)
2026-02-21T13:07:56.2662445Z _SHAPE_DIM_4 = tl.constexpr(128)
2026-02-21T13:07:56.2662533Z 
2026-02-21T13:07:56.2662578Z @triton.jit
2026-02-21T13:07:56.2662762Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr):
2026-02-21T13:07:56.2663048Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T13:07:56.2663267Z     num_pid_m = tl.cdiv(2048, _BLOCK_SIZE_1)
2026-02-21T13:07:56.2663412Z     num_pid_n = 192
2026-02-21T13:07:56.2663530Z     inner_2d_pid = tl.program_id(0)
2026-02-21T13:07:56.2663797Z     num_pid_in_group = 8 * num_pid_n
2026-02-21T13:07:56.2663957Z     group_id = inner_2d_pid // num_pid_in_group
2026-02-21T13:07:56.2664126Z     first_pid_m = group_id * 8
2026-02-21T13:07:56.2664283Z     group_size_m = min(num_pid_m - first_pid_m, 8)
2026-02-21T13:07:56.2664496Z     pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T13:07:56.2664725Z     pid_1 = inner_2d_pid % num_pid_in_group // group_size_m
2026-02-21T13:07:56.2664896Z     offset_1 = pid_0 * _BLOCK_SIZE_1
2026-02-21T13:07:56.2665080Z     indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32)
2026-02-21T13:07:56.2665261Z     offset_0 = pid_1
2026-02-21T13:07:56.2665396Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T13:07:56.2665572Z     indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32)
2026-02-21T13:07:56.2665822Z     # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T13:07:56.2666102Z     m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32)
2026-02-21T13:07:56.2666290Z     # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0)
2026-02-21T13:07:56.2666460Z     l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32)
2026-02-21T13:07:56.2666662Z     # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32)
2026-02-21T13:07:56.2666889Z     acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32)
2026-02-21T13:07:56.2667058Z     # src[attention.py:71]: q = q_view[tile_b, tile_m, :]
2026-02-21T13:07:56.2667393Z     q = tl.load(tl.make_block_ptr(q_view, [192, 2048, 128], [262144, 128, 1], [offset_0, offset_1, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _SHAPE_DIM], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T13:07:56.2667771Z     # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)):
2026-02-21T13:07:56.2667939Z     # src[attention.py:73]:     k = k_view[tile_b, :, tile_n]
2026-02-21T13:07:56.2668093Z     # src[attention.py:74]:     qk = torch.bmm(q, k)
2026-02-21T13:07:56.2668224Z     # src[attention.py:72-85]: ...
2026-02-21T13:07:56.2668415Z     for offset_2 in tl.range(0, 2048, _BLOCK_SIZE_3, loop_unroll_factor=1, num_stages=1, flatten=False):
2026-02-21T13:07:56.2668677Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32)
2026-02-21T13:07:56.2668822Z         q_copy = q
2026-02-21T13:07:56.2668910Z         m_i_copy = m_i
2026-02-21T13:07:56.2669003Z         l_i_copy = l_i
2026-02-21T13:07:56.2669095Z         acc_copy = acc
2026-02-21T13:07:56.2669187Z         q_copy_0 = q_copy
2026-02-21T13:07:56.2669291Z         m_i_copy_0 = m_i_copy
2026-02-21T13:07:56.2669390Z         l_i_copy_0 = l_i_copy
2026-02-21T13:07:56.2669492Z         acc_copy_0 = acc_copy
2026-02-21T13:07:56.2669614Z         # src[attention.py:73]: k = k_view[tile_b, :, tile_n]
2026-02-21T13:07:56.2669861Z         k = tl.load(k_view + (indices_0[:, None, None] * 262144 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None)
2026-02-21T13:07:56.2670100Z         # src[attention.py:74]: qk = torch.bmm(q, k)
2026-02-21T13:07:56.2670545Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T13:07:56.2671007Z         # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale)
2026-02-21T13:07:56.2671192Z         amax = tl.cast(tl.max(qk, 2), tl.bfloat16)
2026-02-21T13:07:56.2671315Z         v_0 = 0.12751743074602467
2026-02-21T13:07:56.2671434Z         v_1 = tl.cast(amax * v_0, tl.bfloat16)
2026-02-21T13:07:56.2671556Z         v_2 = tl.cast(v_1, tl.float32)
2026-02-21T13:07:56.2671688Z         v_3 = triton_helpers.maximum(m_i_copy_0, v_2)
2026-02-21T13:07:56.2671846Z         # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None]
2026-02-21T13:07:56.2671991Z         v_4 = 0.12751743074602467
2026-02-21T13:07:56.2672121Z         v_5 = tl.cast(qk * v_4, tl.bfloat16)
2026-02-21T13:07:56.2672241Z         subscript = v_3[:, :, None]
2026-02-21T13:07:56.2672356Z         v_6 = tl.cast(v_5, tl.float32)
2026-02-21T13:07:56.2672465Z         v_7 = v_6 - subscript
2026-02-21T13:07:56.2672582Z         # src[attention.py:77]: p = torch.exp2(qk)
2026-02-21T13:07:56.2672707Z         v_8 = libdevice.exp2(v_7)
2026-02-21T13:07:56.2672836Z         # src[attention.py:78]: l_ij = torch.sum(p, -1)
2026-02-21T13:07:56.2672971Z         l_ij = tl.cast(tl.sum(v_8, 2), tl.float32)
2026-02-21T13:07:56.2673117Z         # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij)
2026-02-21T13:07:56.2673257Z         v_9 = m_i_copy_0 - v_3
2026-02-21T13:07:56.2673363Z         v_10 = libdevice.exp2(v_9)
2026-02-21T13:07:56.2673489Z         # src[attention.py:80]: l_i = l_i * alpha + l_ij
2026-02-21T13:07:56.2673616Z         v_11 = l_i_copy_0 * v_10
2026-02-21T13:07:56.2673735Z         l_i = v_11 + l_ij
2026-02-21T13:07:56.2673854Z         # src[attention.py:81]: acc = acc * alpha[:, :, None]
2026-02-21T13:07:56.2673991Z         subscript_1 = v_10[:, :, None]
2026-02-21T13:07:56.2674109Z         v_13 = acc_copy_0 * subscript_1
2026-02-21T13:07:56.2674242Z         # src[attention.py:82]: v = v_view[tile_b, tile_n, :]
2026-02-21T13:07:56.2674576Z         v = tl.load(tl.make_block_ptr(v_view, [192, 2048, 128], [262144, 128, 1], [offset_0, offset_2, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_3, _SHAPE_DIM_4], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T13:07:56.2674898Z         # src[attention.py:83]: p = p.to(v.dtype)
2026-02-21T13:07:56.2675041Z         v_14 = tl.cast(v_8, tl.bfloat16)
2026-02-21T13:07:56.2675182Z         # src[attention.py:84]: acc = torch.baddbmm(acc, p, v)
2026-02-21T13:07:56.2675632Z         acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128])
2026-02-21T13:07:56.2676061Z         # src[attention.py:85]: m_i = m_ij
2026-02-21T13:07:56.2676187Z         m_i = v_3
2026-02-21T13:07:56.2676295Z     # src[attention.py:87]: acc = acc / l_i[:, :, None]
2026-02-21T13:07:56.2676421Z     subscript_2 = l_i[:, :, None]
2026-02-21T13:07:56.2676531Z     v_15 = acc / subscript_2
2026-02-21T13:07:56.2676672Z     # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype)
2026-02-21T13:07:56.2676820Z     v_16 = tl.cast(v_15, tl.bfloat16)
2026-02-21T13:07:56.2677069Z     tl.store(out + (indices_0[:, None, None] * 262144 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), v_16, None)
2026-02-21T13:07:56.2677251Z 
2026-02-21T13:07:56.2677383Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T13:07:56.2677582Z     """
2026-02-21T13:07:56.2677675Z     Computes scaled dot-product attention.
2026-02-21T13:07:56.2677758Z 
2026-02-21T13:07:56.2677869Z     Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
2026-02-21T13:07:56.2678022Z 
2026-02-21T13:07:56.2678058Z     Args:
2026-02-21T13:07:56.2678160Z         q_in: Query tensor of shape [..., seq_len_q, head_dim]
2026-02-21T13:07:56.2678332Z         k_in: Key tensor of shape [..., seq_len_k, head_dim]
2026-02-21T13:07:56.2678482Z         v_in: Value tensor of shape [..., seq_len_k, head_dim]
2026-02-21T13:07:56.2678582Z 
2026-02-21T13:07:56.2678615Z     Returns:
2026-02-21T13:07:56.2686776Z         Output tensor of shape [..., seq_len_q, head_dim]
2026-02-21T13:07:56.2686915Z     """
2026-02-21T13:07:56.2687013Z     # src[attention.py:56]: m_dim = q_in.size(-2)
2026-02-21T13:07:56.2687141Z     m_dim = q_in.size(-2)
2026-02-21T13:07:56.2687252Z     # src[attention.py:57]: n_dim = k_in.size(-2)
2026-02-21T13:07:56.2687370Z     n_dim = k_in.size(-2)
2026-02-21T13:07:56.2687489Z     # src[attention.py:58]: assert n_dim == v_in.size(-2)
2026-02-21T13:07:56.2687685Z     assert n_dim == v_in.size(-2)
2026-02-21T13:07:56.2687831Z     # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1))
2026-02-21T13:07:56.2687978Z     head_dim = 128
2026-02-21T13:07:56.2688113Z     # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T13:07:56.2688293Z     assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T13:07:56.2688458Z     # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T13:07:56.2688619Z     q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T13:07:56.2688772Z     # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T13:07:56.2688932Z     v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T13:07:56.2689107Z     # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T13:07:56.2689306Z     k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T13:07:56.2689474Z     # src[attention.py:64]: out = torch.empty_like(q_view)
2026-02-21T13:07:56.2689609Z     out = torch.empty_like(q_view)
2026-02-21T13:07:56.2689771Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T13:07:56.2689931Z     _BLOCK_SIZE_1 = 128
2026-02-21T13:07:56.2690026Z     _RDIM_SIZE_2 = 128
2026-02-21T13:07:56.2690166Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T13:07:56.2690394Z     # src[attention.py:68]:     m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T13:07:56.2690605Z     # src[attention.py:69]:     l_i = torch.full_like(m_i, 1.0)
2026-02-21T13:07:56.2690772Z     # src[attention.py:67-88]: ...
2026-02-21T13:07:56.2691078Z     _launcher(_helion_attention, (triton.cdiv(2048, _BLOCK_SIZE_1) * 192,), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=4, waves_per_eu=2, matrix_instr_nonkdim=0)
2026-02-21T13:07:56.2691402Z     # src[attention.py:89]: return out.view(q_in.size())
2026-02-21T13:07:56.2691540Z     return out.view(q_in.size())
2026-02-21T13:07:57.1850376Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T13:07:57.1852052Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=4, num_warps=4, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T13:07:57.1853201Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T13:07:57.1853442Z WARNING:tritonbench.utils.triton_op:Completed input ID 4:
2026-02-21T13:07:57.1853692Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T13:07:57.1853870Z ------------------------------------------
2026-02-21T13:07:57.1854048Z (4, 48, 2048, 2048, 128)
2026-02-21T13:07:57.1854136Z 
2026-02-21T13:07:57.1854693Z  67%|██████▋   | 4/6 [41:03<23:45, 712.60s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5:
2026-02-21T13:07:57.1854969Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T13:07:57.1855254Z ------------------------------------------
2026-02-21T13:07:57.1855410Z (4, 48, 4096, 4096, 128)
2026-02-21T13:07:57.1855633Z INFO:tritonbench.utils.triton_op:Took 0.09ms to get benchmark function for aten
2026-02-21T13:07:58.2106229Z INFO:tritonbench.utils.triton_op:Took 1.42ms to get benchmark function for flex_attention
2026-02-21T13:07:59.5452789Z WARNING:__main__:Input tensor metadata:
2026-02-21T13:07:59.5453153Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T13:07:59.5453456Z               'dtype': 'torch.bfloat16',
2026-02-21T13:07:59.5453747Z               'shape': (4, 48, 4096, 128),
2026-02-21T13:07:59.5454054Z               'stride': (25165824, 524288, 128, 1)},
2026-02-21T13:07:59.5454701Z             { 'device': 'cuda:0',
2026-02-21T13:07:59.5454968Z               'dtype': 'torch.bfloat16',
2026-02-21T13:07:59.5455268Z               'shape': (4, 48, 4096, 128),
2026-02-21T13:07:59.5455552Z               'stride': (25165824, 524288, 128, 1)},
2026-02-21T13:07:59.5455843Z             { 'device': 'cuda:0',
2026-02-21T13:07:59.5456107Z               'dtype': 'torch.bfloat16',
2026-02-21T13:07:59.5456387Z               'shape': (4, 48, 4096, 128),
2026-02-21T13:07:59.5456671Z               'stride': (25165824, 524288, 128, 1)}),
2026-02-21T13:07:59.5456960Z   'kwargs': {}}
2026-02-21T13:07:59.5496140Z INFO:tritonbench.utils.triton_op:Took 4.67ms to get benchmark function for helion_attention
2026-02-21T13:07:59.7937663Z [0s] Autotune random seed: 2150287535
2026-02-21T13:07:59.8499544Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T13:08:34.0939106Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 64, 2048], indexing=['pointer', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[1, 3], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:08:34.4189717Z [34s] Timeout after 30s compiling Config(block_sizes=[1, 2, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 2], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:36.9903501Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 2, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:38.0265459Z [38s] Timeout after 30s compiling Config(block_sizes=[1, 32, 2048], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[3, 3], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:38.9411927Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 32], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[1, 4], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:39.7437267Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:08:41.2780056Z [41s] Timeout after 30s compiling Config(block_sizes=[1, 128, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 2], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:08:42.6114285Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 64, 1024], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:08:43.6194781Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 1, 4096], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:08:43.9810158Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 512], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:44.4700243Z [44s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:08:45.0110802Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[1, 0], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:08:45.3342183Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 4, 2048], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:08:45.5504829Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:08:46.2413208Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 16, 4096], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:46.4628722Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:08:46.9268690Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 512, 1024], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:49.0999246Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 8, 4096], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:08:49.7181740Z [49s] Timeout after 30s compiling Config(block_sizes=[1, 2, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=4, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 4], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:50.0485245Z [50s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 16], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=128, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, True], range_num_stages=[1, 4], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:51.1478171Z [51s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['pointer', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:08:51.7537548Z [51s] Timeout after 30s compiling Config(block_sizes=[1, 4, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=2, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[0, 1], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:52.0891504Z [52s] Timeout after 30s compiling Config(block_sizes=[1, 1, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[3, 0], range_unroll_factors=[4, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:08:52.0912153Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.2 configs/s
2026-02-21T13:11:50.4219603Z /tmp/torchinductor_root/ih/ciheokr4uxsbdya5nc2jy32y2mmxkj4j7prhpucwzng6ytixyhh4.py:63:24: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T13:11:50.4221107Z             k = tl.load(tl.make_block_ptr(k_view, [192, 128, 4096], [524288, 1, 128], [offset_0, 0, offset_2], [_BLOCK_SIZE_0, _SHAPE_DIM_3, _BLOCK_SIZE_3], [2, 0, 1]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T13:11:50.4221969Z                        ^
2026-02-21T13:11:50.4223653Z /tmp/torchinductor_root/ih/ciheokr4uxsbdya5nc2jy32y2mmxkj4j7prhpucwzng6ytixyhh4.py:65:145: note: - use: %137 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x128x512xbf16, #ttg.blocked<{sizePerThread = [1, 8, 1], threadsPerWarp = [1, 16, 4], warpsPerCTA = [1, 1, 8], order = [1, 0, 2]}>>) -> tensor<128x512xbf16, #ttg.blocked<{sizePerThread = [8, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 8], order = [0, 1]}>>
2026-02-21T13:11:50.4225280Z 
2026-02-21T13:11:50.4226213Z             qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T13:11:50.4227368Z                                                                                                                                                 ^
2026-02-21T13:11:50.4227730Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T13:11:50.4228196Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [2, 1, 0]}>
2026-02-21T13:11:50.4228889Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [2, 1, 0]}>
2026-02-21T13:11:50.4229467Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [2, 1, 0]}>
2026-02-21T13:11:50.4230049Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [2, 1, 0]}>
2026-02-21T13:11:50.4230726Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [8, 1], order = [1, 0]}>
2026-02-21T13:11:50.4231278Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [1, 0]}>
2026-02-21T13:11:50.4231815Z #blocked6 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [8], order = [0]}>
2026-02-21T13:11:50.4232350Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 8], order = [0, 1]}>
2026-02-21T13:11:50.4232878Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [4, 2], order = [1, 0]}>
2026-02-21T13:11:50.4233431Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [4, 1, 2], order = [0, 1, 2]}>
2026-02-21T13:11:50.4234021Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [4, 2, 1], order = [0, 1, 2]}>
2026-02-21T13:11:50.4234601Z #blocked11 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 4, 2], order = [2, 1, 0]}>
2026-02-21T13:11:50.4235259Z #blocked12 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 8], order = [0, 1, 2]}>
2026-02-21T13:11:50.4235829Z #blocked13 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [8, 1, 1], order = [0, 1, 2]}>
2026-02-21T13:11:50.4236398Z #blocked14 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [8, 1, 1], order = [2, 1, 0]}>
2026-02-21T13:11:50.4236968Z #blocked15 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 8, 1], order = [0, 1, 2]}>
2026-02-21T13:11:50.4237601Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 8 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T13:11:50.4238298Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T13:11:50.4238844Z     %c4096_i64 = arith.constant 4096 : i64
2026-02-21T13:11:50.4239008Z     %c128_i64 = arith.constant 128 : i64
2026-02-21T13:11:50.4239167Z     %c192_i64 = arith.constant 192 : i64
2026-02-21T13:11:50.4239321Z     %c0_i64 = arith.constant 0 : i64
2026-02-21T13:11:50.4239485Z     %c524288_i64 = arith.constant 524288 : i64
2026-02-21T13:11:50.4239647Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T13:11:50.4239805Z     %c524288_i32 = arith.constant 524288 : i32
2026-02-21T13:11:50.4240010Z     %cst = arith.constant dense<128> : tensor<1x1x128xi64, #blocked>
2026-02-21T13:11:50.4240248Z     %cst_0 = arith.constant dense<0> : tensor<1x1x128xi64, #blocked>
2026-02-21T13:11:50.4240517Z     %cst_1 = arith.constant dense<0.000000e+00> : tensor<1x128x512xbf16, #blocked1>
2026-02-21T13:11:50.4240785Z     %cst_2 = arith.constant dense<4096> : tensor<1x1x512xi64, #blocked1>
2026-02-21T13:11:50.4241028Z     %cst_3 = arith.constant dense<0> : tensor<1x1x512xi64, #blocked1>
2026-02-21T13:11:50.4241268Z     %cst_4 = arith.constant dense<128> : tensor<1x128x1xi64, #blocked2>
2026-02-21T13:11:50.4241508Z     %cst_5 = arith.constant dense<0> : tensor<1x128x1xi64, #blocked2>
2026-02-21T13:11:50.4241749Z     %cst_6 = arith.constant dense<128> : tensor<1x1x512xi64, #blocked1>
2026-02-21T13:11:50.4241972Z     %c512_i32 = arith.constant 512 : i32
2026-02-21T13:11:50.4242130Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T13:11:50.4242290Z     %c262144_i32 = arith.constant 262144 : i32
2026-02-21T13:11:50.4242453Z     %c2432_i32 = arith.constant 2432 : i32
2026-02-21T13:11:50.4242692Z     %c786432_i32 = arith.constant 786432 : i32
2026-02-21T13:11:50.4242897Z     %cst_7 = arith.constant dense<128> : tensor<1x512x1xi32, #blocked3>
2026-02-21T13:11:50.4243147Z     %cst_8 = arith.constant dense<0.127517432> : tensor<1x1x512xf32, #blocked1>
2026-02-21T13:11:50.4243435Z     %cst_9 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4243699Z     %cst_10 = arith.constant dense<0.000000e+00> : tensor<1x512xf32, #blocked5>
2026-02-21T13:11:50.4243914Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T13:11:50.4244118Z     %cst_11 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked>
2026-02-21T13:11:50.4244379Z     %cst_12 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4244639Z     %cst_13 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4244850Z     %c64_i32 = arith.constant 64 : i32
2026-02-21T13:11:50.4245003Z     %c192_i32 = arith.constant 192 : i32
2026-02-21T13:11:50.4245159Z     %0 = tt.get_program_id x : i32
2026-02-21T13:11:50.4245369Z     %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked6>
2026-02-21T13:11:50.4245732Z     %2 = ttg.convert_layout %1 : tensor<128xi32, #blocked6> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T13:11:50.4246197Z     %3 = tt.expand_dims %2 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi32, #blocked7>
2026-02-21T13:11:50.4246585Z     %4 = ttg.convert_layout %3 : tensor<1x128xi32, #blocked7> -> tensor<1x128xi32, #blocked8>
2026-02-21T13:11:50.4246964Z     %5 = ttg.convert_layout %4 : tensor<1x128xi32, #blocked8> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T13:11:50.4247411Z     %6 = tt.expand_dims %5 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi32, #blocked9>
2026-02-21T13:11:50.4247802Z     %7 = ttg.convert_layout %6 : tensor<1x1x128xi32, #blocked9> -> tensor<1x1x128xi32, #blocked>
2026-02-21T13:11:50.4248066Z     %8 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked>
2026-02-21T13:11:50.4248376Z     %9 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked6>
2026-02-21T13:11:50.4248600Z     %10 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x512x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:11:50.4248824Z     %11 = arith.extsi %1 : tensor<128xi32, #blocked6> to tensor<128xi64, #blocked6>
2026-02-21T13:11:50.4249107Z     %12 = ttg.convert_layout %11 : tensor<128xi64, #blocked6> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T13:11:50.4249456Z     %13 = tt.expand_dims %12 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi64, #blocked7>
2026-02-21T13:11:50.4249763Z     %14 = ttg.convert_layout %13 : tensor<1x128xi64, #blocked7> -> tensor<1x128xi64, #blocked8>
2026-02-21T13:11:50.4250072Z     %15 = ttg.convert_layout %14 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>>
2026-02-21T13:11:50.4250438Z     %16 = tt.expand_dims %15 {axis = 2 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 2, parent = #blocked10}>> -> tensor<1x128x1xi64, #blocked10>
2026-02-21T13:11:50.4250761Z     %17 = ttg.convert_layout %16 : tensor<1x128x1xi64, #blocked10> -> tensor<1x128x1xi64, #blocked2>
2026-02-21T13:11:50.4251032Z     %18 = tt.broadcast %17 : tensor<1x128x1xi64, #blocked2> -> tensor<1x128x512xi64, #blocked2>
2026-02-21T13:11:50.4251301Z     %19 = ttg.convert_layout %18 : tensor<1x128x512xi64, #blocked2> -> tensor<1x128x512xi64, #blocked1>
2026-02-21T13:11:50.4251554Z     %20 = arith.extsi %9 : tensor<512xi32, #blocked6> to tensor<512xi64, #blocked6>
2026-02-21T13:11:50.4251778Z     %21 = arith.cmpi sge, %17, %cst_5 : tensor<1x128x1xi64, #blocked2>
2026-02-21T13:11:50.4251963Z     %22 = arith.cmpi slt, %17, %cst_4 : tensor<1x128x1xi64, #blocked2>
2026-02-21T13:11:50.4252139Z     %23 = arith.andi %21, %22 : tensor<1x128x1xi1, #blocked2>
2026-02-21T13:11:50.4252345Z     %24 = tt.broadcast %7 : tensor<1x1x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked>
2026-02-21T13:11:50.4252610Z     %25 = ttg.convert_layout %24 : tensor<1x512x128xi32, #blocked> -> tensor<1x512x128xi32, #blocked11>
2026-02-21T13:11:50.4252885Z     %26 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x512x128x!tt.ptr<bf16>, #blocked11>
2026-02-21T13:11:50.4253108Z     %27 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked>
2026-02-21T13:11:50.4253324Z     %28 = arith.extsi %1 : tensor<128xi32, #blocked6> to tensor<128xi64, #blocked6>
2026-02-21T13:11:50.4253605Z     %29 = ttg.convert_layout %28 : tensor<128xi64, #blocked6> -> tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T13:11:50.4253950Z     %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<128xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x128xi64, #blocked7>
2026-02-21T13:11:50.4254254Z     %31 = ttg.convert_layout %30 : tensor<1x128xi64, #blocked7> -> tensor<1x128xi64, #blocked8>
2026-02-21T13:11:50.4254560Z     %32 = ttg.convert_layout %31 : tensor<1x128xi64, #blocked8> -> tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>>
2026-02-21T13:11:50.4254921Z     %33 = tt.expand_dims %32 {axis = 1 : i32} : tensor<1x128xi64, #ttg.slice<{dim = 1, parent = #blocked9}>> -> tensor<1x1x128xi64, #blocked9>
2026-02-21T13:11:50.4255257Z     %34 = ttg.convert_layout %33 : tensor<1x1x128xi64, #blocked9> -> tensor<1x1x128xi64, #blocked>
2026-02-21T13:11:50.4255485Z     %35 = arith.cmpi sge, %34, %cst_0 : tensor<1x1x128xi64, #blocked>
2026-02-21T13:11:50.4255664Z     %36 = arith.cmpi slt, %34, %cst : tensor<1x1x128xi64, #blocked>
2026-02-21T13:11:50.4255838Z     %37 = arith.andi %35, %36 : tensor<1x1x128xi1, #blocked>
2026-02-21T13:11:50.4256008Z     scf.for %arg4 = %0 to %c786432_i32 step %c2432_i32  : i32 {
2026-02-21T13:11:50.4256185Z       %38 = arith.divsi %arg4, %c262144_i32 : i32
2026-02-21T13:11:50.4256319Z       %39 = arith.muli %38, %c64_i32 : i32
2026-02-21T13:11:50.4256446Z       %40 = arith.subi %c192_i32, %39 : i32
2026-02-21T13:11:50.4256603Z       %41 = arith.minsi %40, %c64_i32 : i32
2026-02-21T13:11:50.4256734Z       %42 = arith.remsi %arg4, %c262144_i32 : i32
2026-02-21T13:11:50.4256866Z       %43 = arith.remsi %42, %41 : i32
2026-02-21T13:11:50.4256987Z       %44 = arith.addi %39, %43 : i32
2026-02-21T13:11:50.4257105Z       %45 = arith.divsi %42, %41 : i32
2026-02-21T13:11:50.4257225Z       %46 = arith.muli %44, %c524288_i32 : i32
2026-02-21T13:11:50.4257346Z       %47 = arith.muli %45, %c128_i32 : i32
2026-02-21T13:11:50.4257461Z       %48 = arith.addi %46, %47 : i32
2026-02-21T13:11:50.4257594Z       %49 = tt.splat %48 : i32 -> tensor<1x1x128xi32, #blocked>
2026-02-21T13:11:50.4257748Z       %50 = arith.addi %49, %7 : tensor<1x1x128xi32, #blocked>
2026-02-21T13:11:50.4257945Z       %51 = tt.addptr %8, %50 : tensor<1x1x128x!tt.ptr<bf16>, #blocked>, tensor<1x1x128xi32, #blocked>
2026-02-21T13:11:50.4258149Z       %52 = tt.load %51 : tensor<1x1x128x!tt.ptr<bf16>, #blocked>
2026-02-21T13:11:50.4258288Z       %53 = arith.extsi %44 : i32 to i64
2026-02-21T13:11:50.4258409Z       %54 = arith.muli %53, %c524288_i64 : i64
2026-02-21T13:11:50.4258545Z       %55 = tt.splat %54 : i64 -> tensor<1x128x512xi64, #blocked1>
2026-02-21T13:11:50.4258689Z       %56 = arith.cmpi sge, %53, %c0_i64 : i64
2026-02-21T13:11:50.4258810Z       %57 = arith.cmpi slt, %53, %c192_i64 : i64
2026-02-21T13:11:50.4258933Z       %58 = arith.andi %56, %57 : i1
2026-02-21T13:11:50.4259063Z       %59 = tt.splat %58 : i1 -> tensor<1x128x1xi1, #blocked2>
2026-02-21T13:11:50.4259218Z       %60 = arith.andi %59, %23 : tensor<1x128x1xi1, #blocked2>
2026-02-21T13:11:50.4259415Z       %61 = tt.broadcast %60 : tensor<1x128x1xi1, #blocked2> -> tensor<1x128x512xi1, #blocked2>
2026-02-21T13:11:50.4259678Z       %62 = ttg.convert_layout %61 : tensor<1x128x512xi1, #blocked2> -> tensor<1x128x512xi1, #blocked1>
2026-02-21T13:11:50.4259920Z       %63 = tt.reshape %52 : tensor<1x1x128xbf16, #blocked> -> tensor<1x128xbf16, #blocked8>
2026-02-21T13:11:50.4260113Z       %64 = tt.splat %46 : i32 -> tensor<1x512x1xi32, #blocked3>
2026-02-21T13:11:50.4260474Z       %65:3 = scf.for %arg5 = %c0_i32 to %c4096_i32 step %c512_i32 iter_args(%arg6 = %cst_13, %arg7 = %cst_12, %arg8 = %cst_11) -> (tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked>)  : i32 {
2026-02-21T13:11:50.4260859Z         %91 = tt.splat %arg5 : i32 -> tensor<512xi32, #blocked6>
2026-02-21T13:11:50.4261013Z         %92 = arith.addi %91, %9 : tensor<512xi32, #blocked6>
2026-02-21T13:11:50.4261148Z         %93 = arith.extsi %arg5 : i32 to i64
2026-02-21T13:11:50.4261280Z         %94 = tt.splat %93 : i64 -> tensor<512xi64, #blocked6>
2026-02-21T13:11:50.4261430Z         %95 = arith.addi %94, %20 : tensor<512xi64, #blocked6>
2026-02-21T13:11:50.4261662Z         %96 = ttg.convert_layout %95 : tensor<512xi64, #blocked6> -> tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T13:11:50.4261990Z         %97 = tt.expand_dims %96 {axis = 0 : i32} : tensor<512xi64, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x512xi64, #blocked7>
2026-02-21T13:11:50.4262282Z         %98 = ttg.convert_layout %97 : tensor<1x512xi64, #blocked7> -> tensor<1x512xi64, #blocked5>
2026-02-21T13:11:50.4262573Z         %99 = ttg.convert_layout %98 : tensor<1x512xi64, #blocked5> -> tensor<1x512xi64, #ttg.slice<{dim = 1, parent = #blocked12}>>
2026-02-21T13:11:50.4262933Z         %100 = tt.expand_dims %99 {axis = 1 : i32} : tensor<1x512xi64, #ttg.slice<{dim = 1, parent = #blocked12}>> -> tensor<1x1x512xi64, #blocked12>
2026-02-21T13:11:50.4263248Z         %101 = ttg.convert_layout %100 : tensor<1x1x512xi64, #blocked12> -> tensor<1x1x512xi64, #blocked1>
2026-02-21T13:11:50.4263468Z         %102 = arith.muli %101, %cst_6 : tensor<1x1x512xi64, #blocked1>
2026-02-21T13:11:50.4263676Z         %103 = tt.broadcast %102 : tensor<1x1x512xi64, #blocked1> -> tensor<1x128x512xi64, #blocked1>
2026-02-21T13:11:50.4263886Z         %104 = arith.addi %19, %103 : tensor<1x128x512xi64, #blocked1>
2026-02-21T13:11:50.4264067Z         %105 = arith.addi %55, %104 : tensor<1x128x512xi64, #blocked1>
2026-02-21T13:11:50.4264286Z         %106 = tt.addptr %10, %105 : tensor<1x128x512x!tt.ptr<bf16>, #blocked1>, tensor<1x128x512xi64, #blocked1>
2026-02-21T13:11:50.4264512Z         %107 = arith.cmpi sge, %101, %cst_3 : tensor<1x1x512xi64, #blocked1>
2026-02-21T13:11:50.4264694Z         %108 = arith.cmpi slt, %101, %cst_2 : tensor<1x1x512xi64, #blocked1>
2026-02-21T13:11:50.4264867Z         %109 = arith.andi %107, %108 : tensor<1x1x512xi1, #blocked1>
2026-02-21T13:11:50.4265069Z         %110 = tt.broadcast %109 : tensor<1x1x512xi1, #blocked1> -> tensor<1x128x512xi1, #blocked1>
2026-02-21T13:11:50.4265276Z         %111 = arith.andi %62, %110 : tensor<1x128x512xi1, #blocked1>
2026-02-21T13:11:50.4265456Z         %112 = tt.load %106, %111, %cst_1 : tensor<1x128x512x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:11:50.4265687Z         %113 = tt.reshape %112 : tensor<1x128x512xbf16, #blocked1> -> tensor<128x512xbf16, #blocked5>
2026-02-21T13:11:50.4265991Z         %114 = ttg.convert_layout %63 : tensor<1x128xbf16, #blocked8> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>>
2026-02-21T13:11:50.4266350Z         %115 = ttg.convert_layout %113 : tensor<128x512xbf16, #blocked5> -> tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>>
2026-02-21T13:11:50.4266658Z         %116 = ttg.convert_layout %cst_10 : tensor<1x512xf32, #blocked5> -> tensor<1x512xf32, #blocked5>
2026-02-21T13:11:50.4267068Z         %117 = tt.dot %114, %115, %116, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked5}>> * tensor<128x512xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked5}>> -> tensor<1x512xf32, #blocked5>
2026-02-21T13:11:50.4267479Z         %118 = tt.reshape %117 : tensor<1x512xf32, #blocked5> -> tensor<1x1x512xf32, #blocked1>
2026-02-21T13:11:50.4267719Z         %119 = arith.truncf %118 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1>
2026-02-21T13:11:50.4267964Z         %120 = arith.extf %119 : tensor<1x1x512xbf16, #blocked1> to tensor<1x1x512xf32, #blocked1>
2026-02-21T13:11:50.4268155Z         %121 = "tt.reduce"(%120) <{axis = 2 : i32}> ({
2026-02-21T13:11:50.4268296Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:11:50.4268418Z           %176 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:11:50.4268543Z           tt.reduce.return %176 : f32
2026-02-21T13:11:50.4268734Z         }) : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:11:50.4269024Z         %122 = ttg.convert_layout %121 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4269297Z         %123 = arith.truncf %122 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4>
2026-02-21T13:11:50.4269518Z         %124 = arith.extf %123 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4269709Z         %125 = arith.mulf %124, %cst_9 : tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4269899Z         %126 = arith.truncf %125 : tensor<1x1xf32, #blocked4> to tensor<1x1xbf16, #blocked4>
2026-02-21T13:11:50.4270115Z         %127 = arith.extf %126 : tensor<1x1xbf16, #blocked4> to tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4270311Z         %128 = arith.cmpf ogt, %arg6, %127 : tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4270498Z         %129 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4270660Z         %130 = arith.ori %128, %129 : tensor<1x1xi1, #blocked4>
2026-02-21T13:11:50.4270852Z         %131 = arith.select %130, %arg6, %127 : tensor<1x1xi1, #blocked4>, tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4271056Z         %132 = arith.mulf %120, %cst_8 : tensor<1x1x512xf32, #blocked1>
2026-02-21T13:11:50.4271261Z         %133 = arith.truncf %132 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1>
2026-02-21T13:11:50.4271549Z         %134 = ttg.convert_layout %131 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>>
2026-02-21T13:11:50.4271906Z         %135 = tt.expand_dims %134 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13>
2026-02-21T13:11:50.4272210Z         %136 = ttg.convert_layout %135 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T13:11:50.4272458Z         %137 = arith.extf %133 : tensor<1x1x512xbf16, #blocked1> to tensor<1x1x512xf32, #blocked1>
2026-02-21T13:11:50.4272699Z         %138 = tt.broadcast %136 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x512xf32, #blocked14>
2026-02-21T13:11:50.4272954Z         %139 = ttg.convert_layout %138 : tensor<1x1x512xf32, #blocked14> -> tensor<1x1x512xf32, #blocked1>
2026-02-21T13:11:50.4273173Z         %140 = arith.subf %137, %139 : tensor<1x1x512xf32, #blocked1>
2026-02-21T13:11:50.4273483Z         %141 = tt.extern_elementwise %140 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1x512xf32, #blocked1>
2026-02-21T13:11:50.4273778Z         %142 = "tt.reduce"(%141) <{axis = 2 : i32}> ({
2026-02-21T13:11:50.4273905Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:11:50.4274025Z           %176 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:11:50.4274143Z           tt.reduce.return %176 : f32
2026-02-21T13:11:50.4274331Z         }) : (tensor<1x1x512xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:11:50.4274618Z         %143 = ttg.convert_layout %142 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4274861Z         %144 = arith.subf %arg6, %131 : tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4275187Z         %145 = tt.extern_elementwise %144 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked4>) -> tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4275475Z         %146 = arith.mulf %arg7, %145 : tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4275638Z         %147 = arith.addf %146, %143 : tensor<1x1xf32, #blocked4>
2026-02-21T13:11:50.4275878Z         %148 = ttg.convert_layout %145 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>>
2026-02-21T13:11:50.4276235Z         %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13>
2026-02-21T13:11:50.4276535Z         %150 = ttg.convert_layout %149 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T13:11:50.4276782Z         %151 = tt.broadcast %150 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x128xf32, #blocked14>
2026-02-21T13:11:50.4277035Z         %152 = ttg.convert_layout %151 : tensor<1x1x128xf32, #blocked14> -> tensor<1x1x128xf32, #blocked>
2026-02-21T13:11:50.4277245Z         %153 = arith.mulf %arg8, %152 : tensor<1x1x128xf32, #blocked>
2026-02-21T13:11:50.4277486Z         %154 = ttg.convert_layout %92 : tensor<512xi32, #blocked6> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked7}>>
2026-02-21T13:11:50.4277816Z         %155 = tt.expand_dims %154 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked7}>> -> tensor<1x512xi32, #blocked7>
2026-02-21T13:11:50.4278106Z         %156 = ttg.convert_layout %155 : tensor<1x512xi32, #blocked7> -> tensor<1x512xi32, #blocked5>
2026-02-21T13:11:50.4278414Z         %157 = ttg.convert_layout %156 : tensor<1x512xi32, #blocked5> -> tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked15}>>
2026-02-21T13:11:50.4278759Z         %158 = tt.expand_dims %157 {axis = 2 : i32} : tensor<1x512xi32, #ttg.slice<{dim = 2, parent = #blocked15}>> -> tensor<1x512x1xi32, #blocked15>
2026-02-21T13:11:50.4279071Z         %159 = ttg.convert_layout %158 : tensor<1x512x1xi32, #blocked15> -> tensor<1x512x1xi32, #blocked3>
2026-02-21T13:11:50.4279289Z         %160 = arith.muli %159, %cst_7 : tensor<1x512x1xi32, #blocked3>
2026-02-21T13:11:50.4279453Z         %161 = arith.addi %64, %160 : tensor<1x512x1xi32, #blocked3>
2026-02-21T13:11:50.4279671Z         %162 = tt.broadcast %161 : tensor<1x512x1xi32, #blocked3> -> tensor<1x512x128xi32, #blocked3>
2026-02-21T13:11:50.4279930Z         %163 = ttg.convert_layout %162 : tensor<1x512x128xi32, #blocked3> -> tensor<1x512x128xi32, #blocked11>
2026-02-21T13:11:50.4280153Z         %164 = arith.addi %163, %25 : tensor<1x512x128xi32, #blocked11>
2026-02-21T13:11:50.4280377Z         %165 = tt.addptr %26, %164 : tensor<1x512x128x!tt.ptr<bf16>, #blocked11>, tensor<1x512x128xi32, #blocked11>
2026-02-21T13:11:50.4280603Z         %166 = tt.load %165 : tensor<1x512x128x!tt.ptr<bf16>, #blocked11>
2026-02-21T13:11:50.4280813Z         %167 = arith.truncf %141 : tensor<1x1x512xf32, #blocked1> to tensor<1x1x512xbf16, #blocked1>
2026-02-21T13:11:50.4281051Z         %168 = tt.reshape %153 : tensor<1x1x128xf32, #blocked> -> tensor<1x128xf32, #blocked8>
2026-02-21T13:11:50.4281285Z         %169 = tt.reshape %167 : tensor<1x1x512xbf16, #blocked1> -> tensor<1x512xbf16, #blocked5>
2026-02-21T13:11:50.4281529Z         %170 = tt.reshape %166 : tensor<1x512x128xbf16, #blocked11> -> tensor<512x128xbf16, #blocked8>
2026-02-21T13:11:50.4281831Z         %171 = ttg.convert_layout %169 : tensor<1x512xbf16, #blocked5> -> tensor<1x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:11:50.4282196Z         %172 = ttg.convert_layout %170 : tensor<512x128xbf16, #blocked8> -> tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:11:50.4282497Z         %173 = ttg.convert_layout %168 : tensor<1x128xf32, #blocked8> -> tensor<1x128xf32, #blocked8>
2026-02-21T13:11:50.4282937Z         %174 = tt.dot %171, %172, %173, inputPrecision = tf32 : tensor<1x512xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<512x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x128xf32, #blocked8>
2026-02-21T13:11:50.4283351Z         %175 = tt.reshape %174 : tensor<1x128xf32, #blocked8> -> tensor<1x1x128xf32, #blocked>
2026-02-21T13:11:50.4283618Z         scf.yield %131, %147, %175 : tensor<1x1xf32, #blocked4>, tensor<1x1xf32, #blocked4>, tensor<1x1x128xf32, #blocked>
2026-02-21T13:11:50.4283835Z       } {tt.flatten, tt.num_stages = 4 : i32}
2026-02-21T13:11:50.4284082Z       %66 = ttg.convert_layout %65#1 : tensor<1x1xf32, #blocked4> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>>
2026-02-21T13:11:50.4284416Z       %67 = tt.expand_dims %66 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked13}>> -> tensor<1x1x1xf32, #blocked13>
2026-02-21T13:11:50.4284710Z       %68 = ttg.convert_layout %67 : tensor<1x1x1xf32, #blocked13> -> tensor<1x1x1xf32, #blocked14>
2026-02-21T13:11:50.4284951Z       %69 = tt.broadcast %68 : tensor<1x1x1xf32, #blocked14> -> tensor<1x1x128xf32, #blocked14>
2026-02-21T13:11:50.4285192Z       %70 = ttg.convert_layout %69 : tensor<1x1x128xf32, #blocked14> -> tensor<1x1x128xf32, #blocked>
2026-02-21T13:11:50.4285401Z       %71 = arith.divf %65#2, %70 : tensor<1x1x128xf32, #blocked>
2026-02-21T13:11:50.4285594Z       %72 = arith.truncf %71 : tensor<1x1x128xf32, #blocked> to tensor<1x1x128xbf16, #blocked>
2026-02-21T13:11:50.4285773Z       %73 = arith.extsi %44 : i32 to i64
2026-02-21T13:11:50.4285887Z       %74 = arith.extsi %45 : i32 to i64
2026-02-21T13:11:50.4286011Z       %75 = arith.muli %73, %c524288_i64 : i64
2026-02-21T13:11:50.4286152Z       %76 = tt.splat %75 : i64 -> tensor<1x1x128xi64, #blocked>
2026-02-21T13:11:50.4286312Z       %77 = arith.muli %74, %c128_i64 : i64
2026-02-21T13:11:50.4286447Z       %78 = tt.splat %77 : i64 -> tensor<1x1x128xi64, #blocked>
2026-02-21T13:11:50.4286604Z       %79 = arith.addi %78, %34 : tensor<1x1x128xi64, #blocked>
2026-02-21T13:11:50.4286761Z       %80 = arith.addi %76, %79 : tensor<1x1x128xi64, #blocked>
2026-02-21T13:11:50.4286961Z       %81 = tt.addptr %27, %80 : tensor<1x1x128x!tt.ptr<bf16>, #blocked>, tensor<1x1x128xi64, #blocked>
2026-02-21T13:11:50.4287159Z       %82 = arith.cmpi sge, %73, %c0_i64 : i64
2026-02-21T13:11:50.4287285Z       %83 = arith.cmpi slt, %73, %c192_i64 : i64
2026-02-21T13:11:50.4303275Z       %84 = arith.andi %82, %83 : i1
2026-02-21T13:11:50.4303410Z       %85 = arith.cmpi sge, %74, %c0_i64 : i64
2026-02-21T13:11:50.4303541Z       %86 = arith.cmpi slt, %74, %c4096_i64 : i64
2026-02-21T13:11:50.4303672Z       %87 = arith.andi %85, %86 : i1
2026-02-21T13:11:50.4303783Z       %88 = arith.andi %84, %87 : i1
2026-02-21T13:11:50.4303922Z       %89 = tt.splat %88 : i1 -> tensor<1x1x128xi1, #blocked>
2026-02-21T13:11:50.4304079Z       %90 = arith.andi %89, %37 : tensor<1x1x128xi1, #blocked>
2026-02-21T13:11:50.4304247Z       tt.store %81, %72, %90 : tensor<1x1x128x!tt.ptr<bf16>, #blocked>
2026-02-21T13:11:50.4304423Z     } {tt.loop_unroll_factor = 1 : i32, tt.num_stages = 1 : i32}
2026-02-21T13:11:50.4304566Z     tt.return
2026-02-21T13:11:50.4304654Z   }
2026-02-21T13:11:50.4304732Z }
2026-02-21T13:11:50.4304776Z 
2026-02-21T13:11:50.4304813Z {-#
2026-02-21T13:11:50.4304898Z   external_resources: {
2026-02-21T13:11:50.4305003Z     mlir_reproducer: {
2026-02-21T13:11:50.4307154Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=1 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T13:11:50.4309445Z       disable_threading: false,
2026-02-21T13:11:50.4309559Z       verify_each: true
2026-02-21T13:11:50.4309652Z     }
2026-02-21T13:11:50.4309731Z   }
2026-02-21T13:11:50.4309804Z #-}
2026-02-21T13:11:50.4310085Z /tmp/torchinductor_root/ih/ciheokr4uxsbdya5nc2jy32y2mmxkj4j7prhpucwzng6ytixyhh4.py:18:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T13:11:50.4310801Z /tmp/torchinductor_root/ih/ciheokr4uxsbdya5nc2jy32y2mmxkj4j7prhpucwzng6ytixyhh4.py:18:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T13:11:50.4311359Z [230s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T13:11:50.4312184Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 512], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T13:11:50.4312916Z Error: RuntimeError: PassManager::run failed
2026-02-21T13:11:50.4313085Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T13:12:02.0753735Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T13:12:02.0762427Z [242s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s])
2026-02-21T13:12:02.1258382Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 - configs/s
2026-02-21T13:12:03.2288375Z [243s] Initial random population of 100, 5 starting points: 
2026-02-21T13:12:03.2288874Z error=21
2026-02-21T13:12:03.2289081Z timeout=23
2026-02-21T13:12:03.2289283Z ok=56
2026-02-21T13:12:03.2289467Z min=7.3252
2026-02-21T13:12:03.2289662Z mid=46.2729
2026-02-21T13:12:03.2289873Z max=8495.9043
2026-02-21T13:12:03.2290112Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:12:03.2290518Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T13:12:03.2290901Z  'l2_groupings': [2],
2026-02-21T13:12:03.2291180Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:12:03.2291486Z  'loop_orders': [[0, 1]],
2026-02-21T13:12:03.2291753Z  'matrix_instr_nonkdim': 16,
2026-02-21T13:12:03.2292009Z  'num_stages': 1,
2026-02-21T13:12:03.2292239Z  'num_warps': 4,
2026-02-21T13:12:03.2292462Z  'pid_type': 'flat',
2026-02-21T13:12:03.2292717Z  'range_flattens': [None, False],
2026-02-21T13:12:03.2293022Z  'range_multi_buffers': [None, False],
2026-02-21T13:12:03.2293325Z  'range_num_stages': [0, 1],
2026-02-21T13:12:03.2293595Z  'range_unroll_factors': [0, 0],
2026-02-21T13:12:03.2293892Z  'range_warp_specializes': [],
2026-02-21T13:12:03.2294160Z  'waves_per_eu': 1}
2026-02-21T13:12:03.2303622Z [243s] Fitting surrogate: 100 points, 100 targets
2026-02-21T13:12:04.1775317Z [244s] Generation 1 starting: 96 neighbors, 5 active search path(s)
2026-02-21T13:12:28.8260207Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 1.5 configs/s
2026-02-21T13:12:47.1097083Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 98/98 5.4 configs/s
2026-02-21T13:12:47.4716855Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 33/33 - configs/s
2026-02-21T13:12:57.6373515Z [297s] Generation 1 complete: 
2026-02-21T13:12:57.6373936Z ok=101
2026-02-21T13:12:57.6374154Z min=6.0520
2026-02-21T13:12:57.6374369Z mid=12.1041
2026-02-21T13:12:57.6374590Z max=68.3722
2026-02-21T13:12:57.6374873Z best={'block_sizes': [1, 128, 32],
2026-02-21T13:12:57.6375294Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T13:12:57.6376242Z  'l2_groupings': [16],
2026-02-21T13:12:57.6376522Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:12:57.6376847Z  'loop_orders': [[0, 1]],
2026-02-21T13:12:57.6377132Z  'matrix_instr_nonkdim': 16,
2026-02-21T13:12:57.6377421Z  'num_stages': 1,
2026-02-21T13:12:57.6377655Z  'num_warps': 4,
2026-02-21T13:12:57.6377895Z  'pid_type': 'flat',
2026-02-21T13:12:57.6378155Z  'range_flattens': [None, None],
2026-02-21T13:12:57.6378468Z  'range_multi_buffers': [None, False],
2026-02-21T13:12:57.6378830Z  'range_num_stages': [0, 4],
2026-02-21T13:12:57.6379124Z  'range_unroll_factors': [0, 4],
2026-02-21T13:12:57.6379388Z  'range_warp_specializes': [],
2026-02-21T13:12:57.6379629Z  'waves_per_eu': 2}
2026-02-21T13:12:57.6414099Z [297s] Fitting surrogate: 201 points, 201 targets
2026-02-21T13:12:58.5833046Z [298s] Generation 2 starting: 88 neighbors, 5 active search path(s)
2026-02-21T13:13:36.1010180Z [336s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, False], range_num_stages=[3, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:13:49.9628305Z [350s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:13:49.9653660Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.4 configs/s
2026-02-21T13:14:05.7976919Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 90/90 5.7 configs/s
2026-02-21T13:14:06.2006992Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 33/33 - configs/s
2026-02-21T13:14:17.4107896Z [377s] Generation 2 complete: 
2026-02-21T13:14:17.4108307Z error=9
2026-02-21T13:14:17.4108500Z timeout=2
2026-02-21T13:14:17.4108688Z ok=82
2026-02-21T13:14:17.4108862Z min=5.8575
2026-02-21T13:14:17.4109053Z mid=8.0320
2026-02-21T13:14:17.4109231Z max=103.1657
2026-02-21T13:14:17.4109471Z best={'block_sizes': [1, 256, 64],
2026-02-21T13:14:17.4109850Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T13:14:17.4110230Z  'l2_groupings': [2],
2026-02-21T13:14:17.4110491Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:14:17.4110786Z  'loop_orders': [[0, 1]],
2026-02-21T13:14:17.4111047Z  'matrix_instr_nonkdim': 16,
2026-02-21T13:14:17.4111303Z  'num_stages': 1,
2026-02-21T13:14:17.4111516Z  'num_warps': 8,
2026-02-21T13:14:17.4111728Z  'pid_type': 'flat',
2026-02-21T13:14:17.4111990Z  'range_flattens': [None, None],
2026-02-21T13:14:17.4112273Z  'range_multi_buffers': [None, True],
2026-02-21T13:14:17.4112551Z  'range_num_stages': [0, 1],
2026-02-21T13:14:17.4112806Z  'range_unroll_factors': [0, 0],
2026-02-21T13:14:17.4113082Z  'range_warp_specializes': [],
2026-02-21T13:14:17.4113333Z  'waves_per_eu': 1}
2026-02-21T13:14:17.4148052Z [377s] Fitting surrogate: 294 points, 294 targets
2026-02-21T13:14:19.3291138Z [379s] Generation 3 starting: 86 neighbors, 5 active search path(s)
2026-02-21T13:14:53.6258322Z [413s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:14:53.8478032Z [413s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[None, False], range_num_stages=[3, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:14:57.5060542Z [417s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[2, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:14:57.5083028Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.8 configs/s
2026-02-21T13:15:14.1933113Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 87/87 5.2 configs/s
2026-02-21T13:15:14.5767410Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 34/34 - configs/s
2026-02-21T13:15:25.4174523Z [445s] Generation 3 complete: 
2026-02-21T13:15:25.4174894Z error=5
2026-02-21T13:15:25.4175072Z timeout=3
2026-02-21T13:15:25.4175246Z ok=83
2026-02-21T13:15:25.4175424Z min=5.9373
2026-02-21T13:15:25.4175638Z mid=7.1941
2026-02-21T13:15:25.4175805Z max=123.1062
2026-02-21T13:15:25.4176016Z best={'block_sizes': [1, 256, 32],
2026-02-21T13:15:25.4176369Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'pointer'],
2026-02-21T13:15:25.4176713Z  'l2_groupings': [1],
2026-02-21T13:15:25.4176963Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:15:25.4177236Z  'loop_orders': [[1, 0]],
2026-02-21T13:15:25.4177862Z  'matrix_instr_nonkdim': 32,
2026-02-21T13:15:25.4178094Z  'num_stages': 2,
2026-02-21T13:15:25.4178315Z  'num_warps': 8,
2026-02-21T13:15:25.4178514Z  'pid_type': 'flat',
2026-02-21T13:15:25.4178743Z  'range_flattens': [None, False],
2026-02-21T13:15:25.4179005Z  'range_multi_buffers': [None, False],
2026-02-21T13:15:25.4179280Z  'range_num_stages': [0, 2],
2026-02-21T13:15:25.4179519Z  'range_unroll_factors': [0, 2],
2026-02-21T13:15:25.4179775Z  'range_warp_specializes': [],
2026-02-21T13:15:25.4180005Z  'waves_per_eu': 2}
2026-02-21T13:15:25.4216402Z [445s] Fitting surrogate: 385 points, 385 targets
2026-02-21T13:15:26.3676785Z [446s] Generation 4 starting: 85 neighbors, 5 active search path(s)
2026-02-21T13:16:03.6646411Z [483s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:16:08.3220036Z [488s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 16], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:16:08.3245874Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 0.6 configs/s
2026-02-21T13:16:19.3405510Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 87/87 8.0 configs/s
2026-02-21T13:16:19.7435410Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 37/37 - configs/s
2026-02-21T13:16:31.9892110Z [512s] Generation 4 complete: 
2026-02-21T13:16:31.9892349Z error=11
2026-02-21T13:16:31.9892445Z timeout=2
2026-02-21T13:16:31.9892828Z ok=78
2026-02-21T13:16:31.9892938Z min=5.3599
2026-02-21T13:16:31.9893027Z mid=7.0520
2026-02-21T13:16:31.9893118Z max=56.8961
2026-02-21T13:16:31.9893227Z best={'block_sizes': [1, 128, 32],
2026-02-21T13:16:31.9894655Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T13:16:31.9894939Z  'l2_groupings': [8],
2026-02-21T13:16:31.9895131Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:16:31.9895274Z  'loop_orders': [[0, 1]],
2026-02-21T13:16:31.9895403Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:16:31.9895520Z  'num_stages': 1,
2026-02-21T13:16:31.9895622Z  'num_warps': 4,
2026-02-21T13:16:31.9895726Z  'pid_type': 'flat',
2026-02-21T13:16:31.9895866Z  'range_flattens': [None, True],
2026-02-21T13:16:31.9896083Z  'range_multi_buffers': [None, False],
2026-02-21T13:16:31.9896305Z  'range_num_stages': [0, 3],
2026-02-21T13:16:31.9896505Z  'range_unroll_factors': [0, 4],
2026-02-21T13:16:31.9896714Z  'range_warp_specializes': [],
2026-02-21T13:16:31.9896877Z  'waves_per_eu': 2}
2026-02-21T13:16:31.9930691Z [512s] Fitting surrogate: 476 points, 476 targets
2026-02-21T13:16:32.8648038Z [513s] Generation 5 starting: 75 neighbors, 5 active search path(s)
2026-02-21T13:17:11.6563122Z [551s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:17:11.6582829Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76/76 0.4 configs/s
2026-02-21T13:17:21.0513775Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 76/76 8.1 configs/s
2026-02-21T13:17:21.4037494Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 38/38 - configs/s
2026-02-21T13:17:32.1835423Z [572s] Generation 5 complete: 
2026-02-21T13:17:32.1835724Z error=8
2026-02-21T13:17:32.1835863Z timeout=1
2026-02-21T13:17:32.1835985Z ok=71
2026-02-21T13:17:32.1836133Z min=5.2410
2026-02-21T13:17:32.1836303Z mid=6.8149
2026-02-21T13:17:32.1836436Z max=64.5690
2026-02-21T13:17:32.1836581Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:17:32.1836837Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T13:17:32.1837094Z  'l2_groupings': [8],
2026-02-21T13:17:32.1837284Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:17:32.1837475Z  'loop_orders': [[0, 1]],
2026-02-21T13:17:32.1837649Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:17:32.1837815Z  'num_stages': 1,
2026-02-21T13:17:32.1837958Z  'num_warps': 4,
2026-02-21T13:17:32.1838097Z  'pid_type': 'flat',
2026-02-21T13:17:32.1838263Z  'range_flattens': [None, True],
2026-02-21T13:17:32.1838463Z  'range_multi_buffers': [None, False],
2026-02-21T13:17:32.1838652Z  'range_num_stages': [0, 3],
2026-02-21T13:17:32.1839226Z  'range_unroll_factors': [0, 4],
2026-02-21T13:17:32.1839410Z  'range_warp_specializes': [],
2026-02-21T13:17:32.1839574Z  'waves_per_eu': 2}
2026-02-21T13:17:32.1886924Z [572s] Fitting surrogate: 556 points, 556 targets
2026-02-21T13:17:32.8577626Z [573s] Generation 6 starting: 62 neighbors, 4 active search path(s)
2026-02-21T13:18:07.8462413Z [607s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 32], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:18:07.8478075Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 0.4 configs/s
2026-02-21T13:18:16.6066976Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 63/63 7.2 configs/s
2026-02-21T13:18:16.8540400Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 38/38 - configs/s
2026-02-21T13:18:24.3442272Z [624s] Generation 6 complete: 
2026-02-21T13:18:24.3442674Z error=7
2026-02-21T13:18:24.3443277Z timeout=1
2026-02-21T13:18:24.3443419Z ok=58
2026-02-21T13:18:24.3443559Z min=5.2401
2026-02-21T13:18:24.3443700Z mid=6.9285
2026-02-21T13:18:24.3443843Z max=93.9612
2026-02-21T13:18:24.3444010Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:18:24.3444322Z  'indexing': ['block_ptr', 'block_ptr', 'pointer', 'pointer'],
2026-02-21T13:18:24.3444611Z  'l2_groupings': [8],
2026-02-21T13:18:24.3444807Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:18:24.3445032Z  'loop_orders': [[0, 1]],
2026-02-21T13:18:24.3445230Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:18:24.3445417Z  'num_stages': 1,
2026-02-21T13:18:24.3445581Z  'num_warps': 4,
2026-02-21T13:18:24.3445874Z  'pid_type': 'flat',
2026-02-21T13:18:24.3446060Z  'range_flattens': [None, True],
2026-02-21T13:18:24.3446300Z  'range_multi_buffers': [None, True],
2026-02-21T13:18:24.3446518Z  'range_num_stages': [0, 3],
2026-02-21T13:18:24.3446718Z  'range_unroll_factors': [0, 4],
2026-02-21T13:18:24.3446927Z  'range_warp_specializes': [],
2026-02-21T13:18:24.3447131Z  'waves_per_eu': 2}
2026-02-21T13:18:24.3478797Z [624s] Fitting surrogate: 622 points, 622 targets
2026-02-21T13:18:25.0433786Z [625s] Generation 7 starting: 62 neighbors, 4 active search path(s)
2026-02-21T13:18:48.9709852Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63/63 0.8 configs/s
2026-02-21T13:18:57.6663068Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 63/63 7.3 configs/s
2026-02-21T13:18:57.9267338Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 39/39 - configs/s
2026-02-21T13:19:05.9791778Z [666s] Generation 7 complete: 
2026-02-21T13:19:05.9791999Z error=1
2026-02-21T13:19:05.9792119Z ok=65
2026-02-21T13:19:05.9792200Z min=5.2029
2026-02-21T13:19:05.9792281Z mid=7.0048
2026-02-21T13:19:05.9792356Z max=29.7676
2026-02-21T13:19:05.9792465Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:19:05.9792621Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:19:05.9792780Z  'l2_groupings': [8],
2026-02-21T13:19:05.9792892Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:19:05.9793007Z  'loop_orders': [[0, 1]],
2026-02-21T13:19:05.9793110Z  'matrix_instr_nonkdim': 32,
2026-02-21T13:19:05.9793208Z  'num_stages': 1,
2026-02-21T13:19:05.9793294Z  'num_warps': 4,
2026-02-21T13:19:05.9793724Z  'pid_type': 'flat',
2026-02-21T13:19:05.9793829Z  'range_flattens': [None, True],
2026-02-21T13:19:05.9793940Z  'range_multi_buffers': [None, False],
2026-02-21T13:19:05.9794055Z  'range_num_stages': [0, 1],
2026-02-21T13:19:05.9794155Z  'range_unroll_factors': [0, 1],
2026-02-21T13:19:05.9794263Z  'range_warp_specializes': [],
2026-02-21T13:19:05.9794364Z  'waves_per_eu': 2}
2026-02-21T13:19:05.9830254Z [666s] Fitting surrogate: 688 points, 688 targets
2026-02-21T13:19:06.6628236Z [666s] Generation 8 starting: 61 neighbors, 4 active search path(s)
2026-02-21T13:19:41.1228302Z [701s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[1, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:19:41.1247562Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 0.4 configs/s
2026-02-21T13:19:50.5502382Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 61/61 6.5 configs/s
2026-02-21T13:19:50.7340652Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 40/40 - configs/s
2026-02-21T13:19:56.4205748Z [716s] Generation 8 complete: 
2026-02-21T13:19:56.4206165Z error=4
2026-02-21T13:19:56.4206378Z timeout=1
2026-02-21T13:19:56.4206596Z ok=60
2026-02-21T13:19:56.4206794Z min=4.8733
2026-02-21T13:19:56.4206993Z mid=10.6151
2026-02-21T13:19:56.4207194Z max=81.1952
2026-02-21T13:19:56.4207423Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:19:56.4208290Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:19:56.4208704Z  'l2_groupings': [8],
2026-02-21T13:19:56.4208981Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:19:56.4209297Z  'loop_orders': [[0, 1]],
2026-02-21T13:19:56.4209590Z  'matrix_instr_nonkdim': 32,
2026-02-21T13:19:56.4209858Z  'num_stages': 1,
2026-02-21T13:19:56.4210081Z  'num_warps': 4,
2026-02-21T13:19:56.4210310Z  'pid_type': 'flat',
2026-02-21T13:19:56.4210565Z  'range_flattens': [None, None],
2026-02-21T13:19:56.4210866Z  'range_multi_buffers': [None, False],
2026-02-21T13:19:56.4211180Z  'range_num_stages': [0, 1],
2026-02-21T13:19:56.4211630Z  'range_unroll_factors': [0, 1],
2026-02-21T13:19:56.4211922Z  'range_warp_specializes': [],
2026-02-21T13:19:56.4212204Z  'waves_per_eu': 2}
2026-02-21T13:19:56.4244283Z [716s] Fitting surrogate: 753 points, 753 targets
2026-02-21T13:19:56.9177907Z [717s] Generation 9 starting: 37 neighbors, 2 active search path(s)
2026-02-21T13:20:28.5842498Z [748s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, False], range_num_stages=[1, 4], range_unroll_factors=[2, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:20:31.3808547Z [751s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[True, False], range_num_stages=[1, 4], range_unroll_factors=[3, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:20:33.0081468Z [753s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:20:33.0101190Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 0.4 configs/s
2026-02-21T13:20:39.8516657Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 38/38 5.5 configs/s
2026-02-21T13:20:39.9337274Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:20:42.4237043Z [762s] Generation 9 complete: 
2026-02-21T13:20:42.4237769Z error=2
2026-02-21T13:20:42.4237988Z timeout=3
2026-02-21T13:20:42.4238180Z ok=34
2026-02-21T13:20:42.4238381Z min=4.9134
2026-02-21T13:20:42.4238589Z mid=10.6098
2026-02-21T13:20:42.4238814Z max=99.0830
2026-02-21T13:20:42.4239039Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:20:42.4239461Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:20:42.4239874Z  'l2_groupings': [8],
2026-02-21T13:20:42.4240154Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:20:42.4240488Z  'loop_orders': [[0, 1]],
2026-02-21T13:20:42.4240766Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:20:42.4241031Z  'num_stages': 1,
2026-02-21T13:20:42.4241255Z  'num_warps': 4,
2026-02-21T13:20:42.4241493Z  'pid_type': 'flat',
2026-02-21T13:20:42.4241752Z  'range_flattens': [None, None],
2026-02-21T13:20:42.4242070Z  'range_multi_buffers': [None, False],
2026-02-21T13:20:42.4242450Z  'range_num_stages': [0, 1],
2026-02-21T13:20:42.4242781Z  'range_unroll_factors': [0, 1],
2026-02-21T13:20:42.4243037Z  'range_warp_specializes': [],
2026-02-21T13:20:42.4243266Z  'waves_per_eu': 2}
2026-02-21T13:20:42.4271159Z [762s] Fitting surrogate: 792 points, 792 targets
2026-02-21T13:20:42.8800958Z [763s] Generation 10 starting: 33 neighbors, 2 active search path(s)
2026-02-21T13:21:00.2781264Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 1.4 configs/s
2026-02-21T13:21:06.5214194Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 34/34 5.6 configs/s
2026-02-21T13:21:06.6024371Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:21:09.0120946Z [789s] Generation 10 complete: 
2026-02-21T13:21:09.0124033Z ok=35
2026-02-21T13:21:09.0124690Z min=4.8251
2026-02-21T13:21:09.0124977Z mid=9.9433
2026-02-21T13:21:09.0125178Z max=78.2560
2026-02-21T13:21:09.0125772Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:21:09.0126221Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:21:09.0126661Z  'l2_groupings': [8],
2026-02-21T13:21:09.0126933Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:21:09.0127248Z  'loop_orders': [[0, 1]],
2026-02-21T13:21:09.0127521Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:21:09.0127800Z  'num_stages': 1,
2026-02-21T13:21:09.0128029Z  'num_warps': 4,
2026-02-21T13:21:09.0128263Z  'pid_type': 'flat',
2026-02-21T13:21:09.0128522Z  'range_flattens': [None, None],
2026-02-21T13:21:09.0128825Z  'range_multi_buffers': [None, False],
2026-02-21T13:21:09.0129149Z  'range_num_stages': [0, 1],
2026-02-21T13:21:09.0129422Z  'range_unroll_factors': [0, 1],
2026-02-21T13:21:09.0129714Z  'range_warp_specializes': [],
2026-02-21T13:21:09.0129987Z  'waves_per_eu': 2}
2026-02-21T13:21:09.0154025Z [789s] Fitting surrogate: 827 points, 827 targets
2026-02-21T13:21:09.5060350Z [789s] Generation 11 starting: 33 neighbors, 2 active search path(s)
2026-02-21T13:21:27.2315611Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 1.1 configs/s
2026-02-21T13:21:34.3838769Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 4.7 configs/s
2026-02-21T13:21:34.4990097Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:21:38.0424536Z [818s] Generation 11 complete: 
2026-02-21T13:21:38.0425139Z ok=35
2026-02-21T13:21:38.0425396Z min=4.8449
2026-02-21T13:21:38.0425604Z mid=7.7070
2026-02-21T13:21:38.0425774Z max=95.6146
2026-02-21T13:21:38.0425970Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:21:38.0426673Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:21:38.0427019Z  'l2_groupings': [8],
2026-02-21T13:21:38.0427245Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:21:38.0427507Z  'loop_orders': [[0, 1]],
2026-02-21T13:21:38.0427744Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:21:38.0427962Z  'num_stages': 1,
2026-02-21T13:21:38.0428148Z  'num_warps': 4,
2026-02-21T13:21:38.0428349Z  'pid_type': 'flat',
2026-02-21T13:21:38.0428568Z  'range_flattens': [None, None],
2026-02-21T13:21:38.0433023Z  'range_multi_buffers': [None, False],
2026-02-21T13:21:38.0433282Z  'range_num_stages': [0, 1],
2026-02-21T13:21:38.0433507Z  'range_unroll_factors': [0, 1],
2026-02-21T13:21:38.0433850Z  'range_warp_specializes': [],
2026-02-21T13:21:38.0434072Z  'waves_per_eu': 2}
2026-02-21T13:21:38.0459837Z [818s] Fitting surrogate: 862 points, 862 targets
2026-02-21T13:21:38.3505516Z [818s] Generation 12 starting: 17 neighbors, 1 active search path(s)
2026-02-21T13:21:56.9153357Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 0.8 configs/s
2026-02-21T13:22:00.7402166Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 4.4 configs/s
2026-02-21T13:22:00.8371681Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:22:03.8106994Z [843s] Generation 12 complete: 
2026-02-21T13:22:03.8107464Z ok=19
2026-02-21T13:22:03.8108113Z min=5.0059
2026-02-21T13:22:03.8108331Z mid=5.4232
2026-02-21T13:22:03.8108533Z max=76.7894
2026-02-21T13:22:03.8108794Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:22:03.8109216Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:22:03.8109633Z  'l2_groupings': [8],
2026-02-21T13:22:03.8109921Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:22:03.8110242Z  'loop_orders': [[0, 1]],
2026-02-21T13:22:03.8110516Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:22:03.8110790Z  'num_stages': 1,
2026-02-21T13:22:03.8111022Z  'num_warps': 4,
2026-02-21T13:22:03.8111263Z  'pid_type': 'flat',
2026-02-21T13:22:03.8111522Z  'range_flattens': [None, None],
2026-02-21T13:22:03.8111828Z  'range_multi_buffers': [None, False],
2026-02-21T13:22:03.8112143Z  'range_num_stages': [0, 1],
2026-02-21T13:22:03.8112422Z  'range_unroll_factors': [0, 1],
2026-02-21T13:22:03.8112715Z  'range_warp_specializes': [],
2026-02-21T13:22:03.8112979Z  'waves_per_eu': 2}
2026-02-21T13:22:03.8142801Z [843s] Fitting surrogate: 881 points, 881 targets
2026-02-21T13:22:04.1023915Z [844s] Generation 13 starting: 16 neighbors, 1 active search path(s)
2026-02-21T13:22:10.3247249Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.5 configs/s
2026-02-21T13:22:12.9072220Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 6.3 configs/s
2026-02-21T13:22:13.0024391Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:22:15.9683027Z [856s] Generation 13 complete: 
2026-02-21T13:22:15.9683454Z ok=18
2026-02-21T13:22:15.9683664Z min=4.9615
2026-02-21T13:22:15.9683904Z mid=5.3730
2026-02-21T13:22:15.9684109Z max=72.5528
2026-02-21T13:22:15.9684351Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:22:15.9684772Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:22:15.9685187Z  'l2_groupings': [8],
2026-02-21T13:22:15.9685458Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:22:15.9685792Z  'loop_orders': [[0, 1]],
2026-02-21T13:22:15.9686068Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:22:15.9686327Z  'num_stages': 1,
2026-02-21T13:22:15.9686935Z  'num_warps': 4,
2026-02-21T13:22:15.9687168Z  'pid_type': 'flat',
2026-02-21T13:22:15.9687429Z  'range_flattens': [None, None],
2026-02-21T13:22:15.9687734Z  'range_multi_buffers': [None, False],
2026-02-21T13:22:15.9688054Z  'range_num_stages': [0, 1],
2026-02-21T13:22:15.9688337Z  'range_unroll_factors': [0, 1],
2026-02-21T13:22:15.9688621Z  'range_warp_specializes': [],
2026-02-21T13:22:15.9688903Z  'waves_per_eu': 2}
2026-02-21T13:22:15.9717888Z [856s] Fitting surrogate: 899 points, 899 targets
2026-02-21T13:22:16.2501575Z [856s] Generation 14 starting: 16 neighbors, 1 active search path(s)
2026-02-21T13:22:24.5072850Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 1.9 configs/s
2026-02-21T13:22:27.0159118Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 6.5 configs/s
2026-02-21T13:22:27.0742385Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:22:28.7860155Z [868s] Generation 14 complete: 
2026-02-21T13:22:28.7860485Z ok=18
2026-02-21T13:22:28.7860674Z min=4.9430
2026-02-21T13:22:28.7860868Z mid=10.0618
2026-02-21T13:22:28.7861051Z max=48.6261
2026-02-21T13:22:28.7861259Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:22:28.7861916Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:22:28.7862291Z  'l2_groupings': [8],
2026-02-21T13:22:28.7862543Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:22:28.7862828Z  'loop_orders': [[0, 1]],
2026-02-21T13:22:28.7863091Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:22:28.7863330Z  'num_stages': 1,
2026-02-21T13:22:28.7863541Z  'num_warps': 4,
2026-02-21T13:22:28.7863748Z  'pid_type': 'flat',
2026-02-21T13:22:28.7863984Z  'range_flattens': [None, None],
2026-02-21T13:22:28.7864257Z  'range_multi_buffers': [None, False],
2026-02-21T13:22:28.7864539Z  'range_num_stages': [0, 1],
2026-02-21T13:22:28.7864951Z  'range_unroll_factors': [0, 1],
2026-02-21T13:22:28.7865215Z  'range_warp_specializes': [],
2026-02-21T13:22:28.7865465Z  'waves_per_eu': 2}
2026-02-21T13:22:28.7894583Z [868s] Fitting surrogate: 917 points, 917 targets
2026-02-21T13:22:29.0815758Z [869s] Generation 15 starting: 16 neighbors, 1 active search path(s)
2026-02-21T13:22:35.8669563Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.2 configs/s
2026-02-21T13:22:38.4805606Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 6.2 configs/s
2026-02-21T13:22:38.5844539Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:22:41.7610292Z [881s] Generation 15 complete: 
2026-02-21T13:22:41.7610759Z ok=18
2026-02-21T13:22:41.7610970Z min=4.9948
2026-02-21T13:22:41.7611181Z mid=5.3809
2026-02-21T13:22:41.7611378Z max=75.6484
2026-02-21T13:22:41.7611611Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:22:41.7612020Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:22:41.7612479Z  'l2_groupings': [8],
2026-02-21T13:22:41.7612751Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:22:41.7613083Z  'loop_orders': [[0, 1]],
2026-02-21T13:22:41.7613367Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:22:41.7613629Z  'num_stages': 1,
2026-02-21T13:22:41.7613853Z  'num_warps': 4,
2026-02-21T13:22:41.7614095Z  'pid_type': 'flat',
2026-02-21T13:22:41.7614355Z  'range_flattens': [None, None],
2026-02-21T13:22:41.7614654Z  'range_multi_buffers': [None, False],
2026-02-21T13:22:41.7614961Z  'range_num_stages': [0, 1],
2026-02-21T13:22:41.7615231Z  'range_unroll_factors': [0, 1],
2026-02-21T13:22:41.7615887Z  'range_warp_specializes': [],
2026-02-21T13:22:41.7616158Z  'waves_per_eu': 2}
2026-02-21T13:22:41.7648497Z [881s] Fitting surrogate: 935 points, 935 targets
2026-02-21T13:22:42.0479863Z [882s] Generation 16 starting: 16 neighbors, 1 active search path(s)
2026-02-21T13:22:47.7240657Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 4.0 configs/s
2026-02-21T13:22:49.7441446Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 8.3 configs/s
2026-02-21T13:22:49.8085401Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:22:51.7640750Z [891s] Generation 16 complete: 
2026-02-21T13:22:51.7641136Z ok=18
2026-02-21T13:22:51.7641339Z min=4.9749
2026-02-21T13:22:51.7641588Z mid=7.2286
2026-02-21T13:22:51.7641792Z max=16.2258
2026-02-21T13:22:51.7642030Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:22:51.7642452Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:22:51.7643021Z  'l2_groupings': [8],
2026-02-21T13:22:51.7643298Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:22:51.7643616Z  'loop_orders': [[0, 1]],
2026-02-21T13:22:51.7643887Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:22:51.7644145Z  'num_stages': 1,
2026-02-21T13:22:51.7644377Z  'num_warps': 4,
2026-02-21T13:22:51.7644605Z  'pid_type': 'flat',
2026-02-21T13:22:51.7644872Z  'range_flattens': [None, None],
2026-02-21T13:22:51.7645174Z  'range_multi_buffers': [None, False],
2026-02-21T13:22:51.7645482Z  'range_num_stages': [0, 1],
2026-02-21T13:22:51.7645780Z  'range_unroll_factors': [0, 1],
2026-02-21T13:22:51.7646071Z  'range_warp_specializes': [],
2026-02-21T13:22:51.7646377Z  'waves_per_eu': 2}
2026-02-21T13:22:51.7677599Z [891s] Fitting surrogate: 953 points, 953 targets
2026-02-21T13:22:52.0570336Z [892s] Generation 17 starting: 16 neighbors, 1 active search path(s)
2026-02-21T13:22:59.2733103Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16/16 2.3 configs/s
2026-02-21T13:23:02.0995523Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 16/16 5.7 configs/s
2026-02-21T13:23:02.2076930Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:23:05.5595055Z [905s] Generation 17 complete: 
2026-02-21T13:23:05.5595415Z ok=18
2026-02-21T13:23:05.5595623Z min=4.9209
2026-02-21T13:23:05.5595885Z mid=5.3017
2026-02-21T13:23:05.5596087Z max=76.3154
2026-02-21T13:23:05.5596819Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:23:05.5597247Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:23:05.5597684Z  'l2_groupings': [8],
2026-02-21T13:23:05.5597963Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:23:05.5598292Z  'loop_orders': [[0, 1]],
2026-02-21T13:23:05.5598589Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:23:05.5598854Z  'num_stages': 1,
2026-02-21T13:23:05.5599094Z  'num_warps': 4,
2026-02-21T13:23:05.5599327Z  'pid_type': 'flat',
2026-02-21T13:23:05.5599590Z  'range_flattens': [None, None],
2026-02-21T13:23:05.5599908Z  'range_multi_buffers': [None, False],
2026-02-21T13:23:05.5600227Z  'range_num_stages': [0, 1],
2026-02-21T13:23:05.5600512Z  'range_unroll_factors': [0, 1],
2026-02-21T13:23:05.5600811Z  'range_warp_specializes': [],
2026-02-21T13:23:05.5601088Z  'waves_per_eu': 2}
2026-02-21T13:23:05.5634519Z [905s] Fitting surrogate: 971 points, 971 targets
2026-02-21T13:23:05.8322106Z [905s] Generation 18 starting: 15 neighbors, 1 active search path(s)
2026-02-21T13:23:16.7857294Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 0.5 configs/s
2026-02-21T13:23:18.8101768Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 15/15 7.7 configs/s
2026-02-21T13:23:18.8854434Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:23:21.1895040Z [921s] Generation 18 complete: 
2026-02-21T13:23:21.1895465Z ok=17
2026-02-21T13:23:21.1895730Z min=4.9914
2026-02-21T13:23:21.1895947Z mid=6.2518
2026-02-21T13:23:21.1896152Z max=19.8900
2026-02-21T13:23:21.1896707Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:23:21.1897136Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:23:21.1897548Z  'l2_groupings': [8],
2026-02-21T13:23:21.1897837Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:23:21.1898158Z  'loop_orders': [[0, 1]],
2026-02-21T13:23:21.1898434Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:23:21.1898720Z  'num_stages': 1,
2026-02-21T13:23:21.1898950Z  'num_warps': 4,
2026-02-21T13:23:21.1899186Z  'pid_type': 'flat',
2026-02-21T13:23:21.1899592Z  'range_flattens': [None, None],
2026-02-21T13:23:21.1899904Z  'range_multi_buffers': [None, False],
2026-02-21T13:23:21.1900214Z  'range_num_stages': [0, 1],
2026-02-21T13:23:21.1900675Z  'range_unroll_factors': [0, 1],
2026-02-21T13:23:21.1900973Z  'range_warp_specializes': [],
2026-02-21T13:23:21.1901262Z  'waves_per_eu': 2}
2026-02-21T13:23:21.1928979Z [921s] Fitting surrogate: 988 points, 988 targets
2026-02-21T13:23:21.4725461Z [921s] Generation 19 starting: 15 neighbors, 1 active search path(s)
2026-02-21T13:23:27.8737019Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 2.3 configs/s
2026-02-21T13:23:30.1402534Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 15/15 6.8 configs/s
2026-02-21T13:23:30.1927649Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:23:31.7391240Z [931s] Generation 19 complete: 
2026-02-21T13:23:31.7391615Z ok=17
2026-02-21T13:23:31.7391739Z min=4.9294
2026-02-21T13:23:31.7391884Z mid=11.1025
2026-02-21T13:23:31.7392061Z max=19.9553
2026-02-21T13:23:31.7392235Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:23:31.7392561Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:23:31.7392891Z  'l2_groupings': [8],
2026-02-21T13:23:31.7393100Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:23:31.7393343Z  'loop_orders': [[0, 1]],
2026-02-21T13:23:31.7393556Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:23:31.7393756Z  'num_stages': 1,
2026-02-21T13:23:31.7393949Z  'num_warps': 4,
2026-02-21T13:23:31.7394130Z  'pid_type': 'flat',
2026-02-21T13:23:31.7394330Z  'range_flattens': [None, None],
2026-02-21T13:23:31.7394569Z  'range_multi_buffers': [None, False],
2026-02-21T13:23:31.7394804Z  'range_num_stages': [0, 1],
2026-02-21T13:23:31.7395029Z  'range_unroll_factors': [0, 1],
2026-02-21T13:23:31.7395249Z  'range_warp_specializes': [],
2026-02-21T13:23:31.7395786Z  'waves_per_eu': 2}
2026-02-21T13:23:31.7427969Z [931s] Fitting surrogate: 1005 points, 1005 targets
2026-02-21T13:23:32.0341152Z [932s] Generation 20 starting: 17 neighbors, 1 active search path(s)
2026-02-21T13:23:39.1135769Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 3.2 configs/s
2026-02-21T13:23:41.7786939Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 6.5 configs/s
2026-02-21T13:23:41.8542441Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 - configs/s
2026-02-21T13:23:44.1736136Z [944s] Generation 20 complete: 
2026-02-21T13:23:44.1736596Z error=1
2026-02-21T13:23:44.1736800Z ok=18
2026-02-21T13:23:44.1737006Z min=4.9479
2026-02-21T13:23:44.1737217Z mid=7.1990
2026-02-21T13:23:44.1737415Z max=75.5786
2026-02-21T13:23:44.1737650Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:23:44.1738066Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:23:44.1738482Z  'l2_groupings': [8],
2026-02-21T13:23:44.1738766Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:23:44.1739084Z  'loop_orders': [[0, 1]],
2026-02-21T13:23:44.1739683Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:23:44.1739948Z  'num_stages': 1,
2026-02-21T13:23:44.1740176Z  'num_warps': 4,
2026-02-21T13:23:44.1740412Z  'pid_type': 'flat',
2026-02-21T13:23:44.1740684Z  'range_flattens': [None, None],
2026-02-21T13:23:44.1740982Z  'range_multi_buffers': [None, False],
2026-02-21T13:23:44.1741296Z  'range_num_stages': [0, 1],
2026-02-21T13:23:44.1741579Z  'range_unroll_factors': [0, 1],
2026-02-21T13:23:44.1741807Z  'range_warp_specializes': [],
2026-02-21T13:23:44.1742068Z  'waves_per_eu': 2}
2026-02-21T13:23:44.1772455Z [944s] Fitting surrogate: 1024 points, 1024 targets
2026-02-21T13:23:44.3047828Z [944s] Autotuning complete in 944.5s after searching 938 configs.
2026-02-21T13:23:44.3048050Z One can hardcode the best config and skip autotuning with:
2026-02-21T13:23:44.3048929Z     @helion.kernel(config=helion.Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T13:23:44.3049572Z 
2026-02-21T13:23:44.3049757Z [944s] Code of selected kernel: /tmp/torchinductor_root/ue/cuel5li36jx6visw4dnehr6crgpmvnmtejhc4lconh7ghkb7a2du.py
2026-02-21T13:23:44.3282774Z from __future__ import annotations
2026-02-21T13:23:44.3283134Z 
2026-02-21T13:23:44.3283453Z import torch
2026-02-21T13:23:44.3283738Z import triton
2026-02-21T13:23:44.3284041Z import triton.language as tl
2026-02-21T13:23:44.3284412Z from torch._inductor.runtime import triton_helpers
2026-02-21T13:23:44.3284892Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T13:23:44.3285754Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T13:23:44.3286072Z 
2026-02-21T13:23:44.3286196Z _BLOCK_SIZE_1 = tl.constexpr(128)
2026-02-21T13:23:44.3286510Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T13:23:44.3286806Z _SHAPE_DIM = tl.constexpr(128)
2026-02-21T13:23:44.3287115Z _BLOCK_SIZE_3 = tl.constexpr(128)
2026-02-21T13:23:44.3287421Z _SHAPE_DIM_4 = tl.constexpr(128)
2026-02-21T13:23:44.3287710Z _SHAPE_DIM_5 = tl.constexpr(128)
2026-02-21T13:23:44.3287901Z 
2026-02-21T13:23:44.3287989Z @triton.jit
2026-02-21T13:23:44.3288368Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr):
2026-02-21T13:23:44.3288993Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T13:23:44.3289443Z     num_pid_m = 192
2026-02-21T13:23:44.3289725Z     num_pid_n = tl.cdiv(4096, _BLOCK_SIZE_1)
2026-02-21T13:23:44.3289960Z     inner_2d_pid = tl.program_id(0)
2026-02-21T13:23:44.3290130Z     num_pid_in_group = 8 * num_pid_n
2026-02-21T13:23:44.3290309Z     group_id = inner_2d_pid // num_pid_in_group
2026-02-21T13:23:44.3290494Z     first_pid_m = group_id * 8
2026-02-21T13:23:44.3290671Z     group_size_m = min(num_pid_m - first_pid_m, 8)
2026-02-21T13:23:44.3290907Z     pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T13:23:44.3291165Z     pid_1 = inner_2d_pid % num_pid_in_group // group_size_m
2026-02-21T13:23:44.3291360Z     offset_0 = pid_0
2026-02-21T13:23:44.3291513Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T13:23:44.3291711Z     offset_1 = pid_1 * _BLOCK_SIZE_1
2026-02-21T13:23:44.3291887Z     indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32)
2026-02-21T13:23:44.3292159Z     # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T13:23:44.3292455Z     m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32)
2026-02-21T13:23:44.3292706Z     # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0)
2026-02-21T13:23:44.3292945Z     l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32)
2026-02-21T13:23:44.3293210Z     # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32)
2026-02-21T13:23:44.3293578Z     acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32)
2026-02-21T13:23:44.3293812Z     # src[attention.py:71]: q = q_view[tile_b, tile_m, :]
2026-02-21T13:23:44.3294264Z     q = tl.load(tl.make_block_ptr(q_view, [192, 4096, 128], [524288, 128, 1], [offset_0, offset_1, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _SHAPE_DIM], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T13:23:44.3294796Z     # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)):
2026-02-21T13:23:44.3295044Z     # src[attention.py:73]:     k = k_view[tile_b, :, tile_n]
2026-02-21T13:23:44.3295252Z     # src[attention.py:74]:     qk = torch.bmm(q, k)
2026-02-21T13:23:44.3295424Z     # src[attention.py:72-85]: ...
2026-02-21T13:23:44.3295725Z     for offset_2 in tl.range(0, 4096, _BLOCK_SIZE_3, loop_unroll_factor=1, num_stages=1, disallow_acc_multi_buffer=True):
2026-02-21T13:23:44.3296077Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32)
2026-02-21T13:23:44.3296277Z         q_copy = q
2026-02-21T13:23:44.3296393Z         m_i_copy = m_i
2026-02-21T13:23:44.3296518Z         l_i_copy = l_i
2026-02-21T13:23:44.3296670Z         acc_copy = acc
2026-02-21T13:23:44.3296796Z         q_copy_0 = q_copy
2026-02-21T13:23:44.3296933Z         m_i_copy_0 = m_i_copy
2026-02-21T13:23:44.3297067Z         l_i_copy_0 = l_i_copy
2026-02-21T13:23:44.3297207Z         acc_copy_0 = acc_copy
2026-02-21T13:23:44.3297374Z         # src[attention.py:73]: k = k_view[tile_b, :, tile_n]
2026-02-21T13:23:44.3297717Z         k = tl.load(k_view + (indices_0[:, None, None] * 524288 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None)
2026-02-21T13:23:44.3298044Z         # src[attention.py:74]: qk = torch.bmm(q, k)
2026-02-21T13:23:44.3298664Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T13:23:44.3299361Z         # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale)
2026-02-21T13:23:44.3299613Z         amax = tl.cast(tl.max(qk, 2), tl.bfloat16)
2026-02-21T13:23:44.3299787Z         v_0 = 0.12751743074602467
2026-02-21T13:23:44.3299986Z         v_1 = tl.cast(amax * v_0, tl.bfloat16)
2026-02-21T13:23:44.3300120Z         v_2 = tl.cast(v_1, tl.float32)
2026-02-21T13:23:44.3300255Z         v_3 = triton_helpers.maximum(m_i_copy_0, v_2)
2026-02-21T13:23:44.3300430Z         # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None]
2026-02-21T13:23:44.3300583Z         v_4 = 0.12751743074602467
2026-02-21T13:23:44.3300700Z         v_5 = tl.cast(qk * v_4, tl.bfloat16)
2026-02-21T13:23:44.3300826Z         subscript = v_3[:, :, None]
2026-02-21T13:23:44.3300951Z         v_6 = tl.cast(v_5, tl.float32)
2026-02-21T13:23:44.3301067Z         v_7 = v_6 - subscript
2026-02-21T13:23:44.3301192Z         # src[attention.py:77]: p = torch.exp2(qk)
2026-02-21T13:23:44.3301327Z         v_8 = libdevice.exp2(v_7)
2026-02-21T13:23:44.3301461Z         # src[attention.py:78]: l_ij = torch.sum(p, -1)
2026-02-21T13:23:44.3301613Z         l_ij = tl.cast(tl.sum(v_8, 2), tl.float32)
2026-02-21T13:23:44.3301769Z         # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij)
2026-02-21T13:23:44.3301919Z         v_9 = m_i_copy_0 - v_3
2026-02-21T13:23:44.3302030Z         v_10 = libdevice.exp2(v_9)
2026-02-21T13:23:44.3302169Z         # src[attention.py:80]: l_i = l_i * alpha + l_ij
2026-02-21T13:23:44.3302307Z         v_11 = l_i_copy_0 * v_10
2026-02-21T13:23:44.3302421Z         l_i = v_11 + l_ij
2026-02-21T13:23:44.3302548Z         # src[attention.py:81]: acc = acc * alpha[:, :, None]
2026-02-21T13:23:44.3302694Z         subscript_1 = v_10[:, :, None]
2026-02-21T13:23:44.3302819Z         v_13 = acc_copy_0 * subscript_1
2026-02-21T13:23:44.3302965Z         # src[attention.py:82]: v = v_view[tile_b, tile_n, :]
2026-02-21T13:23:44.3303345Z         v = tl.load(tl.make_block_ptr(v_view, [192, 4096, 128], [524288, 128, 1], [offset_0, offset_2, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_3, _SHAPE_DIM_4], [2, 1, 0]), boundary_check=[0, 1, 2], padding_option='zero')
2026-02-21T13:23:44.3303687Z         # src[attention.py:83]: p = p.to(v.dtype)
2026-02-21T13:23:44.3303824Z         v_14 = tl.cast(v_8, tl.bfloat16)
2026-02-21T13:23:44.3303976Z         # src[attention.py:84]: acc = torch.baddbmm(acc, p, v)
2026-02-21T13:23:44.3304480Z         acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128])
2026-02-21T13:23:44.3304942Z         # src[attention.py:85]: m_i = m_ij
2026-02-21T13:23:44.3305065Z         m_i = v_3
2026-02-21T13:23:44.3305179Z     # src[attention.py:87]: acc = acc / l_i[:, :, None]
2026-02-21T13:23:44.3305322Z     subscript_2 = l_i[:, :, None]
2026-02-21T13:23:44.3305438Z     v_15 = acc / subscript_2
2026-02-21T13:23:44.3305587Z     # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype)
2026-02-21T13:23:44.3305767Z     v_16 = tl.cast(v_15, tl.bfloat16)
2026-02-21T13:23:44.3306071Z     tl.store(tl.make_block_ptr(out, [192, 4096, 128], [524288, 128, 1], [offset_0, offset_1, 0], [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _SHAPE_DIM_5], [2, 1, 0]), v_16, boundary_check=[0, 1, 2])
2026-02-21T13:23:44.3306335Z 
2026-02-21T13:23:44.3306480Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T13:23:44.3306693Z     """
2026-02-21T13:23:44.3306791Z     Computes scaled dot-product attention.
2026-02-21T13:23:44.3306881Z 
2026-02-21T13:23:44.3307016Z     Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
2026-02-21T13:23:44.3307183Z 
2026-02-21T13:23:44.3307216Z     Args:
2026-02-21T13:23:44.3307333Z         q_in: Query tensor of shape [..., seq_len_q, head_dim]
2026-02-21T13:23:44.3307504Z         k_in: Key tensor of shape [..., seq_len_k, head_dim]
2026-02-21T13:23:44.3307674Z         v_in: Value tensor of shape [..., seq_len_k, head_dim]
2026-02-21T13:23:44.3307778Z 
2026-02-21T13:23:44.3307815Z     Returns:
2026-02-21T13:23:44.3307927Z         Output tensor of shape [..., seq_len_q, head_dim]
2026-02-21T13:23:44.3308061Z     """
2026-02-21T13:23:44.3308160Z     # src[attention.py:56]: m_dim = q_in.size(-2)
2026-02-21T13:23:44.3308292Z     m_dim = q_in.size(-2)
2026-02-21T13:23:44.3308413Z     # src[attention.py:57]: n_dim = k_in.size(-2)
2026-02-21T13:23:44.3308545Z     n_dim = k_in.size(-2)
2026-02-21T13:23:44.3308673Z     # src[attention.py:58]: assert n_dim == v_in.size(-2)
2026-02-21T13:23:44.3308815Z     assert n_dim == v_in.size(-2)
2026-02-21T13:23:44.3308973Z     # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1))
2026-02-21T13:23:44.3309130Z     head_dim = 128
2026-02-21T13:23:44.3309269Z     # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T13:23:44.3309460Z     assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T13:23:44.3309641Z     # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T13:23:44.3309805Z     q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T13:23:44.3309962Z     # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T13:23:44.3310117Z     v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T13:23:44.3310296Z     # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T13:23:44.3310492Z     k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T13:23:44.3310654Z     # src[attention.py:64]: out = torch.empty_like(q_view)
2026-02-21T13:23:44.3310791Z     out = torch.empty_like(q_view)
2026-02-21T13:23:44.3310947Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T13:23:44.3311129Z     _BLOCK_SIZE_1 = 128
2026-02-21T13:23:44.3311221Z     _RDIM_SIZE_2 = 128
2026-02-21T13:23:44.3311363Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T13:23:44.3311587Z     # src[attention.py:68]:     m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T13:23:44.3311797Z     # src[attention.py:69]:     l_i = torch.full_like(m_i, 1.0)
2026-02-21T13:23:44.3311939Z     # src[attention.py:67-88]: ...
2026-02-21T13:23:44.3312251Z     _launcher(_helion_attention, (192 * triton.cdiv(4096, _BLOCK_SIZE_1),), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=1, waves_per_eu=2, matrix_instr_nonkdim=0)
2026-02-21T13:23:44.3312574Z     # src[attention.py:89]: return out.view(q_in.size())
2026-02-21T13:23:44.3312707Z     return out.view(q_in.size())
2026-02-21T13:23:45.3577158Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T13:23:45.3579523Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T13:23:45.3581502Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T13:23:45.3581969Z WARNING:tritonbench.utils.triton_op:Completed input ID 5:
2026-02-21T13:23:45.3582389Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T13:23:45.3582718Z ------------------------------------------
2026-02-21T13:23:45.3583022Z (4, 48, 4096, 4096, 128)
2026-02-21T13:23:45.3583203Z 
2026-02-21T13:23:45.3584067Z  83%|████████▎ | 5/6 [56:51<13:17, 797.55s/it]WARNING:tritonbench.utils.triton_op:Running input ID 6:
2026-02-21T13:23:45.3584591Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T13:23:45.3584907Z ------------------------------------------
2026-02-21T13:23:45.3585205Z (4, 48, 8192, 8192, 128)
2026-02-21T13:23:45.3590234Z INFO:tritonbench.utils.triton_op:Took 0.06ms to get benchmark function for aten
2026-02-21T13:23:47.2369488Z INFO:tritonbench.utils.triton_op:Took 1.44ms to get benchmark function for flex_attention
2026-02-21T13:23:48.7369926Z WARNING:__main__:Input tensor metadata:
2026-02-21T13:23:48.7370377Z { 'args': ( { 'device': 'cuda:0',
2026-02-21T13:23:48.7370705Z               'dtype': 'torch.bfloat16',
2026-02-21T13:23:48.7371028Z               'shape': (4, 48, 8192, 128),
2026-02-21T13:23:48.7371357Z               'stride': (50331648, 1048576, 128, 1)},
2026-02-21T13:23:48.7371689Z             { 'device': 'cuda:0',
2026-02-21T13:23:48.7371992Z               'dtype': 'torch.bfloat16',
2026-02-21T13:23:48.7372301Z               'shape': (4, 48, 8192, 128),
2026-02-21T13:23:48.7372615Z               'stride': (50331648, 1048576, 128, 1)},
2026-02-21T13:23:48.7372959Z             { 'device': 'cuda:0',
2026-02-21T13:23:48.7373242Z               'dtype': 'torch.bfloat16',
2026-02-21T13:23:48.7373547Z               'shape': (4, 48, 8192, 128),
2026-02-21T13:23:48.7373873Z               'stride': (50331648, 1048576, 128, 1)}),
2026-02-21T13:23:48.7374184Z   'kwargs': {}}
2026-02-21T13:23:48.7415028Z INFO:tritonbench.utils.triton_op:Took 4.86ms to get benchmark function for helion_attention
2026-02-21T13:23:48.9863428Z [0s] Autotune random seed: 2150287535
2026-02-21T13:23:49.1214909Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0
2026-02-21T13:24:21.6519037Z [32s] Timeout after 30s compiling Config(block_sizes=[1, 512, 16], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:24:24.1257609Z [35s] Timeout after 30s compiling Config(block_sizes=[1, 4096, 128], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=1, num_stages=1, num_warps=8, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:24:24.4183187Z [35s] Timeout after 30s compiling Config(block_sizes=[1, 32, 1024], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, True], range_num_stages=[3, 3], range_unroll_factors=[2, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:24:26.5105472Z [37s] Timeout after 30s compiling Config(block_sizes=[1, 2, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[1, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:24:27.6041856Z [38s] Timeout after 30s compiling Config(block_sizes=[1, 32, 2048], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[False, False], range_num_stages=[3, 3], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:24:28.7854041Z [39s] Timeout after 30s compiling Config(block_sizes=[1, 2048, 32], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[None, True], range_num_stages=[1, 4], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:24:29.7400521Z [40s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[2, 4], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:24:31.4641828Z [42s] Timeout after 30s compiling Config(block_sizes=[1, 128, 1024], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=3, num_warps=8, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, None], range_num_stages=[3, 2], range_unroll_factors=[1, 1], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:24:33.1068618Z [43s] Timeout after 30s compiling Config(block_sizes=[1, 64, 1024], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:24:34.6430573Z [45s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 512], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:24:35.1492342Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:24:35.8291296Z [46s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=4, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, False], range_num_stages=[1, 0], range_unroll_factors=[4, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:24:36.3053733Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 4, 2048], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=64, num_stages=1, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:24:36.5686881Z [47s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:24:37.2166680Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 16, 4096], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:24:37.4821886Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 2, 2048], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=3, num_warps=16, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, True], range_num_stages=[4, 4], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:24:37.7443764Z [48s] Timeout after 30s compiling Config(block_sizes=[1, 256, 2048], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 2], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:24:40.2872079Z [51s] Timeout after 30s compiling Config(block_sizes=[1, 8, 2048], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[None, True], range_num_stages=[2, 1], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:24:40.5592823Z [51s] Timeout after 31s compiling Config(block_sizes=[1, 8, 4096], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:24:42.4304912Z [53s] Timeout after 30s compiling Config(block_sizes=[1, 4096, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[64], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 3], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:24:43.7679023Z [54s] Timeout after 30s compiling Config(block_sizes=[1, 2, 1024], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=4, num_warps=8, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[4, 3], range_unroll_factors=[1, 3], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:24:44.0559501Z [54s] Timeout after 30s compiling Config(block_sizes=[1, 32, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=4, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 3], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:24:44.3558474Z [55s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 3], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=4)
2026-02-21T13:24:44.3591704Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.1 configs/s
2026-02-21T13:28:44.9811843Z /tmp/torchinductor_root/sr/csruq75k5vrnqasgu6rcu5p5ucqjhzcjg6lokhjrwpi27fgp4emv.py:93:135: error: 'tt.load' op operation destroyed but still has uses
2026-02-21T13:28:44.9812729Z             v = tl.load(v_view + (indices_0[:, None, None] * 1048576 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None)
2026-02-21T13:28:44.9813266Z                                                                                                                                       ^
2026-02-21T13:28:44.9814749Z /tmp/torchinductor_root/sr/csruq75k5vrnqasgu6rcu5p5ucqjhzcjg6lokhjrwpi27fgp4emv.py:97:144: note: - use: %122 = "tt.reshape"(<<UNKNOWN SSA VALUE>>) : (tensor<1x4096x128xbf16, #ttg.blocked<{sizePerThread = [1, 1, 8], threadsPerWarp = [1, 4, 16], warpsPerCTA = [1, 2, 1], order = [2, 0, 1]}>>) -> tensor<4096x128xbf16, #ttg.linear<{register = [[0, 1], [0, 2], [0, 4], [8, 0], [16, 0], [32, 0], [64, 0], [128, 0], [256, 0], [512, 0], [1024, 0], [2048, 0]], lane = [[0, 8], [0, 16], [0, 32], [0, 64], [1, 0], [2, 0]], warp = [[4, 0]], block = []}>>
2026-02-21T13:28:44.9816051Z 
2026-02-21T13:28:44.9816750Z             acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128])
2026-02-21T13:28:44.9817966Z                                                                                                                                                ^
2026-02-21T13:28:44.9818323Z LLVM ERROR: operation destroyed but still has uses
2026-02-21T13:28:44.9840748Z #blocked = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 2, 1], order = [2, 1, 0]}>
2026-02-21T13:28:44.9841395Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [2, 1, 0]}>
2026-02-21T13:28:44.9841847Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}>
2026-02-21T13:28:44.9842289Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T13:28:44.9842790Z #blocked4 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}>
2026-02-21T13:28:44.9843200Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}>
2026-02-21T13:28:44.9843735Z #blocked6 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 1, 64], warpsPerCTA = [1, 1, 2], order = [0, 1, 2]}>
2026-02-21T13:28:44.9844202Z #blocked7 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [1, 64, 1], warpsPerCTA = [1, 2, 1], order = [0, 1, 2]}>
2026-02-21T13:28:44.9844648Z #blocked8 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}>
2026-02-21T13:28:44.9845104Z #blocked9 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [2, 1, 1], order = [0, 1, 2]}>
2026-02-21T13:28:44.9845687Z #blocked10 = #ttg.blocked<{sizePerThread = [1, 1, 1], threadsPerWarp = [64, 1, 1], warpsPerCTA = [2, 1, 1], order = [2, 1, 0]}>
2026-02-21T13:28:44.9846191Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} {
2026-02-21T13:28:44.9846957Z   tt.func public @_helion_attention(%arg0: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg1: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg2: !tt.ptr<bf16> {tt.divisibility = 16 : i32}, %arg3: !tt.ptr<bf16> {tt.divisibility = 16 : i32}) attributes {noinline = false} {
2026-02-21T13:28:44.9847555Z     %c128_i32 = arith.constant 128 : i32
2026-02-21T13:28:44.9847746Z     %c1048576_i32 = arith.constant 1048576 : i32
2026-02-21T13:28:44.9847936Z     %c4096_i32 = arith.constant 4096 : i32
2026-02-21T13:28:44.9848116Z     %c8192_i32 = arith.constant 8192 : i32
2026-02-21T13:28:44.9848280Z     %c0_i32 = arith.constant 0 : i32
2026-02-21T13:28:44.9848462Z     %c131072_i32 = arith.constant 131072 : i32
2026-02-21T13:28:44.9848638Z     %c1_i32 = arith.constant 1 : i32
2026-02-21T13:28:44.9848852Z     %cst = arith.constant dense<128> : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9849149Z     %cst_0 = arith.constant dense<0.127517432> : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9849453Z     %cst_1 = arith.constant dense<0.127517432> : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9849747Z     %cst_2 = arith.constant dense<0.000000e+00> : tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:44.9850025Z     %cst_3 = arith.constant dense<128> : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9850318Z     %cst_4 = arith.constant dense<0.000000e+00> : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9850605Z     %cst_5 = arith.constant dense<1.000000e+00> : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9850887Z     %cst_6 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9851120Z     %c16_i32 = arith.constant 16 : i32
2026-02-21T13:28:44.9851294Z     %c192_i32 = arith.constant 192 : i32
2026-02-21T13:28:44.9851472Z     %c1572864_i32 = arith.constant 1572864 : i32
2026-02-21T13:28:44.9851653Z     %c324_i32 = arith.constant 324 : i32
2026-02-21T13:28:44.9851821Z     %0 = tt.get_program_id x : i32
2026-02-21T13:28:44.9852026Z     %1 = arith.muli %0, %c324_i32 : i32
2026-02-21T13:28:44.9852191Z     %2 = arith.addi %1, %c324_i32 : i32
2026-02-21T13:28:44.9852370Z     %3 = arith.minsi %2, %c1572864_i32 : i32
2026-02-21T13:28:44.9852620Z     %4 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked4>
2026-02-21T13:28:44.9853020Z     %5 = ttg.convert_layout %4 : tensor<128xi32, #blocked4> -> tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:44.9853542Z     %6 = tt.expand_dims %5 {axis = 0 : i32} : tensor<128xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x128xi32, #blocked5>
2026-02-21T13:28:44.9853992Z     %7 = ttg.convert_layout %6 : tensor<1x128xi32, #blocked5> -> tensor<1x128xi32, #blocked3>
2026-02-21T13:28:44.9854435Z     %8 = ttg.convert_layout %7 : tensor<1x128xi32, #blocked3> -> tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:44.9854830Z     %9 = tt.expand_dims %8 {axis = 1 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x128xi32, #blocked6>
2026-02-21T13:28:44.9855183Z     %10 = ttg.convert_layout %9 : tensor<1x1x128xi32, #blocked6> -> tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:44.9855476Z     %11 = tt.splat %arg0 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9855723Z     %12 = tt.make_range {end = 4096 : i32, start = 0 : i32} : tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9856046Z     %13 = ttg.convert_layout %7 : tensor<1x128xi32, #blocked3> -> tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:44.9856435Z     %14 = tt.expand_dims %13 {axis = 2 : i32} : tensor<1x128xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x128x1xi32, #blocked7>
2026-02-21T13:28:44.9856790Z     %15 = ttg.convert_layout %14 : tensor<1x128x1xi32, #blocked7> -> tensor<1x128x1xi32, #blocked>
2026-02-21T13:28:44.9857087Z     %16 = tt.splat %arg1 : !tt.ptr<bf16> -> tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9857420Z     %17 = tt.broadcast %10 : tensor<1x1x128xi32, #blocked1> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9857696Z     %18 = tt.splat %arg2 : !tt.ptr<bf16> -> tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9857935Z     %19 = tt.splat %arg3 : !tt.ptr<bf16> -> tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9858126Z     %20 = arith.subi %3, %1 : i32
2026-02-21T13:28:44.9858265Z     %c1_i32_7 = arith.constant 1 : i32
2026-02-21T13:28:44.9858413Z     %21 = arith.subi %c1_i32, %c1_i32_7 : i32
2026-02-21T13:28:44.9858550Z     %22 = arith.addi %20, %21 : i32
2026-02-21T13:28:44.9858681Z     %23 = arith.divui %22, %c1_i32 : i32
2026-02-21T13:28:44.9858820Z     %c2_i32 = arith.constant 2 : i32
2026-02-21T13:28:44.9858947Z     %24 = arith.remsi %23, %c2_i32 : i32
2026-02-21T13:28:44.9859083Z     %25 = arith.subi %23, %24 : i32
2026-02-21T13:28:44.9859209Z     %26 = arith.muli %25, %c1_i32 : i32
2026-02-21T13:28:44.9859345Z     %27 = arith.addi %1, %26 : i32
2026-02-21T13:28:44.9859478Z     %28 = arith.muli %c1_i32, %c2_i32 : i32
2026-02-21T13:28:44.9859627Z     scf.for %arg4 = %1 to %27 step %28  : i32 {
2026-02-21T13:28:44.9859791Z       %29 = arith.divsi %arg4, %c131072_i32 : i32
2026-02-21T13:28:44.9859935Z       %30 = arith.muli %29, %c16_i32 : i32
2026-02-21T13:28:44.9860071Z       %31 = arith.subi %c192_i32, %30 : i32
2026-02-21T13:28:44.9860206Z       %32 = arith.minsi %31, %c16_i32 : i32
2026-02-21T13:28:44.9860347Z       %33 = arith.remsi %arg4, %c131072_i32 : i32
2026-02-21T13:28:44.9860484Z       %34 = arith.remsi %33, %32 : i32
2026-02-21T13:28:44.9860620Z       %35 = arith.addi %30, %34 : i32
2026-02-21T13:28:44.9860744Z       %36 = arith.divsi %33, %32 : i32
2026-02-21T13:28:44.9860883Z       %37 = arith.muli %35, %c1048576_i32 : i32
2026-02-21T13:28:44.9861027Z       %38 = arith.muli %36, %c128_i32 : i32
2026-02-21T13:28:44.9861154Z       %39 = arith.addi %37, %38 : i32
2026-02-21T13:28:44.9861321Z       %40 = tt.splat %39 : i32 -> tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:44.9861532Z       %41 = arith.addi %40, %10 : tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:44.9861786Z       %42 = tt.addptr %11, %41 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:44.9862030Z       %43 = tt.load %42 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9862218Z       %44 = tt.splat %37 : i32 -> tensor<1x128x1xi32, #blocked>
2026-02-21T13:28:44.9862408Z       %45 = arith.addi %44, %15 : tensor<1x128x1xi32, #blocked>
2026-02-21T13:28:44.9862661Z       %46 = tt.broadcast %45 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x4096xi32, #blocked>
2026-02-21T13:28:44.9862943Z       %47 = ttg.convert_layout %46 : tensor<1x128x4096xi32, #blocked> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9863194Z       %48 = tt.reshape %43 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3>
2026-02-21T13:28:44.9863387Z       %49 = tt.splat %37 : i32 -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9863527Z       %c0_i32_8 = arith.constant 0 : i32
2026-02-21T13:28:44.9863648Z       %c16384_i32 = arith.constant 16384 : i32
2026-02-21T13:28:44.9864002Z       %50:3 = scf.for %arg5 = %c0_i32 to %c0_i32_8 step %c16384_i32 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T13:28:44.9864352Z         %93 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9864533Z         %94 = arith.addi %93, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9864772Z         %95 = ttg.convert_layout %94 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:44.9865126Z         %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:44.9865421Z         %97 = ttg.convert_layout %96 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:44.9865722Z         %98 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:44.9866070Z         %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:44.9866380Z         %100 = ttg.convert_layout %99 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9866601Z         %101 = arith.muli %100, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9866811Z         %102 = tt.broadcast %101 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9867026Z         %103 = arith.addi %47, %102 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9867253Z         %104 = tt.addptr %16, %103 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9867476Z         %105 = tt.load %104 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9867693Z         %106 = tt.reshape %105 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:44.9867996Z         %107 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:44.9868354Z         %108 = ttg.convert_layout %106 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:44.9868672Z         %109 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9869089Z         %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9869497Z         %111 = ttg.convert_layout %110 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:44.9869759Z         %112 = tt.reshape %111 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9870002Z         %113 = arith.truncf %112 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9870253Z         %114 = arith.extf %113 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9870444Z         %115 = "tt.reduce"(%114) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9870586Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9870706Z           %395 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:44.9870833Z           tt.reduce.return %395 : f32
2026-02-21T13:28:44.9871027Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9871319Z         %116 = ttg.convert_layout %115 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9871598Z         %117 = arith.truncf %116 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9871844Z         %118 = arith.extf %117 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9872038Z         %119 = arith.mulf %118, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9872230Z         %120 = arith.truncf %119 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9872444Z         %121 = arith.extf %120 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9872642Z         %122 = arith.cmpf ogt, %arg6, %121 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9872813Z         %123 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9872982Z         %124 = arith.ori %122, %123 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:44.9873188Z         %125 = arith.select %124, %arg6, %121 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9873398Z         %126 = arith.mulf %114, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9873609Z         %127 = arith.truncf %126 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9873896Z         %128 = ttg.convert_layout %125 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9874231Z         %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9874531Z         %130 = ttg.convert_layout %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9874776Z         %131 = arith.extf %127 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9875025Z         %132 = tt.broadcast %130 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:44.9875280Z         %133 = ttg.convert_layout %132 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9875498Z         %134 = arith.subf %131, %133 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9875813Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9876107Z         %136 = "tt.reduce"(%135) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9876233Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9876351Z           %395 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:44.9876477Z           tt.reduce.return %395 : f32
2026-02-21T13:28:44.9876663Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9876958Z         %137 = ttg.convert_layout %136 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9877201Z         %138 = arith.subf %arg6, %125 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9877504Z         %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9877792Z         %140 = arith.mulf %arg7, %139 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9877949Z         %141 = arith.addf %140, %137 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9878191Z         %142 = ttg.convert_layout %139 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9878541Z         %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9878840Z         %144 = ttg.convert_layout %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9879090Z         %145 = tt.broadcast %144 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:44.9879341Z         %146 = ttg.convert_layout %145 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9879556Z         %147 = arith.mulf %arg8, %146 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9879821Z         %148 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:44.9880169Z         %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:44.9880479Z         %150 = ttg.convert_layout %149 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9880692Z         %151 = arith.muli %150, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9880853Z         %152 = arith.addi %49, %151 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9881072Z         %153 = tt.broadcast %152 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:44.9881333Z         %154 = ttg.convert_layout %153 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9881558Z         %155 = arith.addi %154, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9881785Z         %156 = tt.addptr %18, %155 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9882014Z         %157 = tt.load %156 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9882227Z         %158 = arith.truncf %135 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9882467Z         %159 = tt.reshape %147 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9882750Z         %160 = tt.reshape %158 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:44.9882996Z         %161 = tt.reshape %157 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:44.9883307Z         %162 = ttg.convert_layout %160 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:44.9883675Z         %163 = ttg.convert_layout %161 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:44.9883977Z         %164 = ttg.convert_layout %159 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9884386Z         %165 = tt.dot %162, %163, %164, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9884784Z         %166 = tt.reshape %165 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9884964Z         %c1_i32_12 = arith.constant 1 : i32
2026-02-21T13:28:44.9885095Z         %167 = arith.muli %c4096_i32, %c1_i32_12 : i32
2026-02-21T13:28:44.9885242Z         %168 = arith.addi %arg5, %167 : i32
2026-02-21T13:28:44.9885381Z         %169 = tt.splat %168 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9885539Z         %170 = arith.addi %169, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9885784Z         %171 = ttg.convert_layout %170 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:44.9886130Z         %172 = tt.expand_dims %171 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:44.9886446Z         %173 = ttg.convert_layout %172 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:44.9886740Z         %174 = ttg.convert_layout %173 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:44.9887091Z         %175 = tt.expand_dims %174 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:44.9887404Z         %176 = ttg.convert_layout %175 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9887646Z         %177 = arith.muli %176, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9887853Z         %178 = tt.broadcast %177 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9888064Z         %179 = arith.addi %47, %178 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9888288Z         %180 = tt.addptr %16, %179 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9888514Z         %181 = tt.load %180 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9888726Z         %182 = tt.reshape %181 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:44.9889049Z         %183 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:44.9889408Z         %184 = ttg.convert_layout %182 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:44.9889721Z         %185 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9890126Z         %186 = tt.dot %183, %184, %185, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9890536Z         %187 = ttg.convert_layout %186 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:44.9890780Z         %188 = tt.reshape %187 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9891023Z         %189 = arith.truncf %188 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9891272Z         %190 = arith.extf %189 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9891461Z         %191 = "tt.reduce"(%190) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9891588Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9891708Z           %395 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:44.9891833Z           tt.reduce.return %395 : f32
2026-02-21T13:28:44.9892022Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9892314Z         %192 = ttg.convert_layout %191 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9892584Z         %193 = arith.truncf %192 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9892804Z         %194 = arith.extf %193 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9892997Z         %195 = arith.mulf %194, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9893203Z         %196 = arith.truncf %195 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9893418Z         %197 = arith.extf %196 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9893615Z         %198 = arith.cmpf ogt, %125, %197 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9893779Z         %199 = arith.cmpf une, %125, %125 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9893952Z         %200 = arith.ori %198, %199 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:44.9894142Z         %201 = arith.select %200, %125, %197 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9894345Z         %202 = arith.mulf %190, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9894556Z         %203 = arith.truncf %202 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9894846Z         %204 = ttg.convert_layout %201 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9895183Z         %205 = tt.expand_dims %204 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9895490Z         %206 = ttg.convert_layout %205 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9895734Z         %207 = arith.extf %203 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9895980Z         %208 = tt.broadcast %206 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:44.9896235Z         %209 = ttg.convert_layout %208 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9896453Z         %210 = arith.subf %207, %209 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9896779Z         %211 = tt.extern_elementwise %210 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9897070Z         %212 = "tt.reduce"(%211) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9897197Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9897314Z           %395 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:44.9897436Z           tt.reduce.return %395 : f32
2026-02-21T13:28:44.9897620Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9897914Z         %213 = ttg.convert_layout %212 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9898155Z         %214 = arith.subf %125, %201 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9898437Z         %215 = tt.extern_elementwise %214 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9898722Z         %216 = arith.mulf %141, %215 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9898880Z         %217 = arith.addf %216, %213 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9899119Z         %218 = ttg.convert_layout %215 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9899450Z         %219 = tt.expand_dims %218 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9899742Z         %220 = ttg.convert_layout %219 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9899991Z         %221 = tt.broadcast %220 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:44.9900242Z         %222 = ttg.convert_layout %221 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9900457Z         %223 = arith.mulf %166, %222 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9900707Z         %224 = ttg.convert_layout %173 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:44.9901071Z         %225 = tt.expand_dims %224 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:44.9901383Z         %226 = ttg.convert_layout %225 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9901596Z         %227 = arith.muli %226, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9901781Z         %228 = arith.addi %49, %227 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9901983Z         %229 = tt.broadcast %228 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:44.9902242Z         %230 = ttg.convert_layout %229 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9902469Z         %231 = arith.addi %230, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9902691Z         %232 = tt.addptr %18, %231 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9902919Z         %233 = tt.load %232 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9903148Z         %234 = arith.truncf %211 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9903388Z         %235 = tt.reshape %223 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9903625Z         %236 = tt.reshape %234 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:44.9903871Z         %237 = tt.reshape %233 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:44.9904174Z         %238 = ttg.convert_layout %236 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:44.9916563Z         %239 = ttg.convert_layout %237 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:44.9916892Z         %240 = ttg.convert_layout %235 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9917302Z         %241 = tt.dot %238, %239, %240, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9917700Z         %242 = tt.reshape %241 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9917885Z         %c2_i32_13 = arith.constant 2 : i32
2026-02-21T13:28:44.9918015Z         %243 = arith.muli %c4096_i32, %c2_i32_13 : i32
2026-02-21T13:28:44.9918139Z         %244 = arith.addi %arg5, %243 : i32
2026-02-21T13:28:44.9918280Z         %245 = tt.splat %244 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9918438Z         %246 = arith.addi %245, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9918681Z         %247 = ttg.convert_layout %246 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:44.9919024Z         %248 = tt.expand_dims %247 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:44.9919320Z         %249 = ttg.convert_layout %248 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:44.9919616Z         %250 = ttg.convert_layout %249 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:44.9919970Z         %251 = tt.expand_dims %250 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:44.9920279Z         %252 = ttg.convert_layout %251 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9920502Z         %253 = arith.muli %252, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9920709Z         %254 = tt.broadcast %253 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9920945Z         %255 = arith.addi %47, %254 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9921164Z         %256 = tt.addptr %16, %255 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9921388Z         %257 = tt.load %256 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9921604Z         %258 = tt.reshape %257 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:44.9921922Z         %259 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:44.9922283Z         %260 = ttg.convert_layout %258 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:44.9922634Z         %261 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9923073Z         %262 = tt.dot %259, %260, %261, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9923476Z         %263 = ttg.convert_layout %262 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:44.9923715Z         %264 = tt.reshape %263 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9923962Z         %265 = arith.truncf %264 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9924215Z         %266 = arith.extf %265 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9924405Z         %267 = "tt.reduce"(%266) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9924552Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9924672Z           %395 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:44.9924802Z           tt.reduce.return %395 : f32
2026-02-21T13:28:44.9924990Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9925279Z         %268 = ttg.convert_layout %267 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9925550Z         %269 = arith.truncf %268 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9925771Z         %270 = arith.extf %269 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9925964Z         %271 = arith.mulf %270, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9926154Z         %272 = arith.truncf %271 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9926375Z         %273 = arith.extf %272 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9926569Z         %274 = arith.cmpf ogt, %201, %273 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9926733Z         %275 = arith.cmpf une, %201, %201 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9926894Z         %276 = arith.ori %274, %275 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:44.9927085Z         %277 = arith.select %276, %201, %273 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9927288Z         %278 = arith.mulf %266, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9927498Z         %279 = arith.truncf %278 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9927782Z         %280 = ttg.convert_layout %277 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9928114Z         %281 = tt.expand_dims %280 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9928406Z         %282 = ttg.convert_layout %281 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9928678Z         %283 = arith.extf %279 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9928925Z         %284 = tt.broadcast %282 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:44.9929182Z         %285 = ttg.convert_layout %284 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9929401Z         %286 = arith.subf %283, %285 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9929728Z         %287 = tt.extern_elementwise %286 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9930017Z         %288 = "tt.reduce"(%287) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9930144Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9930263Z           %395 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:44.9930383Z           tt.reduce.return %395 : f32
2026-02-21T13:28:44.9930572Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9930878Z         %289 = ttg.convert_layout %288 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9931116Z         %290 = arith.subf %201, %277 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9931397Z         %291 = tt.extern_elementwise %290 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9931685Z         %292 = arith.mulf %217, %291 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9931841Z         %293 = arith.addf %292, %289 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9932101Z         %294 = ttg.convert_layout %291 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9932437Z         %295 = tt.expand_dims %294 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9932735Z         %296 = ttg.convert_layout %295 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9932987Z         %297 = tt.broadcast %296 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:44.9933243Z         %298 = ttg.convert_layout %297 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9933464Z         %299 = arith.mulf %242, %298 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9933724Z         %300 = ttg.convert_layout %249 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:44.9934072Z         %301 = tt.expand_dims %300 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:44.9934387Z         %302 = ttg.convert_layout %301 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9934603Z         %303 = arith.muli %302, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9934773Z         %304 = arith.addi %49, %303 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9934980Z         %305 = tt.broadcast %304 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:44.9935244Z         %306 = ttg.convert_layout %305 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9935472Z         %307 = arith.addi %306, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9935695Z         %308 = tt.addptr %18, %307 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9935928Z         %309 = tt.load %308 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9936149Z         %310 = arith.truncf %287 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9936416Z         %311 = tt.reshape %299 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9936657Z         %312 = tt.reshape %310 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:44.9936905Z         %313 = tt.reshape %309 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:44.9937214Z         %314 = ttg.convert_layout %312 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:44.9937596Z         %315 = ttg.convert_layout %313 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:44.9937902Z         %316 = ttg.convert_layout %311 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9938318Z         %317 = tt.dot %314, %315, %316, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9938718Z         %318 = tt.reshape %317 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9938916Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T13:28:44.9939047Z         %319 = arith.muli %c4096_i32, %c3_i32 : i32
2026-02-21T13:28:44.9939174Z         %320 = arith.addi %arg5, %319 : i32
2026-02-21T13:28:44.9939317Z         %321 = tt.splat %320 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9939477Z         %322 = arith.addi %321, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9939722Z         %323 = ttg.convert_layout %322 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:44.9940078Z         %324 = tt.expand_dims %323 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:44.9940377Z         %325 = ttg.convert_layout %324 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:44.9940678Z         %326 = ttg.convert_layout %325 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:44.9941026Z         %327 = tt.expand_dims %326 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:44.9941345Z         %328 = ttg.convert_layout %327 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9941570Z         %329 = arith.muli %328, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9941781Z         %330 = tt.broadcast %329 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9942000Z         %331 = arith.addi %47, %330 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9942223Z         %332 = tt.addptr %16, %331 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9942459Z         %333 = tt.load %332 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9942680Z         %334 = tt.reshape %333 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:44.9942983Z         %335 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:44.9943349Z         %336 = ttg.convert_layout %334 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:44.9943662Z         %337 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9944082Z         %338 = tt.dot %335, %336, %337, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9944488Z         %339 = ttg.convert_layout %338 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:44.9944772Z         %340 = tt.reshape %339 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9945022Z         %341 = arith.truncf %340 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9945275Z         %342 = arith.extf %341 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9945482Z         %343 = "tt.reduce"(%342) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9945615Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9945738Z           %395 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:44.9945871Z           tt.reduce.return %395 : f32
2026-02-21T13:28:44.9946224Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9946529Z         %344 = ttg.convert_layout %343 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9946809Z         %345 = arith.truncf %344 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9947051Z         %346 = arith.extf %345 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9947250Z         %347 = arith.mulf %346, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9947443Z         %348 = arith.truncf %347 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9947670Z         %349 = arith.extf %348 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9947869Z         %350 = arith.cmpf ogt, %277, %349 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9948036Z         %351 = arith.cmpf une, %277, %277 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9948219Z         %352 = arith.ori %350, %351 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:44.9948412Z         %353 = arith.select %352, %277, %349 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9948624Z         %354 = arith.mulf %342, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9948839Z         %355 = arith.truncf %354 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9949130Z         %356 = ttg.convert_layout %353 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9949468Z         %357 = tt.expand_dims %356 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9949768Z         %358 = ttg.convert_layout %357 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9950025Z         %359 = arith.extf %355 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9950281Z         %360 = tt.broadcast %358 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:44.9950540Z         %361 = ttg.convert_layout %360 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9950763Z         %362 = arith.subf %359, %361 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9951071Z         %363 = tt.extern_elementwise %362 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9951369Z         %364 = "tt.reduce"(%363) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9951502Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9951623Z           %395 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:44.9951749Z           tt.reduce.return %395 : f32
2026-02-21T13:28:44.9951937Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9952235Z         %365 = ttg.convert_layout %364 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9952498Z         %366 = arith.subf %277, %353 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9952785Z         %367 = tt.extern_elementwise %366 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9953077Z         %368 = arith.mulf %293, %367 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9953237Z         %369 = arith.addf %368, %365 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9953495Z         %370 = ttg.convert_layout %367 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9953829Z         %371 = tt.expand_dims %370 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9954125Z         %372 = ttg.convert_layout %371 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9954376Z         %373 = tt.broadcast %372 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:44.9954632Z         %374 = ttg.convert_layout %373 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9954868Z         %375 = arith.mulf %318, %374 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9955124Z         %376 = ttg.convert_layout %325 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:44.9955475Z         %377 = tt.expand_dims %376 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:44.9955792Z         %378 = ttg.convert_layout %377 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9956005Z         %379 = arith.muli %378, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9956195Z         %380 = arith.addi %49, %379 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9956401Z         %381 = tt.broadcast %380 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:44.9956662Z         %382 = ttg.convert_layout %381 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9956890Z         %383 = arith.addi %382, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9957117Z         %384 = tt.addptr %18, %383 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9957349Z         %385 = tt.load %384 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9957567Z         %386 = arith.truncf %363 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9957812Z         %387 = tt.reshape %375 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9958056Z         %388 = tt.reshape %386 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:44.9958309Z         %389 = tt.reshape %385 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:44.9958622Z         %390 = ttg.convert_layout %388 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:44.9958990Z         %391 = ttg.convert_layout %389 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:44.9959294Z         %392 = ttg.convert_layout %387 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9959705Z         %393 = tt.dot %390, %391, %392, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9960103Z         %394 = tt.reshape %393 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9960377Z         scf.yield %353, %369, %394 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9960621Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T13:28:44.9960956Z       %51:3 = scf.for %arg5 = %c0_i32_8 to %c8192_i32 step %c4096_i32 iter_args(%arg6 = %50#0, %arg7 = %50#1, %arg8 = %50#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T13:28:44.9961306Z         %93 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9961488Z         %94 = arith.addi %93, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9961729Z         %95 = ttg.convert_layout %94 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:44.9962068Z         %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:44.9962364Z         %97 = ttg.convert_layout %96 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:44.9962711Z         %98 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:44.9963087Z         %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:44.9963397Z         %100 = ttg.convert_layout %99 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9963624Z         %101 = arith.muli %100, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9963837Z         %102 = tt.broadcast %101 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9964054Z         %103 = arith.addi %47, %102 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9964309Z         %104 = tt.addptr %16, %103 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9964536Z         %105 = tt.load %104 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9964758Z         %106 = tt.reshape %105 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:44.9965062Z         %107 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:44.9965425Z         %108 = ttg.convert_layout %106 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:44.9965747Z         %109 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9966168Z         %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9966580Z         %111 = ttg.convert_layout %110 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:44.9966832Z         %112 = tt.reshape %111 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9967082Z         %113 = arith.truncf %112 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9967340Z         %114 = arith.extf %113 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9967536Z         %115 = "tt.reduce"(%114) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9967670Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9967792Z           %167 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:44.9967923Z           tt.reduce.return %167 : f32
2026-02-21T13:28:44.9968118Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9968423Z         %116 = ttg.convert_layout %115 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9968717Z         %117 = arith.truncf %116 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9968938Z         %118 = arith.extf %117 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9969136Z         %119 = arith.mulf %118, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9969331Z         %120 = arith.truncf %119 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9969550Z         %121 = arith.extf %120 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9969767Z         %122 = arith.cmpf ogt, %arg6, %121 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9969939Z         %123 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9970109Z         %124 = arith.ori %122, %123 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:44.9970305Z         %125 = arith.select %124, %arg6, %121 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9970519Z         %126 = arith.mulf %114, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9970736Z         %127 = arith.truncf %126 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9971041Z         %128 = ttg.convert_layout %125 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9971377Z         %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9971675Z         %130 = ttg.convert_layout %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9971924Z         %131 = arith.extf %127 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9972190Z         %132 = tt.broadcast %130 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:44.9972445Z         %133 = ttg.convert_layout %132 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9972667Z         %134 = arith.subf %131, %133 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9972983Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9973273Z         %136 = "tt.reduce"(%135) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9973401Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9973523Z           %167 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:44.9973644Z           tt.reduce.return %167 : f32
2026-02-21T13:28:44.9973831Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9974122Z         %137 = ttg.convert_layout %136 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9974365Z         %138 = arith.subf %arg6, %125 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9974650Z         %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9974940Z         %140 = arith.mulf %arg7, %139 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9975098Z         %141 = arith.addf %140, %137 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9975337Z         %142 = ttg.convert_layout %139 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9975670Z         %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9975965Z         %144 = ttg.convert_layout %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9976212Z         %145 = tt.broadcast %144 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:44.9976477Z         %146 = ttg.convert_layout %145 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9976692Z         %147 = arith.mulf %arg8, %146 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9976943Z         %148 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:44.9977288Z         %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:44.9977612Z         %150 = ttg.convert_layout %149 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9977826Z         %151 = arith.muli %150, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9977992Z         %152 = arith.addi %49, %151 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9978197Z         %153 = tt.broadcast %152 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:44.9978458Z         %154 = ttg.convert_layout %153 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9978696Z         %155 = arith.addi %154, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9978917Z         %156 = tt.addptr %18, %155 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:44.9979146Z         %157 = tt.load %156 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9979359Z         %158 = arith.truncf %135 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9979600Z         %159 = tt.reshape %147 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9979835Z         %160 = tt.reshape %158 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:44.9980093Z         %161 = tt.reshape %157 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:44.9980399Z         %162 = ttg.convert_layout %160 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:44.9980762Z         %163 = ttg.convert_layout %161 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:44.9981062Z         %164 = ttg.convert_layout %159 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9981469Z         %165 = tt.dot %162, %163, %164, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:44.9981870Z         %166 = tt.reshape %165 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9982139Z         scf.yield %125, %141, %166 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9982358Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T13:28:44.9982580Z       %52 = ttg.convert_layout %51#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9982910Z       %53 = tt.expand_dims %52 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9983203Z       %54 = ttg.convert_layout %53 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9983441Z       %55 = tt.broadcast %54 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:44.9983683Z       %56 = ttg.convert_layout %55 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9983891Z       %57 = arith.divf %51#2, %56 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:44.9984092Z       %58 = arith.truncf %57 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1>
2026-02-21T13:28:44.9984342Z       %59 = tt.addptr %19, %41 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:44.9984571Z       tt.store %59, %58 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9984718Z       %c1_i32_9 = arith.constant 1 : i32
2026-02-21T13:28:44.9984841Z       %60 = arith.muli %c1_i32, %c1_i32_9 : i32
2026-02-21T13:28:44.9984962Z       %61 = arith.addi %arg4, %60 : i32
2026-02-21T13:28:44.9985079Z       %62 = arith.divsi %61, %c131072_i32 : i32
2026-02-21T13:28:44.9985217Z       %63 = arith.muli %62, %c16_i32 : i32
2026-02-21T13:28:44.9985334Z       %64 = arith.subi %c192_i32, %63 : i32
2026-02-21T13:28:44.9985446Z       %65 = arith.minsi %64, %c16_i32 : i32
2026-02-21T13:28:44.9985566Z       %66 = arith.remsi %61, %c131072_i32 : i32
2026-02-21T13:28:44.9985682Z       %67 = arith.remsi %66, %65 : i32
2026-02-21T13:28:44.9985793Z       %68 = arith.addi %63, %67 : i32
2026-02-21T13:28:44.9985901Z       %69 = arith.divsi %66, %65 : i32
2026-02-21T13:28:44.9986018Z       %70 = arith.muli %68, %c1048576_i32 : i32
2026-02-21T13:28:44.9986137Z       %71 = arith.muli %69, %c128_i32 : i32
2026-02-21T13:28:44.9986249Z       %72 = arith.addi %70, %71 : i32
2026-02-21T13:28:44.9986397Z       %73 = tt.splat %72 : i32 -> tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:44.9986552Z       %74 = arith.addi %73, %10 : tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:44.9986758Z       %75 = tt.addptr %11, %74 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:44.9986967Z       %76 = tt.load %75 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9987125Z       %77 = tt.splat %70 : i32 -> tensor<1x128x1xi32, #blocked>
2026-02-21T13:28:44.9987277Z       %78 = arith.addi %77, %15 : tensor<1x128x1xi32, #blocked>
2026-02-21T13:28:44.9987473Z       %79 = tt.broadcast %78 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x4096xi32, #blocked>
2026-02-21T13:28:44.9987741Z       %80 = ttg.convert_layout %79 : tensor<1x128x4096xi32, #blocked> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9987987Z       %81 = tt.reshape %76 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3>
2026-02-21T13:28:44.9988178Z       %82 = tt.splat %70 : i32 -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:44.9988315Z       %c0_i32_10 = arith.constant 0 : i32
2026-02-21T13:28:44.9988439Z       %c16384_i32_11 = arith.constant 16384 : i32
2026-02-21T13:28:44.9988781Z       %83:3 = scf.for %arg5 = %c0_i32 to %c0_i32_10 step %c16384_i32_11 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T13:28:44.9989135Z         %93 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9989293Z         %94 = arith.addi %93, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:44.9989529Z         %95 = ttg.convert_layout %94 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:44.9989861Z         %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:44.9990157Z         %97 = ttg.convert_layout %96 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:44.9990446Z         %98 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:44.9990791Z         %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:44.9991098Z         %100 = ttg.convert_layout %99 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9991316Z         %101 = arith.muli %100, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:44.9991532Z         %102 = tt.broadcast %101 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9991741Z         %103 = arith.addi %80, %102 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9991986Z         %104 = tt.addptr %16, %103 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:44.9992211Z         %105 = tt.load %104 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:44.9992425Z         %106 = tt.reshape %105 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:44.9992725Z         %107 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:44.9993097Z         %108 = ttg.convert_layout %106 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:44.9993413Z         %109 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9993823Z         %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:44.9994239Z         %111 = ttg.convert_layout %110 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:44.9994483Z         %112 = tt.reshape %111 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9994725Z         %113 = arith.truncf %112 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9994975Z         %114 = arith.extf %113 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9995169Z         %115 = "tt.reduce"(%114) <{axis = 2 : i32}> ({
2026-02-21T13:28:44.9995294Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:44.9995417Z           %395 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:44.9995555Z           tt.reduce.return %395 : f32
2026-02-21T13:28:44.9995743Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:44.9996036Z         %116 = ttg.convert_layout %115 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9996308Z         %117 = arith.truncf %116 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9996528Z         %118 = arith.extf %117 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9996719Z         %119 = arith.mulf %118, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9996909Z         %120 = arith.truncf %119 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:44.9997125Z         %121 = arith.extf %120 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9997321Z         %122 = arith.cmpf ogt, %arg6, %121 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9997494Z         %123 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9997661Z         %124 = arith.ori %122, %123 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:44.9997854Z         %125 = arith.select %124, %arg6, %121 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:44.9998057Z         %126 = arith.mulf %114, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9998267Z         %127 = arith.truncf %126 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:44.9998554Z         %128 = ttg.convert_layout %125 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:44.9998884Z         %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:44.9999179Z         %130 = ttg.convert_layout %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:44.9999425Z         %131 = arith.extf %127 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:44.9999690Z         %132 = tt.broadcast %130 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:44.9999951Z         %133 = ttg.convert_layout %132 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0000169Z         %134 = arith.subf %131, %133 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0000474Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0000776Z         %136 = "tt.reduce"(%135) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0000905Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0001023Z           %395 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:45.0001140Z           tt.reduce.return %395 : f32
2026-02-21T13:28:45.0001328Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0001620Z         %137 = ttg.convert_layout %136 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0001884Z         %138 = arith.subf %arg6, %125 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0002169Z         %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0002456Z         %140 = arith.mulf %arg7, %139 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0002655Z         %141 = arith.addf %140, %137 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0002891Z         %142 = ttg.convert_layout %139 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0003243Z         %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0003539Z         %144 = ttg.convert_layout %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0003784Z         %145 = tt.broadcast %144 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0004038Z         %146 = ttg.convert_layout %145 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0004251Z         %147 = arith.mulf %arg8, %146 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0004504Z         %148 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:45.0004850Z         %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:45.0005162Z         %150 = ttg.convert_layout %149 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0005377Z         %151 = arith.muli %150, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0005541Z         %152 = arith.addi %82, %151 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0005746Z         %153 = tt.broadcast %152 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:45.0006008Z         %154 = ttg.convert_layout %153 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0006228Z         %155 = arith.addi %154, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0006456Z         %156 = tt.addptr %18, %155 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0006680Z         %157 = tt.load %156 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0006896Z         %158 = arith.truncf %135 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0007144Z         %159 = tt.reshape %147 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0007399Z         %160 = tt.reshape %158 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:45.0007647Z         %161 = tt.reshape %157 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:45.0007949Z         %162 = ttg.convert_layout %160 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:45.0008310Z         %163 = ttg.convert_layout %161 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:45.0008629Z         %164 = ttg.convert_layout %159 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0009032Z         %165 = tt.dot %162, %163, %164, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0009424Z         %166 = tt.reshape %165 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0009603Z         %c1_i32_12 = arith.constant 1 : i32
2026-02-21T13:28:45.0009749Z         %167 = arith.muli %c4096_i32, %c1_i32_12 : i32
2026-02-21T13:28:45.0009877Z         %168 = arith.addi %arg5, %167 : i32
2026-02-21T13:28:45.0010011Z         %169 = tt.splat %168 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0010173Z         %170 = arith.addi %169, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0010414Z         %171 = ttg.convert_layout %170 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:45.0010752Z         %172 = tt.expand_dims %171 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:45.0011066Z         %173 = ttg.convert_layout %172 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:45.0011359Z         %174 = ttg.convert_layout %173 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:45.0011707Z         %175 = tt.expand_dims %174 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:45.0012015Z         %176 = ttg.convert_layout %175 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0012234Z         %177 = arith.muli %176, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0012444Z         %178 = tt.broadcast %177 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0012656Z         %179 = arith.addi %80, %178 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0012877Z         %180 = tt.addptr %16, %179 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0013103Z         %181 = tt.load %180 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0013321Z         %182 = tt.reshape %181 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:45.0013622Z         %183 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:45.0013977Z         %184 = ttg.convert_layout %182 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:45.0014287Z         %185 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0014693Z         %186 = tt.dot %183, %184, %185, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0015095Z         %187 = ttg.convert_layout %186 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:45.0015356Z         %188 = tt.reshape %187 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0015598Z         %189 = arith.truncf %188 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0015848Z         %190 = arith.extf %189 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0016036Z         %191 = "tt.reduce"(%190) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0016163Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0016299Z           %395 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:45.0016421Z           tt.reduce.return %395 : f32
2026-02-21T13:28:45.0016610Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0016900Z         %192 = ttg.convert_layout %191 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0017170Z         %193 = arith.truncf %192 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0017398Z         %194 = arith.extf %193 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0017603Z         %195 = arith.mulf %194, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0017795Z         %196 = arith.truncf %195 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0018013Z         %197 = arith.extf %196 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0018208Z         %198 = arith.cmpf ogt, %125, %197 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0018375Z         %199 = arith.cmpf une, %125, %125 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0018538Z         %200 = arith.ori %198, %199 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:45.0018749Z         %201 = arith.select %200, %125, %197 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0018950Z         %202 = arith.mulf %190, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0019160Z         %203 = arith.truncf %202 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0028918Z         %204 = ttg.convert_layout %201 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0029274Z         %205 = tt.expand_dims %204 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0029576Z         %206 = ttg.convert_layout %205 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0029829Z         %207 = arith.extf %203 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0030075Z         %208 = tt.broadcast %206 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:45.0030337Z         %209 = ttg.convert_layout %208 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0030561Z         %210 = arith.subf %207, %209 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0030871Z         %211 = tt.extern_elementwise %210 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0031166Z         %212 = "tt.reduce"(%211) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0031293Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0031420Z           %395 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:45.0031545Z           tt.reduce.return %395 : f32
2026-02-21T13:28:45.0031736Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0032030Z         %213 = ttg.convert_layout %212 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0032274Z         %214 = arith.subf %125, %201 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0032559Z         %215 = tt.extern_elementwise %214 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0032896Z         %216 = arith.mulf %141, %215 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0033050Z         %217 = arith.addf %216, %213 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0033287Z         %218 = ttg.convert_layout %215 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0033636Z         %219 = tt.expand_dims %218 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0033933Z         %220 = ttg.convert_layout %219 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0034183Z         %221 = tt.broadcast %220 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0034433Z         %222 = ttg.convert_layout %221 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0034651Z         %223 = arith.mulf %166, %222 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0034918Z         %224 = ttg.convert_layout %173 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:45.0035268Z         %225 = tt.expand_dims %224 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:45.0035580Z         %226 = ttg.convert_layout %225 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0035792Z         %227 = arith.muli %226, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0035958Z         %228 = arith.addi %82, %227 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0036173Z         %229 = tt.broadcast %228 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:45.0036436Z         %230 = ttg.convert_layout %229 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0036664Z         %231 = arith.addi %230, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0036886Z         %232 = tt.addptr %18, %231 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0037114Z         %233 = tt.load %232 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0037325Z         %234 = arith.truncf %211 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0037571Z         %235 = tt.reshape %223 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0037809Z         %236 = tt.reshape %234 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:45.0038052Z         %237 = tt.reshape %233 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:45.0038357Z         %238 = ttg.convert_layout %236 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:45.0038721Z         %239 = ttg.convert_layout %237 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:45.0039025Z         %240 = ttg.convert_layout %235 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0039436Z         %241 = tt.dot %238, %239, %240, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0039831Z         %242 = tt.reshape %241 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0040013Z         %c2_i32_13 = arith.constant 2 : i32
2026-02-21T13:28:45.0040141Z         %243 = arith.muli %c4096_i32, %c2_i32_13 : i32
2026-02-21T13:28:45.0040269Z         %244 = arith.addi %arg5, %243 : i32
2026-02-21T13:28:45.0040425Z         %245 = tt.splat %244 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0040582Z         %246 = arith.addi %245, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0040828Z         %247 = ttg.convert_layout %246 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:45.0041163Z         %248 = tt.expand_dims %247 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:45.0041478Z         %249 = ttg.convert_layout %248 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:45.0041774Z         %250 = ttg.convert_layout %249 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:45.0042124Z         %251 = tt.expand_dims %250 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:45.0042440Z         %252 = ttg.convert_layout %251 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0042707Z         %253 = arith.muli %252, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0042938Z         %254 = tt.broadcast %253 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0043150Z         %255 = arith.addi %80, %254 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0043371Z         %256 = tt.addptr %16, %255 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0043602Z         %257 = tt.load %256 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0043812Z         %258 = tt.reshape %257 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:45.0044127Z         %259 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:45.0044486Z         %260 = ttg.convert_layout %258 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:45.0044797Z         %261 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0045208Z         %262 = tt.dot %259, %260, %261, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0045613Z         %263 = ttg.convert_layout %262 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:45.0045852Z         %264 = tt.reshape %263 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0046097Z         %265 = arith.truncf %264 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0046341Z         %266 = arith.extf %265 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0046538Z         %267 = "tt.reduce"(%266) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0046664Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0046783Z           %395 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:45.0046911Z           tt.reduce.return %395 : f32
2026-02-21T13:28:45.0047100Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0047396Z         %268 = ttg.convert_layout %267 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0047671Z         %269 = arith.truncf %268 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0047895Z         %270 = arith.extf %269 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0048089Z         %271 = arith.mulf %270, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0048276Z         %272 = arith.truncf %271 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0048513Z         %273 = arith.extf %272 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0048705Z         %274 = arith.cmpf ogt, %201, %273 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0048876Z         %275 = arith.cmpf une, %201, %201 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0049040Z         %276 = arith.ori %274, %275 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:45.0049245Z         %277 = arith.select %276, %201, %273 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0049449Z         %278 = arith.mulf %266, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0049659Z         %279 = arith.truncf %278 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0049948Z         %280 = ttg.convert_layout %277 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0050281Z         %281 = tt.expand_dims %280 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0050596Z         %282 = ttg.convert_layout %281 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0050842Z         %283 = arith.extf %279 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0051087Z         %284 = tt.broadcast %282 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:45.0051341Z         %285 = ttg.convert_layout %284 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0051560Z         %286 = arith.subf %283, %285 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0051881Z         %287 = tt.extern_elementwise %286 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0052173Z         %288 = "tt.reduce"(%287) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0052296Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0052416Z           %395 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:45.0052536Z           tt.reduce.return %395 : f32
2026-02-21T13:28:45.0052724Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0053018Z         %289 = ttg.convert_layout %288 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0053257Z         %290 = arith.subf %201, %277 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0053546Z         %291 = tt.extern_elementwise %290 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0053835Z         %292 = arith.mulf %217, %291 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0053989Z         %293 = arith.addf %292, %289 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0054228Z         %294 = ttg.convert_layout %291 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0054565Z         %295 = tt.expand_dims %294 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0054861Z         %296 = ttg.convert_layout %295 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0055108Z         %297 = tt.broadcast %296 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0055358Z         %298 = ttg.convert_layout %297 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0055570Z         %299 = arith.mulf %242, %298 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0055820Z         %300 = ttg.convert_layout %249 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:45.0056421Z         %301 = tt.expand_dims %300 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:45.0056730Z         %302 = ttg.convert_layout %301 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0056941Z         %303 = arith.muli %302, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0057105Z         %304 = arith.addi %82, %303 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0057325Z         %305 = tt.broadcast %304 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:45.0057584Z         %306 = ttg.convert_layout %305 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0057806Z         %307 = arith.addi %306, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0058027Z         %308 = tt.addptr %18, %307 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0058258Z         %309 = tt.load %308 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0058489Z         %310 = arith.truncf %287 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0058736Z         %311 = tt.reshape %299 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0058969Z         %312 = tt.reshape %310 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:45.0059215Z         %313 = tt.reshape %309 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:45.0059524Z         %314 = ttg.convert_layout %312 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:45.0059911Z         %315 = ttg.convert_layout %313 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:45.0060218Z         %316 = ttg.convert_layout %311 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0060647Z         %317 = tt.dot %314, %315, %316, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0061056Z         %318 = tt.reshape %317 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0061250Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T13:28:45.0061375Z         %319 = arith.muli %c4096_i32, %c3_i32 : i32
2026-02-21T13:28:45.0061509Z         %320 = arith.addi %arg5, %319 : i32
2026-02-21T13:28:45.0061648Z         %321 = tt.splat %320 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0061804Z         %322 = arith.addi %321, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0062048Z         %323 = ttg.convert_layout %322 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:45.0062383Z         %324 = tt.expand_dims %323 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:45.0062684Z         %325 = ttg.convert_layout %324 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:45.0062987Z         %326 = ttg.convert_layout %325 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:45.0063339Z         %327 = tt.expand_dims %326 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:45.0063652Z         %328 = ttg.convert_layout %327 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0063870Z         %329 = arith.muli %328, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0064083Z         %330 = tt.broadcast %329 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0064319Z         %331 = arith.addi %80, %330 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0064539Z         %332 = tt.addptr %16, %331 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0064766Z         %333 = tt.load %332 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0064976Z         %334 = tt.reshape %333 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:45.0065293Z         %335 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:45.0065649Z         %336 = ttg.convert_layout %334 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:45.0065963Z         %337 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0066379Z         %338 = tt.dot %335, %336, %337, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0066797Z         %339 = ttg.convert_layout %338 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:45.0067035Z         %340 = tt.reshape %339 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0067283Z         %341 = arith.truncf %340 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0067529Z         %342 = arith.extf %341 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0067721Z         %343 = "tt.reduce"(%342) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0067844Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0067980Z           %395 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:45.0068106Z           tt.reduce.return %395 : f32
2026-02-21T13:28:45.0068291Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0068585Z         %344 = ttg.convert_layout %343 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0068854Z         %345 = arith.truncf %344 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0069076Z         %346 = arith.extf %345 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0069270Z         %347 = arith.mulf %346, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0069456Z         %348 = arith.truncf %347 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0069672Z         %349 = arith.extf %348 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0069864Z         %350 = arith.cmpf ogt, %277, %349 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0070033Z         %351 = arith.cmpf une, %277, %277 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0070193Z         %352 = arith.ori %350, %351 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:45.0070388Z         %353 = arith.select %352, %277, %349 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0070588Z         %354 = arith.mulf %342, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0070795Z         %355 = arith.truncf %354 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0071080Z         %356 = ttg.convert_layout %353 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0071410Z         %357 = tt.expand_dims %356 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0071703Z         %358 = ttg.convert_layout %357 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0071944Z         %359 = arith.extf %355 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0072204Z         %360 = tt.broadcast %358 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:45.0072456Z         %361 = ttg.convert_layout %360 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0072670Z         %362 = arith.subf %359, %361 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0072988Z         %363 = tt.extern_elementwise %362 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0073276Z         %364 = "tt.reduce"(%363) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0073398Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0073514Z           %395 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:45.0073633Z           tt.reduce.return %395 : f32
2026-02-21T13:28:45.0073816Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0074118Z         %365 = ttg.convert_layout %364 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0074354Z         %366 = arith.subf %277, %353 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0074632Z         %367 = tt.extern_elementwise %366 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0074915Z         %368 = arith.mulf %293, %367 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0075067Z         %369 = arith.addf %368, %365 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0075299Z         %370 = ttg.convert_layout %367 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0075642Z         %371 = tt.expand_dims %370 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0075933Z         %372 = ttg.convert_layout %371 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0076178Z         %373 = tt.broadcast %372 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0076426Z         %374 = ttg.convert_layout %373 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0076636Z         %375 = arith.mulf %318, %374 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0076885Z         %376 = ttg.convert_layout %325 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:45.0077228Z         %377 = tt.expand_dims %376 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:45.0077534Z         %378 = ttg.convert_layout %377 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0077745Z         %379 = arith.muli %378, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0077905Z         %380 = arith.addi %82, %379 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0078104Z         %381 = tt.broadcast %380 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:45.0078360Z         %382 = ttg.convert_layout %381 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0078580Z         %383 = arith.addi %382, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0078799Z         %384 = tt.addptr %18, %383 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0079021Z         %385 = tt.load %384 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0079230Z         %386 = arith.truncf %363 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0079470Z         %387 = tt.reshape %375 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0079721Z         %388 = tt.reshape %386 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:45.0079964Z         %389 = tt.reshape %385 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:45.0080263Z         %390 = ttg.convert_layout %388 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:45.0080637Z         %391 = ttg.convert_layout %389 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:45.0080936Z         %392 = ttg.convert_layout %387 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0081342Z         %393 = tt.dot %390, %391, %392, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0081738Z         %394 = tt.reshape %393 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0082019Z         scf.yield %353, %369, %394 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0082235Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T13:28:45.0082607Z       %84:3 = scf.for %arg5 = %c0_i32_10 to %c8192_i32 step %c4096_i32 iter_args(%arg6 = %83#0, %arg7 = %83#1, %arg8 = %83#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T13:28:45.0082953Z         %93 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0083110Z         %94 = arith.addi %93, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0083363Z         %95 = ttg.convert_layout %94 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:45.0083692Z         %96 = tt.expand_dims %95 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:45.0083987Z         %97 = ttg.convert_layout %96 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:45.0084271Z         %98 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:45.0084607Z         %99 = tt.expand_dims %98 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:45.0084912Z         %100 = ttg.convert_layout %99 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0085130Z         %101 = arith.muli %100, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0085339Z         %102 = tt.broadcast %101 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0085547Z         %103 = arith.addi %80, %102 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0085766Z         %104 = tt.addptr %16, %103 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0085991Z         %105 = tt.load %104 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0086201Z         %106 = tt.reshape %105 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:45.0086500Z         %107 = ttg.convert_layout %81 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:45.0086852Z         %108 = ttg.convert_layout %106 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:45.0087159Z         %109 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0087566Z         %110 = tt.dot %107, %108, %109, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0087985Z         %111 = ttg.convert_layout %110 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:45.0088225Z         %112 = tt.reshape %111 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0088465Z         %113 = arith.truncf %112 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0088733Z         %114 = arith.extf %113 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0088922Z         %115 = "tt.reduce"(%114) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0089042Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0089161Z           %167 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:45.0089282Z           tt.reduce.return %167 : f32
2026-02-21T13:28:45.0089467Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0089756Z         %116 = ttg.convert_layout %115 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0090039Z         %117 = arith.truncf %116 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0090255Z         %118 = arith.extf %117 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0090445Z         %119 = arith.mulf %118, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0090637Z         %120 = arith.truncf %119 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0090856Z         %121 = arith.extf %120 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0091067Z         %122 = arith.cmpf ogt, %arg6, %121 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0091235Z         %123 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0091395Z         %124 = arith.ori %122, %123 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:45.0091586Z         %125 = arith.select %124, %arg6, %121 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0091787Z         %126 = arith.mulf %114, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0091993Z         %127 = arith.truncf %126 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0092277Z         %128 = ttg.convert_layout %125 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0092611Z         %129 = tt.expand_dims %128 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0092908Z         %130 = ttg.convert_layout %129 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0093149Z         %131 = arith.extf %127 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0093392Z         %132 = tt.broadcast %130 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:45.0093645Z         %133 = ttg.convert_layout %132 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0093862Z         %134 = arith.subf %131, %133 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0094165Z         %135 = tt.extern_elementwise %134 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0094451Z         %136 = "tt.reduce"(%135) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0094576Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0094690Z           %167 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:45.0094808Z           tt.reduce.return %167 : f32
2026-02-21T13:28:45.0094995Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0095297Z         %137 = ttg.convert_layout %136 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0095534Z         %138 = arith.subf %arg6, %125 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0095814Z         %139 = tt.extern_elementwise %138 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0096113Z         %140 = arith.mulf %arg7, %139 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0096270Z         %141 = arith.addf %140, %137 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0096505Z         %142 = ttg.convert_layout %139 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0096836Z         %143 = tt.expand_dims %142 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0097130Z         %144 = ttg.convert_layout %143 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0097377Z         %145 = tt.broadcast %144 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0097643Z         %146 = ttg.convert_layout %145 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0097852Z         %147 = arith.mulf %arg8, %146 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0098102Z         %148 = ttg.convert_layout %97 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:45.0098447Z         %149 = tt.expand_dims %148 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:45.0098770Z         %150 = ttg.convert_layout %149 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0098981Z         %151 = arith.muli %150, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0099141Z         %152 = arith.addi %82, %151 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0099340Z         %153 = tt.broadcast %152 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:45.0099597Z         %154 = ttg.convert_layout %153 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0099816Z         %155 = arith.addi %154, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0100036Z         %156 = tt.addptr %18, %155 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0100257Z         %157 = tt.load %156 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0100465Z         %158 = arith.truncf %135 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0100704Z         %159 = tt.reshape %147 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0100935Z         %160 = tt.reshape %158 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:45.0101179Z         %161 = tt.reshape %157 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:45.0101480Z         %162 = ttg.convert_layout %160 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:45.0101835Z         %163 = ttg.convert_layout %161 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:45.0102134Z         %164 = ttg.convert_layout %159 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0102539Z         %165 = tt.dot %162, %163, %164, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0102933Z         %166 = tt.reshape %165 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0103211Z         scf.yield %125, %141, %166 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0103426Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T13:28:45.0103645Z       %85 = ttg.convert_layout %84#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0103967Z       %86 = tt.expand_dims %85 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0104266Z       %87 = ttg.convert_layout %86 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0104501Z       %88 = tt.broadcast %87 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0104741Z       %89 = ttg.convert_layout %88 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0104951Z       %90 = arith.divf %84#2, %89 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0105144Z       %91 = arith.truncf %90 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1>
2026-02-21T13:28:45.0105401Z       %92 = tt.addptr %19, %74 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:45.0105607Z       tt.store %92, %91 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0105782Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T13:28:45.0105948Z     scf.for %arg4 = %27 to %3 step %c1_i32  : i32 {
2026-02-21T13:28:45.0106078Z       %29 = arith.divsi %arg4, %c131072_i32 : i32
2026-02-21T13:28:45.0106201Z       %30 = arith.muli %29, %c16_i32 : i32
2026-02-21T13:28:45.0106314Z       %31 = arith.subi %c192_i32, %30 : i32
2026-02-21T13:28:45.0106445Z       %32 = arith.minsi %31, %c16_i32 : i32
2026-02-21T13:28:45.0106561Z       %33 = arith.remsi %arg4, %c131072_i32 : i32
2026-02-21T13:28:45.0106678Z       %34 = arith.remsi %33, %32 : i32
2026-02-21T13:28:45.0106784Z       %35 = arith.addi %30, %34 : i32
2026-02-21T13:28:45.0106890Z       %36 = arith.divsi %33, %32 : i32
2026-02-21T13:28:45.0107003Z       %37 = arith.muli %35, %c1048576_i32 : i32
2026-02-21T13:28:45.0107118Z       %38 = arith.muli %36, %c128_i32 : i32
2026-02-21T13:28:45.0107227Z       %39 = arith.addi %37, %38 : i32
2026-02-21T13:28:45.0107353Z       %40 = tt.splat %39 : i32 -> tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:45.0107509Z       %41 = arith.addi %40, %10 : tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:45.0107707Z       %42 = tt.addptr %11, %41 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:45.0107911Z       %43 = tt.load %42 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0108065Z       %44 = tt.splat %37 : i32 -> tensor<1x128x1xi32, #blocked>
2026-02-21T13:28:45.0108215Z       %45 = arith.addi %44, %15 : tensor<1x128x1xi32, #blocked>
2026-02-21T13:28:45.0108409Z       %46 = tt.broadcast %45 : tensor<1x128x1xi32, #blocked> -> tensor<1x128x4096xi32, #blocked>
2026-02-21T13:28:45.0108660Z       %47 = ttg.convert_layout %46 : tensor<1x128x4096xi32, #blocked> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0108905Z       %48 = tt.reshape %43 : tensor<1x1x128xbf16, #blocked1> -> tensor<1x128xbf16, #blocked3>
2026-02-21T13:28:45.0109093Z       %49 = tt.splat %37 : i32 -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0109228Z       %c0_i32_8 = arith.constant 0 : i32
2026-02-21T13:28:45.0109348Z       %c16384_i32 = arith.constant 16384 : i32
2026-02-21T13:28:45.0109684Z       %50:3 = scf.for %arg5 = %c0_i32 to %c0_i32_8 step %c16384_i32 iter_args(%arg6 = %cst_6, %arg7 = %cst_5, %arg8 = %cst_4) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T13:28:45.0110038Z         %60 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0110191Z         %61 = arith.addi %60, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0110442Z         %62 = ttg.convert_layout %61 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:45.0110770Z         %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:45.0111058Z         %64 = ttg.convert_layout %63 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:45.0111368Z         %65 = ttg.convert_layout %64 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:45.0111706Z         %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:45.0112009Z         %67 = ttg.convert_layout %66 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0112221Z         %68 = arith.muli %67, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0112425Z         %69 = tt.broadcast %68 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0112640Z         %70 = arith.addi %47, %69 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0112755Z         %71 = tt.addptr %16, %70 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0112821Z         %72 = tt.load %71 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0112923Z         %73 = tt.reshape %72 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:45.0113074Z         %74 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:45.0113251Z         %75 = ttg.convert_layout %73 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:45.0113357Z         %76 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0113615Z         %77 = tt.dot %74, %75, %76, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0113716Z         %78 = ttg.convert_layout %77 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:45.0113809Z         %79 = tt.reshape %78 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0113908Z         %80 = arith.truncf %79 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0114005Z         %81 = arith.extf %80 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0114054Z         %82 = "tt.reduce"(%81) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0114095Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0114144Z           %362 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:45.0114184Z           tt.reduce.return %362 : f32
2026-02-21T13:28:45.0114298Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0114431Z         %83 = ttg.convert_layout %82 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0114519Z         %84 = arith.truncf %83 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0114603Z         %85 = arith.extf %84 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0114661Z         %86 = arith.mulf %85, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0114747Z         %87 = arith.truncf %86 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0114829Z         %88 = arith.extf %87 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0114916Z         %89 = arith.cmpf ogt, %arg6, %88 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0114983Z         %90 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0115039Z         %91 = arith.ori %89, %90 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:45.0115132Z         %92 = arith.select %91, %arg6, %88 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0115195Z         %93 = arith.mulf %81, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0115310Z         %94 = arith.truncf %93 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0115447Z         %95 = ttg.convert_layout %92 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0115595Z         %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0115696Z         %97 = ttg.convert_layout %96 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0115793Z         %98 = arith.extf %94 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0115904Z         %99 = tt.broadcast %97 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:45.0116014Z         %100 = ttg.convert_layout %99 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0116077Z         %101 = arith.subf %98, %100 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0116284Z         %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0116333Z         %103 = "tt.reduce"(%102) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0116387Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0116432Z           %362 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:45.0116473Z           tt.reduce.return %362 : f32
2026-02-21T13:28:45.0116586Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0116726Z         %104 = ttg.convert_layout %103 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0116788Z         %105 = arith.subf %arg6, %92 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0116974Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0117035Z         %107 = arith.mulf %arg7, %106 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0117096Z         %108 = arith.addf %107, %104 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0117235Z         %109 = ttg.convert_layout %106 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0117384Z         %110 = tt.expand_dims %109 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0117494Z         %111 = ttg.convert_layout %110 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0117593Z         %112 = tt.broadcast %111 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0117702Z         %113 = ttg.convert_layout %112 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0117769Z         %114 = arith.mulf %arg8, %113 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0117918Z         %115 = ttg.convert_layout %64 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:45.0118078Z         %116 = tt.expand_dims %115 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:45.0118192Z         %117 = ttg.convert_layout %116 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0118272Z         %118 = arith.muli %117, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0118335Z         %119 = arith.addi %49, %118 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0118439Z         %120 = tt.broadcast %119 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:45.0118553Z         %121 = ttg.convert_layout %120 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0118632Z         %122 = arith.addi %121, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0118751Z         %123 = tt.addptr %18, %122 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0118817Z         %124 = tt.load %123 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0118921Z         %125 = arith.truncf %102 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0119018Z         %126 = tt.reshape %114 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0119129Z         %127 = tt.reshape %125 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:45.0119232Z         %128 = tt.reshape %124 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:45.0119389Z         %129 = ttg.convert_layout %127 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:45.0119551Z         %130 = ttg.convert_layout %128 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:45.0119652Z         %131 = ttg.convert_layout %126 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0119930Z         %132 = tt.dot %129, %130, %131, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0120024Z         %133 = tt.reshape %132 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0120068Z         %c1_i32_9 = arith.constant 1 : i32
2026-02-21T13:28:45.0120120Z         %134 = arith.muli %c4096_i32, %c1_i32_9 : i32
2026-02-21T13:28:45.0120161Z         %135 = arith.addi %arg5, %134 : i32
2026-02-21T13:28:45.0120221Z         %136 = tt.splat %135 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0120281Z         %137 = arith.addi %136, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0120425Z         %138 = ttg.convert_layout %137 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:45.0120575Z         %139 = tt.expand_dims %138 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:45.0120680Z         %140 = ttg.convert_layout %139 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:45.0120829Z         %141 = ttg.convert_layout %140 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:45.0120985Z         %142 = tt.expand_dims %141 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:45.0121095Z         %143 = ttg.convert_layout %142 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0121163Z         %144 = arith.muli %143, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0121267Z         %145 = tt.broadcast %144 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0121331Z         %146 = arith.addi %47, %145 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0121449Z         %147 = tt.addptr %16, %146 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0121533Z         %148 = tt.load %147 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0121638Z         %149 = tt.reshape %148 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:45.0121792Z         %150 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:45.0121953Z         %151 = ttg.convert_layout %149 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:45.0122073Z         %152 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0122334Z         %153 = tt.dot %150, %151, %152, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0122437Z         %154 = ttg.convert_layout %153 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:45.0122545Z         %155 = tt.reshape %154 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0122692Z         %156 = arith.truncf %155 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0122792Z         %157 = arith.extf %156 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0122841Z         %158 = "tt.reduce"(%157) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0122882Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0122930Z           %362 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:45.0122970Z           tt.reduce.return %362 : f32
2026-02-21T13:28:45.0123100Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0123236Z         %159 = ttg.convert_layout %158 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0123326Z         %160 = arith.truncf %159 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0123416Z         %161 = arith.extf %160 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0123478Z         %162 = arith.mulf %161, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0123567Z         %163 = arith.truncf %162 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0123654Z         %164 = arith.extf %163 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0123717Z         %165 = arith.cmpf ogt, %92, %164 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0123779Z         %166 = arith.cmpf une, %92, %92 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0123838Z         %167 = arith.ori %165, %166 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:45.0123933Z         %168 = arith.select %167, %92, %164 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0123999Z         %169 = arith.mulf %157, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0124103Z         %170 = arith.truncf %169 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0124244Z         %171 = ttg.convert_layout %168 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0124392Z         %172 = tt.expand_dims %171 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0124497Z         %173 = ttg.convert_layout %172 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0124598Z         %174 = arith.extf %170 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0124700Z         %175 = tt.broadcast %173 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:45.0124830Z         %176 = ttg.convert_layout %175 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0124896Z         %177 = arith.subf %174, %176 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0125101Z         %178 = tt.extern_elementwise %177 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0125150Z         %179 = "tt.reduce"(%178) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0125211Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0125257Z           %362 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:45.0125297Z           tt.reduce.return %362 : f32
2026-02-21T13:28:45.0125409Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0125546Z         %180 = ttg.convert_layout %179 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0125607Z         %181 = arith.subf %92, %168 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0125814Z         %182 = tt.extern_elementwise %181 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0125874Z         %183 = arith.mulf %108, %182 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0125932Z         %184 = arith.addf %183, %180 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0126078Z         %185 = ttg.convert_layout %182 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0126227Z         %186 = tt.expand_dims %185 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0126351Z         %187 = ttg.convert_layout %186 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0126454Z         %188 = tt.broadcast %187 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0126564Z         %189 = ttg.convert_layout %188 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0126628Z         %190 = arith.mulf %133, %189 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0126781Z         %191 = ttg.convert_layout %140 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:45.0126939Z         %192 = tt.expand_dims %191 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:45.0127049Z         %193 = ttg.convert_layout %192 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0127116Z         %194 = arith.muli %193, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0127179Z         %195 = arith.addi %49, %194 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0127282Z         %196 = tt.broadcast %195 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:45.0127404Z         %197 = ttg.convert_layout %196 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0127470Z         %198 = arith.addi %197, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0127589Z         %199 = tt.addptr %18, %198 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0127658Z         %200 = tt.load %199 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0127765Z         %201 = arith.truncf %178 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0127859Z         %202 = tt.reshape %190 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0127958Z         %203 = tt.reshape %201 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:45.0128065Z         %204 = tt.reshape %200 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:45.0128239Z         %205 = ttg.convert_layout %203 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:45.0128403Z         %206 = ttg.convert_layout %204 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:45.0128507Z         %207 = ttg.convert_layout %202 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0128781Z         %208 = tt.dot %205, %206, %207, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0128876Z         %209 = tt.reshape %208 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0128923Z         %c2_i32_10 = arith.constant 2 : i32
2026-02-21T13:28:45.0128973Z         %210 = arith.muli %c4096_i32, %c2_i32_10 : i32
2026-02-21T13:28:45.0129016Z         %211 = arith.addi %arg5, %210 : i32
2026-02-21T13:28:45.0129081Z         %212 = tt.splat %211 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0129153Z         %213 = arith.addi %212, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0129297Z         %214 = ttg.convert_layout %213 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:45.0129452Z         %215 = tt.expand_dims %214 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:45.0129557Z         %216 = ttg.convert_layout %215 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:45.0129718Z         %217 = ttg.convert_layout %216 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:45.0129879Z         %218 = tt.expand_dims %217 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:45.0129994Z         %219 = ttg.convert_layout %218 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0130060Z         %220 = arith.muli %219, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0130168Z         %221 = tt.broadcast %220 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0130234Z         %222 = arith.addi %47, %221 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0130351Z         %223 = tt.addptr %16, %222 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0130419Z         %224 = tt.load %223 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0130525Z         %225 = tt.reshape %224 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:45.0130678Z         %226 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:45.0130846Z         %227 = ttg.convert_layout %225 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:45.0130953Z         %228 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0131212Z         %229 = tt.dot %226, %227, %228, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0131319Z         %230 = ttg.convert_layout %229 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:45.0131416Z         %231 = tt.reshape %230 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0131521Z         %232 = arith.truncf %231 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0131638Z         %233 = arith.extf %232 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0131687Z         %234 = "tt.reduce"(%233) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0131727Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0131779Z           %362 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:45.0131820Z           tt.reduce.return %362 : f32
2026-02-21T13:28:45.0131946Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0132086Z         %235 = ttg.convert_layout %234 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0132176Z         %236 = arith.truncf %235 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0132263Z         %237 = arith.extf %236 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0132328Z         %238 = arith.mulf %237, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0132418Z         %239 = arith.truncf %238 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0132518Z         %240 = arith.extf %239 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0132584Z         %241 = arith.cmpf ogt, %168, %240 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0132650Z         %242 = arith.cmpf une, %168, %168 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0132710Z         %243 = arith.ori %241, %242 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:45.0132805Z         %244 = arith.select %243, %168, %240 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0132874Z         %245 = arith.mulf %233, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0132991Z         %246 = arith.truncf %245 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0133132Z         %247 = ttg.convert_layout %244 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0133285Z         %248 = tt.expand_dims %247 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0133389Z         %249 = ttg.convert_layout %248 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0133488Z         %250 = arith.extf %246 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0133592Z         %251 = tt.broadcast %249 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:45.0133705Z         %252 = ttg.convert_layout %251 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0133770Z         %253 = arith.subf %250, %252 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0133975Z         %254 = tt.extern_elementwise %253 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0134025Z         %255 = "tt.reduce"(%254) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0134065Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0134113Z           %362 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:45.0134154Z           tt.reduce.return %362 : f32
2026-02-21T13:28:45.0134268Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0134406Z         %256 = ttg.convert_layout %255 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0134469Z         %257 = arith.subf %168, %244 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0134655Z         %258 = tt.extern_elementwise %257 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0134733Z         %259 = arith.mulf %184, %258 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0134793Z         %260 = arith.addf %259, %256 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0134932Z         %261 = ttg.convert_layout %258 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0135082Z         %262 = tt.expand_dims %261 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0135202Z         %263 = ttg.convert_layout %262 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0135301Z         %264 = tt.broadcast %263 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0135410Z         %265 = ttg.convert_layout %264 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0135477Z         %266 = arith.mulf %209, %265 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0135628Z         %267 = ttg.convert_layout %216 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:45.0135800Z         %268 = tt.expand_dims %267 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:45.0135913Z         %269 = ttg.convert_layout %268 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0135977Z         %270 = arith.muli %269, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0136040Z         %271 = arith.addi %49, %270 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0136145Z         %272 = tt.broadcast %271 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:45.0136272Z         %273 = ttg.convert_layout %272 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0136336Z         %274 = arith.addi %273, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0136455Z         %275 = tt.addptr %18, %274 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0136524Z         %276 = tt.load %275 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0136629Z         %277 = arith.truncf %254 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0136728Z         %278 = tt.reshape %266 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0136827Z         %279 = tt.reshape %277 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:45.0136932Z         %280 = tt.reshape %276 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:45.0137093Z         %281 = ttg.convert_layout %279 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:45.0137253Z         %282 = ttg.convert_layout %280 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:45.0137355Z         %283 = ttg.convert_layout %278 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0137621Z         %284 = tt.dot %281, %282, %283, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0137718Z         %285 = tt.reshape %284 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0137761Z         %c3_i32 = arith.constant 3 : i32
2026-02-21T13:28:45.0137813Z         %286 = arith.muli %c4096_i32, %c3_i32 : i32
2026-02-21T13:28:45.0137856Z         %287 = arith.addi %arg5, %286 : i32
2026-02-21T13:28:45.0137917Z         %288 = tt.splat %287 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0137980Z         %289 = arith.addi %288, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0138143Z         %290 = ttg.convert_layout %289 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:45.0138295Z         %291 = tt.expand_dims %290 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:45.0138402Z         %292 = ttg.convert_layout %291 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:45.0138563Z         %293 = ttg.convert_layout %292 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:45.0138719Z         %294 = tt.expand_dims %293 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:45.0138834Z         %295 = ttg.convert_layout %294 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0138900Z         %296 = arith.muli %295, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0139006Z         %297 = tt.broadcast %296 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0139091Z         %298 = arith.addi %47, %297 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0139210Z         %299 = tt.addptr %16, %298 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0139276Z         %300 = tt.load %299 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0139382Z         %301 = tt.reshape %300 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:45.0139537Z         %302 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:45.0139712Z         %303 = ttg.convert_layout %301 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:45.0139822Z         %304 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0140084Z         %305 = tt.dot %302, %303, %304, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0140187Z         %306 = ttg.convert_layout %305 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:45.0140286Z         %307 = tt.reshape %306 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0140391Z         %308 = arith.truncf %307 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0140490Z         %309 = arith.extf %308 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0140541Z         %310 = "tt.reduce"(%309) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0140582Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0140630Z           %362 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:45.0140670Z           tt.reduce.return %362 : f32
2026-02-21T13:28:45.0140785Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0140922Z         %311 = ttg.convert_layout %310 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0141014Z         %312 = arith.truncf %311 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0141105Z         %313 = arith.extf %312 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0141167Z         %314 = arith.mulf %313, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0141260Z         %315 = arith.truncf %314 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0141348Z         %316 = arith.extf %315 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0141426Z         %317 = arith.cmpf ogt, %244, %316 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0141490Z         %318 = arith.cmpf une, %244, %244 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0141550Z         %319 = arith.ori %317, %318 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:45.0141646Z         %320 = arith.select %319, %244, %316 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0141711Z         %321 = arith.mulf %309, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0141832Z         %322 = arith.truncf %321 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0141972Z         %323 = ttg.convert_layout %320 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0142122Z         %324 = tt.expand_dims %323 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0142230Z         %325 = ttg.convert_layout %324 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0142343Z         %326 = arith.extf %322 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0142445Z         %327 = tt.broadcast %325 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:45.0142561Z         %328 = ttg.convert_layout %327 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0142625Z         %329 = arith.subf %326, %328 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0142827Z         %330 = tt.extern_elementwise %329 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0142878Z         %331 = "tt.reduce"(%330) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0142930Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0142976Z           %362 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:45.0143017Z           tt.reduce.return %362 : f32
2026-02-21T13:28:45.0143131Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0143268Z         %332 = ttg.convert_layout %331 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0143328Z         %333 = arith.subf %244, %320 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0143521Z         %334 = tt.extern_elementwise %333 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0143580Z         %335 = arith.mulf %260, %334 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0143638Z         %336 = arith.addf %335, %332 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0143779Z         %337 = ttg.convert_layout %334 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0143929Z         %338 = tt.expand_dims %337 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0144034Z         %339 = ttg.convert_layout %338 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0144136Z         %340 = tt.broadcast %339 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0144244Z         %341 = ttg.convert_layout %340 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0144309Z         %342 = arith.mulf %285, %341 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0144460Z         %343 = ttg.convert_layout %292 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:45.0144617Z         %344 = tt.expand_dims %343 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:45.0144743Z         %345 = ttg.convert_layout %344 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0144810Z         %346 = arith.muli %345, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0144871Z         %347 = arith.addi %49, %346 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0144973Z         %348 = tt.broadcast %347 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:45.0145091Z         %349 = ttg.convert_layout %348 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0145171Z         %350 = arith.addi %349, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0145289Z         %351 = tt.addptr %18, %350 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0145358Z         %352 = tt.load %351 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0145463Z         %353 = arith.truncf %330 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0145560Z         %354 = tt.reshape %342 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0145672Z         %355 = tt.reshape %353 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:45.0145776Z         %356 = tt.reshape %352 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:45.0145932Z         %357 = ttg.convert_layout %355 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:45.0146097Z         %358 = ttg.convert_layout %356 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:45.0146214Z         %359 = ttg.convert_layout %354 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0146475Z         %360 = tt.dot %357, %358, %359, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0146575Z         %361 = tt.reshape %360 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0146708Z         scf.yield %320, %336, %361 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0146754Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T13:28:45.0147006Z       %51:3 = scf.for %arg5 = %c0_i32_8 to %c8192_i32 step %c4096_i32 iter_args(%arg6 = %50#0, %arg7 = %50#1, %arg8 = %50#2) -> (tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>)  : i32 {
2026-02-21T13:28:45.0147067Z         %60 = tt.splat %arg5 : i32 -> tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0147128Z         %61 = arith.addi %60, %12 : tensor<4096xi32, #blocked4>
2026-02-21T13:28:45.0147273Z         %62 = ttg.convert_layout %61 : tensor<4096xi32, #blocked4> -> tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>>
2026-02-21T13:28:45.0147425Z         %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<4096xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4096xi32, #blocked5>
2026-02-21T13:28:45.0147527Z         %64 = ttg.convert_layout %63 : tensor<1x4096xi32, #blocked5> -> tensor<1x4096xi32, #blocked3>
2026-02-21T13:28:45.0147673Z         %65 = ttg.convert_layout %64 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>>
2026-02-21T13:28:45.0147828Z         %66 = tt.expand_dims %65 {axis = 1 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 1, parent = #blocked6}>> -> tensor<1x1x4096xi32, #blocked6>
2026-02-21T13:28:45.0147937Z         %67 = ttg.convert_layout %66 : tensor<1x1x4096xi32, #blocked6> -> tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0148006Z         %68 = arith.muli %67, %cst_3 : tensor<1x1x4096xi32, #blocked1>
2026-02-21T13:28:45.0148107Z         %69 = tt.broadcast %68 : tensor<1x1x4096xi32, #blocked1> -> tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0148184Z         %70 = arith.addi %47, %69 : tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0148303Z         %71 = tt.addptr %16, %70 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>, tensor<1x128x4096xi32, #blocked1>
2026-02-21T13:28:45.0148368Z         %72 = tt.load %71 : tensor<1x128x4096x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0148469Z         %73 = tt.reshape %72 : tensor<1x128x4096xbf16, #blocked1> -> tensor<128x4096xbf16, #blocked3>
2026-02-21T13:28:45.0148634Z         %74 = ttg.convert_layout %48 : tensor<1x128xbf16, #blocked3> -> tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>>
2026-02-21T13:28:45.0148794Z         %75 = ttg.convert_layout %73 : tensor<128x4096xbf16, #blocked3> -> tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>>
2026-02-21T13:28:45.0148900Z         %76 = ttg.convert_layout %cst_2 : tensor<1x4096xf32, #blocked3> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0149173Z         %77 = tt.dot %74, %75, %76, inputPrecision = tf32 : tensor<1x128xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked8}>> * tensor<128x4096xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked8}>> -> tensor<1x4096xf32, #blocked8>
2026-02-21T13:28:45.0149276Z         %78 = ttg.convert_layout %77 : tensor<1x4096xf32, #blocked8> -> tensor<1x4096xf32, #blocked3>
2026-02-21T13:28:45.0149371Z         %79 = tt.reshape %78 : tensor<1x4096xf32, #blocked3> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0149474Z         %80 = arith.truncf %79 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0149572Z         %81 = arith.extf %80 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0149624Z         %82 = "tt.reduce"(%81) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0149678Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0149729Z           %134 = arith.maxnumf %arg9, %arg10 : f32
2026-02-21T13:28:45.0149770Z           tt.reduce.return %134 : f32
2026-02-21T13:28:45.0149883Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0150020Z         %83 = ttg.convert_layout %82 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0150108Z         %84 = arith.truncf %83 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0150191Z         %85 = arith.extf %84 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0150256Z         %86 = arith.mulf %85, %cst_1 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0150340Z         %87 = arith.truncf %86 : tensor<1x1xf32, #blocked2> to tensor<1x1xbf16, #blocked2>
2026-02-21T13:28:45.0150423Z         %88 = arith.extf %87 : tensor<1x1xbf16, #blocked2> to tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0150491Z         %89 = arith.cmpf ogt, %arg6, %88 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0150558Z         %90 = arith.cmpf une, %arg6, %arg6 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0150616Z         %91 = arith.ori %89, %90 : tensor<1x1xi1, #blocked2>
2026-02-21T13:28:45.0150714Z         %92 = arith.select %91, %arg6, %88 : tensor<1x1xi1, #blocked2>, tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0150777Z         %93 = arith.mulf %81, %cst_0 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0150877Z         %94 = arith.truncf %93 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0151017Z         %95 = ttg.convert_layout %92 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0151165Z         %96 = tt.expand_dims %95 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0151266Z         %97 = ttg.convert_layout %96 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0151363Z         %98 = arith.extf %94 : tensor<1x1x4096xbf16, #blocked1> to tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0151486Z         %99 = tt.broadcast %97 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked10>
2026-02-21T13:28:45.0151598Z         %100 = ttg.convert_layout %99 : tensor<1x1x4096xf32, #blocked10> -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0151661Z         %101 = arith.subf %98, %100 : tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0151866Z         %102 = tt.extern_elementwise %101 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1x4096xf32, #blocked1>
2026-02-21T13:28:45.0151934Z         %103 = "tt.reduce"(%102) <{axis = 2 : i32}> ({
2026-02-21T13:28:45.0151974Z         ^bb0(%arg9: f32, %arg10: f32):
2026-02-21T13:28:45.0152021Z           %134 = arith.addf %arg9, %arg10 : f32
2026-02-21T13:28:45.0152063Z           tt.reduce.return %134 : f32
2026-02-21T13:28:45.0152174Z         }) : (tensor<1x1x4096xf32, #blocked1>) -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>>
2026-02-21T13:28:45.0152330Z         %104 = ttg.convert_layout %103 : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked1}>> -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0152392Z         %105 = arith.subf %arg6, %92 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0152577Z         %106 = tt.extern_elementwise %105 {libname = "", libpath = "", pure = true, symbol = "__ocml_exp2_f32"} : (tensor<1x1xf32, #blocked2>) -> tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0152643Z         %107 = arith.mulf %arg7, %106 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0152701Z         %108 = arith.addf %107, %104 : tensor<1x1xf32, #blocked2>
2026-02-21T13:28:45.0152840Z         %109 = ttg.convert_layout %106 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0153005Z         %110 = tt.expand_dims %109 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0153111Z         %111 = ttg.convert_layout %110 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0153211Z         %112 = tt.broadcast %111 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0153325Z         %113 = ttg.convert_layout %112 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0153390Z         %114 = arith.mulf %arg8, %113 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0153539Z         %115 = ttg.convert_layout %64 : tensor<1x4096xi32, #blocked3> -> tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>>
2026-02-21T13:28:45.0153700Z         %116 = tt.expand_dims %115 {axis = 2 : i32} : tensor<1x4096xi32, #ttg.slice<{dim = 2, parent = #blocked7}>> -> tensor<1x4096x1xi32, #blocked7>
2026-02-21T13:28:45.0153812Z         %117 = ttg.convert_layout %116 : tensor<1x4096x1xi32, #blocked7> -> tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0153876Z         %118 = arith.muli %117, %cst : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0153942Z         %119 = arith.addi %49, %118 : tensor<1x4096x1xi32, #blocked>
2026-02-21T13:28:45.0154048Z         %120 = tt.broadcast %119 : tensor<1x4096x1xi32, #blocked> -> tensor<1x4096x128xi32, #blocked>
2026-02-21T13:28:45.0154163Z         %121 = ttg.convert_layout %120 : tensor<1x4096x128xi32, #blocked> -> tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0154230Z         %122 = arith.addi %121, %17 : tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0154349Z         %123 = tt.addptr %18, %122 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>, tensor<1x4096x128xi32, #blocked1>
2026-02-21T13:28:45.0154416Z         %124 = tt.load %123 : tensor<1x4096x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0154522Z         %125 = arith.truncf %102 : tensor<1x1x4096xf32, #blocked1> to tensor<1x1x4096xbf16, #blocked1>
2026-02-21T13:28:45.0154617Z         %126 = tt.reshape %114 : tensor<1x1x128xf32, #blocked1> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0154736Z         %127 = tt.reshape %125 : tensor<1x1x4096xbf16, #blocked1> -> tensor<1x4096xbf16, #blocked3>
2026-02-21T13:28:45.0154840Z         %128 = tt.reshape %124 : tensor<1x4096x128xbf16, #blocked1> -> tensor<4096x128xbf16, #blocked3>
2026-02-21T13:28:45.0154999Z         %129 = ttg.convert_layout %127 : tensor<1x4096xbf16, #blocked3> -> tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>>
2026-02-21T13:28:45.0155172Z         %130 = ttg.convert_layout %128 : tensor<4096x128xbf16, #blocked3> -> tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>>
2026-02-21T13:28:45.0155274Z         %131 = ttg.convert_layout %126 : tensor<1x128xf32, #blocked3> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0155537Z         %132 = tt.dot %129, %130, %131, inputPrecision = tf32 : tensor<1x4096xbf16, #ttg.dot_op<{opIdx = 0, parent = #blocked3}>> * tensor<4096x128xbf16, #ttg.dot_op<{opIdx = 1, parent = #blocked3}>> -> tensor<1x128xf32, #blocked3>
2026-02-21T13:28:45.0155633Z         %133 = tt.reshape %132 : tensor<1x128xf32, #blocked3> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0155780Z         scf.yield %92, %108, %133 : tensor<1x1xf32, #blocked2>, tensor<1x1xf32, #blocked2>, tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0155829Z       } {tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T13:28:45.0155969Z       %52 = ttg.convert_layout %51#1 : tensor<1x1xf32, #blocked2> -> tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>>
2026-02-21T13:28:45.0156118Z       %53 = tt.expand_dims %52 {axis = 2 : i32} : tensor<1x1xf32, #ttg.slice<{dim = 2, parent = #blocked9}>> -> tensor<1x1x1xf32, #blocked9>
2026-02-21T13:28:45.0156220Z       %54 = ttg.convert_layout %53 : tensor<1x1x1xf32, #blocked9> -> tensor<1x1x1xf32, #blocked10>
2026-02-21T13:28:45.0156329Z       %55 = tt.broadcast %54 : tensor<1x1x1xf32, #blocked10> -> tensor<1x1x128xf32, #blocked10>
2026-02-21T13:28:45.0156436Z       %56 = ttg.convert_layout %55 : tensor<1x1x128xf32, #blocked10> -> tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0156501Z       %57 = arith.divf %51#2, %56 : tensor<1x1x128xf32, #blocked1>
2026-02-21T13:28:45.0156602Z       %58 = arith.truncf %57 : tensor<1x1x128xf32, #blocked1> to tensor<1x1x128xbf16, #blocked1>
2026-02-21T13:28:45.0156707Z       %59 = tt.addptr %19, %41 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>, tensor<1x1x128xi32, #blocked1>
2026-02-21T13:28:45.0156773Z       tt.store %59, %58 : tensor<1x1x128x!tt.ptr<bf16>, #blocked1>
2026-02-21T13:28:45.0156850Z     } {tt.disallow_acc_multi_buffer, tt.flatten, tt.num_stages = 1 : i32}
2026-02-21T13:28:45.0156885Z     tt.return
2026-02-21T13:28:45.0156918Z   }
2026-02-21T13:28:45.0156951Z }
2026-02-21T13:28:45.0156954Z 
2026-02-21T13:28:45.0156984Z {-#
2026-02-21T13:28:45.0157029Z   external_resources: {
2026-02-21T13:28:45.0157066Z     mlir_reproducer: {
2026-02-21T13:28:45.0159197Z       pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{  max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)",
2026-02-21T13:28:45.0159259Z       disable_threading: false,
2026-02-21T13:28:45.0159296Z       verify_each: true
2026-02-21T13:28:45.0159329Z     }
2026-02-21T13:28:45.0159359Z   }
2026-02-21T13:28:45.0159387Z #-}
2026-02-21T13:28:45.0159623Z /tmp/torchinductor_root/sr/csruq75k5vrnqasgu6rcu5p5ucqjhzcjg6lokhjrwpi27fgp4emv.py:16:0: error: Failures have been detected while processing an MLIR pass pipeline
2026-02-21T13:28:45.0160084Z /tmp/torchinductor_root/sr/csruq75k5vrnqasgu6rcu5p5ucqjhzcjg6lokhjrwpi27fgp4emv.py:16:0: note: Pipeline failed while executing [`TritonAMDGPUStreamPipeline` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
2026-02-21T13:28:45.0160197Z [295s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config.
2026-02-21T13:28:45.0160844Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 1, 4096], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=4, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, True], range_multi_buffers=[False, True], range_num_stages=[2, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=4), static_shapes=True)
2026-02-21T13:28:45.0160901Z Error: RuntimeError: PassManager::run failed
2026-02-21T13:28:45.0160982Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code.
2026-02-21T13:30:17.7946870Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 0.7 configs/s
2026-02-21T13:30:17.7956089Z [388s] Adaptive compile timeout: 30s (90% percentile=30.0s, bounds=[30.0s, 30s])
2026-02-21T13:30:17.9633019Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6/6 - configs/s
2026-02-21T13:30:18.9425916Z [389s] Initial random population of 100, 5 starting points: 
2026-02-21T13:30:18.9426349Z error=19
2026-02-21T13:30:18.9426559Z timeout=23
2026-02-21T13:30:18.9429403Z ok=58
2026-02-21T13:30:18.9429683Z min=30.2500
2026-02-21T13:30:18.9429930Z mid=215.0913
2026-02-21T13:30:18.9430151Z max=11076.6172
2026-02-21T13:30:18.9430410Z best={'block_sizes': [1, 128, 32],
2026-02-21T13:30:18.9430832Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'pointer'],
2026-02-21T13:30:18.9431253Z  'l2_groupings': [1],
2026-02-21T13:30:18.9431525Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:30:18.9431836Z  'loop_orders': [[1, 0]],
2026-02-21T13:30:18.9432115Z  'matrix_instr_nonkdim': 32,
2026-02-21T13:30:18.9432371Z  'num_sm_multiplier': 1,
2026-02-21T13:30:18.9432610Z  'num_stages': 2,
2026-02-21T13:30:18.9432822Z  'num_warps': 4,
2026-02-21T13:30:18.9433076Z  'pid_type': 'persistent_interleaved',
2026-02-21T13:30:18.9433381Z  'range_flattens': [False, False],
2026-02-21T13:30:18.9433676Z  'range_multi_buffers': [True, None],
2026-02-21T13:30:18.9433959Z  'range_num_stages': [3, 2],
2026-02-21T13:30:18.9434217Z  'range_unroll_factors': [3, 2],
2026-02-21T13:30:18.9434498Z  'range_warp_specializes': [],
2026-02-21T13:30:18.9434751Z  'waves_per_eu': 2}
2026-02-21T13:30:18.9438327Z [389s] Fitting surrogate: 100 points, 100 targets
2026-02-21T13:30:19.8744244Z [390s] Generation 1 starting: 97 neighbors, 5 active search path(s)
2026-02-21T13:30:55.1616011Z [426s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[3, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:31:00.1148647Z [430s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[True, False], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:31:00.1168262Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.5 configs/s
2026-02-21T13:31:48.7014165Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 97/97 2.2 configs/s
2026-02-21T13:31:49.7143689Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 8/8 - configs/s
2026-02-21T13:31:57.7471634Z [488s] Generation 1 complete: 
2026-02-21T13:31:57.7471977Z error=7
2026-02-21T13:31:57.7472169Z timeout=2
2026-02-21T13:31:57.7472327Z ok=93
2026-02-21T13:31:57.7472474Z min=22.4960
2026-02-21T13:31:57.7472670Z mid=34.5427
2026-02-21T13:31:57.7472816Z max=387.1271
2026-02-21T13:31:57.7473000Z best={'block_sizes': [1, 128, 32],
2026-02-21T13:31:57.7473315Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:31:57.7474018Z  'l2_groupings': [8],
2026-02-21T13:31:57.7474243Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:31:57.7474476Z  'loop_orders': [[0, 1]],
2026-02-21T13:31:57.7474683Z  'matrix_instr_nonkdim': 16,
2026-02-21T13:31:57.7474891Z  'num_sm_multiplier': 64,
2026-02-21T13:31:57.7475092Z  'num_stages': 2,
2026-02-21T13:31:57.7475257Z  'num_warps': 4,
2026-02-21T13:31:57.7475447Z  'pid_type': 'persistent_blocked',
2026-02-21T13:31:57.7475677Z  'range_flattens': [False, None],
2026-02-21T13:31:57.7475916Z  'range_multi_buffers': [True, False],
2026-02-21T13:31:57.7476148Z  'range_num_stages': [1, 1],
2026-02-21T13:31:57.7476484Z  'range_unroll_factors': [3, 0],
2026-02-21T13:31:57.7476703Z  'range_warp_specializes': [],
2026-02-21T13:31:57.7476904Z  'waves_per_eu': 2}
2026-02-21T13:31:57.7488012Z [488s] Fitting surrogate: 202 points, 202 targets
2026-02-21T13:31:58.6968296Z [489s] Generation 2 starting: 94 neighbors, 5 active search path(s)
2026-02-21T13:32:32.4735275Z [523s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=1, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, True], range_num_stages=[3, 2], range_unroll_factors=[3, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:32:42.2950616Z [533s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=64, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[1, 1], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:32:45.8884961Z [536s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 2], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:32:47.2913170Z [538s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 3], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:32:47.2935914Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 0.6 configs/s
2026-02-21T13:34:01.0895446Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 97/97 1.9 configs/s
2026-02-21T13:34:02.2113841Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T13:34:12.1397203Z [623s] Generation 2 complete: 
2026-02-21T13:34:12.1397557Z error=3
2026-02-21T13:34:12.1397791Z timeout=4
2026-02-21T13:34:12.1397978Z ok=93
2026-02-21T13:34:12.1398162Z min=20.9861
2026-02-21T13:34:12.1398357Z mid=40.1549
2026-02-21T13:34:12.1398538Z max=846.0421
2026-02-21T13:34:12.1398767Z best={'block_sizes': [1, 128, 32],
2026-02-21T13:34:12.1399158Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:34:12.1399541Z  'l2_groupings': [8],
2026-02-21T13:34:12.1399815Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:34:12.1400125Z  'loop_orders': [[0, 1]],
2026-02-21T13:34:12.1400403Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:34:12.1400660Z  'num_sm_multiplier': 64,
2026-02-21T13:34:12.1400898Z  'num_stages': 2,
2026-02-21T13:34:12.1401110Z  'num_warps': 4,
2026-02-21T13:34:12.1401689Z  'pid_type': 'persistent_blocked',
2026-02-21T13:34:12.1401990Z  'range_flattens': [None, None],
2026-02-21T13:34:12.1402246Z  'range_multi_buffers': [True, False],
2026-02-21T13:34:12.1402504Z  'range_num_stages': [1, 1],
2026-02-21T13:34:12.1402848Z  'range_unroll_factors': [3, 0],
2026-02-21T13:34:12.1403099Z  'range_warp_specializes': [],
2026-02-21T13:34:12.1403333Z  'waves_per_eu': 2}
2026-02-21T13:34:12.1415519Z [623s] Fitting surrogate: 302 points, 302 targets
2026-02-21T13:34:13.0621156Z [623s] Generation 3 starting: 90 neighbors, 5 active search path(s)
2026-02-21T13:34:52.4150616Z [663s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:34:53.0860030Z [663s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:34:54.3226095Z [665s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:34:59.8661242Z [670s] Timeout after 30s compiling Config(block_sizes=[1, 512, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=2, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:35:00.7833549Z [671s] Timeout after 30s compiling Config(block_sizes=[1, 256, 32], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[2, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:35:01.0361915Z [671s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:35:01.3144519Z [672s] Timeout after 30s compiling Config(block_sizes=[1, 256, 512], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:35:01.5669841Z [672s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=2, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:35:01.8162959Z [672s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 128], indexing=['pointer', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:35:02.2360763Z [673s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['pointer', 'block_ptr', 'block_ptr', 'pointer'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, None], range_num_stages=[0, 1], range_unroll_factors=[0, 2], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:35:02.2375291Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 0.6 configs/s
2026-02-21T13:35:52.0009222Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 92/92 2.3 configs/s
2026-02-21T13:35:52.8628796Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━━━ 9/9 - configs/s
2026-02-21T13:36:00.4781909Z [731s] Generation 3 complete: 
2026-02-21T13:36:00.4782299Z error=5
2026-02-21T13:36:00.4782557Z timeout=10
2026-02-21T13:36:00.4782765Z ok=80
2026-02-21T13:36:00.4782996Z min=20.6766
2026-02-21T13:36:00.4783198Z mid=36.8367
2026-02-21T13:36:00.4783401Z max=509.9584
2026-02-21T13:36:00.4783640Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:36:00.4784074Z  'indexing': ['pointer', 'block_ptr', 'pointer', 'block_ptr'],
2026-02-21T13:36:00.4784480Z  'l2_groupings': [4],
2026-02-21T13:36:00.4784757Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:36:00.4785073Z  'loop_orders': [[0, 1]],
2026-02-21T13:36:00.4785354Z  'matrix_instr_nonkdim': 16,
2026-02-21T13:36:00.4785994Z  'num_stages': 1,
2026-02-21T13:36:00.4786228Z  'num_warps': 4,
2026-02-21T13:36:00.4786477Z  'pid_type': 'flat',
2026-02-21T13:36:00.4786735Z  'range_flattens': [None, False],
2026-02-21T13:36:00.4787041Z  'range_multi_buffers': [None, None],
2026-02-21T13:36:00.4787343Z  'range_num_stages': [0, 0],
2026-02-21T13:36:00.4787624Z  'range_unroll_factors': [0, 1],
2026-02-21T13:36:00.4787922Z  'range_warp_specializes': [],
2026-02-21T13:36:00.4788197Z  'waves_per_eu': 2}
2026-02-21T13:36:00.4803858Z [731s] Fitting surrogate: 397 points, 397 targets
2026-02-21T13:36:01.3636670Z [732s] Generation 4 starting: 90 neighbors, 5 active search path(s)
2026-02-21T13:36:39.5966180Z [770s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:36:41.2048889Z [772s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[True, None], range_multi_buffers=[False, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:36:46.8361681Z [777s] Timeout after 30s compiling Config(block_sizes=[1, 256, 512], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:36:49.5804329Z [780s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[True, None], range_num_stages=[3, 1], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:36:49.5826211Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 93/93 0.7 configs/s
2026-02-21T13:37:56.6514439Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 93/93 1.6 configs/s
2026-02-21T13:37:57.2990840Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:38:03.6454095Z [854s] Generation 4 complete: 
2026-02-21T13:38:03.6454441Z error=7
2026-02-21T13:38:03.6454555Z timeout=4
2026-02-21T13:38:03.6454662Z ok=84
2026-02-21T13:38:03.6454771Z min=19.5046
2026-02-21T13:38:03.6454882Z mid=49.8971
2026-02-21T13:38:03.6454991Z max=860.8808
2026-02-21T13:38:03.6455124Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:38:03.6455354Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:38:03.6455587Z  'l2_groupings': [8],
2026-02-21T13:38:03.6455766Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:38:03.6455956Z  'loop_orders': [[0, 1]],
2026-02-21T13:38:03.6456104Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:38:03.6456250Z  'num_sm_multiplier': 64,
2026-02-21T13:38:03.6456391Z  'num_stages': 2,
2026-02-21T13:38:03.6456521Z  'num_warps': 4,
2026-02-21T13:38:03.6456664Z  'pid_type': 'persistent_interleaved',
2026-02-21T13:38:03.6456839Z  'range_flattens': [None, None],
2026-02-21T13:38:03.6456997Z  'range_multi_buffers': [True, None],
2026-02-21T13:38:03.6457162Z  'range_num_stages': [1, 1],
2026-02-21T13:38:03.6457629Z  'range_unroll_factors': [3, 0],
2026-02-21T13:38:03.6457786Z  'range_warp_specializes': [],
2026-02-21T13:38:03.6457932Z  'waves_per_eu': 2}
2026-02-21T13:38:03.6479913Z [854s] Fitting surrogate: 492 points, 492 targets
2026-02-21T13:38:04.5664349Z [855s] Generation 5 starting: 87 neighbors, 5 active search path(s)
2026-02-21T13:38:45.5797964Z [896s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=1, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:38:45.8866876Z [896s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=3)
2026-02-21T13:38:45.8892743Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.8 configs/s
2026-02-21T13:40:03.6952776Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 90/90 1.5 configs/s
2026-02-21T13:40:04.1398990Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:40:08.4734116Z [979s] Generation 5 complete: 
2026-02-21T13:40:08.4734300Z error=4
2026-02-21T13:40:08.4734433Z timeout=2
2026-02-21T13:40:08.4734546Z ok=86
2026-02-21T13:40:08.4734670Z min=19.3694
2026-02-21T13:40:08.4734794Z mid=63.8328
2026-02-21T13:40:08.4734908Z max=712.0135
2026-02-21T13:40:08.4735047Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:40:08.4735232Z  'indexing': ['block_ptr', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:40:08.4735385Z  'l2_groupings': [8],
2026-02-21T13:40:08.4735490Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:40:08.4735608Z  'loop_orders': [[0, 1]],
2026-02-21T13:40:08.4735723Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:40:08.4735878Z  'num_sm_multiplier': 64,
2026-02-21T13:40:08.4736029Z  'num_stages': 2,
2026-02-21T13:40:08.4736490Z  'num_warps': 4,
2026-02-21T13:40:08.4736620Z  'pid_type': 'persistent_interleaved',
2026-02-21T13:40:08.4736754Z  'range_flattens': [None, None],
2026-02-21T13:40:08.4736865Z  'range_multi_buffers': [True, None],
2026-02-21T13:40:08.4736977Z  'range_num_stages': [0, 1],
2026-02-21T13:40:08.4737080Z  'range_unroll_factors': [3, 0],
2026-02-21T13:40:08.4737188Z  'range_warp_specializes': [],
2026-02-21T13:40:08.4737291Z  'waves_per_eu': 2}
2026-02-21T13:40:08.4759352Z [979s] Fitting surrogate: 584 points, 584 targets
2026-02-21T13:40:09.3577892Z [980s] Generation 6 starting: 85 neighbors, 5 active search path(s)
2026-02-21T13:40:48.4249562Z [1019s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:40:50.3988460Z [1021s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=16, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:40:50.6973977Z [1021s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:40:56.9370432Z [1027s] Timeout after 30s compiling Config(block_sizes=[1, 512, 32], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=2, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[2, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:40:56.9393577Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 0.6 configs/s
2026-02-21T13:42:21.9709715Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 88/88 1.4 configs/s
2026-02-21T13:42:22.4823141Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:42:27.4848736Z [1118s] Generation 6 complete: 
2026-02-21T13:42:27.4849069Z error=4
2026-02-21T13:42:27.4849615Z timeout=4
2026-02-21T13:42:27.4849824Z ok=82
2026-02-21T13:42:27.4850027Z min=19.4487
2026-02-21T13:42:27.4850255Z mid=67.1629
2026-02-21T13:42:27.4850452Z max=927.1636
2026-02-21T13:42:27.4850693Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:42:27.4851121Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:42:27.4851524Z  'l2_groupings': [8],
2026-02-21T13:42:27.4851793Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:42:27.4852113Z  'loop_orders': [[0, 1]],
2026-02-21T13:42:27.4852383Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:42:27.4852670Z  'num_sm_multiplier': 64,
2026-02-21T13:42:27.4852927Z  'num_stages': 2,
2026-02-21T13:42:27.4853153Z  'num_warps': 4,
2026-02-21T13:42:27.4853411Z  'pid_type': 'persistent_interleaved',
2026-02-21T13:42:27.4853727Z  'range_flattens': [None, None],
2026-02-21T13:42:27.4854028Z  'range_multi_buffers': [True, None],
2026-02-21T13:42:27.4854331Z  'range_num_stages': [0, 1],
2026-02-21T13:42:27.4854615Z  'range_unroll_factors': [3, 0],
2026-02-21T13:42:27.4854905Z  'range_warp_specializes': [],
2026-02-21T13:42:27.4855184Z  'waves_per_eu': 2}
2026-02-21T13:42:27.4876156Z [1118s] Fitting surrogate: 674 points, 674 targets
2026-02-21T13:42:28.3689740Z [1119s] Generation 7 starting: 87 neighbors, 5 active search path(s)
2026-02-21T13:43:08.1647245Z [1159s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:43:08.4463672Z [1159s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:43:09.6894435Z [1160s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'block_ptr', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:43:15.4250030Z [1166s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, True], range_multi_buffers=[False, None], range_num_stages=[3, 1], range_unroll_factors=[3, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:43:16.1828219Z [1167s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['pointer', 'block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=16, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[3, 1], range_unroll_factors=[3, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:43:16.1841661Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 0.7 configs/s
2026-02-21T13:44:05.6333782Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 89/89 1.7 configs/s
2026-02-21T13:44:06.3911763Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:44:13.8170851Z [1224s] Generation 7 complete: 
2026-02-21T13:44:13.8171323Z error=5
2026-02-21T13:44:13.8171533Z timeout=5
2026-02-21T13:44:13.8171739Z ok=82
2026-02-21T13:44:13.8171944Z min=19.0203
2026-02-21T13:44:13.8172151Z mid=39.8274
2026-02-21T13:44:13.8172365Z max=318.4178
2026-02-21T13:44:13.8173071Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:44:13.8173494Z  'indexing': ['pointer', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T13:44:13.8173897Z  'l2_groupings': [8],
2026-02-21T13:44:13.8174190Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:44:13.8174529Z  'loop_orders': [[0, 1]],
2026-02-21T13:44:13.8174814Z  'matrix_instr_nonkdim': 32,
2026-02-21T13:44:13.8175082Z  'num_stages': 1,
2026-02-21T13:44:13.8175314Z  'num_warps': 4,
2026-02-21T13:44:13.8175551Z  'pid_type': 'flat',
2026-02-21T13:44:13.8175813Z  'range_flattens': [None, False],
2026-02-21T13:44:13.8176136Z  'range_multi_buffers': [None, None],
2026-02-21T13:44:13.8176444Z  'range_num_stages': [0, 0],
2026-02-21T13:44:13.8176731Z  'range_unroll_factors': [0, 1],
2026-02-21T13:44:13.8177035Z  'range_warp_specializes': [],
2026-02-21T13:44:13.8177316Z  'waves_per_eu': 2}
2026-02-21T13:44:13.8202272Z [1224s] Fitting surrogate: 766 points, 766 targets
2026-02-21T13:44:14.7694909Z [1225s] Generation 8 starting: 94 neighbors, 5 active search path(s)
2026-02-21T13:44:51.7541650Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 1.0 configs/s
2026-02-21T13:45:36.6453066Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 95/95 2.2 configs/s
2026-02-21T13:45:37.4974833Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:45:45.8829631Z [1316s] Generation 8 complete: 
2026-02-21T13:45:45.8830064Z error=2
2026-02-21T13:45:45.8830289Z ok=97
2026-02-21T13:45:45.8830496Z min=19.0580
2026-02-21T13:45:45.8830711Z mid=33.2071
2026-02-21T13:45:45.8830909Z max=422.5741
2026-02-21T13:45:45.8831208Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:45:45.8831633Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T13:45:45.8832493Z  'l2_groupings': [8],
2026-02-21T13:45:45.8832772Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:45:45.8833099Z  'loop_orders': [[0, 1]],
2026-02-21T13:45:45.8833391Z  'matrix_instr_nonkdim': 32,
2026-02-21T13:45:45.8833666Z  'num_stages': 1,
2026-02-21T13:45:45.8833903Z  'num_warps': 4,
2026-02-21T13:45:45.8834134Z  'pid_type': 'flat',
2026-02-21T13:45:45.8834400Z  'range_flattens': [None, False],
2026-02-21T13:45:45.8834844Z  'range_multi_buffers': [None, None],
2026-02-21T13:45:45.8835156Z  'range_num_stages': [0, 0],
2026-02-21T13:45:45.8835439Z  'range_unroll_factors': [0, 1],
2026-02-21T13:45:45.8835744Z  'range_warp_specializes': [],
2026-02-21T13:45:45.8836021Z  'waves_per_eu': 2}
2026-02-21T13:45:45.8861012Z [1316s] Fitting surrogate: 865 points, 865 targets
2026-02-21T13:45:46.7928407Z [1317s] Generation 9 starting: 93 neighbors, 5 active search path(s)
2026-02-21T13:46:25.1383190Z [1356s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:46:26.8188853Z [1357s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:46:27.1426323Z [1358s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:46:27.7400835Z [1358s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, False], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:46:28.0303843Z [1358s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[3, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:46:28.7871041Z [1359s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:46:28.7889206Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 1.1 configs/s
2026-02-21T13:47:20.2529252Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━━ 94/94 2.1 configs/s
2026-02-21T13:47:20.9006361Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:47:27.2599212Z [1418s] Generation 9 complete: 
2026-02-21T13:47:27.2599650Z error=6
2026-02-21T13:47:27.2599863Z timeout=6
2026-02-21T13:47:27.2600063Z ok=86
2026-02-21T13:47:27.2600273Z min=18.9024
2026-02-21T13:47:27.2600480Z mid=38.0107
2026-02-21T13:47:27.2600683Z max=423.7712
2026-02-21T13:47:27.2600946Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:47:27.2601355Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T13:47:27.2601742Z  'l2_groupings': [2],
2026-02-21T13:47:27.2602029Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:47:27.2602349Z  'loop_orders': [[0, 1]],
2026-02-21T13:47:27.2602734Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:47:27.2603017Z  'num_stages': 2,
2026-02-21T13:47:27.2603248Z  'num_warps': 4,
2026-02-21T13:47:27.2603487Z  'pid_type': 'flat',
2026-02-21T13:47:27.2603765Z  'range_flattens': [None, False],
2026-02-21T13:47:27.2604078Z  'range_multi_buffers': [None, None],
2026-02-21T13:47:27.2604386Z  'range_num_stages': [0, 1],
2026-02-21T13:47:27.2605120Z  'range_unroll_factors': [0, 0],
2026-02-21T13:47:27.2605427Z  'range_warp_specializes': [],
2026-02-21T13:47:27.2605671Z  'waves_per_eu': 2}
2026-02-21T13:47:27.2633797Z [1418s] Fitting surrogate: 963 points, 963 targets
2026-02-21T13:47:29.1371522Z [1420s] Generation 10 starting: 89 neighbors, 5 active search path(s)
2026-02-21T13:48:08.4480901Z [1459s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:48:08.8543250Z [1459s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:48:10.1276413Z [1461s] Timeout after 30s compiling Config(block_sizes=[1, 512, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:48:14.2861753Z [1465s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[True, None], range_multi_buffers=[True, None], range_num_stages=[0, 1], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:48:15.2814967Z [1466s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['pointer', 'pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=32, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, None], range_num_stages=[0, 1], range_unroll_factors=[4, 0], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:48:15.2842189Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 0.6 configs/s
2026-02-21T13:48:56.9780261Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 2.0 configs/s
2026-02-21T13:48:57.7437420Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:49:05.2365804Z [1516s] Generation 10 complete: 
2026-02-21T13:49:05.2366235Z error=4
2026-02-21T13:49:05.2366440Z timeout=5
2026-02-21T13:49:05.2366648Z ok=85
2026-02-21T13:49:05.2368457Z min=19.0507
2026-02-21T13:49:05.2368720Z mid=33.1354
2026-02-21T13:49:05.2368922Z max=485.2544
2026-02-21T13:49:05.2369172Z best={'block_sizes': [1, 128, 128],
2026-02-21T13:49:05.2369620Z  'indexing': ['pointer', 'pointer', 'block_ptr', 'block_ptr'],
2026-02-21T13:49:05.2370031Z  'l2_groupings': [16],
2026-02-21T13:49:05.2370309Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:49:05.2370646Z  'loop_orders': [[0, 1]],
2026-02-21T13:49:05.2370929Z  'matrix_instr_nonkdim': 32,
2026-02-21T13:49:05.2371193Z  'num_stages': 2,
2026-02-21T13:49:05.2371451Z  'num_warps': 4,
2026-02-21T13:49:05.2371683Z  'pid_type': 'flat',
2026-02-21T13:49:05.2371942Z  'range_flattens': [None, True],
2026-02-21T13:49:05.2372247Z  'range_multi_buffers': [None, None],
2026-02-21T13:49:05.2373057Z  'range_num_stages': [0, 1],
2026-02-21T13:49:05.2373322Z  'range_unroll_factors': [0, 1],
2026-02-21T13:49:05.2373560Z  'range_warp_specializes': [],
2026-02-21T13:49:05.2373787Z  'waves_per_eu': 2}
2026-02-21T13:49:05.2405414Z [1516s] Fitting surrogate: 1057 points, 1057 targets
2026-02-21T13:49:06.2267714Z [1517s] Generation 11 starting: 88 neighbors, 5 active search path(s)
2026-02-21T13:49:47.1022254Z [1557s] Timeout after 30s compiling Config(block_sizes=[1, 256, 128], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:49:48.5313545Z [1559s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[0, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:49:50.7857753Z [1561s] Timeout after 30s compiling Config(block_sizes=[1, 256, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, False], range_multi_buffers=[True, None], range_num_stages=[1, 4], range_unroll_factors=[2, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:49:50.7882323Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 0.7 configs/s
2026-02-21T13:50:43.7562733Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 1.8 configs/s
2026-02-21T13:50:44.3550086Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:50:50.2217888Z [1621s] Generation 11 complete: 
2026-02-21T13:50:50.2220896Z error=1
2026-02-21T13:50:50.2221170Z timeout=3
2026-02-21T13:50:50.2221406Z ok=89
2026-02-21T13:50:50.2221609Z min=19.0933
2026-02-21T13:50:50.2221814Z mid=36.2580
2026-02-21T13:50:50.2222019Z max=807.0919
2026-02-21T13:50:50.2222274Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:50:50.2222713Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T13:50:50.2223148Z  'l2_groupings': [8],
2026-02-21T13:50:50.2223436Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:50:50.2223748Z  'loop_orders': [[0, 1]],
2026-02-21T13:50:50.2224330Z  'matrix_instr_nonkdim': 32,
2026-02-21T13:50:50.2224605Z  'num_stages': 1,
2026-02-21T13:50:50.2224836Z  'num_warps': 4,
2026-02-21T13:50:50.2225065Z  'pid_type': 'flat',
2026-02-21T13:50:50.2225333Z  'range_flattens': [None, True],
2026-02-21T13:50:50.2225641Z  'range_multi_buffers': [None, None],
2026-02-21T13:50:50.2225946Z  'range_num_stages': [0, 0],
2026-02-21T13:50:50.2226223Z  'range_unroll_factors': [0, 0],
2026-02-21T13:50:50.2226528Z  'range_warp_specializes': [],
2026-02-21T13:50:50.2226807Z  'waves_per_eu': 2}
2026-02-21T13:50:50.2259671Z [1621s] Fitting surrogate: 1150 points, 1150 targets
2026-02-21T13:50:50.9775262Z [1621s] Generation 12 starting: 72 neighbors, 4 active search path(s)
2026-02-21T13:51:27.5000586Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73/73 0.8 configs/s
2026-02-21T13:52:09.7679305Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 73/73 1.4 configs/s
2026-02-21T13:52:10.4277527Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:52:16.8999293Z [1707s] Generation 12 complete: 
2026-02-21T13:52:16.8999722Z error=2
2026-02-21T13:52:16.8999875Z ok=74
2026-02-21T13:52:16.9000513Z min=19.1221
2026-02-21T13:52:16.9000670Z mid=31.8995
2026-02-21T13:52:16.9000822Z max=437.4912
2026-02-21T13:52:16.9001004Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:52:16.9001317Z  'indexing': ['block_ptr', 'pointer', 'pointer', 'block_ptr'],
2026-02-21T13:52:16.9001635Z  'l2_groupings': [8],
2026-02-21T13:52:16.9001844Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:52:16.9002084Z  'loop_orders': [[0, 1]],
2026-02-21T13:52:16.9002289Z  'matrix_instr_nonkdim': 32,
2026-02-21T13:52:16.9002488Z  'num_stages': 1,
2026-02-21T13:52:16.9002732Z  'num_warps': 4,
2026-02-21T13:52:16.9002906Z  'pid_type': 'flat',
2026-02-21T13:52:16.9003238Z  'range_flattens': [None, True],
2026-02-21T13:52:16.9003465Z  'range_multi_buffers': [None, None],
2026-02-21T13:52:16.9003691Z  'range_num_stages': [0, 0],
2026-02-21T13:52:16.9003911Z  'range_unroll_factors': [0, 0],
2026-02-21T13:52:16.9004129Z  'range_warp_specializes': [],
2026-02-21T13:52:16.9004335Z  'waves_per_eu': 2}
2026-02-21T13:52:16.9038816Z [1707s] Fitting surrogate: 1226 points, 1226 targets
2026-02-21T13:52:17.5161449Z [1708s] Generation 13 starting: 55 neighbors, 3 active search path(s)
2026-02-21T13:52:54.9981454Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56/56 0.6 configs/s
2026-02-21T13:53:26.2005643Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 56/56 1.7 configs/s
2026-02-21T13:53:26.7422059Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:53:32.0480352Z [1782s] Generation 13 complete: 
2026-02-21T13:53:32.0480667Z error=1
2026-02-21T13:53:32.0480804Z ok=58
2026-02-21T13:53:32.0480944Z min=19.1422
2026-02-21T13:53:32.0481115Z mid=33.1211
2026-02-21T13:53:32.0481257Z max=303.2963
2026-02-21T13:53:32.0481421Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:53:32.0482052Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T13:53:32.0482319Z  'l2_groupings': [4],
2026-02-21T13:53:32.0482506Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:53:32.0482816Z  'loop_orders': [[0, 1]],
2026-02-21T13:53:32.0482999Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:53:32.0483184Z  'num_stages': 1,
2026-02-21T13:53:32.0483338Z  'num_warps': 4,
2026-02-21T13:53:32.0483498Z  'pid_type': 'flat',
2026-02-21T13:53:32.0483766Z  'range_flattens': [None, None],
2026-02-21T13:53:32.0483975Z  'range_multi_buffers': [None, True],
2026-02-21T13:53:32.0484186Z  'range_num_stages': [0, 1],
2026-02-21T13:53:32.0484373Z  'range_unroll_factors': [0, 1],
2026-02-21T13:53:32.0484576Z  'range_warp_specializes': [],
2026-02-21T13:53:32.0484764Z  'waves_per_eu': 2}
2026-02-21T13:53:32.0517820Z [1782s] Fitting surrogate: 1285 points, 1285 targets
2026-02-21T13:53:32.6488580Z [1783s] Generation 14 starting: 57 neighbors, 3 active search path(s)
2026-02-21T13:54:08.7621776Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58/58 0.6 configs/s
2026-02-21T13:54:46.7511337Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 58/58 1.4 configs/s
2026-02-21T13:54:47.2402681Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:54:52.0391793Z [1862s] Generation 14 complete: 
2026-02-21T13:54:52.0392142Z error=4
2026-02-21T13:54:52.0392343Z ok=56
2026-02-21T13:54:52.0392541Z min=19.1220
2026-02-21T13:54:52.0392775Z mid=30.6319
2026-02-21T13:54:52.0392978Z max=558.1230
2026-02-21T13:54:52.0393222Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:54:52.0393613Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T13:54:52.0393995Z  'l2_groupings': [4],
2026-02-21T13:54:52.0394266Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:54:52.0394585Z  'loop_orders': [[0, 1]],
2026-02-21T13:54:52.0395180Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:54:52.0395455Z  'num_stages': 2,
2026-02-21T13:54:52.0395699Z  'num_warps': 4,
2026-02-21T13:54:52.0395926Z  'pid_type': 'flat',
2026-02-21T13:54:52.0396189Z  'range_flattens': [None, None],
2026-02-21T13:54:52.0396486Z  'range_multi_buffers': [None, True],
2026-02-21T13:54:52.0396801Z  'range_num_stages': [0, 1],
2026-02-21T13:54:52.0397072Z  'range_unroll_factors': [0, 1],
2026-02-21T13:54:52.0397376Z  'range_warp_specializes': [],
2026-02-21T13:54:52.0397651Z  'waves_per_eu': 2}
2026-02-21T13:54:52.0430632Z [1862s] Fitting surrogate: 1345 points, 1345 targets
2026-02-21T13:54:52.6093480Z [1863s] Generation 15 starting: 51 neighbors, 3 active search path(s)
2026-02-21T13:55:28.0111546Z [1898s] Timeout after 30s compiling Config(block_sizes=[1, 256, 256], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', '', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=1, num_warps=2, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, None], range_num_stages=[1, 4], range_unroll_factors=[1, 4], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:55:32.7062428Z [1903s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 0], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:55:32.9950429Z [1903s] Timeout after 30s compiling Config(block_sizes=[1, 128, 512], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 0], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:55:33.2832304Z [1904s] Timeout after 30s compiling Config(block_sizes=[1, 128, 128], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 1], range_unroll_factors=[4, 2], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:55:33.2854769Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52/52 0.6 configs/s
2026-02-21T13:56:02.8412710Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 52/52 1.7 configs/s
2026-02-21T13:56:03.1670160Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:56:06.3474700Z [1937s] Generation 15 complete: 
2026-02-21T13:56:06.3477872Z error=2
2026-02-21T13:56:06.3478436Z timeout=4
2026-02-21T13:56:06.3478671Z ok=48
2026-02-21T13:56:06.3478843Z min=19.1412
2026-02-21T13:56:06.3479019Z mid=36.5238
2026-02-21T13:56:06.3479188Z max=587.0950
2026-02-21T13:56:06.3479773Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:56:06.3480122Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T13:56:06.3480449Z  'l2_groupings': [4],
2026-02-21T13:56:06.3480683Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:56:06.3480949Z  'loop_orders': [[0, 1]],
2026-02-21T13:56:06.3481190Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:56:06.3481414Z  'num_stages': 2,
2026-02-21T13:56:06.3481602Z  'num_warps': 4,
2026-02-21T13:56:06.3481783Z  'pid_type': 'flat',
2026-02-21T13:56:06.3481990Z  'range_flattens': [None, None],
2026-02-21T13:56:06.3482233Z  'range_multi_buffers': [None, False],
2026-02-21T13:56:06.3482484Z  'range_num_stages': [0, 1],
2026-02-21T13:56:06.3482935Z  'range_unroll_factors': [0, 1],
2026-02-21T13:56:06.3483169Z  'range_warp_specializes': [],
2026-02-21T13:56:06.3483396Z  'waves_per_eu': 2}
2026-02-21T13:56:06.3515034Z [1937s] Fitting surrogate: 1399 points, 1399 targets
2026-02-21T13:56:06.9069687Z [1937s] Generation 16 starting: 50 neighbors, 3 active search path(s)
2026-02-21T13:56:41.2799343Z [1972s] Timeout after 30s compiling Config(block_sizes=[1, 1024, 64], indexing=['block_ptr', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=2, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, None], range_num_stages=[0, 4], range_unroll_factors=[0, 4], range_warp_specializes=[], waves_per_eu=1)
2026-02-21T13:56:48.6075261Z [1979s] Timeout after 30s compiling Config(block_sizes=[1, 128, 256], indexing=['pointer', 'pointer', 'block_ptr', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=3, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, True], range_num_stages=[1, 0], range_unroll_factors=[4, 1], range_warp_specializes=[], waves_per_eu=2)
2026-02-21T13:56:48.6097028Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 0.6 configs/s
2026-02-21T13:57:15.9899001Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 1.8 configs/s
2026-02-21T13:57:16.4288625Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:57:20.7108524Z [2011s] Generation 16 complete: 
2026-02-21T13:57:20.7111110Z error=5
2026-02-21T13:57:20.7111721Z timeout=2
2026-02-21T13:57:20.7112000Z ok=46
2026-02-21T13:57:20.7112203Z min=19.0803
2026-02-21T13:57:20.7112414Z mid=33.1198
2026-02-21T13:57:20.7112610Z max=373.0114
2026-02-21T13:57:20.7112870Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:57:20.7113312Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T13:57:20.7113711Z  'l2_groupings': [8],
2026-02-21T13:57:20.7113980Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:57:20.7114632Z  'loop_orders': [[0, 1]],
2026-02-21T13:57:20.7114901Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:57:20.7115166Z  'num_stages': 2,
2026-02-21T13:57:20.7115419Z  'num_warps': 4,
2026-02-21T13:57:20.7115651Z  'pid_type': 'flat',
2026-02-21T13:57:20.7115915Z  'range_flattens': [None, None],
2026-02-21T13:57:20.7116219Z  'range_multi_buffers': [None, False],
2026-02-21T13:57:20.7116531Z  'range_num_stages': [0, 1],
2026-02-21T13:57:20.7116813Z  'range_unroll_factors': [0, 1],
2026-02-21T13:57:20.7117107Z  'range_warp_specializes': [],
2026-02-21T13:57:20.7117381Z  'waves_per_eu': 2}
2026-02-21T13:57:20.7157581Z [2011s] Fitting surrogate: 1452 points, 1452 targets
2026-02-21T13:57:21.1262447Z [2012s] Generation 17 starting: 27 neighbors, 2 active search path(s)
2026-02-21T13:57:47.9311784Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28 0.4 configs/s
2026-02-21T13:58:15.4921273Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 28/28 1.0 configs/s
2026-02-21T13:58:15.6866429Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:58:17.5634240Z [2068s] Generation 17 complete: 
2026-02-21T13:58:17.5634896Z error=3
2026-02-21T13:58:17.5635103Z ok=26
2026-02-21T13:58:17.5635308Z min=19.0784
2026-02-21T13:58:17.5635520Z mid=39.6130
2026-02-21T13:58:17.5635724Z max=527.7296
2026-02-21T13:58:17.5635968Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:58:17.5636369Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T13:58:17.5636780Z  'l2_groupings': [8],
2026-02-21T13:58:17.5637074Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:58:17.5637434Z  'loop_orders': [[0, 1]],
2026-02-21T13:58:17.5637716Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:58:17.5637980Z  'num_stages': 2,
2026-02-21T13:58:17.5638227Z  'num_warps': 4,
2026-02-21T13:58:17.5638607Z  'pid_type': 'flat',
2026-02-21T13:58:17.5638881Z  'range_flattens': [None, None],
2026-02-21T13:58:17.5639191Z  'range_multi_buffers': [None, False],
2026-02-21T13:58:17.5639509Z  'range_num_stages': [0, 1],
2026-02-21T13:58:17.5639791Z  'range_unroll_factors': [0, 1],
2026-02-21T13:58:17.5640082Z  'range_warp_specializes': [],
2026-02-21T13:58:17.5640366Z  'waves_per_eu': 2}
2026-02-21T13:58:17.5678136Z [2068s] Fitting surrogate: 1481 points, 1481 targets
2026-02-21T13:58:17.9807255Z [2068s] Generation 18 starting: 26 neighbors, 2 active search path(s)
2026-02-21T13:58:27.1496546Z Generation 18: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 4.7 configs/s
2026-02-21T13:58:43.3872278Z Generation 18: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 27/27 1.6 configs/s
2026-02-21T13:58:43.5265109Z Generation 18: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:58:44.8692404Z [2095s] Generation 18 complete: 
2026-02-21T13:58:44.8692742Z ok=28
2026-02-21T13:58:44.8692947Z min=19.1416
2026-02-21T13:58:44.8693190Z mid=39.7519
2026-02-21T13:58:44.8693396Z max=293.1759
2026-02-21T13:58:44.8693645Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:58:44.8694067Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T13:58:44.8694457Z  'l2_groupings': [8],
2026-02-21T13:58:44.8694749Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:58:44.8695064Z  'loop_orders': [[0, 1]],
2026-02-21T13:58:44.8695338Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:58:44.8695603Z  'num_stages': 3,
2026-02-21T13:58:44.8695837Z  'num_warps': 4,
2026-02-21T13:58:44.8696063Z  'pid_type': 'flat',
2026-02-21T13:58:44.8696664Z  'range_flattens': [None, None],
2026-02-21T13:58:44.8696969Z  'range_multi_buffers': [None, False],
2026-02-21T13:58:44.8697276Z  'range_num_stages': [0, 1],
2026-02-21T13:58:44.8697557Z  'range_unroll_factors': [0, 1],
2026-02-21T13:58:44.8697848Z  'range_warp_specializes': [],
2026-02-21T13:58:44.8698121Z  'waves_per_eu': 2}
2026-02-21T13:58:44.8740684Z [2095s] Fitting surrogate: 1509 points, 1509 targets
2026-02-21T13:58:45.2962404Z [2096s] Generation 19 starting: 29 neighbors, 2 active search path(s)
2026-02-21T13:59:03.2334327Z Generation 19: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 0.8 configs/s
2026-02-21T13:59:24.9810332Z Generation 19: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 1.4 configs/s
2026-02-21T13:59:25.2245401Z Generation 19: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T13:59:27.4198901Z [2138s] Generation 19 complete: 
2026-02-21T13:59:27.4202279Z error=2
2026-02-21T13:59:27.4202664Z ok=29
2026-02-21T13:59:27.4202934Z min=19.1087
2026-02-21T13:59:27.4203157Z mid=39.6798
2026-02-21T13:59:27.4203356Z max=495.8887
2026-02-21T13:59:27.4203618Z best={'block_sizes': [1, 128, 64],
2026-02-21T13:59:27.4204027Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T13:59:27.4204422Z  'l2_groupings': [8],
2026-02-21T13:59:27.4204702Z  'load_eviction_policies': ['', '', ''],
2026-02-21T13:59:27.4205034Z  'loop_orders': [[0, 1]],
2026-02-21T13:59:27.4205309Z  'matrix_instr_nonkdim': 0,
2026-02-21T13:59:27.4205569Z  'num_stages': 3,
2026-02-21T13:59:27.4205818Z  'num_warps': 4,
2026-02-21T13:59:27.4206044Z  'pid_type': 'flat',
2026-02-21T13:59:27.4206302Z  'range_flattens': [None, True],
2026-02-21T13:59:27.4206604Z  'range_multi_buffers': [None, False],
2026-02-21T13:59:27.4207273Z  'range_num_stages': [0, 1],
2026-02-21T13:59:27.4207498Z  'range_unroll_factors': [0, 1],
2026-02-21T13:59:27.4207742Z  'range_warp_specializes': [],
2026-02-21T13:59:27.4207967Z  'waves_per_eu': 2}
2026-02-21T13:59:27.4250477Z [2138s] Fitting surrogate: 1540 points, 1540 targets
2026-02-21T13:59:27.8306377Z [2138s] Generation 20 starting: 29 neighbors, 2 active search path(s)
2026-02-21T13:59:59.9885210Z Generation 20: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 0.3 configs/s
2026-02-21T14:00:20.0269271Z Generation 20: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 30/30 1.5 configs/s
2026-02-21T14:00:20.2273461Z Generation 20: verifying top configs 100% ━━━━━━━━━━━━━━━━━━━━ 10/10 - configs/s
2026-02-21T14:00:22.1641661Z [2193s] Generation 20 complete: 
2026-02-21T14:00:22.1642050Z error=3
2026-02-21T14:00:22.1642247Z ok=28
2026-02-21T14:00:22.1642460Z min=19.0642
2026-02-21T14:00:22.1642747Z mid=36.9077
2026-02-21T14:00:22.1642947Z max=419.0839
2026-02-21T14:00:22.1643209Z best={'block_sizes': [1, 128, 64],
2026-02-21T14:00:22.1643609Z  'indexing': ['pointer', 'pointer', 'pointer', 'pointer'],
2026-02-21T14:00:22.1643996Z  'l2_groupings': [8],
2026-02-21T14:00:22.1644270Z  'load_eviction_policies': ['', '', ''],
2026-02-21T14:00:22.1644602Z  'loop_orders': [[0, 1]],
2026-02-21T14:00:22.1644876Z  'matrix_instr_nonkdim': 0,
2026-02-21T14:00:22.1645142Z  'num_stages': 3,
2026-02-21T14:00:22.1645369Z  'num_warps': 4,
2026-02-21T14:00:22.1645606Z  'pid_type': 'flat',
2026-02-21T14:00:22.1645897Z  'range_flattens': [None, True],
2026-02-21T14:00:22.1646197Z  'range_multi_buffers': [None, False],
2026-02-21T14:00:22.1646520Z  'range_num_stages': [0, 1],
2026-02-21T14:00:22.1646791Z  'range_unroll_factors': [0, 1],
2026-02-21T14:00:22.1647088Z  'range_warp_specializes': [],
2026-02-21T14:00:22.1647369Z  'waves_per_eu': 2}
2026-02-21T14:00:22.1685573Z [2193s] Fitting surrogate: 1571 points, 1571 targets
2026-02-21T14:00:22.3076215Z [2193s] Autotuning complete in 2193.2s after searching 1445 configs.
2026-02-21T14:00:22.3076814Z One can hardcode the best config and skip autotuning with:
2026-02-21T14:00:22.3078703Z     @helion.kernel(config=helion.Config(block_sizes=[1, 128, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T14:00:22.3080695Z 
2026-02-21T14:00:22.3081092Z [2193s] Code of selected kernel: /tmp/torchinductor_root/w7/cw7ecwu3gro2xz2hdr4lasw34crvsiptntuzvcf2ojdxnm47fxts.py
2026-02-21T14:00:22.3308070Z from __future__ import annotations
2026-02-21T14:00:22.3308461Z 
2026-02-21T14:00:22.3308528Z import torch
2026-02-21T14:00:22.3308677Z import triton
2026-02-21T14:00:22.3308840Z import triton.language as tl
2026-02-21T14:00:22.3309145Z from torch._inductor.runtime import triton_helpers
2026-02-21T14:00:22.3309449Z from torch._inductor.runtime.triton_compat import libdevice
2026-02-21T14:00:22.3309775Z from helion.runtime import default_launcher as _default_launcher
2026-02-21T14:00:22.3309989Z 
2026-02-21T14:00:22.3310067Z _BLOCK_SIZE_1 = tl.constexpr(128)
2026-02-21T14:00:22.3310273Z _BLOCK_SIZE_0 = tl.constexpr(1)
2026-02-21T14:00:22.3310467Z _BLOCK_SIZE_3 = tl.constexpr(64)
2026-02-21T14:00:22.3310591Z 
2026-02-21T14:00:22.3310653Z @triton.jit
2026-02-21T14:00:22.3310896Z def _helion_attention(q_view, k_view, v_view, out, _RDIM_SIZE_2: tl.constexpr):
2026-02-21T14:00:22.3311313Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T14:00:22.3311604Z     num_pid_m = 192
2026-02-21T14:00:22.3311784Z     num_pid_n = tl.cdiv(8192, _BLOCK_SIZE_1)
2026-02-21T14:00:22.3312003Z     inner_2d_pid = tl.program_id(0)
2026-02-21T14:00:22.3312212Z     num_pid_in_group = 8 * num_pid_n
2026-02-21T14:00:22.3312439Z     group_id = inner_2d_pid // num_pid_in_group
2026-02-21T14:00:22.3312665Z     first_pid_m = group_id * 8
2026-02-21T14:00:22.3312887Z     group_size_m = min(num_pid_m - first_pid_m, 8)
2026-02-21T14:00:22.3313184Z     pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m
2026-02-21T14:00:22.3313504Z     pid_1 = inner_2d_pid % num_pid_in_group // group_size_m
2026-02-21T14:00:22.3313739Z     offset_0 = pid_0
2026-02-21T14:00:22.3313930Z     indices_0 = offset_0 + tl.zeros([1], tl.int32)
2026-02-21T14:00:22.3314156Z     offset_1 = pid_1 * _BLOCK_SIZE_1
2026-02-21T14:00:22.3314411Z     indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32)
2026-02-21T14:00:22.3314774Z     indices_4 = tl.arange(0, _RDIM_SIZE_2).to(tl.int32)
2026-02-21T14:00:22.3315122Z     # src[attention.py:68]: m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T14:00:22.3315515Z     m_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], float('-inf'), tl.float32)
2026-02-21T14:00:22.3315838Z     # src[attention.py:69]: l_i = torch.full_like(m_i, 1.0)
2026-02-21T14:00:22.3316136Z     l_i = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 1.0, tl.float32)
2026-02-21T14:00:22.3316497Z     # src[attention.py:70]: acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32)
2026-02-21T14:00:22.3316870Z     acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128], 0.0, tl.float32)
2026-02-21T14:00:22.3317181Z     # src[attention.py:71]: q = q_view[tile_b, tile_m, :]
2026-02-21T14:00:22.3317619Z     q = tl.load(q_view + (indices_0[:, None, None] * 1048576 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), None)
2026-02-21T14:00:22.3318087Z     # src[attention.py:72]: for tile_n in hl.tile(v_view.size(1)):
2026-02-21T14:00:22.3318395Z     # src[attention.py:73]:     k = k_view[tile_b, :, tile_n]
2026-02-21T14:00:22.3318673Z     # src[attention.py:74]:     qk = torch.bmm(q, k)
2026-02-21T14:00:22.3318912Z     # src[attention.py:72-85]: ...
2026-02-21T14:00:22.3319361Z     for offset_2 in tl.range(0, 8192, _BLOCK_SIZE_3, loop_unroll_factor=1, num_stages=1, disallow_acc_multi_buffer=True, flatten=True):
2026-02-21T14:00:22.3319862Z         indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_3).to(tl.int32)
2026-02-21T14:00:22.3320127Z         q_copy = q
2026-02-21T14:00:22.3320318Z         m_i_copy = m_i
2026-02-21T14:00:22.3320489Z         l_i_copy = l_i
2026-02-21T14:00:22.3320654Z         acc_copy = acc
2026-02-21T14:00:22.3320822Z         q_copy_0 = q_copy
2026-02-21T14:00:22.3321001Z         m_i_copy_0 = m_i_copy
2026-02-21T14:00:22.3321165Z         l_i_copy_0 = l_i_copy
2026-02-21T14:00:22.3321301Z         acc_copy_0 = acc_copy
2026-02-21T14:00:22.3321474Z         # src[attention.py:73]: k = k_view[tile_b, :, tile_n]
2026-02-21T14:00:22.3321806Z         k = tl.load(k_view + (indices_0[:, None, None] * 1048576 + indices_4[None, :, None] * 1 + indices_2[None, None, :] * 128), None)
2026-02-21T14:00:22.3322148Z         # src[attention.py:74]: qk = torch.bmm(q, k)
2026-02-21T14:00:22.3322865Z         qk = tl.cast(tl.reshape(tl.dot(tl.reshape(tl.cast(q_copy_0, tl.bfloat16), [_BLOCK_SIZE_1, 128]), tl.reshape(tl.cast(k, tl.bfloat16), [128, _BLOCK_SIZE_3]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.bfloat16)
2026-02-21T14:00:22.3323490Z         # src[attention.py:75]: m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale)
2026-02-21T14:00:22.3323738Z         amax = tl.cast(tl.max(qk, 2), tl.bfloat16)
2026-02-21T14:00:22.3323911Z         v_0 = 0.12751743074602467
2026-02-21T14:00:22.3324066Z         v_1 = tl.cast(amax * v_0, tl.bfloat16)
2026-02-21T14:00:22.3324235Z         v_2 = tl.cast(v_1, tl.float32)
2026-02-21T14:00:22.3324411Z         v_3 = triton_helpers.maximum(m_i_copy_0, v_2)
2026-02-21T14:00:22.3324630Z         # src[attention.py:76]: qk = qk * qk_scale - m_ij[:, :, None]
2026-02-21T14:00:22.3324828Z         v_4 = 0.12751743074602467
2026-02-21T14:00:22.3324981Z         v_5 = tl.cast(qk * v_4, tl.bfloat16)
2026-02-21T14:00:22.3325149Z         subscript = v_3[:, :, None]
2026-02-21T14:00:22.3325299Z         v_6 = tl.cast(v_5, tl.float32)
2026-02-21T14:00:22.3325452Z         v_7 = v_6 - subscript
2026-02-21T14:00:22.3325608Z         # src[attention.py:77]: p = torch.exp2(qk)
2026-02-21T14:00:22.3325781Z         v_8 = libdevice.exp2(v_7)
2026-02-21T14:00:22.3325954Z         # src[attention.py:78]: l_ij = torch.sum(p, -1)
2026-02-21T14:00:22.3326147Z         l_ij = tl.cast(tl.sum(v_8, 2), tl.float32)
2026-02-21T14:00:22.3326345Z         # src[attention.py:79]: alpha = torch.exp2(m_i - m_ij)
2026-02-21T14:00:22.3326535Z         v_9 = m_i_copy_0 - v_3
2026-02-21T14:00:22.3326685Z         v_10 = libdevice.exp2(v_9)
2026-02-21T14:00:22.3326882Z         # src[attention.py:80]: l_i = l_i * alpha + l_ij
2026-02-21T14:00:22.3327062Z         v_11 = l_i_copy_0 * v_10
2026-02-21T14:00:22.3327207Z         l_i = v_11 + l_ij
2026-02-21T14:00:22.3327370Z         # src[attention.py:81]: acc = acc * alpha[:, :, None]
2026-02-21T14:00:22.3327555Z         subscript_1 = v_10[:, :, None]
2026-02-21T14:00:22.3327715Z         v_13 = acc_copy_0 * subscript_1
2026-02-21T14:00:22.3327896Z         # src[attention.py:82]: v = v_view[tile_b, tile_n, :]
2026-02-21T14:00:22.3328226Z         v = tl.load(v_view + (indices_0[:, None, None] * 1048576 + indices_2[None, :, None] * 128 + indices_4[None, None, :] * 1), None)
2026-02-21T14:00:22.3328548Z         # src[attention.py:83]: p = p.to(v.dtype)
2026-02-21T14:00:22.3328719Z         v_14 = tl.cast(v_8, tl.bfloat16)
2026-02-21T14:00:22.3328909Z         # src[attention.py:84]: acc = torch.baddbmm(acc, p, v)
2026-02-21T14:00:22.3329522Z         acc = tl.reshape(tl.dot(tl.reshape(tl.cast(v_14, tl.bfloat16), [_BLOCK_SIZE_1, _BLOCK_SIZE_3]), tl.reshape(tl.cast(v, tl.bfloat16), [_BLOCK_SIZE_3, 128]), acc=tl.reshape(v_13, [_BLOCK_SIZE_1, 128]), input_precision='tf32', out_dtype=tl.float32), [_BLOCK_SIZE_0, _BLOCK_SIZE_1, 128])
2026-02-21T14:00:22.3330111Z         # src[attention.py:85]: m_i = m_ij
2026-02-21T14:00:22.3330267Z         m_i = v_3
2026-02-21T14:00:22.3330412Z     # src[attention.py:87]: acc = acc / l_i[:, :, None]
2026-02-21T14:00:22.3330597Z     subscript_2 = l_i[:, :, None]
2026-02-21T14:00:22.3330746Z     v_15 = acc / subscript_2
2026-02-21T14:00:22.3330940Z     # src[attention.py:88]: out[tile_b, tile_m, :] = acc.to(out.dtype)
2026-02-21T14:00:22.3331170Z     v_16 = tl.cast(v_15, tl.bfloat16)
2026-02-21T14:00:22.3331470Z     tl.store(out + (indices_0[:, None, None] * 1048576 + indices_1[None, :, None] * 128 + indices_4[None, None, :] * 1), v_16, None)
2026-02-21T14:00:22.3331721Z 
2026-02-21T14:00:22.3331906Z def attention(q_in: torch.Tensor, k_in: torch.Tensor, v_in: torch.Tensor, *, _launcher=_default_launcher):
2026-02-21T14:00:22.3332131Z     """
2026-02-21T14:00:22.3332234Z     Computes scaled dot-product attention.
2026-02-21T14:00:22.3332347Z 
2026-02-21T14:00:22.3332467Z     Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
2026-02-21T14:00:22.3332635Z 
2026-02-21T14:00:22.3332671Z     Args:
2026-02-21T14:00:22.3332805Z         q_in: Query tensor of shape [..., seq_len_q, head_dim]
2026-02-21T14:00:22.3332975Z         k_in: Key tensor of shape [..., seq_len_k, head_dim]
2026-02-21T14:00:22.3333143Z         v_in: Value tensor of shape [..., seq_len_k, head_dim]
2026-02-21T14:00:22.3333250Z 
2026-02-21T14:00:22.3333295Z     Returns:
2026-02-21T14:00:22.3333461Z         Output tensor of shape [..., seq_len_q, head_dim]
2026-02-21T14:00:22.3333596Z     """
2026-02-21T14:00:22.3333701Z     # src[attention.py:56]: m_dim = q_in.size(-2)
2026-02-21T14:00:22.3333840Z     m_dim = q_in.size(-2)
2026-02-21T14:00:22.3333959Z     # src[attention.py:57]: n_dim = k_in.size(-2)
2026-02-21T14:00:22.3334095Z     n_dim = k_in.size(-2)
2026-02-21T14:00:22.3334221Z     # src[attention.py:58]: assert n_dim == v_in.size(-2)
2026-02-21T14:00:22.3334373Z     assert n_dim == v_in.size(-2)
2026-02-21T14:00:22.3334529Z     # src[attention.py:59]: head_dim = hl.specialize(q_in.size(-1))
2026-02-21T14:00:22.3334691Z     head_dim = 128
2026-02-21T14:00:22.3334831Z     # src[attention.py:60]: assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T14:00:22.3335025Z     assert head_dim == k_in.size(-1) == v_in.size(-1)
2026-02-21T14:00:22.3335207Z     # src[attention.py:61]: q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T14:00:22.3335382Z     q_view = q_in.reshape([-1, m_dim, head_dim])
2026-02-21T14:00:22.3335555Z     # src[attention.py:62]: v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T14:00:22.3335725Z     v_view = v_in.reshape([-1, n_dim, head_dim])
2026-02-21T14:00:22.3335916Z     # src[attention.py:63]: k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T14:00:22.3336155Z     k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
2026-02-21T14:00:22.3336339Z     # src[attention.py:64]: out = torch.empty_like(q_view)
2026-02-21T14:00:22.3336492Z     out = torch.empty_like(q_view)
2026-02-21T14:00:22.3336666Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T14:00:22.3336845Z     _BLOCK_SIZE_1 = 128
2026-02-21T14:00:22.3336946Z     _RDIM_SIZE_2 = 128
2026-02-21T14:00:22.3337101Z     # src[attention.py:67]: for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
2026-02-21T14:00:22.3337345Z     # src[attention.py:68]:     m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
2026-02-21T14:00:22.3337574Z     # src[attention.py:69]:     l_i = torch.full_like(m_i, 1.0)
2026-02-21T14:00:22.3337730Z     # src[attention.py:67-88]: ...
2026-02-21T14:00:22.3338053Z     _launcher(_helion_attention, (192 * triton.cdiv(8192, _BLOCK_SIZE_1),), q_view, k_view, v_view, out, _RDIM_SIZE_2, num_warps=4, num_stages=3, waves_per_eu=2, matrix_instr_nonkdim=0)
2026-02-21T14:00:22.3338403Z     # src[attention.py:89]: return out.view(q_in.size())
2026-02-21T14:00:22.3338549Z     return out.view(q_in.size())
2026-02-21T14:00:23.5614666Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T14:00:23.5616878Z TritonBench accuracy check failed with Helion kernel config: @helion.kernel(config=helion.Config(block_sizes=[1, 128, 64], indexing=['pointer', 'pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', '', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=4, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 1], range_unroll_factors=[0, 1], range_warp_specializes=[], waves_per_eu=2), static_shapes=True)
2026-02-21T14:00:23.5619212Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2026-02-21T14:00:23.5619689Z WARNING:tritonbench.utils.triton_op:Completed input ID 6:
2026-02-21T14:00:23.5620113Z (Batch, Heads, SeqLen, SeqLen_KV, Dhead)
2026-02-21T14:00:23.5620460Z ------------------------------------------
2026-02-21T14:00:23.5620864Z (4, 48, 8192, 8192, 128)
2026-02-21T14:00:23.5621046Z 
2026-02-21T14:00:23.5621466Z 100%|██████████| 6/6 [1:33:29<00:00, 1273.77s/it]
2026-02-21T14:00:23.5621935Z 100%|██████████| 6/6 [1:33:29<00:00, 934.99s/it] 
2026-02-21T14:00:23.5623588Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmp01habu_d.csv
2026-02-21T14:00:26.5181086Z   (Batch, Heads, SeqLen, SeqLen_KV, Dhead)    flex_attention-speedup    flex_attention-accuracy    helion_attention-speedup    helion_attention-accuracy
2026-02-21T14:00:26.5182149Z ------------------------------------------  ------------------------  -------------------------  --------------------------  ---------------------------
2026-02-21T14:00:26.5182901Z                     (4, 48, 128, 128, 128)                   2.0022                    0                            3.29499                            0
2026-02-21T14:00:26.5183513Z                     (4, 48, 256, 256, 128)                   2.17004                   0                            3.2996                             0
2026-02-21T14:00:26.5184139Z                     (4, 48, 512, 512, 128)                   2.75028                   1                            3.67232                            0
2026-02-21T14:00:26.5184746Z                   (4, 48, 2048, 2048, 128)                   3.51329                   1                            5.09774                            0
2026-02-21T14:00:26.5185368Z                   (4, 48, 4096, 4096, 128)                   3.66077                   1                            5.69485                            0
2026-02-21T14:00:26.5185980Z                   (4, 48, 8192, 8192, 128)                   3.88718                   1                            6.07271                            0
2026-02-21T14:00:26.5186984Z                                    average                   2.99729                   0.666667                     4.52203                            0
2026-02-21T14:00:34.0719105Z ✅ Completed benchmark for kernel: flash_attention
2026-02-21T14:00:34.0726436Z [
2026-02-21T14:00:34.0726670Z   {
2026-02-21T14:00:34.0726892Z     "benchmark": {
2026-02-21T14:00:34.0727291Z       "name": "Helion Benchmark",
2026-02-21T14:00:34.0727600Z       "extra_info": {
2026-02-21T14:00:34.0727946Z         "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-"
2026-02-21T14:00:34.0728333Z       }
2026-02-21T14:00:34.0728536Z     },
2026-02-21T14:00:34.0728733Z     "model": {
2026-02-21T14:00:34.0728939Z       "name": "flash_attention"
2026-02-21T14:00:34.0729156Z     },
2026-02-21T14:00:34.0729313Z     "metric": {
2026-02-21T14:00:34.0729522Z       "name": "torch_compile_speedup",
2026-02-21T14:00:34.0729850Z       "benchmark_values": [
2026-02-21T14:00:34.0730073Z         2.0021993947171195,
2026-02-21T14:00:34.0730279Z         2.1700365612654573,
2026-02-21T14:00:34.0730517Z         2.7502829234205497,
2026-02-21T14:00:34.0730731Z         3.513288970276924,
2026-02-21T14:00:34.0731297Z         3.660765895251711,
2026-02-21T14:00:34.0731500Z         3.8871759075278574
2026-02-21T14:00:34.0731688Z       ]
2026-02-21T14:00:34.0731847Z     },
2026-02-21T14:00:34.0732050Z     "shape": [
2026-02-21T14:00:34.0732237Z       "(4, 48, 128, 128, 128)",
2026-02-21T14:00:34.0732444Z       "(4, 48, 256, 256, 128)",
2026-02-21T14:00:34.0732650Z       "(4, 48, 512, 512, 128)",
2026-02-21T14:00:34.0732894Z       "(4, 48, 2048, 2048, 128)",
2026-02-21T14:00:34.0733116Z       "(4, 48, 4096, 4096, 128)",
2026-02-21T14:00:34.0733444Z       "(4, 48, 8192, 8192, 128)"
2026-02-21T14:00:34.0733684Z     ]
2026-02-21T14:00:34.0733838Z   },
2026-02-21T14:00:34.0733987Z   {
2026-02-21T14:00:34.0734146Z     "benchmark": {
2026-02-21T14:00:34.0734378Z       "name": "Helion Benchmark",
2026-02-21T14:00:34.0734601Z       "extra_info": {
2026-02-21T14:00:34.0734861Z         "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-"
2026-02-21T14:00:34.0735182Z       }
2026-02-21T14:00:34.0735329Z     },
2026-02-21T14:00:34.0735491Z     "model": {
2026-02-21T14:00:34.0735668Z       "name": "flash_attention"
2026-02-21T14:00:34.0735908Z     },
2026-02-21T14:00:34.0736136Z     "metric": {
2026-02-21T14:00:34.0736486Z       "name": "torch_compile_accuracy",
2026-02-21T14:00:34.0736740Z       "benchmark_values": [
2026-02-21T14:00:34.0736999Z         0.0,
2026-02-21T14:00:34.0737162Z         0.0,
2026-02-21T14:00:34.0737317Z         1.0,
2026-02-21T14:00:34.0737476Z         1.0,
2026-02-21T14:00:34.0737631Z         1.0,
2026-02-21T14:00:34.0737800Z         1.0
2026-02-21T14:00:34.0737953Z       ]
2026-02-21T14:00:34.0738107Z     },
2026-02-21T14:00:34.0738289Z     "shape": [
2026-02-21T14:00:34.0738466Z       "(4, 48, 128, 128, 128)",
2026-02-21T14:00:34.0738668Z       "(4, 48, 256, 256, 128)",
2026-02-21T14:00:34.0738821Z       "(4, 48, 512, 512, 128)",
2026-02-21T14:00:34.0738992Z       "(4, 48, 2048, 2048, 128)",
2026-02-21T14:00:34.0739149Z       "(4, 48, 4096, 4096, 128)",
2026-02-21T14:00:34.0739306Z       "(4, 48, 8192, 8192, 128)"
2026-02-21T14:00:34.0739453Z     ]
2026-02-21T14:00:34.0739562Z   },
2026-02-21T14:00:34.0739695Z   {
2026-02-21T14:00:34.0739805Z     "benchmark": {
2026-02-21T14:00:34.0739947Z       "name": "Helion Benchmark",
2026-02-21T14:00:34.0740108Z       "extra_info": {
2026-02-21T14:00:34.0740290Z         "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-"
2026-02-21T14:00:34.0740518Z       }
2026-02-21T14:00:34.0740624Z     },
2026-02-21T14:00:34.0740732Z     "model": {
2026-02-21T14:00:34.0740875Z       "name": "flash_attention"
2026-02-21T14:00:34.0741018Z     },
2026-02-21T14:00:34.0741163Z     "metric": {
2026-02-21T14:00:34.0741291Z       "name": "helion_speedup",
2026-02-21T14:00:34.0741450Z       "benchmark_values": [
2026-02-21T14:00:34.0741596Z         3.2949895320648634,
2026-02-21T14:00:34.0741742Z         3.2995974823628726,
2026-02-21T14:00:34.0742014Z         3.6723164512582986,
2026-02-21T14:00:34.0742160Z         5.097737134414277,
2026-02-21T14:00:34.0742307Z         5.694853907742621,
2026-02-21T14:00:34.0742447Z         6.072705940751335
2026-02-21T14:00:34.0742640Z       ]
2026-02-21T14:00:34.0742772Z     },
2026-02-21T14:00:34.0742881Z     "shape": [
2026-02-21T14:00:34.0743012Z       "(4, 48, 128, 128, 128)",
2026-02-21T14:00:34.0743160Z       "(4, 48, 256, 256, 128)",
2026-02-21T14:00:34.0743306Z       "(4, 48, 512, 512, 128)",
2026-02-21T14:00:34.0743477Z       "(4, 48, 2048, 2048, 128)",
2026-02-21T14:00:34.0743637Z       "(4, 48, 4096, 4096, 128)",
2026-02-21T14:00:34.0743795Z       "(4, 48, 8192, 8192, 128)"
2026-02-21T14:00:34.0743943Z     ]
2026-02-21T14:00:34.0744046Z   },
2026-02-21T14:00:34.0744151Z   {
2026-02-21T14:00:34.0744257Z     "benchmark": {
2026-02-21T14:00:34.0744398Z       "name": "Helion Benchmark",
2026-02-21T14:00:34.0744553Z       "extra_info": {
2026-02-21T14:00:34.0744727Z         "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-"
2026-02-21T14:00:34.0744926Z       }
2026-02-21T14:00:34.0745033Z     },
2026-02-21T14:00:34.0745142Z     "model": {
2026-02-21T14:00:34.0745305Z       "name": "flash_attention"
2026-02-21T14:00:34.0745481Z     },
2026-02-21T14:00:34.0745587Z     "metric": {
2026-02-21T14:00:34.0745720Z       "name": "helion_accuracy",
2026-02-21T14:00:34.0745881Z       "benchmark_values": [
2026-02-21T14:00:34.0746022Z         0.0,
2026-02-21T14:00:34.0746157Z         0.0,
2026-02-21T14:00:34.0746272Z         0.0,
2026-02-21T14:00:34.0746387Z         0.0,
2026-02-21T14:00:34.0746505Z         0.0,
2026-02-21T14:00:34.0746612Z         0.0
2026-02-21T14:00:34.0746755Z       ]
2026-02-21T14:00:34.0746872Z     },
2026-02-21T14:00:34.0746999Z     "shape": [
2026-02-21T14:00:34.0747129Z       "(4, 48, 128, 128, 128)",
2026-02-21T14:00:34.0747338Z       "(4, 48, 256, 256, 128)",
2026-02-21T14:00:34.0747487Z       "(4, 48, 512, 512, 128)",
2026-02-21T14:00:34.0747635Z       "(4, 48, 2048, 2048, 128)",
2026-02-21T14:00:34.0747789Z       "(4, 48, 4096, 4096, 128)",
2026-02-21T14:00:34.0747940Z       "(4, 48, 8192, 8192, 128)"
2026-02-21T14:00:34.0748083Z     ]
2026-02-21T14:00:34.0748190Z   }
2026-02-21T14:00:34.0767182Z ]
2026-02-21T14:00:34.0823565Z ##[group]Run pytorch/test-infra/.github/actions/gather-benchmark-metadata@main
2026-02-21T14:00:34.0823771Z with:
2026-02-21T14:00:34.0824158Z   github-token: ***
2026-02-21T14:00:34.0824256Z   venv: .venv/bin/activate
2026-02-21T14:00:34.0824353Z   schema-version: v3
2026-02-21T14:00:34.0824443Z env:
2026-02-21T14:00:34.0824528Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:34.0824661Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.0824829Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:34.0824990Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.0825129Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.0825269Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.0825414Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:34.0825572Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:34.0825726Z ##[endgroup]
2026-02-21T14:00:34.0872755Z ##[group]Run set -eux
2026-02-21T14:00:34.0872860Z [36;1mset -eux[0m
2026-02-21T14:00:34.0872943Z [36;1m[0m
2026-02-21T14:00:34.0873033Z [36;1mif [[ -z "${GITHUB_TOKEN}" ]]; then[0m
2026-02-21T14:00:34.0873167Z [36;1m  echo "Missing github-token input"[0m
2026-02-21T14:00:34.0873279Z [36;1m  exit 1[0m
2026-02-21T14:00:34.0873362Z [36;1mfi[0m
2026-02-21T14:00:34.0873996Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T14:00:34.0874124Z env:
2026-02-21T14:00:34.0874213Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:34.0874339Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.0874498Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:34.0874649Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.0874783Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.0874919Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.0875061Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:34.0875215Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:34.0875437Z   GITHUB_TOKEN: ***
2026-02-21T14:00:34.0875527Z ##[endgroup]
2026-02-21T14:00:34.1259011Z + [[ -z *** ]]
2026-02-21T14:00:34.1338360Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main
2026-02-21T14:00:34.1338529Z with:
2026-02-21T14:00:34.1338696Z   github-token: ***
2026-02-21T14:00:34.1338793Z env:
2026-02-21T14:00:34.1338878Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:34.1339011Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.1339172Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:34.1339330Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.1339465Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.1339605Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.1339748Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:34.1339999Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:34.1340140Z ##[endgroup]
2026-02-21T14:00:34.1347291Z ##[group]Run set -eux
2026-02-21T14:00:34.1347412Z [36;1mset -eux[0m
2026-02-21T14:00:34.1347495Z [36;1m[0m
2026-02-21T14:00:34.1347677Z [36;1mpython3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}"[0m
2026-02-21T14:00:34.1348058Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T14:00:34.1348183Z env:
2026-02-21T14:00:34.1348270Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:34.1348396Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.1348549Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:34.1348703Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.1348834Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.1348967Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:34.1349229Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:34.1349387Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:34.1349590Z   GITHUB_TOKEN: ***
2026-02-21T14:00:34.1349674Z ##[endgroup]
2026-02-21T14:00:34.2615755Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 22253280836 linux.rocm.gpu.gfx942.2-n2gvb-runner-2vdnn
2026-02-21T14:00:37.2159368Z setting job-id=64380329937
2026-02-21T14:00:37.2160125Z setting job-name=run-mi325x (int4_gemm,flash_attention) / benchmark-rocm6.4-int4_gemm,flash_attention-py3.12-mi325x
2026-02-21T14:00:37.2330944Z ##[group]Run set -eux
2026-02-21T14:00:37.2331081Z [36;1mset -eux[0m
2026-02-21T14:00:37.2331164Z [36;1m[0m
2026-02-21T14:00:37.2331266Z [36;1mif [[ -n ".venv/bin/activate" ]]; then[0m
2026-02-21T14:00:37.2331398Z [36;1m  source ".venv/bin/activate"[0m
2026-02-21T14:00:37.2331511Z [36;1mfi[0m
2026-02-21T14:00:37.2331601Z [36;1m[0m
2026-02-21T14:00:37.2331759Z [36;1mpython3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \[0m
2026-02-21T14:00:37.2331965Z [36;1m  --schema-version "${SCHEMA_VERSION}" \[0m
2026-02-21T14:00:37.2332096Z [36;1m  --repo "${REPO}" \[0m
2026-02-21T14:00:37.2332212Z [36;1m  --head-branch "${HEAD_BRANCH}" \[0m
2026-02-21T14:00:37.2332333Z [36;1m  --head-sha "${HEAD_SHA}" \[0m
2026-02-21T14:00:37.2332456Z [36;1m  --workflow-id "${WORKFLOW_RUN_ID}" \[0m
2026-02-21T14:00:37.2332589Z [36;1m  --run-attempt "${RUN_ATTEMPT}" \[0m
2026-02-21T14:00:37.2332712Z [36;1m  --job-id "${JOB_ID}" \[0m
2026-02-21T14:00:37.2332822Z [36;1m  --job-name "${JOB_NAME}"[0m
2026-02-21T14:00:37.2333023Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T14:00:37.2333232Z env:
2026-02-21T14:00:37.2333315Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:37.2333442Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.2333598Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:37.2333760Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.2333894Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.2334025Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.2334166Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:37.2334325Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:37.2334471Z   SCHEMA_VERSION: v3
2026-02-21T14:00:37.2334565Z   REPO: pytorch/helion
2026-02-21T14:00:37.2334669Z   HEAD_BRANCH: refs/heads/main
2026-02-21T14:00:37.2334793Z   HEAD_SHA: 874a7d0cadab18218a84ad3579d329dc95c51820
2026-02-21T14:00:37.2334926Z   WORKFLOW_RUN_ID: 22253280836
2026-02-21T14:00:37.2335032Z   RUN_ATTEMPT: 1
2026-02-21T14:00:37.2335119Z   JOB_ID: 64380329937
2026-02-21T14:00:37.2335329Z   JOB_NAME: run-mi325x (int4_gemm,flash_attention) / benchmark-rocm6.4-int4_gemm,flash_attention-py3.12-mi325x
2026-02-21T14:00:37.2335697Z ##[endgroup]
2026-02-21T14:00:37.3177297Z + [[ -n .venv/bin/activate ]]
2026-02-21T14:00:37.3178032Z + source .venv/bin/activate
2026-02-21T14:00:37.3179358Z ++ '[' -z '' ']'
2026-02-21T14:00:37.3179609Z ++ '[' -n x ']'
2026-02-21T14:00:37.3179898Z ++ SCRIPT_PATH=.venv/bin/activate
2026-02-21T14:00:37.3180371Z ++ '[' .venv/bin/activate = /__w/_temp/aa47f69b-53c5-4181-bfd7-b91eff01d549.sh ']'
2026-02-21T14:00:37.3180862Z ++ deactivate nondestructive
2026-02-21T14:00:37.3181444Z ++ unset -f pydoc
2026-02-21T14:00:37.3181672Z ++ '[' -z '' ']'
2026-02-21T14:00:37.3181890Z ++ '[' -z '' ']'
2026-02-21T14:00:37.3182107Z ++ hash -r
2026-02-21T14:00:37.3182319Z ++ '[' -z '' ']'
2026-02-21T14:00:37.3182551Z ++ unset VIRTUAL_ENV
2026-02-21T14:00:37.3182814Z ++ unset VIRTUAL_ENV_PROMPT
2026-02-21T14:00:37.3183125Z ++ '[' '!' nondestructive = nondestructive ']'
2026-02-21T14:00:37.3183484Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv
2026-02-21T14:00:37.3183849Z ++ '[' linux-gnu = cygwin ']'
2026-02-21T14:00:37.3184164Z ++ '[' linux-gnu = msys ']'
2026-02-21T14:00:37.3184634Z ++ export VIRTUAL_ENV
2026-02-21T14:00:37.3184894Z ++ '[' -z '' ']'
2026-02-21T14:00:37.3185128Z ++ unset SCRIPT_PATH
2026-02-21T14:00:37.3186117Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T14:00:37.3187884Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T14:00:37.3188621Z ++ export PATH
2026-02-21T14:00:37.3188733Z ++ '[' xhelion '!=' x ']'
2026-02-21T14:00:37.3188874Z ++ VIRTUAL_ENV_PROMPT=helion
2026-02-21T14:00:37.3189014Z ++ export VIRTUAL_ENV_PROMPT
2026-02-21T14:00:37.3189146Z ++ '[' -z '' ']'
2026-02-21T14:00:37.3189259Z ++ '[' -z '' ']'
2026-02-21T14:00:37.3189371Z ++ _OLD_VIRTUAL_PS1=
2026-02-21T14:00:37.3189497Z ++ PS1='(helion) '
2026-02-21T14:00:37.3189609Z ++ export PS1
2026-02-21T14:00:37.3189727Z ++ alias pydoc
2026-02-21T14:00:37.3189840Z ++ true
2026-02-21T14:00:37.3189942Z ++ hash -r
2026-02-21T14:00:37.3190844Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-benchmark-metadata/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/helion --head-branch refs/heads/main --head-sha 874a7d0cadab18218a84ad3579d329dc95c51820 --workflow-id 22253280836 --run-attempt 1 --job-id 64380329937 --job-name 'run-mi325x (int4_gemm,flash_attention) / benchmark-rocm6.4-int4_gemm,flash_attention-py3.12-mi325x'
2026-02-21T14:00:37.3470108Z ##[group]Run pytorch/test-infra/.github/actions/gather-runners-info@main
2026-02-21T14:00:37.3470291Z with:
2026-02-21T14:00:37.3470387Z   venv: .venv/bin/activate
2026-02-21T14:00:37.3470486Z env:
2026-02-21T14:00:37.3470575Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:37.3470725Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.3470883Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:37.3471045Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.3471182Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.3471324Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.3471461Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:37.3471628Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:37.3471769Z ##[endgroup]
2026-02-21T14:00:37.3478783Z ##[group]Run set -eux
2026-02-21T14:00:37.3478896Z [36;1mset -eux[0m
2026-02-21T14:00:37.3478993Z [36;1m[0m
2026-02-21T14:00:37.3479079Z [36;1mif command -v nvidia-smi; then[0m
2026-02-21T14:00:37.3479197Z [36;1m  DEVICE_NAME=cuda[0m
2026-02-21T14:00:37.3479296Z [36;1m  nvidia-smi[0m
2026-02-21T14:00:37.3479397Z [36;1melif command -v rocm-smi; then[0m
2026-02-21T14:00:37.3479603Z [36;1m  DEVICE_NAME=rocm[0m
2026-02-21T14:00:37.3479700Z [36;1m  rocm-smi[0m
2026-02-21T14:00:37.3479797Z [36;1melif command -v hl-smi; then[0m
2026-02-21T14:00:37.3479910Z [36;1m  DEVICE_NAME=hpu[0m
2026-02-21T14:00:37.3480007Z [36;1m  hl-smi[0m
2026-02-21T14:00:37.3480085Z [36;1melse[0m
2026-02-21T14:00:37.3480171Z [36;1m  arch=$(uname -m)[0m
2026-02-21T14:00:37.3480264Z [36;1m[0m
2026-02-21T14:00:37.3480343Z [36;1m  case "$arch" in[0m
2026-02-21T14:00:37.3480490Z [36;1m    aarch64|arm64)[0m
2026-02-21T14:00:37.3480596Z [36;1m      DEVICE_NAME=arm64-cpu[0m
2026-02-21T14:00:37.3480700Z [36;1m      ;;[0m
2026-02-21T14:00:37.3480779Z [36;1m    *)[0m
2026-02-21T14:00:37.3480866Z [36;1m      DEVICE_NAME=cpu[0m
2026-02-21T14:00:37.3480965Z [36;1m      ;;[0m
2026-02-21T14:00:37.3481043Z [36;1m  esac[0m
2026-02-21T14:00:37.3481121Z [36;1m  lscpu[0m
2026-02-21T14:00:37.3481203Z [36;1mfi[0m
2026-02-21T14:00:37.3481306Z [36;1mecho "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV[0m
2026-02-21T14:00:37.3481544Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T14:00:37.3481665Z env:
2026-02-21T14:00:37.3481750Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:37.3481874Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.3482025Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:37.3482177Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.3482308Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.3482442Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.3482632Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:37.3482787Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:37.3482930Z ##[endgroup]
2026-02-21T14:00:37.4374231Z + command -v nvidia-smi
2026-02-21T14:00:37.4374504Z + command -v rocm-smi
2026-02-21T14:00:37.4381104Z /usr/bin/rocm-smi
2026-02-21T14:00:37.4381290Z + DEVICE_NAME=rocm
2026-02-21T14:00:37.4381445Z + rocm-smi
2026-02-21T14:00:37.5990972Z 
2026-02-21T14:00:37.5991025Z 
2026-02-21T14:00:37.5991489Z ============================================ ROCm System Management Interface ============================================
2026-02-21T14:00:37.5992221Z ====================================================== Concise Info ======================================================
2026-02-21T14:00:37.5992964Z Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf  PwrCap   VRAM%  GPU%  
2026-02-21T14:00:37.5994043Z [3m              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)                                                   [0m
2026-02-21T14:00:37.5995030Z ==========================================================================================================================
2026-02-21T14:00:37.5996243Z 0       3     0x74a5,   51110  33.0°C      125.0W    NPS1, SPX, 0        134Mhz  900Mhz  0%   auto  1000.0W  0%     0%    
2026-02-21T14:00:37.5997123Z 1       5     0x74a5,   2987   32.0°C      132.0W    NPS1, SPX, 0        133Mhz  900Mhz  0%   auto  1000.0W  0%     0%    
2026-02-21T14:00:37.5997949Z 2       4     0x74a5,   61326  32.0°C      133.0W    NPS1, SPX, 0        133Mhz  900Mhz  0%   auto  1000.0W  0%     0%    
2026-02-21T14:00:37.5998712Z 3       2     0x74a5,   9091   48.0°C      144.0W    NPS1, SPX, 0        133Mhz  900Mhz  0%   auto  1000.0W  0%     0%    
2026-02-21T14:00:37.5999270Z 4       7     0x74a5,   26567  31.0°C      129.0W    NPS1, SPX, 0        133Mhz  900Mhz  0%   auto  1000.0W  0%     0%    
2026-02-21T14:00:37.5999805Z 5       9     0x74a5,   43978  36.0°C      137.0W    NPS1, SPX, 0        133Mhz  900Mhz  0%   auto  1000.0W  0%     0%    
2026-02-21T14:00:37.6000321Z 6       8     0x74a5,   20463  31.0°C      129.0W    NPS1, SPX, 0        134Mhz  900Mhz  0%   auto  1000.0W  0%     0%    
2026-02-21T14:00:37.6000836Z 7       6     0x74a5,   33762  34.0°C      135.0W    NPS1, SPX, 0        133Mhz  900Mhz  0%   auto  1000.0W  0%     0%    
2026-02-21T14:00:37.6001296Z ==========================================================================================================================
2026-02-21T14:00:37.6001659Z ================================================== End of ROCm SMI Log ===================================================
2026-02-21T14:00:37.6054374Z + echo DEVICE_NAME=rocm
2026-02-21T14:00:37.6091413Z ##[group]Run set -eux
2026-02-21T14:00:37.6091548Z [36;1mset -eux[0m
2026-02-21T14:00:37.6091695Z [36;1m[0m
2026-02-21T14:00:37.6091793Z [36;1mif [[ "${DEVICE_NAME}" == "cuda" ]]; then[0m
2026-02-21T14:00:37.6091936Z [36;1m  # Return the same device name as PyTorch[0m
2026-02-21T14:00:37.6092158Z [36;1m  DEVICE_TYPE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader)[0m
2026-02-21T14:00:37.6092346Z [36;1melif [[ "${DEVICE_NAME}" == "rocm" ]]; then[0m
2026-02-21T14:00:37.6092552Z [36;1m  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs)[0m
2026-02-21T14:00:37.6092779Z [36;1melif [[ "${DEVICE_NAME}" == "hpu" ]]; then[0m
2026-02-21T14:00:37.6093026Z [36;1m  DEVICE_TYPE="Intel Gaudi3 "$(hl-smi -q | grep "Product Name" | head -n 1 | awk -F ':' '{print $2}' | sed 's/^ *//')[0m
2026-02-21T14:00:37.6093262Z [36;1melif [[ "${DEVICE_NAME}" == "cpu" ]]; then[0m
2026-02-21T14:00:37.6093738Z [36;1m  DEVICE_TYPE="$(lscpu | grep "Model name" | sed -E 's/.*Model name:[[:space:]]*//; s/Intel\(R\)//g; s/\(R\)//g; s/\(TM\)//g; s/CPU//g; s/Processor//g; s/[[:space:]]+/ /g; s/^ //; s/ $//; s/ /_/g')_$(awk -F: '/Core\(s\) per socket/ {c=$2} /Socket\(s\)/ {s=$2} END {gsub(/ /,"",c); gsub(/ /,"",s); printf "%sc", c*s}' < <(lscpu))"[0m
2026-02-21T14:00:37.6094198Z [36;1melif [[ "${DEVICE_NAME}" == "arm64-cpu" ]]; then[0m
2026-02-21T14:00:37.6094424Z [36;1m  DEVICE_TYPE=$(lscpu | grep 'Vendor ID' | cut -f 2 -d ":" | awk '{$1=$1}1' | cut -f 2 -d " ")[0m
2026-02-21T14:00:37.6094611Z [36;1mfi[0m
2026-02-21T14:00:37.6094716Z [36;1mecho "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV[0m
2026-02-21T14:00:37.6095049Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T14:00:37.6095171Z env:
2026-02-21T14:00:37.6095285Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:37.6095410Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.6095567Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:37.6095723Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.6095855Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.6095990Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.6096128Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:37.6096389Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:37.6096522Z   DEVICE_NAME: rocm
2026-02-21T14:00:37.6096628Z ##[endgroup]
2026-02-21T14:00:37.6814587Z + [[ rocm == \c\u\d\a ]]
2026-02-21T14:00:37.6814795Z + [[ rocm == \r\o\c\m ]]
2026-02-21T14:00:37.6819326Z ++ rocminfo
2026-02-21T14:00:37.6832743Z ++ grep 'Marketing Name'
2026-02-21T14:00:37.6832906Z ++ tail -n1
2026-02-21T14:00:37.6833034Z ++ awk -F: '{print $2}'
2026-02-21T14:00:37.6833157Z ++ xargs
2026-02-21T14:00:37.8702643Z + DEVICE_TYPE='AMD Instinct MI325X'
2026-02-21T14:00:37.8703369Z + echo 'DEVICE_TYPE=AMD Instinct MI325X'
2026-02-21T14:00:37.8745293Z ##[group]Run set -eux
2026-02-21T14:00:37.8745507Z [36;1mset -eux[0m
2026-02-21T14:00:37.8745687Z [36;1m[0m
2026-02-21T14:00:37.8745868Z [36;1mif [[ -n ".venv/bin/activate" ]]; then[0m
2026-02-21T14:00:37.8746102Z [36;1m  source ".venv/bin/activate"[0m
2026-02-21T14:00:37.8746321Z [36;1mfi[0m
2026-02-21T14:00:37.8746485Z [36;1m[0m
2026-02-21T14:00:37.8746711Z [36;1mpython3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82[0m
2026-02-21T14:00:37.8747100Z [36;1mpython3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py"[0m
2026-02-21T14:00:37.8747485Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T14:00:37.8747715Z env:
2026-02-21T14:00:37.8747983Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:37.8748197Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.8748421Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:37.8748625Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.8748828Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.8749015Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:37.8749207Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:37.8749491Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:37.8749681Z   DEVICE_NAME: rocm
2026-02-21T14:00:37.8749928Z   DEVICE_TYPE: AMD Instinct MI325X
2026-02-21T14:00:37.8750086Z ##[endgroup]
2026-02-21T14:00:37.9736568Z + [[ -n .venv/bin/activate ]]
2026-02-21T14:00:37.9737006Z + source .venv/bin/activate
2026-02-21T14:00:37.9737283Z ++ '[' -z '' ']'
2026-02-21T14:00:37.9737497Z ++ '[' -n x ']'
2026-02-21T14:00:37.9737763Z ++ SCRIPT_PATH=.venv/bin/activate
2026-02-21T14:00:37.9738162Z ++ '[' .venv/bin/activate = /__w/_temp/4f24e081-2409-4924-8914-94b31dabd3e3.sh ']'
2026-02-21T14:00:37.9738587Z ++ deactivate nondestructive
2026-02-21T14:00:37.9738821Z ++ unset -f pydoc
2026-02-21T14:00:37.9739028Z ++ '[' -z '' ']'
2026-02-21T14:00:37.9739249Z ++ '[' -z '' ']'
2026-02-21T14:00:37.9739432Z ++ hash -r
2026-02-21T14:00:37.9739663Z ++ '[' -z '' ']'
2026-02-21T14:00:37.9739862Z ++ unset VIRTUAL_ENV
2026-02-21T14:00:37.9740102Z ++ unset VIRTUAL_ENV_PROMPT
2026-02-21T14:00:37.9740383Z ++ '[' '!' nondestructive = nondestructive ']'
2026-02-21T14:00:37.9740695Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv
2026-02-21T14:00:37.9741074Z ++ '[' linux-gnu = cygwin ']'
2026-02-21T14:00:37.9741362Z ++ '[' linux-gnu = msys ']'
2026-02-21T14:00:37.9742856Z ++ export VIRTUAL_ENV
2026-02-21T14:00:37.9743113Z ++ '[' -z '' ']'
2026-02-21T14:00:37.9743402Z ++ unset SCRIPT_PATH
2026-02-21T14:00:37.9744166Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T14:00:37.9745467Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2026-02-21T14:00:37.9746291Z ++ export PATH
2026-02-21T14:00:37.9746517Z ++ '[' xhelion '!=' x ']'
2026-02-21T14:00:37.9746765Z ++ VIRTUAL_ENV_PROMPT=helion
2026-02-21T14:00:37.9747062Z ++ export VIRTUAL_ENV_PROMPT
2026-02-21T14:00:37.9747286Z ++ '[' -z '' ']'
2026-02-21T14:00:37.9747682Z ++ '[' -z '' ']'
2026-02-21T14:00:37.9747905Z ++ _OLD_VIRTUAL_PS1=
2026-02-21T14:00:37.9748132Z ++ PS1='(helion) '
2026-02-21T14:00:37.9748330Z ++ export PS1
2026-02-21T14:00:37.9748561Z ++ alias pydoc
2026-02-21T14:00:37.9748756Z ++ true
2026-02-21T14:00:37.9748964Z ++ hash -r
2026-02-21T14:00:37.9749264Z + python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82
2026-02-21T14:00:38.5017821Z Collecting psutil==7.0.0
2026-02-21T14:00:38.5530433Z   Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
2026-02-21T14:00:38.5680216Z Collecting nvidia-ml-py==13.580.82
2026-02-21T14:00:38.5735666Z   Downloading nvidia_ml_py-13.580.82-py3-none-any.whl.metadata (9.6 kB)
2026-02-21T14:00:38.5810555Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB)
2026-02-21T14:00:38.6123171Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl (49 kB)
2026-02-21T14:00:38.6755561Z Installing collected packages: nvidia-ml-py, psutil
2026-02-21T14:00:38.6760539Z   Attempting uninstall: nvidia-ml-py
2026-02-21T14:00:38.6774759Z     Found existing installation: nvidia-ml-py 13.590.48
2026-02-21T14:00:38.6784068Z     Uninstalling nvidia-ml-py-13.590.48:
2026-02-21T14:00:38.7615284Z       Successfully uninstalled nvidia-ml-py-13.590.48
2026-02-21T14:00:38.7889776Z   Attempting uninstall: psutil
2026-02-21T14:00:38.7906240Z     Found existing installation: psutil 7.2.2
2026-02-21T14:00:38.7918843Z     Uninstalling psutil-7.2.2:
2026-02-21T14:00:38.7922305Z       Successfully uninstalled psutil-7.2.2
2026-02-21T14:00:38.8773520Z 
2026-02-21T14:00:38.8778315Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2026-02-21T14:00:38.8779568Z tritonbench 0.0.1 requires triton, which is not installed.
2026-02-21T14:00:38.8797703Z Successfully installed nvidia-ml-py-13.580.82 psutil-7.0.0
2026-02-21T14:00:38.9777953Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-runners-info/../../scripts/benchmarks/gather_runners_info.py
2026-02-21T14:00:47.7336037Z ##[group]Run pytorch/test-infra/.github/actions/gather-dependencies@main
2026-02-21T14:00:47.7336225Z with:
2026-02-21T14:00:47.7336310Z   venv: .venv/bin/activate
2026-02-21T14:00:47.7336407Z env:
2026-02-21T14:00:47.7336489Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:47.7336618Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.7336775Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:47.7336927Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.7337062Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.7337201Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.7337342Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:47.7337496Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:47.7337631Z   DEVICE_NAME: rocm
2026-02-21T14:00:47.7337727Z   DEVICE_TYPE: AMD Instinct MI325X
2026-02-21T14:00:47.7337832Z ##[endgroup]
2026-02-21T14:00:47.7344415Z ##[group]Run set -eux
2026-02-21T14:00:47.7344527Z [36;1mset -eux[0m
2026-02-21T14:00:47.7344616Z [36;1m[0m
2026-02-21T14:00:47.7344715Z [36;1m# TODO (huydhn): Implement this part[0m
2026-02-21T14:00:47.7344857Z [36;1mecho "dependencies={}" >> "${GITHUB_OUTPUT}"[0m
2026-02-21T14:00:47.7345065Z shell: bash --noprofile --norc -e -o pipefail {0}
2026-02-21T14:00:47.7345185Z env:
2026-02-21T14:00:47.7345272Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:47.7345396Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.7345549Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:47.7345707Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.7345839Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.7345975Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.7346112Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:47.7346272Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:47.7346408Z   DEVICE_NAME: rocm
2026-02-21T14:00:47.7346508Z   DEVICE_TYPE: AMD Instinct MI325X
2026-02-21T14:00:47.7346618Z ##[endgroup]
2026-02-21T14:00:47.8097460Z + echo 'dependencies={}'
2026-02-21T14:00:47.8157351Z ##[group]Run actions/upload-artifact@v6
2026-02-21T14:00:47.8157487Z with:
2026-02-21T14:00:47.8157611Z   name: benchmark-results-mi325x-int4_gemm,flash_attention
2026-02-21T14:00:47.8157759Z   path: test/test-reports
2026-02-21T14:00:47.8157861Z   if-no-files-found: warn
2026-02-21T14:00:47.8157965Z   compression-level: 6
2026-02-21T14:00:47.8158064Z   overwrite: false
2026-02-21T14:00:47.8158158Z   include-hidden-files: false
2026-02-21T14:00:47.8158256Z env:
2026-02-21T14:00:47.8158339Z   HELION_AUTOTUNE_LOG_LEVEL: INFO
2026-02-21T14:00:47.8158463Z   pythonLocation: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.8158618Z   PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig
2026-02-21T14:00:47.8158773Z   Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.8158906Z   Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.8159171Z   Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64
2026-02-21T14:00:47.8159319Z   LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib
2026-02-21T14:00:47.8159477Z   UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python
2026-02-21T14:00:47.8159614Z   DEVICE_NAME: rocm
2026-02-21T14:00:47.8159705Z   DEVICE_TYPE: AMD Instinct MI325X
2026-02-21T14:00:47.8159813Z ##[endgroup]
2026-02-21T14:00:47.8161576Z ##[command]/usr/bin/docker exec  9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T14:00:47.9673928Z With the provided path, there will be 1 file uploaded
2026-02-21T14:00:47.9676122Z Artifact name is valid!
2026-02-21T14:00:47.9676253Z Root directory input is valid!
2026-02-21T14:00:53.2440670Z Beginning upload of artifact content to blob storage
2026-02-21T14:00:55.6810262Z Uploaded bytes 586
2026-02-21T14:00:55.7160646Z Finished uploading artifact content to blob storage!
2026-02-21T14:00:55.7161372Z SHA256 digest of uploaded artifact zip is 42ed98c5c494960b63600f8bc3f969c4bb6eaafe9f6a44cc13b7347ea28c1d8c
2026-02-21T14:00:55.7162035Z Finalizing artifact upload
2026-02-21T14:00:55.8682943Z Artifact benchmark-results-mi325x-int4_gemm,flash_attention.zip successfully finalized. Artifact ID 5601585868
2026-02-21T14:00:55.8684036Z Artifact benchmark-results-mi325x-int4_gemm,flash_attention has been successfully uploaded! Final size is 586 bytes. Artifact ID is 5601585868
2026-02-21T14:00:55.8685091Z Artifact download URL: https://github.com/pytorch/helion/actions/runs/22253280836/artifacts/5601585868
2026-02-21T14:00:55.8852819Z Post job cleanup.
2026-02-21T14:00:55.8855628Z ##[command]/usr/bin/docker exec  9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T14:00:56.0269098Z UV_PYTHON_INSTALL_DIR is already set to /github/home/.local/share/uv/python
2026-02-21T14:00:56.0270751Z (node:871989) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
2026-02-21T14:00:56.0271215Z (Use `node --trace-deprecation ...` to show where the warning was created)
2026-02-21T14:00:56.0406283Z Post job cleanup.
2026-02-21T14:00:56.0408320Z ##[command]/usr/bin/docker exec  9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T14:00:56.2314033Z Post job cleanup.
2026-02-21T14:00:56.2316190Z ##[command]/usr/bin/docker exec  9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54 sh -c "cat /etc/*release | grep ^ID"
2026-02-21T14:00:56.3761231Z [command]/usr/bin/git version
2026-02-21T14:00:56.3789584Z git version 2.43.0
2026-02-21T14:00:56.3821747Z Temporarily overriding HOME='/__w/_temp/87c4a125-c580-4df0-9e3f-e3635800cd14' before making global git config changes
2026-02-21T14:00:56.3822122Z Adding repository directory to the temporary git global config as a safe directory
2026-02-21T14:00:56.3824325Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion
2026-02-21T14:00:56.3843318Z Removing SSH command configuration
2026-02-21T14:00:56.3846121Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2026-02-21T14:00:56.3862728Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2026-02-21T14:00:56.4071493Z Removing HTTP extra header
2026-02-21T14:00:56.4073650Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2026-02-21T14:00:56.4093109Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2026-02-21T14:00:56.4244860Z Removing includeIf entries pointing to credentials config files
2026-02-21T14:00:56.4247605Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir:
2026-02-21T14:00:56.4265787Z includeif.gitdir:/__w/helion/helion/.git.path
2026-02-21T14:00:56.4266085Z includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path
2026-02-21T14:00:56.4266349Z includeif.gitdir:/github/workspace/.git.path
2026-02-21T14:00:56.4266603Z includeif.gitdir:/github/workspace/.git/worktrees/*.path
2026-02-21T14:00:56.4270388Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git.path
2026-02-21T14:00:56.4288882Z /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T14:00:56.4302310Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T14:00:56.4321481Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path
2026-02-21T14:00:56.4333159Z /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T14:00:56.4337706Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T14:00:56.4351714Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git.path
2026-02-21T14:00:56.4365196Z /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T14:00:56.4372045Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T14:00:56.4387985Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git/worktrees/*.path
2026-02-21T14:00:56.4400037Z /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T14:00:56.4407018Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config
2026-02-21T14:00:56.4423851Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url
2026-02-21T14:00:56.4572829Z Removing credentials config '/__w/_temp/git-credentials-da3b5bf7-5ea6-4760-a997-bbf447a0295b.config'
2026-02-21T14:00:56.4691996Z Stop and remove container: c6efec98bce142318ef5a57c93d78ef5_rocmdevubuntu2404644complete_87ce82
2026-02-21T14:00:56.4694556Z ##[command]/usr/bin/docker rm --force 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54
2026-02-21T14:00:58.4963592Z 9ec7733a29ba0fdf15e8b2285d656f06133a96975307e58e1fac690eb0b7ed54
2026-02-21T14:00:58.4994906Z Remove container network: github_network_2fc8484199d94410beae10a38ccd998e
2026-02-21T14:00:58.4997175Z ##[command]/usr/bin/docker network rm github_network_2fc8484199d94410beae10a38ccd998e
2026-02-21T14:00:59.0057460Z github_network_2fc8484199d94410beae10a38ccd998e
2026-02-21T14:00:59.0117045Z Evaluate and set job outputs
2026-02-21T14:00:59.0121147Z Set output 'benchmark-metadata'
2026-02-21T14:00:59.0122225Z Set output 'runners-info'
2026-02-21T14:00:59.0122491Z Set output 'dependencies'
2026-02-21T14:00:59.0122817Z Cleaning up orphan processes