2026-02-21T14:05:51.4129545Z Current runner version: '2.331.0' 2026-02-21T14:05:51.4132671Z Runner name: 'linux.rocm.gpu.gfx942.2-n2gvb-runner-kpc8w' 2026-02-21T14:05:51.4133062Z Runner group name: 'default' 2026-02-21T14:05:51.4133474Z Machine name: 'linux' 2026-02-21T14:05:51.4134589Z ##[group]GITHUB_TOKEN Permissions 2026-02-21T14:05:51.4135626Z Contents: read 2026-02-21T14:05:51.4135861Z Metadata: read 2026-02-21T14:05:51.4136119Z ##[endgroup] 2026-02-21T14:05:51.4137198Z Secret source: Actions 2026-02-21T14:05:51.4137588Z Prepare workflow directory 2026-02-21T14:05:51.4405269Z Prepare all required actions 2026-02-21T14:05:51.4424618Z Getting action download info 2026-02-21T14:05:51.8076101Z Download action repository 'actions/checkout@v6' (SHA:de0fac2e4500dabe0009e67214ff5f5447ce83dd) 2026-02-21T14:05:52.2560908Z Download action repository 'actions/setup-python@v6' (SHA:a309ff8b426b58ec0e2a45f0f869d46889d02405) 2026-02-21T14:05:52.7389588Z Download action repository 'astral-sh/setup-uv@v7' (SHA:eac588ad8def6316056a12d4907a9d4d84ff7a3b) 2026-02-21T14:05:53.2651009Z Download action repository 'pytorch/test-infra@main' (SHA:bb8f04ff3961233c844fde6533c7c6c5f0857909) 2026-02-21T14:05:54.0186936Z Download action repository 'actions/upload-artifact@v6' (SHA:b7c566a772e6b6bfb58ed0dc250532a479d7789f) 2026-02-21T14:05:54.5809933Z Getting action download info 2026-02-21T14:05:54.7562484Z Uses: pytorch/helion/.github/workflows/benchmark.yml@refs/heads/main (874a7d0cadab18218a84ad3579d329dc95c51820) 2026-02-21T14:05:54.7564608Z ##[group] Inputs 2026-02-21T14:05:54.7564748Z runner: linux.rocm.gpu.gfx942.2 2026-02-21T14:05:54.7564875Z python-version: 3.12 2026-02-21T14:05:54.7564999Z image: rocm/dev-ubuntu-24.04:6.4.4-complete 2026-02-21T14:05:54.7565126Z runtime-version: rocm6.4 2026-02-21T14:05:54.7565283Z container-options: --device=/dev/kfd --device=/dev/dri 2026-02-21T14:05:54.7565431Z alias: mi325x 2026-02-21T14:05:54.7565529Z kernels: layer_norm,gemm 2026-02-21T14:05:54.7565643Z env-vars: 2026-02-21T14:05:54.7565750Z custom-args: 2026-02-21T14:05:54.7566016Z run_h100: true 2026-02-21T14:05:54.7566102Z run_b200: true 2026-02-21T14:05:54.7566198Z run_mi325x: true 2026-02-21T14:05:54.7566293Z ##[endgroup] 2026-02-21T14:05:54.7566485Z Complete job name: run-mi325x (layer_norm,gemm) / benchmark-rocm6.4-layer_norm,gemm-py3.12-mi325x 2026-02-21T14:05:54.7782514Z ##[group]Checking docker version 2026-02-21T14:05:54.7789161Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}' 2026-02-21T14:05:54.7943104Z '1.51' 2026-02-21T14:05:54.7950191Z Docker daemon API version: '1.51' 2026-02-21T14:05:54.7950464Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}' 2026-02-21T14:05:54.8100320Z '1.51' 2026-02-21T14:05:54.8108620Z Docker client API version: '1.51' 2026-02-21T14:05:54.8111256Z ##[endgroup] 2026-02-21T14:05:54.8112379Z ##[group]Clean up resources from previous jobs 2026-02-21T14:05:54.8114565Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=d525be" 2026-02-21T14:05:54.8252375Z ##[command]/usr/bin/docker network prune --force --filter "label=d525be" 2026-02-21T14:05:54.8371353Z ##[endgroup] 2026-02-21T14:05:54.8371486Z ##[group]Create local container network 2026-02-21T14:05:54.8375848Z ##[command]/usr/bin/docker network create --label d525be github_network_52abf34bb8e94476be1b321dcdd72aa0 2026-02-21T14:05:54.8678941Z d262add1f3a38a5291583e229142964e744f001aebeff11b077cdc0afd76bef1 2026-02-21T14:05:54.8688981Z ##[endgroup] 2026-02-21T14:05:54.8706728Z ##[group]Starting job container 2026-02-21T14:05:54.8717766Z ##[command]/usr/bin/docker pull rocm/dev-ubuntu-24.04:6.4.4-complete 2026-02-21T14:05:55.2983476Z 6.4.4-complete: Pulling from rocm/dev-ubuntu-24.04 2026-02-21T14:05:55.2983983Z 953cdd413371: Pulling fs layer 2026-02-21T14:05:55.2984286Z 3a9c27801271: Pulling fs layer 2026-02-21T14:05:55.2984697Z 4c8a4cb43e3b: Pulling fs layer 2026-02-21T14:05:55.2984967Z 624e685c2697: Pulling fs layer 2026-02-21T14:05:55.2985234Z 624e685c2697: Waiting 2026-02-21T14:05:55.4319106Z 3a9c27801271: Verifying Checksum 2026-02-21T14:05:55.4319457Z 3a9c27801271: Download complete 2026-02-21T14:05:55.5452153Z 624e685c2697: Verifying Checksum 2026-02-21T14:05:55.5452394Z 624e685c2697: Download complete 2026-02-21T14:05:55.6089485Z 953cdd413371: Verifying Checksum 2026-02-21T14:05:55.6089795Z 953cdd413371: Download complete 2026-02-21T14:05:56.1325503Z 953cdd413371: Pull complete 2026-02-21T14:05:56.2211277Z 3a9c27801271: Pull complete 2026-02-21T14:06:21.1353883Z 4c8a4cb43e3b: Verifying Checksum 2026-02-21T14:06:21.1354363Z 4c8a4cb43e3b: Download complete 2026-02-21T14:07:00.4755719Z 4c8a4cb43e3b: Pull complete 2026-02-21T14:07:00.5634015Z 624e685c2697: Pull complete 2026-02-21T14:07:00.5653589Z Digest: sha256:31418ac10a3769a71eaef330c07280d1d999d7074621339b8f93c484c35f6078 2026-02-21T14:07:00.5658155Z Status: Downloaded newer image for rocm/dev-ubuntu-24.04:6.4.4-complete 2026-02-21T14:07:00.5667537Z docker.io/rocm/dev-ubuntu-24.04:6.4.4-complete 2026-02-21T14:07:00.5726537Z ##[command]/usr/bin/docker create --name 516247e566694a1eb38ddd3134e58d44_rocmdevubuntu2404644complete_8093c4 --label d525be --workdir /__w/helion/helion --network github_network_52abf34bb8e94476be1b321dcdd72aa0 --device=/dev/kfd --device=/dev/dri -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/_work":"/__w" -v "/home/runner/externals":"/__e":ro -v "/home/runner/_work/_temp":"/__w/_temp" -v "/home/runner/_work/_actions":"/__w/_actions" -v "/home/runner/_work/_tool":"/__w/_tool" -v "/home/runner/_work/_temp/_github_home":"/github/home" -v "/home/runner/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" rocm/dev-ubuntu-24.04:6.4.4-complete "-f" "/dev/null" 2026-02-21T14:07:00.7530673Z bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b 2026-02-21T14:07:00.7554313Z ##[command]/usr/bin/docker start bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b 2026-02-21T14:07:00.9143199Z bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b 2026-02-21T14:07:00.9167352Z ##[command]/usr/bin/docker ps --all --filter id=bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b --filter status=running --no-trunc --format "{{.ID}} {{.Status}}" 2026-02-21T14:07:00.9289508Z bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b Up Less than a second 2026-02-21T14:07:00.9316037Z ##[command]/usr/bin/docker inspect --format "{{range .Config.Env}}{{println .}}{{end}}" bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b 2026-02-21T14:07:00.9437435Z HOME=/github/home 2026-02-21T14:07:00.9447287Z GITHUB_ACTIONS=true 2026-02-21T14:07:00.9447593Z CI=true 2026-02-21T14:07:00.9447811Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T14:07:00.9468031Z ##[endgroup] 2026-02-21T14:07:00.9474272Z ##[group]Waiting for all services to be ready 2026-02-21T14:07:00.9475208Z ##[endgroup] 2026-02-21T14:07:00.9574569Z ##[group]Run echo "Detected ROCm image" 2026-02-21T14:07:00.9574765Z echo "Detected ROCm image" 2026-02-21T14:07:00.9574913Z rocminfo || echo "rocminfo not found" 2026-02-21T14:07:00.9576470Z shell: bash -l {0} 2026-02-21T14:07:00.9576635Z env: 2026-02-21T14:07:00.9576724Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:07:00.9576848Z ##[endgroup] 2026-02-21T14:07:00.9957311Z Detected ROCm image 2026-02-21T14:07:01.1556513Z ROCk module version 6.12.12 is loaded 2026-02-21T14:07:01.1556973Z ===================== 2026-02-21T14:07:01.1557301Z HSA System Attributes 2026-02-21T14:07:01.1557570Z ===================== 2026-02-21T14:07:01.1557850Z Runtime Version: 1.15 2026-02-21T14:07:01.1558154Z Runtime Ext Version: 1.7 2026-02-21T14:07:01.1558489Z System Timestamp Freq.: 1000.000000MHz 2026-02-21T14:07:01.1558821Z Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) 2026-02-21T14:07:01.1559054Z Machine Model: LARGE 2026-02-21T14:07:01.1559231Z System Endianness: LITTLE 2026-02-21T14:07:01.1559592Z Mwaitx: DISABLED 2026-02-21T14:07:01.1559707Z XNACK enabled: NO 2026-02-21T14:07:01.1559816Z DMAbuf Support: YES 2026-02-21T14:07:01.1559940Z VMM Support: YES 2026-02-21T14:07:01.1560010Z 2026-02-21T14:07:01.1560051Z ========== 2026-02-21T14:07:01.1560151Z HSA Agents 2026-02-21T14:07:01.1560246Z ========== 2026-02-21T14:07:01.1560341Z ******* 2026-02-21T14:07:01.1560433Z Agent 1 2026-02-21T14:07:01.1560526Z ******* 2026-02-21T14:07:01.1560657Z Name: AMD EPYC 9575F 64-Core Processor 2026-02-21T14:07:01.1560824Z Uuid: CPU-XX 2026-02-21T14:07:01.1561040Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2026-02-21T14:07:01.1561213Z Vendor Name: CPU 2026-02-21T14:07:01.1561408Z Feature: None specified 2026-02-21T14:07:01.1561574Z Profile: FULL_PROFILE 2026-02-21T14:07:01.1561753Z Float Round Mode: NEAR 2026-02-21T14:07:01.1561924Z Max Queue Number: 0(0x0) 2026-02-21T14:07:01.1562087Z Queue Min Size: 0(0x0) 2026-02-21T14:07:01.1562263Z Queue Max Size: 0(0x0) 2026-02-21T14:07:01.1562437Z Queue Type: MULTI 2026-02-21T14:07:01.1562595Z Node: 0 2026-02-21T14:07:01.1562754Z Device Type: CPU 2026-02-21T14:07:01.1562898Z Cache Info: 2026-02-21T14:07:01.1563072Z L1: 49152(0xc000) KB 2026-02-21T14:07:01.1563242Z Chip ID: 0(0x0) 2026-02-21T14:07:01.1563427Z ASIC Revision: 0(0x0) 2026-02-21T14:07:01.1563613Z Cacheline Size: 64(0x40) 2026-02-21T14:07:01.1563786Z Max Clock Freq. (MHz): 3300 2026-02-21T14:07:01.1563943Z BDFID: 0 2026-02-21T14:07:01.1564102Z Internal Node ID: 0 2026-02-21T14:07:01.1564280Z Compute Unit: 128 2026-02-21T14:07:01.1564448Z SIMDs per CU: 0 2026-02-21T14:07:01.1564616Z Shader Engines: 0 2026-02-21T14:07:01.1564782Z Shader Arrs. per Eng.: 0 2026-02-21T14:07:01.1565116Z WatchPts on Addr. Ranges:1 2026-02-21T14:07:01.1565371Z Memory Properties: 2026-02-21T14:07:01.1565507Z Features: None 2026-02-21T14:07:01.1565622Z Pool Info: 2026-02-21T14:07:01.1565726Z Pool 1 2026-02-21T14:07:01.1565895Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T14:07:01.1566066Z Size: 1584642144(0x5e73b860) KB 2026-02-21T14:07:01.1566238Z Allocatable: TRUE 2026-02-21T14:07:01.1566420Z Alloc Granule: 4KB 2026-02-21T14:07:01.1566603Z Alloc Recommended Granule:4KB 2026-02-21T14:07:01.1566793Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1566970Z Accessible by all: TRUE 2026-02-21T14:07:01.1567118Z Pool 2 2026-02-21T14:07:01.1567265Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T14:07:01.1567432Z Size: 1584642144(0x5e73b860) KB 2026-02-21T14:07:01.1567636Z Allocatable: TRUE 2026-02-21T14:07:01.1568228Z Alloc Granule: 4KB 2026-02-21T14:07:01.1568403Z Alloc Recommended Granule:4KB 2026-02-21T14:07:01.1568579Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1568751Z Accessible by all: TRUE 2026-02-21T14:07:01.1568892Z Pool 3 2026-02-21T14:07:01.1569031Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2026-02-21T14:07:01.1569194Z Size: 1584642144(0x5e73b860) KB 2026-02-21T14:07:01.1569355Z Allocatable: TRUE 2026-02-21T14:07:01.1569527Z Alloc Granule: 4KB 2026-02-21T14:07:01.1569701Z Alloc Recommended Granule:4KB 2026-02-21T14:07:01.1569886Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1570055Z Accessible by all: TRUE 2026-02-21T14:07:01.1570195Z Pool 4 2026-02-21T14:07:01.1570330Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T14:07:01.1570490Z Size: 1584642144(0x5e73b860) KB 2026-02-21T14:07:01.1570650Z Allocatable: TRUE 2026-02-21T14:07:01.1570820Z Alloc Granule: 4KB 2026-02-21T14:07:01.1571004Z Alloc Recommended Granule:4KB 2026-02-21T14:07:01.1571179Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1571357Z Accessible by all: TRUE 2026-02-21T14:07:01.1571501Z ISA Info: 2026-02-21T14:07:01.1571607Z ******* 2026-02-21T14:07:01.1571713Z Agent 2 2026-02-21T14:07:01.1571809Z ******* 2026-02-21T14:07:01.1571939Z Name: AMD EPYC 9575F 64-Core Processor 2026-02-21T14:07:01.1572110Z Uuid: CPU-XX 2026-02-21T14:07:01.1572284Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2026-02-21T14:07:01.1572455Z Vendor Name: CPU 2026-02-21T14:07:01.1572642Z Feature: None specified 2026-02-21T14:07:01.1572811Z Profile: FULL_PROFILE 2026-02-21T14:07:01.1572984Z Float Round Mode: NEAR 2026-02-21T14:07:01.1573156Z Max Queue Number: 0(0x0) 2026-02-21T14:07:01.1573367Z Queue Min Size: 0(0x0) 2026-02-21T14:07:01.1573559Z Queue Max Size: 0(0x0) 2026-02-21T14:07:01.1573722Z Queue Type: MULTI 2026-02-21T14:07:01.1573916Z Node: 1 2026-02-21T14:07:01.1574094Z Device Type: CPU 2026-02-21T14:07:01.1574232Z Cache Info: 2026-02-21T14:07:01.1574367Z L1: 49152(0xc000) KB 2026-02-21T14:07:01.1574521Z Chip ID: 0(0x0) 2026-02-21T14:07:01.1574687Z ASIC Revision: 0(0x0) 2026-02-21T14:07:01.1574856Z Cacheline Size: 64(0x40) 2026-02-21T14:07:01.1575034Z Max Clock Freq. (MHz): 3300 2026-02-21T14:07:01.1575197Z BDFID: 0 2026-02-21T14:07:01.1575360Z Internal Node ID: 1 2026-02-21T14:07:01.1575566Z Compute Unit: 128 2026-02-21T14:07:01.1575733Z SIMDs per CU: 0 2026-02-21T14:07:01.1575900Z Shader Engines: 0 2026-02-21T14:07:01.1576067Z Shader Arrs. per Eng.: 0 2026-02-21T14:07:01.1576246Z WatchPts on Addr. Ranges:1 2026-02-21T14:07:01.1576392Z Memory Properties: 2026-02-21T14:07:01.1576502Z Features: None 2026-02-21T14:07:01.1576613Z Pool Info: 2026-02-21T14:07:01.1576717Z Pool 1 2026-02-21T14:07:01.1576860Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T14:07:01.1577029Z Size: 1581146132(0x5e3e6014) KB 2026-02-21T14:07:01.1577204Z Allocatable: TRUE 2026-02-21T14:07:01.1577380Z Alloc Granule: 4KB 2026-02-21T14:07:01.1577562Z Alloc Recommended Granule:4KB 2026-02-21T14:07:01.1577741Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1577919Z Accessible by all: TRUE 2026-02-21T14:07:01.1578065Z Pool 2 2026-02-21T14:07:01.1578207Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T14:07:01.1578379Z Size: 1581146132(0x5e3e6014) KB 2026-02-21T14:07:01.1578543Z Allocatable: TRUE 2026-02-21T14:07:01.1578716Z Alloc Granule: 4KB 2026-02-21T14:07:01.1578894Z Alloc Recommended Granule:4KB 2026-02-21T14:07:01.1579083Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1579266Z Accessible by all: TRUE 2026-02-21T14:07:01.1579410Z Pool 3 2026-02-21T14:07:01.1579560Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2026-02-21T14:07:01.1579731Z Size: 1581146132(0x5e3e6014) KB 2026-02-21T14:07:01.1579903Z Allocatable: TRUE 2026-02-21T14:07:01.1580077Z Alloc Granule: 4KB 2026-02-21T14:07:01.1580265Z Alloc Recommended Granule:4KB 2026-02-21T14:07:01.1580450Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1580630Z Accessible by all: TRUE 2026-02-21T14:07:01.1580782Z Pool 4 2026-02-21T14:07:01.1580961Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T14:07:01.1581134Z Size: 1581146132(0x5e3e6014) KB 2026-02-21T14:07:01.1581304Z Allocatable: TRUE 2026-02-21T14:07:01.1581481Z Alloc Granule: 4KB 2026-02-21T14:07:01.1581666Z Alloc Recommended Granule:4KB 2026-02-21T14:07:01.1581875Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1582077Z Accessible by all: TRUE 2026-02-21T14:07:01.1582240Z ISA Info: 2026-02-21T14:07:01.1582348Z ******* 2026-02-21T14:07:01.1582450Z Agent 3 2026-02-21T14:07:01.1582617Z ******* 2026-02-21T14:07:01.1582744Z Name: gfx942 2026-02-21T14:07:01.1582927Z Uuid: GPU-1d5ad0d1ef228a4f 2026-02-21T14:07:01.1583113Z Marketing Name: AMD Instinct MI325X 2026-02-21T14:07:01.1583369Z Vendor Name: AMD 2026-02-21T14:07:01.1583546Z Feature: KERNEL_DISPATCH 2026-02-21T14:07:01.1583718Z Profile: BASE_PROFILE 2026-02-21T14:07:01.1583915Z Float Round Mode: NEAR 2026-02-21T14:07:01.1584102Z Max Queue Number: 128(0x80) 2026-02-21T14:07:01.1584280Z Queue Min Size: 64(0x40) 2026-02-21T14:07:01.1584449Z Queue Max Size: 131072(0x20000) 2026-02-21T14:07:01.1584615Z Queue Type: MULTI 2026-02-21T14:07:01.1584775Z Node: 2 2026-02-21T14:07:01.1584935Z Device Type: GPU 2026-02-21T14:07:01.1585089Z Cache Info: 2026-02-21T14:07:01.1585221Z L1: 32(0x20) KB 2026-02-21T14:07:01.1585396Z L2: 4096(0x1000) KB 2026-02-21T14:07:01.1585559Z L3: 262144(0x40000) KB 2026-02-21T14:07:01.1585708Z Chip ID: 29861(0x74a5) 2026-02-21T14:07:01.1585869Z ASIC Revision: 1(0x1) 2026-02-21T14:07:01.1586031Z Cacheline Size: 128(0x80) 2026-02-21T14:07:01.1586197Z Max Clock Freq. (MHz): 2100 2026-02-21T14:07:01.1586353Z BDFID: 29952 2026-02-21T14:07:01.1586513Z Internal Node ID: 2 2026-02-21T14:07:01.1586676Z Compute Unit: 304 2026-02-21T14:07:01.1586858Z SIMDs per CU: 4 2026-02-21T14:07:01.1587021Z Shader Engines: 32 2026-02-21T14:07:01.1587193Z Shader Arrs. per Eng.: 1 2026-02-21T14:07:01.1587364Z WatchPts on Addr. Ranges:4 2026-02-21T14:07:01.1587542Z Coherent Host Access: FALSE 2026-02-21T14:07:01.1587689Z Memory Properties: 2026-02-21T14:07:01.1587809Z Features: KERNEL_DISPATCH 2026-02-21T14:07:01.1587962Z Fast F16 Operation: TRUE 2026-02-21T14:07:01.1588128Z Wavefront Size: 64(0x40) 2026-02-21T14:07:01.1588385Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1588533Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1588671Z x 1024(0x400) 2026-02-21T14:07:01.1588858Z y 1024(0x400) 2026-02-21T14:07:01.1588999Z z 1024(0x400) 2026-02-21T14:07:01.1589156Z Max Waves Per CU: 32(0x20) 2026-02-21T14:07:01.1589324Z Max Work-item Per CU: 2048(0x800) 2026-02-21T14:07:01.1589491Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1589631Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1589757Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1589904Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1590048Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1590208Z Max fbarriers/Workgrp: 32 2026-02-21T14:07:01.1595700Z Packet Processor uCode:: 185 2026-02-21T14:07:01.1596073Z SDMA engine uCode:: 24 2026-02-21T14:07:01.1596245Z IOMMU Support:: None 2026-02-21T14:07:01.1596483Z Pool Info: 2026-02-21T14:07:01.1596594Z Pool 1 2026-02-21T14:07:01.1596731Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T14:07:01.1596903Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1597069Z Allocatable: TRUE 2026-02-21T14:07:01.1597242Z Alloc Granule: 4KB 2026-02-21T14:07:01.1597416Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1597593Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1597765Z Accessible by all: FALSE 2026-02-21T14:07:01.1597905Z Pool 2 2026-02-21T14:07:01.1598046Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T14:07:01.1598210Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1598378Z Allocatable: TRUE 2026-02-21T14:07:01.1598543Z Alloc Granule: 4KB 2026-02-21T14:07:01.1598719Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1598894Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1599062Z Accessible by all: FALSE 2026-02-21T14:07:01.1599205Z Pool 3 2026-02-21T14:07:01.1599338Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T14:07:01.1599498Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1599659Z Allocatable: TRUE 2026-02-21T14:07:01.1599828Z Alloc Granule: 4KB 2026-02-21T14:07:01.1599999Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1600174Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1600342Z Accessible by all: FALSE 2026-02-21T14:07:01.1600483Z Pool 4 2026-02-21T14:07:01.1600615Z Segment: GROUP 2026-02-21T14:07:01.1600772Z Size: 64(0x40) KB 2026-02-21T14:07:01.1600932Z Allocatable: FALSE 2026-02-21T14:07:01.1601096Z Alloc Granule: 0KB 2026-02-21T14:07:01.1601269Z Alloc Recommended Granule:0KB 2026-02-21T14:07:01.1601444Z Alloc Alignment: 0KB 2026-02-21T14:07:01.1601644Z Accessible by all: FALSE 2026-02-21T14:07:01.1601789Z ISA Info: 2026-02-21T14:07:01.1601893Z ISA 1 2026-02-21T14:07:01.1602042Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T14:07:01.1602243Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1602418Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1602590Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1602762Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1602931Z Fast f16: TRUE 2026-02-21T14:07:01.1603107Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1603259Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1603411Z x 1024(0x400) 2026-02-21T14:07:01.1603580Z y 1024(0x400) 2026-02-21T14:07:01.1603732Z z 1024(0x400) 2026-02-21T14:07:01.1623915Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1624064Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1624199Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1624351Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1624517Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1624695Z FBarrier Max Size: 32 2026-02-21T14:07:01.1624849Z ISA 2 2026-02-21T14:07:01.1625003Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T14:07:01.1625203Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1625382Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1625556Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1625734Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1625900Z Fast f16: TRUE 2026-02-21T14:07:01.1626067Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1626217Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1626367Z x 1024(0x400) 2026-02-21T14:07:01.1626512Z y 1024(0x400) 2026-02-21T14:07:01.1626672Z z 1024(0x400) 2026-02-21T14:07:01.1626830Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1626976Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1627107Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1627257Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1627407Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1627565Z FBarrier Max Size: 32 2026-02-21T14:07:01.1627704Z ******* 2026-02-21T14:07:01.1627803Z Agent 4 2026-02-21T14:07:01.1627906Z ******* 2026-02-21T14:07:01.1628029Z Name: gfx942 2026-02-21T14:07:01.1628251Z Uuid: GPU-4bb3e96989faecb6 2026-02-21T14:07:01.1628423Z Marketing Name: AMD Instinct MI325X 2026-02-21T14:07:01.1628593Z Vendor Name: AMD 2026-02-21T14:07:01.1628756Z Feature: KERNEL_DISPATCH 2026-02-21T14:07:01.1628968Z Profile: BASE_PROFILE 2026-02-21T14:07:01.1629135Z Float Round Mode: NEAR 2026-02-21T14:07:01.1629304Z Max Queue Number: 128(0x80) 2026-02-21T14:07:01.1629470Z Queue Min Size: 64(0x40) 2026-02-21T14:07:01.1629631Z Queue Max Size: 131072(0x20000) 2026-02-21T14:07:01.1629795Z Queue Type: MULTI 2026-02-21T14:07:01.1629950Z Node: 3 2026-02-21T14:07:01.1630104Z Device Type: GPU 2026-02-21T14:07:01.1630244Z Cache Info: 2026-02-21T14:07:01.1630370Z L1: 32(0x20) KB 2026-02-21T14:07:01.1630519Z L2: 4096(0x1000) KB 2026-02-21T14:07:01.1630663Z L3: 262144(0x40000) KB 2026-02-21T14:07:01.1630822Z Chip ID: 29861(0x74a5) 2026-02-21T14:07:01.1630984Z ASIC Revision: 1(0x1) 2026-02-21T14:07:01.1631200Z Cacheline Size: 128(0x80) 2026-02-21T14:07:01.1631365Z Max Clock Freq. (MHz): 2100 2026-02-21T14:07:01.1631520Z BDFID: 1280 2026-02-21T14:07:01.1631676Z Internal Node ID: 3 2026-02-21T14:07:01.1631836Z Compute Unit: 304 2026-02-21T14:07:01.1631994Z SIMDs per CU: 4 2026-02-21T14:07:01.1632154Z Shader Engines: 32 2026-02-21T14:07:01.1632317Z Shader Arrs. per Eng.: 1 2026-02-21T14:07:01.1632488Z WatchPts on Addr. Ranges:4 2026-02-21T14:07:01.1632660Z Coherent Host Access: FALSE 2026-02-21T14:07:01.1632818Z Memory Properties: 2026-02-21T14:07:01.1632939Z Features: KERNEL_DISPATCH 2026-02-21T14:07:01.1633099Z Fast F16 Operation: TRUE 2026-02-21T14:07:01.1633303Z Wavefront Size: 64(0x40) 2026-02-21T14:07:01.1633464Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1633659Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1633791Z x 1024(0x400) 2026-02-21T14:07:01.1633957Z y 1024(0x400) 2026-02-21T14:07:01.1634094Z z 1024(0x400) 2026-02-21T14:07:01.1634256Z Max Waves Per CU: 32(0x20) 2026-02-21T14:07:01.1634420Z Max Work-item Per CU: 2048(0x800) 2026-02-21T14:07:01.1634595Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1634739Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1634863Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1635017Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1635156Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1635314Z Max fbarriers/Workgrp: 32 2026-02-21T14:07:01.1635489Z Packet Processor uCode:: 185 2026-02-21T14:07:01.1635684Z SDMA engine uCode:: 24 2026-02-21T14:07:01.1635850Z IOMMU Support:: None 2026-02-21T14:07:01.1635987Z Pool Info: 2026-02-21T14:07:01.1636092Z Pool 1 2026-02-21T14:07:01.1636228Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T14:07:01.1636459Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1636623Z Allocatable: TRUE 2026-02-21T14:07:01.1636792Z Alloc Granule: 4KB 2026-02-21T14:07:01.1636966Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1637137Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1637306Z Accessible by all: FALSE 2026-02-21T14:07:01.1637447Z Pool 2 2026-02-21T14:07:01.1637585Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T14:07:01.1637744Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1637906Z Allocatable: TRUE 2026-02-21T14:07:01.1638069Z Alloc Granule: 4KB 2026-02-21T14:07:01.1638243Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1638416Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1638618Z Accessible by all: FALSE 2026-02-21T14:07:01.1638761Z Pool 3 2026-02-21T14:07:01.1638891Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T14:07:01.1639052Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1639209Z Allocatable: TRUE 2026-02-21T14:07:01.1639371Z Alloc Granule: 4KB 2026-02-21T14:07:01.1639544Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1639713Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1639885Z Accessible by all: FALSE 2026-02-21T14:07:01.1640030Z Pool 4 2026-02-21T14:07:01.1640158Z Segment: GROUP 2026-02-21T14:07:01.1640315Z Size: 64(0x40) KB 2026-02-21T14:07:01.1640473Z Allocatable: FALSE 2026-02-21T14:07:01.1640639Z Alloc Granule: 0KB 2026-02-21T14:07:01.1640806Z Alloc Recommended Granule:0KB 2026-02-21T14:07:01.1640979Z Alloc Alignment: 0KB 2026-02-21T14:07:01.1641144Z Accessible by all: FALSE 2026-02-21T14:07:01.1641286Z ISA Info: 2026-02-21T14:07:01.1641387Z ISA 1 2026-02-21T14:07:01.1641528Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T14:07:01.1641709Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1641880Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1642049Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1642226Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1642390Z Fast f16: TRUE 2026-02-21T14:07:01.1642553Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1642705Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1642841Z x 1024(0x400) 2026-02-21T14:07:01.1642985Z y 1024(0x400) 2026-02-21T14:07:01.1643134Z z 1024(0x400) 2026-02-21T14:07:01.1643289Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1643459Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1643786Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1643935Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1644100Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1644267Z FBarrier Max Size: 32 2026-02-21T14:07:01.1644447Z ISA 2 2026-02-21T14:07:01.1644598Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T14:07:01.1644789Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1644957Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1645157Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1645332Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1645494Z Fast f16: TRUE 2026-02-21T14:07:01.1645663Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1645842Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1646013Z x 1024(0x400) 2026-02-21T14:07:01.1646154Z y 1024(0x400) 2026-02-21T14:07:01.1646299Z z 1024(0x400) 2026-02-21T14:07:01.1646458Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1646608Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1646739Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1646885Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1647032Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1647212Z FBarrier Max Size: 32 2026-02-21T14:07:01.1647352Z ******* 2026-02-21T14:07:01.1647464Z Agent 5 2026-02-21T14:07:01.1647561Z ******* 2026-02-21T14:07:01.1647685Z Name: gfx942 2026-02-21T14:07:01.1647842Z Uuid: GPU-f772da076bfdab2e 2026-02-21T14:07:01.1648010Z Marketing Name: AMD Instinct MI325X 2026-02-21T14:07:01.1648172Z Vendor Name: AMD 2026-02-21T14:07:01.1648335Z Feature: KERNEL_DISPATCH 2026-02-21T14:07:01.1648494Z Profile: BASE_PROFILE 2026-02-21T14:07:01.1648657Z Float Round Mode: NEAR 2026-02-21T14:07:01.1648821Z Max Queue Number: 128(0x80) 2026-02-21T14:07:01.1648979Z Queue Min Size: 64(0x40) 2026-02-21T14:07:01.1649139Z Queue Max Size: 131072(0x20000) 2026-02-21T14:07:01.1649296Z Queue Type: MULTI 2026-02-21T14:07:01.1649452Z Node: 4 2026-02-21T14:07:01.1649603Z Device Type: GPU 2026-02-21T14:07:01.1649740Z Cache Info: 2026-02-21T14:07:01.1649860Z L1: 32(0x20) KB 2026-02-21T14:07:01.1650007Z L2: 4096(0x1000) KB 2026-02-21T14:07:01.1650153Z L3: 262144(0x40000) KB 2026-02-21T14:07:01.1650301Z Chip ID: 29861(0x74a5) 2026-02-21T14:07:01.1650457Z ASIC Revision: 1(0x1) 2026-02-21T14:07:01.1650619Z Cacheline Size: 128(0x80) 2026-02-21T14:07:01.1650782Z Max Clock Freq. (MHz): 2100 2026-02-21T14:07:01.1650969Z BDFID: 25856 2026-02-21T14:07:01.1651126Z Internal Node ID: 4 2026-02-21T14:07:01.1651288Z Compute Unit: 304 2026-02-21T14:07:01.1651443Z SIMDs per CU: 4 2026-02-21T14:07:01.1651605Z Shader Engines: 32 2026-02-21T14:07:01.1651768Z Shader Arrs. per Eng.: 1 2026-02-21T14:07:01.1651938Z WatchPts on Addr. Ranges:4 2026-02-21T14:07:01.1652104Z Coherent Host Access: FALSE 2026-02-21T14:07:01.1652249Z Memory Properties: 2026-02-21T14:07:01.1652368Z Features: KERNEL_DISPATCH 2026-02-21T14:07:01.1652514Z Fast F16 Operation: TRUE 2026-02-21T14:07:01.1652679Z Wavefront Size: 64(0x40) 2026-02-21T14:07:01.1652846Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1653028Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1653157Z x 1024(0x400) 2026-02-21T14:07:01.1653329Z y 1024(0x400) 2026-02-21T14:07:01.1653469Z z 1024(0x400) 2026-02-21T14:07:01.1653617Z Max Waves Per CU: 32(0x20) 2026-02-21T14:07:01.1653783Z Max Work-item Per CU: 2048(0x800) 2026-02-21T14:07:01.1653948Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1654088Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1654208Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1654351Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1654494Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1654655Z Max fbarriers/Workgrp: 32 2026-02-21T14:07:01.1654832Z Packet Processor uCode:: 185 2026-02-21T14:07:01.1655002Z SDMA engine uCode:: 24 2026-02-21T14:07:01.1655170Z IOMMU Support:: None 2026-02-21T14:07:01.1655320Z Pool Info: 2026-02-21T14:07:01.1655427Z Pool 1 2026-02-21T14:07:01.1655584Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T14:07:01.1655765Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1655930Z Allocatable: TRUE 2026-02-21T14:07:01.1656118Z Alloc Granule: 4KB 2026-02-21T14:07:01.1656292Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1656465Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1656636Z Accessible by all: FALSE 2026-02-21T14:07:01.1656777Z Pool 2 2026-02-21T14:07:01.1656918Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T14:07:01.1657114Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1657274Z Allocatable: TRUE 2026-02-21T14:07:01.1657438Z Alloc Granule: 4KB 2026-02-21T14:07:01.1657608Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1657781Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1657947Z Accessible by all: FALSE 2026-02-21T14:07:01.1658088Z Pool 3 2026-02-21T14:07:01.1658251Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T14:07:01.1658424Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1658596Z Allocatable: TRUE 2026-02-21T14:07:01.1658812Z Alloc Granule: 4KB 2026-02-21T14:07:01.1658990Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1659161Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1659329Z Accessible by all: FALSE 2026-02-21T14:07:01.1659469Z Pool 4 2026-02-21T14:07:01.1659597Z Segment: GROUP 2026-02-21T14:07:01.1659752Z Size: 64(0x40) KB 2026-02-21T14:07:01.1659906Z Allocatable: FALSE 2026-02-21T14:07:01.1660071Z Alloc Granule: 0KB 2026-02-21T14:07:01.1660239Z Alloc Recommended Granule:0KB 2026-02-21T14:07:01.1660438Z Alloc Alignment: 0KB 2026-02-21T14:07:01.1660608Z Accessible by all: FALSE 2026-02-21T14:07:01.1660747Z ISA Info: 2026-02-21T14:07:01.1660853Z ISA 1 2026-02-21T14:07:01.1660990Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T14:07:01.1661169Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1661339Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1661511Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1661681Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1661845Z Fast f16: TRUE 2026-02-21T14:07:01.1662009Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1662156Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1662296Z x 1024(0x400) 2026-02-21T14:07:01.1662439Z y 1024(0x400) 2026-02-21T14:07:01.1662583Z z 1024(0x400) 2026-02-21T14:07:01.1662737Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1662881Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1663012Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1663157Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1663304Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1663462Z FBarrier Max Size: 32 2026-02-21T14:07:01.1663604Z ISA 2 2026-02-21T14:07:01.1663750Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T14:07:01.1663940Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1664111Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1664277Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1664449Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1664609Z Fast f16: TRUE 2026-02-21T14:07:01.1664772Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1664920Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1665058Z x 1024(0x400) 2026-02-21T14:07:01.1665219Z y 1024(0x400) 2026-02-21T14:07:01.1665411Z z 1024(0x400) 2026-02-21T14:07:01.1665567Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1665711Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1665843Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1665989Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1666135Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1666293Z FBarrier Max Size: 32 2026-02-21T14:07:01.1666455Z ******* 2026-02-21T14:07:01.1666570Z Agent 6 2026-02-21T14:07:01.1666685Z ******* 2026-02-21T14:07:01.1666804Z Name: gfx942 2026-02-21T14:07:01.1666969Z Uuid: GPU-cbf8e3fcdaa6450e 2026-02-21T14:07:01.1667156Z Marketing Name: AMD Instinct MI325X 2026-02-21T14:07:01.1667329Z Vendor Name: AMD 2026-02-21T14:07:01.1667557Z Feature: KERNEL_DISPATCH 2026-02-21T14:07:01.1667718Z Profile: BASE_PROFILE 2026-02-21T14:07:01.1667892Z Float Round Mode: NEAR 2026-02-21T14:07:01.1668056Z Max Queue Number: 128(0x80) 2026-02-21T14:07:01.1668257Z Queue Min Size: 64(0x40) 2026-02-21T14:07:01.1668443Z Queue Max Size: 131072(0x20000) 2026-02-21T14:07:01.1668604Z Queue Type: MULTI 2026-02-21T14:07:01.1668806Z Node: 5 2026-02-21T14:07:01.1668966Z Device Type: GPU 2026-02-21T14:07:01.1669101Z Cache Info: 2026-02-21T14:07:01.1669227Z L1: 32(0x20) KB 2026-02-21T14:07:01.1669372Z L2: 4096(0x1000) KB 2026-02-21T14:07:01.1669522Z L3: 262144(0x40000) KB 2026-02-21T14:07:01.1669667Z Chip ID: 29861(0x74a5) 2026-02-21T14:07:01.1669824Z ASIC Revision: 1(0x1) 2026-02-21T14:07:01.1669986Z Cacheline Size: 128(0x80) 2026-02-21T14:07:01.1670147Z Max Clock Freq. (MHz): 2100 2026-02-21T14:07:01.1670304Z BDFID: 5376 2026-02-21T14:07:01.1670458Z Internal Node ID: 5 2026-02-21T14:07:01.1670619Z Compute Unit: 304 2026-02-21T14:07:01.1670776Z SIMDs per CU: 4 2026-02-21T14:07:01.1670941Z Shader Engines: 32 2026-02-21T14:07:01.1671102Z Shader Arrs. per Eng.: 1 2026-02-21T14:07:01.1671274Z WatchPts on Addr. Ranges:4 2026-02-21T14:07:01.1671442Z Coherent Host Access: FALSE 2026-02-21T14:07:01.1671584Z Memory Properties: 2026-02-21T14:07:01.1671702Z Features: KERNEL_DISPATCH 2026-02-21T14:07:01.1671848Z Fast F16 Operation: TRUE 2026-02-21T14:07:01.1672017Z Wavefront Size: 64(0x40) 2026-02-21T14:07:01.1672179Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1672326Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1672457Z x 1024(0x400) 2026-02-21T14:07:01.1672596Z y 1024(0x400) 2026-02-21T14:07:01.1672771Z z 1024(0x400) 2026-02-21T14:07:01.1672920Z Max Waves Per CU: 32(0x20) 2026-02-21T14:07:01.1673087Z Max Work-item Per CU: 2048(0x800) 2026-02-21T14:07:01.1673249Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1673392Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1673517Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1673658Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1673800Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1673955Z Max fbarriers/Workgrp: 32 2026-02-21T14:07:01.1674131Z Packet Processor uCode:: 185 2026-02-21T14:07:01.1674299Z SDMA engine uCode:: 24 2026-02-21T14:07:01.1674467Z IOMMU Support:: None 2026-02-21T14:07:01.1674607Z Pool Info: 2026-02-21T14:07:01.1674710Z Pool 1 2026-02-21T14:07:01.1674882Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T14:07:01.1675045Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1675207Z Allocatable: TRUE 2026-02-21T14:07:01.1675375Z Alloc Granule: 4KB 2026-02-21T14:07:01.1675549Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1675723Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1675894Z Accessible by all: FALSE 2026-02-21T14:07:01.1676037Z Pool 2 2026-02-21T14:07:01.1676295Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T14:07:01.1676504Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1676703Z Allocatable: TRUE 2026-02-21T14:07:01.1676884Z Alloc Granule: 4KB 2026-02-21T14:07:01.1677100Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1677345Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1677541Z Accessible by all: FALSE 2026-02-21T14:07:01.1677759Z Pool 3 2026-02-21T14:07:01.1678035Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T14:07:01.1678215Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1678421Z Allocatable: TRUE 2026-02-21T14:07:01.1678657Z Alloc Granule: 4KB 2026-02-21T14:07:01.1678847Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1679042Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1679246Z Accessible by all: FALSE 2026-02-21T14:07:01.1679401Z Pool 4 2026-02-21T14:07:01.1679570Z Segment: GROUP 2026-02-21T14:07:01.1679752Z Size: 64(0x40) KB 2026-02-21T14:07:01.1679928Z Allocatable: FALSE 2026-02-21T14:07:01.1680126Z Alloc Granule: 0KB 2026-02-21T14:07:01.1680311Z Alloc Recommended Granule:0KB 2026-02-21T14:07:01.1680629Z Alloc Alignment: 0KB 2026-02-21T14:07:01.1680812Z Accessible by all: FALSE 2026-02-21T14:07:01.1680993Z ISA Info: 2026-02-21T14:07:01.1681158Z ISA 1 2026-02-21T14:07:01.1681321Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T14:07:01.1681533Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1681719Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1681918Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1682112Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1682299Z Fast f16: TRUE 2026-02-21T14:07:01.1682493Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1682783Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1682954Z x 1024(0x400) 2026-02-21T14:07:01.1683109Z y 1024(0x400) 2026-02-21T14:07:01.1683300Z z 1024(0x400) 2026-02-21T14:07:01.1683471Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1683671Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1683834Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1683998Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1684164Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1684346Z FBarrier Max Size: 32 2026-02-21T14:07:01.1699657Z ISA 2 2026-02-21T14:07:01.1699827Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T14:07:01.1700021Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1700197Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1700370Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1700540Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1700721Z Fast f16: TRUE 2026-02-21T14:07:01.1700890Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1701061Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1701221Z x 1024(0x400) 2026-02-21T14:07:01.1701365Z y 1024(0x400) 2026-02-21T14:07:01.1701534Z z 1024(0x400) 2026-02-21T14:07:01.1701687Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1701828Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1701953Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1702096Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1702236Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1702393Z FBarrier Max Size: 32 2026-02-21T14:07:01.1702526Z ******* 2026-02-21T14:07:01.1702627Z Agent 7 2026-02-21T14:07:01.1702722Z ******* 2026-02-21T14:07:01.1702840Z Name: gfx942 2026-02-21T14:07:01.1702993Z Uuid: GPU-2b9f2e9f4d8c917e 2026-02-21T14:07:01.1703155Z Marketing Name: AMD Instinct MI325X 2026-02-21T14:07:01.1703316Z Vendor Name: AMD 2026-02-21T14:07:01.1703476Z Feature: KERNEL_DISPATCH 2026-02-21T14:07:01.1703633Z Profile: BASE_PROFILE 2026-02-21T14:07:01.1703864Z Float Round Mode: NEAR 2026-02-21T14:07:01.1704023Z Max Queue Number: 128(0x80) 2026-02-21T14:07:01.1704180Z Queue Min Size: 64(0x40) 2026-02-21T14:07:01.1704334Z Queue Max Size: 131072(0x20000) 2026-02-21T14:07:01.1704486Z Queue Type: MULTI 2026-02-21T14:07:01.1704633Z Node: 6 2026-02-21T14:07:01.1704782Z Device Type: GPU 2026-02-21T14:07:01.1704915Z Cache Info: 2026-02-21T14:07:01.1705048Z L1: 32(0x20) KB 2026-02-21T14:07:01.1705191Z L2: 4096(0x1000) KB 2026-02-21T14:07:01.1705330Z L3: 262144(0x40000) KB 2026-02-21T14:07:01.1705480Z Chip ID: 29861(0x74a5) 2026-02-21T14:07:01.1705641Z ASIC Revision: 1(0x1) 2026-02-21T14:07:01.1705807Z Cacheline Size: 128(0x80) 2026-02-21T14:07:01.1706002Z Max Clock Freq. (MHz): 2100 2026-02-21T14:07:01.1706162Z BDFID: 62720 2026-02-21T14:07:01.1706323Z Internal Node ID: 6 2026-02-21T14:07:01.1706480Z Compute Unit: 304 2026-02-21T14:07:01.1706643Z SIMDs per CU: 4 2026-02-21T14:07:01.1706806Z Shader Engines: 32 2026-02-21T14:07:01.1706971Z Shader Arrs. per Eng.: 1 2026-02-21T14:07:01.1707139Z WatchPts on Addr. Ranges:4 2026-02-21T14:07:01.1707312Z Coherent Host Access: FALSE 2026-02-21T14:07:01.1707511Z Memory Properties: 2026-02-21T14:07:01.1707627Z Features: KERNEL_DISPATCH 2026-02-21T14:07:01.1707810Z Fast F16 Operation: TRUE 2026-02-21T14:07:01.1707990Z Wavefront Size: 64(0x40) 2026-02-21T14:07:01.1708208Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1708352Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1708485Z x 1024(0x400) 2026-02-21T14:07:01.1708625Z y 1024(0x400) 2026-02-21T14:07:01.1708769Z z 1024(0x400) 2026-02-21T14:07:01.1708925Z Max Waves Per CU: 32(0x20) 2026-02-21T14:07:01.1709090Z Max Work-item Per CU: 2048(0x800) 2026-02-21T14:07:01.1709259Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1709405Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1709533Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1709682Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1709830Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1709992Z Max fbarriers/Workgrp: 32 2026-02-21T14:07:01.1710174Z Packet Processor uCode:: 185 2026-02-21T14:07:01.1710350Z SDMA engine uCode:: 24 2026-02-21T14:07:01.1710517Z IOMMU Support:: None 2026-02-21T14:07:01.1710665Z Pool Info: 2026-02-21T14:07:01.1710773Z Pool 1 2026-02-21T14:07:01.1710920Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T14:07:01.1711092Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1711301Z Allocatable: TRUE 2026-02-21T14:07:01.1711469Z Alloc Granule: 4KB 2026-02-21T14:07:01.1711648Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1711826Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1711996Z Accessible by all: FALSE 2026-02-21T14:07:01.1712139Z Pool 2 2026-02-21T14:07:01.1712277Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T14:07:01.1712441Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1712607Z Allocatable: TRUE 2026-02-21T14:07:01.1712769Z Alloc Granule: 4KB 2026-02-21T14:07:01.1712967Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1713137Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1713307Z Accessible by all: FALSE 2026-02-21T14:07:01.1713484Z Pool 3 2026-02-21T14:07:01.1713638Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T14:07:01.1713800Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1713973Z Allocatable: TRUE 2026-02-21T14:07:01.1714139Z Alloc Granule: 4KB 2026-02-21T14:07:01.1714308Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1714484Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1714652Z Accessible by all: FALSE 2026-02-21T14:07:01.1714790Z Pool 4 2026-02-21T14:07:01.1714923Z Segment: GROUP 2026-02-21T14:07:01.1715076Z Size: 64(0x40) KB 2026-02-21T14:07:01.1715240Z Allocatable: FALSE 2026-02-21T14:07:01.1715401Z Alloc Granule: 0KB 2026-02-21T14:07:01.1715573Z Alloc Recommended Granule:0KB 2026-02-21T14:07:01.1715742Z Alloc Alignment: 0KB 2026-02-21T14:07:01.1715913Z Accessible by all: FALSE 2026-02-21T14:07:01.1716056Z ISA Info: 2026-02-21T14:07:01.1716159Z ISA 1 2026-02-21T14:07:01.1716303Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T14:07:01.1716481Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1716653Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1716824Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1716998Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1717167Z Fast f16: TRUE 2026-02-21T14:07:01.1717330Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1717481Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1717616Z x 1024(0x400) 2026-02-21T14:07:01.1717761Z y 1024(0x400) 2026-02-21T14:07:01.1717903Z z 1024(0x400) 2026-02-21T14:07:01.1718059Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1718205Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1718332Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1718513Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1718656Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1718813Z FBarrier Max Size: 32 2026-02-21T14:07:01.1718946Z ISA 2 2026-02-21T14:07:01.1719093Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T14:07:01.1719282Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1719452Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1719621Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1719812Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1719975Z Fast f16: TRUE 2026-02-21T14:07:01.1720141Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1720295Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1720468Z x 1024(0x400) 2026-02-21T14:07:01.1720649Z y 1024(0x400) 2026-02-21T14:07:01.1720829Z z 1024(0x400) 2026-02-21T14:07:01.1720979Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1721147Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1721304Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1721451Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1721595Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1721755Z FBarrier Max Size: 32 2026-02-21T14:07:01.1721912Z ******* 2026-02-21T14:07:01.1722009Z Agent 8 2026-02-21T14:07:01.1722108Z ******* 2026-02-21T14:07:01.1722225Z Name: gfx942 2026-02-21T14:07:01.1722403Z Uuid: GPU-fdcb91c1de594571 2026-02-21T14:07:01.1722603Z Marketing Name: AMD Instinct MI325X 2026-02-21T14:07:01.1722766Z Vendor Name: AMD 2026-02-21T14:07:01.1722927Z Feature: KERNEL_DISPATCH 2026-02-21T14:07:01.1723085Z Profile: BASE_PROFILE 2026-02-21T14:07:01.1723255Z Float Round Mode: NEAR 2026-02-21T14:07:01.1723450Z Max Queue Number: 128(0x80) 2026-02-21T14:07:01.1723625Z Queue Min Size: 64(0x40) 2026-02-21T14:07:01.1723783Z Queue Max Size: 131072(0x20000) 2026-02-21T14:07:01.1723990Z Queue Type: MULTI 2026-02-21T14:07:01.1724140Z Node: 7 2026-02-21T14:07:01.1724298Z Device Type: GPU 2026-02-21T14:07:01.1724437Z Cache Info: 2026-02-21T14:07:01.1724559Z L1: 32(0x20) KB 2026-02-21T14:07:01.1724707Z L2: 4096(0x1000) KB 2026-02-21T14:07:01.1724848Z L3: 262144(0x40000) KB 2026-02-21T14:07:01.1724998Z Chip ID: 29861(0x74a5) 2026-02-21T14:07:01.1725155Z ASIC Revision: 1(0x1) 2026-02-21T14:07:01.1725317Z Cacheline Size: 128(0x80) 2026-02-21T14:07:01.1725484Z Max Clock Freq. (MHz): 2100 2026-02-21T14:07:01.1725637Z BDFID: 34048 2026-02-21T14:07:01.1725947Z Internal Node ID: 7 2026-02-21T14:07:01.1726107Z Compute Unit: 304 2026-02-21T14:07:01.1726270Z SIMDs per CU: 4 2026-02-21T14:07:01.1726428Z Shader Engines: 32 2026-02-21T14:07:01.1726596Z Shader Arrs. per Eng.: 1 2026-02-21T14:07:01.1726766Z WatchPts on Addr. Ranges:4 2026-02-21T14:07:01.1726932Z Coherent Host Access: FALSE 2026-02-21T14:07:01.1727076Z Memory Properties: 2026-02-21T14:07:01.1727192Z Features: KERNEL_DISPATCH 2026-02-21T14:07:01.1727340Z Fast F16 Operation: TRUE 2026-02-21T14:07:01.1727502Z Wavefront Size: 64(0x40) 2026-02-21T14:07:01.1727667Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1727816Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1727944Z x 1024(0x400) 2026-02-21T14:07:01.1728120Z y 1024(0x400) 2026-02-21T14:07:01.1728257Z z 1024(0x400) 2026-02-21T14:07:01.1728409Z Max Waves Per CU: 32(0x20) 2026-02-21T14:07:01.1728571Z Max Work-item Per CU: 2048(0x800) 2026-02-21T14:07:01.1728736Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1728875Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1728999Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1729142Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1729278Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1729438Z Max fbarriers/Workgrp: 32 2026-02-21T14:07:01.1729612Z Packet Processor uCode:: 185 2026-02-21T14:07:01.1729787Z SDMA engine uCode:: 24 2026-02-21T14:07:01.1729950Z IOMMU Support:: None 2026-02-21T14:07:01.1730088Z Pool Info: 2026-02-21T14:07:01.1730218Z Pool 1 2026-02-21T14:07:01.1730355Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T14:07:01.1730515Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1730687Z Allocatable: TRUE 2026-02-21T14:07:01.1730865Z Alloc Granule: 4KB 2026-02-21T14:07:01.1731039Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1731207Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1731373Z Accessible by all: FALSE 2026-02-21T14:07:01.1731510Z Pool 2 2026-02-21T14:07:01.1731647Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T14:07:01.1731803Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1731959Z Allocatable: TRUE 2026-02-21T14:07:01.1732118Z Alloc Granule: 4KB 2026-02-21T14:07:01.1732286Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1732475Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1732637Z Accessible by all: FALSE 2026-02-21T14:07:01.1732788Z Pool 3 2026-02-21T14:07:01.1732917Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T14:07:01.1733117Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1733270Z Allocatable: TRUE 2026-02-21T14:07:01.1733435Z Alloc Granule: 4KB 2026-02-21T14:07:01.1733604Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1733771Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1733935Z Accessible by all: FALSE 2026-02-21T14:07:01.1734071Z Pool 4 2026-02-21T14:07:01.1734198Z Segment: GROUP 2026-02-21T14:07:01.1734347Z Size: 64(0x40) KB 2026-02-21T14:07:01.1734501Z Allocatable: FALSE 2026-02-21T14:07:01.1734662Z Alloc Granule: 0KB 2026-02-21T14:07:01.1734829Z Alloc Recommended Granule:0KB 2026-02-21T14:07:01.1734997Z Alloc Alignment: 0KB 2026-02-21T14:07:01.1735191Z Accessible by all: FALSE 2026-02-21T14:07:01.1735330Z ISA Info: 2026-02-21T14:07:01.1735429Z ISA 1 2026-02-21T14:07:01.1735565Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T14:07:01.1735742Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1735906Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1736072Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1736239Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1736419Z Fast f16: TRUE 2026-02-21T14:07:01.1736584Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1736734Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1736867Z x 1024(0x400) 2026-02-21T14:07:01.1737009Z y 1024(0x400) 2026-02-21T14:07:01.1737150Z z 1024(0x400) 2026-02-21T14:07:01.1737300Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1737441Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1737568Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1737713Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1737855Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1738007Z FBarrier Max Size: 32 2026-02-21T14:07:01.1738144Z ISA 2 2026-02-21T14:07:01.1738292Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T14:07:01.1738477Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1738648Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1738814Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1738981Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1739139Z Fast f16: TRUE 2026-02-21T14:07:01.1739297Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1739441Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1739574Z x 1024(0x400) 2026-02-21T14:07:01.1739712Z y 1024(0x400) 2026-02-21T14:07:01.1739852Z z 1024(0x400) 2026-02-21T14:07:01.1740037Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1740177Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1740302Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1740443Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1740585Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1740736Z FBarrier Max Size: 32 2026-02-21T14:07:01.1740871Z ******* 2026-02-21T14:07:01.1740964Z Agent 9 2026-02-21T14:07:01.1741058Z ******* 2026-02-21T14:07:01.1741173Z Name: gfx942 2026-02-21T14:07:01.1741325Z Uuid: GPU-8ebd5562b1fbc7ab 2026-02-21T14:07:01.1741487Z Marketing Name: AMD Instinct MI325X 2026-02-21T14:07:01.1741648Z Vendor Name: AMD 2026-02-21T14:07:01.1741805Z Feature: KERNEL_DISPATCH 2026-02-21T14:07:01.1741995Z Profile: BASE_PROFILE 2026-02-21T14:07:01.1742153Z Float Round Mode: NEAR 2026-02-21T14:07:01.1742310Z Max Queue Number: 128(0x80) 2026-02-21T14:07:01.1742466Z Queue Min Size: 64(0x40) 2026-02-21T14:07:01.1742620Z Queue Max Size: 131072(0x20000) 2026-02-21T14:07:01.1742773Z Queue Type: MULTI 2026-02-21T14:07:01.1742921Z Node: 8 2026-02-21T14:07:01.1743069Z Device Type: GPU 2026-02-21T14:07:01.1743202Z Cache Info: 2026-02-21T14:07:01.1743320Z L1: 32(0x20) KB 2026-02-21T14:07:01.1743465Z L2: 4096(0x1000) KB 2026-02-21T14:07:01.1743604Z L3: 262144(0x40000) KB 2026-02-21T14:07:01.1743748Z Chip ID: 29861(0x74a5) 2026-02-21T14:07:01.1743902Z ASIC Revision: 1(0x1) 2026-02-21T14:07:01.1744057Z Cacheline Size: 128(0x80) 2026-02-21T14:07:01.1744214Z Max Clock Freq. (MHz): 2100 2026-02-21T14:07:01.1744363Z BDFID: 58624 2026-02-21T14:07:01.1744515Z Internal Node ID: 8 2026-02-21T14:07:01.1744670Z Compute Unit: 304 2026-02-21T14:07:01.1744822Z SIMDs per CU: 4 2026-02-21T14:07:01.1744977Z Shader Engines: 32 2026-02-21T14:07:01.1745136Z Shader Arrs. per Eng.: 1 2026-02-21T14:07:01.1745298Z WatchPts on Addr. Ranges:4 2026-02-21T14:07:01.1745462Z Coherent Host Access: FALSE 2026-02-21T14:07:01.1745603Z Memory Properties: 2026-02-21T14:07:01.1745717Z Features: KERNEL_DISPATCH 2026-02-21T14:07:01.1745860Z Fast F16 Operation: TRUE 2026-02-21T14:07:01.1746019Z Wavefront Size: 64(0x40) 2026-02-21T14:07:01.1746178Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1746321Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1746447Z x 1024(0x400) 2026-02-21T14:07:01.1746584Z y 1024(0x400) 2026-02-21T14:07:01.1746719Z z 1024(0x400) 2026-02-21T14:07:01.1746903Z Max Waves Per CU: 32(0x20) 2026-02-21T14:07:01.1747065Z Max Work-item Per CU: 2048(0x800) 2026-02-21T14:07:01.1747225Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1747363Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1747481Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1747623Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1747757Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1747911Z Max fbarriers/Workgrp: 32 2026-02-21T14:07:01.1748081Z Packet Processor uCode:: 185 2026-02-21T14:07:01.1748285Z SDMA engine uCode:: 24 2026-02-21T14:07:01.1748447Z IOMMU Support:: None 2026-02-21T14:07:01.1748581Z Pool Info: 2026-02-21T14:07:01.1748684Z Pool 1 2026-02-21T14:07:01.1748816Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T14:07:01.1749016Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1749174Z Allocatable: TRUE 2026-02-21T14:07:01.1749337Z Alloc Granule: 4KB 2026-02-21T14:07:01.1749506Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1749674Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1749838Z Accessible by all: FALSE 2026-02-21T14:07:01.1749974Z Pool 2 2026-02-21T14:07:01.1750107Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T14:07:01.1750317Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1750472Z Allocatable: TRUE 2026-02-21T14:07:01.1750632Z Alloc Granule: 4KB 2026-02-21T14:07:01.1750802Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1750969Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1751132Z Accessible by all: FALSE 2026-02-21T14:07:01.1751269Z Pool 3 2026-02-21T14:07:01.1751398Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T14:07:01.1751551Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1751705Z Allocatable: TRUE 2026-02-21T14:07:01.1751863Z Alloc Granule: 4KB 2026-02-21T14:07:01.1752029Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1752199Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1752362Z Accessible by all: FALSE 2026-02-21T14:07:01.1752501Z Pool 4 2026-02-21T14:07:01.1752625Z Segment: GROUP 2026-02-21T14:07:01.1752774Z Size: 64(0x40) KB 2026-02-21T14:07:01.1752926Z Allocatable: FALSE 2026-02-21T14:07:01.1753086Z Alloc Granule: 0KB 2026-02-21T14:07:01.1753251Z Alloc Recommended Granule:0KB 2026-02-21T14:07:01.1753418Z Alloc Alignment: 0KB 2026-02-21T14:07:01.1753579Z Accessible by all: FALSE 2026-02-21T14:07:01.1753716Z ISA Info: 2026-02-21T14:07:01.1753817Z ISA 1 2026-02-21T14:07:01.1753987Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T14:07:01.1754162Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1754330Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1754495Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1754662Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1754821Z Fast f16: TRUE 2026-02-21T14:07:01.1754979Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1755123Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1755256Z x 1024(0x400) 2026-02-21T14:07:01.1755395Z y 1024(0x400) 2026-02-21T14:07:01.1755534Z z 1024(0x400) 2026-02-21T14:07:01.1755687Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1755827Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1755993Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1756135Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1756276Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1756428Z FBarrier Max Size: 32 2026-02-21T14:07:01.1756564Z ISA 2 2026-02-21T14:07:01.1756709Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T14:07:01.1756892Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1757058Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1757222Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1757392Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1757550Z Fast f16: TRUE 2026-02-21T14:07:01.1757710Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1757855Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1757987Z x 1024(0x400) 2026-02-21T14:07:01.1758128Z y 1024(0x400) 2026-02-21T14:07:01.1758266Z z 1024(0x400) 2026-02-21T14:07:01.1758416Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1758554Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1758680Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1758820Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1758964Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1759117Z FBarrier Max Size: 32 2026-02-21T14:07:01.1759251Z ******* 2026-02-21T14:07:01.1759347Z Agent 10 2026-02-21T14:07:01.1759439Z ******* 2026-02-21T14:07:01.1759554Z Name: gfx942 2026-02-21T14:07:01.1759703Z Uuid: GPU-29cccc52ed98c7d1 2026-02-21T14:07:01.1759774Z Marketing Name: AMD Instinct MI325X 2026-02-21T14:07:01.1759838Z Vendor Name: AMD 2026-02-21T14:07:01.1759901Z Feature: KERNEL_DISPATCH 2026-02-21T14:07:01.1759964Z Profile: BASE_PROFILE 2026-02-21T14:07:01.1760031Z Float Round Mode: NEAR 2026-02-21T14:07:01.1760140Z Max Queue Number: 128(0x80) 2026-02-21T14:07:01.1760203Z Queue Min Size: 64(0x40) 2026-02-21T14:07:01.1760268Z Queue Max Size: 131072(0x20000) 2026-02-21T14:07:01.1760328Z Queue Type: MULTI 2026-02-21T14:07:01.1760383Z Node: 9 2026-02-21T14:07:01.1760445Z Device Type: GPU 2026-02-21T14:07:01.1760486Z Cache Info: 2026-02-21T14:07:01.1760546Z L1: 32(0x20) KB 2026-02-21T14:07:01.1760602Z L2: 4096(0x1000) KB 2026-02-21T14:07:01.1760679Z L3: 262144(0x40000) KB 2026-02-21T14:07:01.1760738Z Chip ID: 29861(0x74a5) 2026-02-21T14:07:01.1760803Z ASIC Revision: 1(0x1) 2026-02-21T14:07:01.1760875Z Cacheline Size: 128(0x80) 2026-02-21T14:07:01.1760939Z Max Clock Freq. (MHz): 2100 2026-02-21T14:07:01.1761035Z BDFID: 38144 2026-02-21T14:07:01.1761102Z Internal Node ID: 9 2026-02-21T14:07:01.1761164Z Compute Unit: 304 2026-02-21T14:07:01.1761225Z SIMDs per CU: 4 2026-02-21T14:07:01.1761289Z Shader Engines: 32 2026-02-21T14:07:01.1761358Z Shader Arrs. per Eng.: 1 2026-02-21T14:07:01.1761428Z WatchPts on Addr. Ranges:4 2026-02-21T14:07:01.1761495Z Coherent Host Access: FALSE 2026-02-21T14:07:01.1761542Z Memory Properties: 2026-02-21T14:07:01.1761595Z Features: KERNEL_DISPATCH 2026-02-21T14:07:01.1761662Z Fast F16 Operation: TRUE 2026-02-21T14:07:01.1761730Z Wavefront Size: 64(0x40) 2026-02-21T14:07:01.1761797Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1761845Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1761899Z x 1024(0x400) 2026-02-21T14:07:01.1761954Z y 1024(0x400) 2026-02-21T14:07:01.1762008Z z 1024(0x400) 2026-02-21T14:07:01.1762071Z Max Waves Per CU: 32(0x20) 2026-02-21T14:07:01.1762140Z Max Work-item Per CU: 2048(0x800) 2026-02-21T14:07:01.1762204Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1762248Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1762307Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1762361Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1762416Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1762484Z Max fbarriers/Workgrp: 32 2026-02-21T14:07:01.1762559Z Packet Processor uCode:: 185 2026-02-21T14:07:01.1762625Z SDMA engine uCode:: 24 2026-02-21T14:07:01.1762691Z IOMMU Support:: None 2026-02-21T14:07:01.1762734Z Pool Info: 2026-02-21T14:07:01.1762775Z Pool 1 2026-02-21T14:07:01.1762846Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2026-02-21T14:07:01.1762909Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1762977Z Allocatable: TRUE 2026-02-21T14:07:01.1763074Z Alloc Granule: 4KB 2026-02-21T14:07:01.1763152Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1763221Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1763290Z Accessible by all: FALSE 2026-02-21T14:07:01.1763331Z Pool 2 2026-02-21T14:07:01.1763401Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2026-02-21T14:07:01.1763459Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1763523Z Allocatable: TRUE 2026-02-21T14:07:01.1763590Z Alloc Granule: 4KB 2026-02-21T14:07:01.1763663Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1763730Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1763801Z Accessible by all: FALSE 2026-02-21T14:07:01.1763841Z Pool 3 2026-02-21T14:07:01.1763936Z Segment: GLOBAL; FLAGS: FINE GRAINED 2026-02-21T14:07:01.1763993Z Size: 268419072(0xfffc000) KB 2026-02-21T14:07:01.1764059Z Allocatable: TRUE 2026-02-21T14:07:01.1764124Z Alloc Granule: 4KB 2026-02-21T14:07:01.1764195Z Alloc Recommended Granule:2048KB 2026-02-21T14:07:01.1764263Z Alloc Alignment: 4KB 2026-02-21T14:07:01.1764330Z Accessible by all: FALSE 2026-02-21T14:07:01.1764369Z Pool 4 2026-02-21T14:07:01.1764434Z Segment: GROUP 2026-02-21T14:07:01.1764495Z Size: 64(0x40) KB 2026-02-21T14:07:01.1764561Z Allocatable: FALSE 2026-02-21T14:07:01.1764628Z Alloc Granule: 0KB 2026-02-21T14:07:01.1764703Z Alloc Recommended Granule:0KB 2026-02-21T14:07:01.1764769Z Alloc Alignment: 0KB 2026-02-21T14:07:01.1764837Z Accessible by all: FALSE 2026-02-21T14:07:01.1764880Z ISA Info: 2026-02-21T14:07:01.1764920Z ISA 1 2026-02-21T14:07:01.1764994Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2026-02-21T14:07:01.1765067Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1765132Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1765203Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1765274Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1765338Z Fast f16: TRUE 2026-02-21T14:07:01.1765406Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1765454Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1765512Z x 1024(0x400) 2026-02-21T14:07:01.1765567Z y 1024(0x400) 2026-02-21T14:07:01.1765659Z z 1024(0x400) 2026-02-21T14:07:01.1765728Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1765775Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1765832Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1765891Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1765982Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1766051Z FBarrier Max Size: 32 2026-02-21T14:07:01.1766095Z ISA 2 2026-02-21T14:07:01.1766179Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2026-02-21T14:07:01.1766253Z Machine Models: HSA_MACHINE_MODEL_LARGE 2026-02-21T14:07:01.1766319Z Profiles: HSA_PROFILE_BASE 2026-02-21T14:07:01.1766391Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1766460Z Default Rounding Mode: NEAR 2026-02-21T14:07:01.1766520Z Fast f16: TRUE 2026-02-21T14:07:01.1766591Z Workgroup Max Size: 1024(0x400) 2026-02-21T14:07:01.1766640Z Workgroup Max Size per Dimension: 2026-02-21T14:07:01.1766697Z x 1024(0x400) 2026-02-21T14:07:01.1766754Z y 1024(0x400) 2026-02-21T14:07:01.1766834Z z 1024(0x400) 2026-02-21T14:07:01.1766899Z Grid Max Size: 4294967295(0xffffffff) 2026-02-21T14:07:01.1766945Z Grid Max Size per Dimension: 2026-02-21T14:07:01.1767005Z x 4294967295(0xffffffff) 2026-02-21T14:07:01.1767061Z y 4294967295(0xffffffff) 2026-02-21T14:07:01.1767116Z z 4294967295(0xffffffff) 2026-02-21T14:07:01.1767188Z FBarrier Max Size: 32 2026-02-21T14:07:01.1767228Z *** Done *** 2026-02-21T14:07:01.2005272Z ##[group]Run set -x 2026-02-21T14:07:01.2005388Z set -x 2026-02-21T14:07:01.2005435Z apt-get update 2026-02-21T14:07:01.2005485Z apt-get install -y git 2026-02-21T14:07:01.2005680Z shell: bash -l {0} 2026-02-21T14:07:01.2005745Z env: 2026-02-21T14:07:01.2005796Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:07:01.2005839Z ##[endgroup] 2026-02-21T14:07:01.2925412Z + apt-get update 2026-02-21T14:07:01.3618786Z Get:1 https://repo.radeon.com/amdgpu/6.4.4/ubuntu noble InRelease [5465 B] 2026-02-21T14:07:01.3689730Z Get:2 https://repo.radeon.com/rocm/apt/6.4.4 noble InRelease [2605 B] 2026-02-21T14:07:01.4576331Z Get:3 https://repo.radeon.com/amdgpu/6.4.4/ubuntu noble/main amd64 Packages [14.6 kB] 2026-02-21T14:07:01.5252010Z Get:4 https://repo.radeon.com/rocm/apt/6.4.4 noble/main amd64 Packages [60.5 kB] 2026-02-21T14:07:01.6052679Z Get:5 http://archive.ubuntu.com/ubuntu noble InRelease [256 kB] 2026-02-21T14:07:01.6109344Z Get:6 http://security.ubuntu.com/ubuntu noble-security InRelease [126 kB] 2026-02-21T14:07:02.0709014Z Get:7 http://security.ubuntu.com/ubuntu noble-security/restricted amd64 Packages [3196 kB] 2026-02-21T14:07:02.1055783Z Get:8 http://archive.ubuntu.com/ubuntu noble-updates InRelease [126 kB] 2026-02-21T14:07:02.2260173Z Get:9 http://archive.ubuntu.com/ubuntu noble-backports InRelease [126 kB] 2026-02-21T14:07:02.3470954Z Get:10 http://archive.ubuntu.com/ubuntu noble/multiverse amd64 Packages [331 kB] 2026-02-21T14:07:02.4290690Z Get:11 http://archive.ubuntu.com/ubuntu noble/universe amd64 Packages [19.3 MB] 2026-02-21T14:07:02.6166675Z Get:12 http://security.ubuntu.com/ubuntu noble-security/universe amd64 Packages [1207 kB] 2026-02-21T14:07:02.6400782Z Get:13 http://security.ubuntu.com/ubuntu noble-security/multiverse amd64 Packages [34.8 kB] 2026-02-21T14:07:02.6411806Z Get:14 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages [1857 kB] 2026-02-21T14:07:02.9463981Z Get:15 http://archive.ubuntu.com/ubuntu noble/main amd64 Packages [1808 kB] 2026-02-21T14:07:02.9647231Z Get:16 http://archive.ubuntu.com/ubuntu noble/restricted amd64 Packages [117 kB] 2026-02-21T14:07:02.9659242Z Get:17 http://archive.ubuntu.com/ubuntu noble-updates/multiverse amd64 Packages [38.1 kB] 2026-02-21T14:07:02.9660128Z Get:18 http://archive.ubuntu.com/ubuntu noble-updates/universe amd64 Packages [2016 kB] 2026-02-21T14:07:02.9848416Z Get:19 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages [2240 kB] 2026-02-21T14:07:03.0077555Z Get:20 http://archive.ubuntu.com/ubuntu noble-updates/restricted amd64 Packages [3381 kB] 2026-02-21T14:07:03.0492882Z Get:21 http://archive.ubuntu.com/ubuntu noble-backports/universe amd64 Packages [34.6 kB] 2026-02-21T14:07:03.0493476Z Get:22 http://archive.ubuntu.com/ubuntu noble-backports/main amd64 Packages [49.5 kB] 2026-02-21T14:07:03.5166463Z Fetched 36.3 MB in 2s (16.5 MB/s) 2026-02-21T14:07:03.9148799Z Reading package lists... 2026-02-21T14:07:03.9250210Z W: https://repo.radeon.com/amdgpu/6.4.4/ubuntu/dists/noble/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. 2026-02-21T14:07:03.9251345Z W: https://repo.radeon.com/rocm/apt/6.4.4/dists/noble/InRelease: Key is stored in legacy trusted.gpg keyring (/etc/apt/trusted.gpg), see the DEPRECATION section in apt-key(8) for details. 2026-02-21T14:07:03.9255365Z + apt-get install -y git 2026-02-21T14:07:04.3283850Z Reading package lists... 2026-02-21T14:07:04.4355235Z Building dependency tree... 2026-02-21T14:07:04.4355846Z Reading state information... 2026-02-21T14:07:04.5274022Z The following additional packages will be installed: 2026-02-21T14:07:04.5276458Z git-man less libcbor0.10 libcurl3t64-gnutls liberror-perl libfido2-1 2026-02-21T14:07:04.5277289Z libxmuu1 openssh-client xauth 2026-02-21T14:07:04.5280426Z Suggested packages: 2026-02-21T14:07:04.5280772Z gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-email git-gui 2026-02-21T14:07:04.5281728Z gitk gitweb git-cvs git-mediawiki git-svn keychain libpam-ssh monkeysphere 2026-02-21T14:07:04.5282116Z ssh-askpass 2026-02-21T14:07:04.5489856Z The following NEW packages will be installed: 2026-02-21T14:07:04.5493499Z git git-man less libcbor0.10 libcurl3t64-gnutls liberror-perl libfido2-1 2026-02-21T14:07:04.5494109Z libxmuu1 openssh-client xauth 2026-02-21T14:07:04.7498834Z 0 upgraded, 10 newly installed, 0 to remove and 101 not upgraded. 2026-02-21T14:07:04.7499318Z Need to get 6330 kB of archives. 2026-02-21T14:07:04.7499740Z After this operation, 29.8 MB of additional disk space will be used. 2026-02-21T14:07:04.7500422Z Get:1 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 less amd64 590-2ubuntu2.1 [142 kB] 2026-02-21T14:07:05.1828841Z Get:2 http://archive.ubuntu.com/ubuntu noble/main amd64 libcbor0.10 amd64 0.10.2-1.2ubuntu2 [25.8 kB] 2026-02-21T14:07:05.1939260Z Get:3 http://archive.ubuntu.com/ubuntu noble/main amd64 libfido2-1 amd64 1.14.0-1build3 [83.5 kB] 2026-02-21T14:07:05.2354458Z Get:4 http://archive.ubuntu.com/ubuntu noble/main amd64 libxmuu1 amd64 2:1.1.3-3build2 [8958 B] 2026-02-21T14:07:05.2393359Z Get:5 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 openssh-client amd64 1:9.6p1-3ubuntu13.14 [906 kB] 2026-02-21T14:07:05.4875538Z Get:6 http://archive.ubuntu.com/ubuntu noble/main amd64 xauth amd64 1:1.1.2-1build1 [25.6 kB] 2026-02-21T14:07:05.4901383Z Get:7 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 libcurl3t64-gnutls amd64 8.5.0-2ubuntu10.6 [333 kB] 2026-02-21T14:07:05.5229422Z Get:8 http://archive.ubuntu.com/ubuntu noble/main amd64 liberror-perl all 0.17029-2 [25.6 kB] 2026-02-21T14:07:05.5248714Z Get:9 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git-man all 1:2.43.0-1ubuntu7.3 [1100 kB] 2026-02-21T14:07:05.6002818Z Get:10 http://archive.ubuntu.com/ubuntu noble-updates/main amd64 git amd64 1:2.43.0-1ubuntu7.3 [3680 kB] 2026-02-21T14:07:05.7868261Z debconf: delaying package configuration, since apt-utils is not installed 2026-02-21T14:07:05.8019882Z Fetched 6330 kB in 1s (5446 kB/s) 2026-02-21T14:07:05.8280806Z Selecting previously unselected package less. 2026-02-21T14:07:05.8309381Z (Reading database ... 2026-02-21T14:07:05.8315544Z (Reading database ... 5% 2026-02-21T14:07:05.8327080Z (Reading database ... 10% 2026-02-21T14:07:05.8333320Z (Reading database ... 15% 2026-02-21T14:07:05.8339206Z (Reading database ... 20% 2026-02-21T14:07:05.8344487Z (Reading database ... 25% 2026-02-21T14:07:05.8353427Z (Reading database ... 30% 2026-02-21T14:07:05.8355372Z (Reading database ... 35% 2026-02-21T14:07:05.8355512Z (Reading database ... 40% 2026-02-21T14:07:05.8355649Z (Reading database ... 45% 2026-02-21T14:07:05.8355794Z (Reading database ... 50% 2026-02-21T14:07:05.8355932Z (Reading database ... 55% 2026-02-21T14:07:05.8356073Z (Reading database ... 60% 2026-02-21T14:07:05.8356275Z (Reading database ... 65% 2026-02-21T14:07:05.8356415Z (Reading database ... 70% 2026-02-21T14:07:05.8356559Z (Reading database ... 75% 2026-02-21T14:07:05.8356701Z (Reading database ... 80% 2026-02-21T14:07:05.8356836Z (Reading database ... 85% 2026-02-21T14:07:05.8356975Z (Reading database ... 90% 2026-02-21T14:07:05.8357116Z (Reading database ... 95% 2026-02-21T14:07:05.8357255Z (Reading database ... 100% 2026-02-21T14:07:05.8357483Z (Reading database ... 28634 files and directories currently installed.) 2026-02-21T14:07:05.8359599Z Preparing to unpack .../0-less_590-2ubuntu2.1_amd64.deb ... 2026-02-21T14:07:05.8371103Z Unpacking less (590-2ubuntu2.1) ... 2026-02-21T14:07:05.8466761Z Selecting previously unselected package libcbor0.10:amd64. 2026-02-21T14:07:05.8481321Z Preparing to unpack .../1-libcbor0.10_0.10.2-1.2ubuntu2_amd64.deb ... 2026-02-21T14:07:05.8484581Z Unpacking libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T14:07:05.8558144Z Selecting previously unselected package libfido2-1:amd64. 2026-02-21T14:07:05.8569668Z Preparing to unpack .../2-libfido2-1_1.14.0-1build3_amd64.deb ... 2026-02-21T14:07:05.8571660Z Unpacking libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T14:07:05.8649989Z Selecting previously unselected package libxmuu1:amd64. 2026-02-21T14:07:05.8663723Z Preparing to unpack .../3-libxmuu1_2%3a1.1.3-3build2_amd64.deb ... 2026-02-21T14:07:05.8665661Z Unpacking libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T14:07:05.8740995Z Selecting previously unselected package openssh-client. 2026-02-21T14:07:05.8755850Z Preparing to unpack .../4-openssh-client_1%3a9.6p1-3ubuntu13.14_amd64.deb ... 2026-02-21T14:07:05.8783666Z Unpacking openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T14:07:05.8934348Z Selecting previously unselected package xauth. 2026-02-21T14:07:05.8946534Z Preparing to unpack .../5-xauth_1%3a1.1.2-1build1_amd64.deb ... 2026-02-21T14:07:05.8947465Z Unpacking xauth (1:1.1.2-1build1) ... 2026-02-21T14:07:05.9026660Z Selecting previously unselected package libcurl3t64-gnutls:amd64. 2026-02-21T14:07:05.9040934Z Preparing to unpack .../6-libcurl3t64-gnutls_8.5.0-2ubuntu10.6_amd64.deb ... 2026-02-21T14:07:05.9043058Z Unpacking libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T14:07:05.9129051Z Selecting previously unselected package liberror-perl. 2026-02-21T14:07:05.9140098Z Preparing to unpack .../7-liberror-perl_0.17029-2_all.deb ... 2026-02-21T14:07:05.9140997Z Unpacking liberror-perl (0.17029-2) ... 2026-02-21T14:07:05.9210585Z Selecting previously unselected package git-man. 2026-02-21T14:07:05.9221990Z Preparing to unpack .../8-git-man_1%3a2.43.0-1ubuntu7.3_all.deb ... 2026-02-21T14:07:05.9222923Z Unpacking git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T14:07:05.9310474Z Selecting previously unselected package git. 2026-02-21T14:07:05.9321441Z Preparing to unpack .../9-git_1%3a2.43.0-1ubuntu7.3_amd64.deb ... 2026-02-21T14:07:05.9347551Z Unpacking git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T14:07:05.9944105Z Setting up libcbor0.10:amd64 (0.10.2-1.2ubuntu2) ... 2026-02-21T14:07:05.9948764Z Setting up libcurl3t64-gnutls:amd64 (8.5.0-2ubuntu10.6) ... 2026-02-21T14:07:05.9952340Z Setting up less (590-2ubuntu2.1) ... 2026-02-21T14:07:05.9978309Z Setting up liberror-perl (0.17029-2) ... 2026-02-21T14:07:05.9981528Z Setting up git-man (1:2.43.0-1ubuntu7.3) ... 2026-02-21T14:07:05.9985417Z Setting up libfido2-1:amd64 (1.14.0-1build3) ... 2026-02-21T14:07:05.9989672Z Setting up libxmuu1:amd64 (2:1.1.3-3build2) ... 2026-02-21T14:07:05.9994084Z Setting up openssh-client (1:9.6p1-3ubuntu13.14) ... 2026-02-21T14:07:06.0242274Z Setting up git (1:2.43.0-1ubuntu7.3) ... 2026-02-21T14:07:06.0269953Z Setting up xauth (1:1.1.2-1build1) ... 2026-02-21T14:07:06.0274848Z Processing triggers for libc-bin (2.39-0ubuntu8.6) ... 2026-02-21T14:07:06.0680368Z ##[group]Run actions/checkout@v6 2026-02-21T14:07:06.0680493Z with: 2026-02-21T14:07:06.0680599Z repository: pytorch/helion 2026-02-21T14:07:06.0680781Z token: *** 2026-02-21T14:07:06.0680866Z ssh-strict: true 2026-02-21T14:07:06.0680954Z ssh-user: git 2026-02-21T14:07:06.0681047Z persist-credentials: true 2026-02-21T14:07:06.0681151Z clean: true 2026-02-21T14:07:06.0681245Z sparse-checkout-cone-mode: true 2026-02-21T14:07:06.0681356Z fetch-depth: 1 2026-02-21T14:07:06.0681442Z fetch-tags: false 2026-02-21T14:07:06.0681530Z show-progress: true 2026-02-21T14:07:06.0681617Z lfs: false 2026-02-21T14:07:06.0681699Z submodules: false 2026-02-21T14:07:06.0681787Z set-safe-directory: true 2026-02-21T14:07:06.0682025Z env: 2026-02-21T14:07:06.0682107Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:07:06.0682215Z ##[endgroup] 2026-02-21T14:07:06.0706026Z ##[command]/usr/bin/docker exec bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b sh -c "cat /etc/*release | grep ^ID" 2026-02-21T14:07:06.3602257Z Syncing repository: pytorch/helion 2026-02-21T14:07:06.3602872Z ##[group]Getting Git version info 2026-02-21T14:07:06.3603008Z Working directory is '/__w/helion/helion' 2026-02-21T14:07:06.3603216Z [command]/usr/bin/git version 2026-02-21T14:07:06.3603331Z git version 2.43.0 2026-02-21T14:07:06.3607874Z ##[endgroup] 2026-02-21T14:07:06.3618281Z Temporarily overriding HOME='/__w/_temp/6286fc4b-2e7a-4823-91cd-66580a2a3985' before making global git config changes 2026-02-21T14:07:06.3618820Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T14:07:06.3620387Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T14:07:06.3640725Z Deleting the contents of '/__w/helion/helion' 2026-02-21T14:07:06.3642966Z ##[group]Initializing the repository 2026-02-21T14:07:06.3644476Z [command]/usr/bin/git init /__w/helion/helion 2026-02-21T14:07:06.3671547Z hint: Using 'master' as the name for the initial branch. This default branch name 2026-02-21T14:07:06.3671793Z hint: is subject to change. To configure the initial branch name to use in all 2026-02-21T14:07:06.3672008Z hint: of your new repositories, which will suppress this warning, call: 2026-02-21T14:07:06.3672164Z hint: 2026-02-21T14:07:06.3672305Z hint: git config --global init.defaultBranch 2026-02-21T14:07:06.3672442Z hint: 2026-02-21T14:07:06.3672569Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2026-02-21T14:07:06.3672777Z hint: 'development'. The just-created branch can be renamed via this command: 2026-02-21T14:07:06.3672946Z hint: 2026-02-21T14:07:06.3673033Z hint: git branch -m 2026-02-21T14:07:06.3674566Z Initialized empty Git repository in /__w/helion/helion/.git/ 2026-02-21T14:07:06.3682479Z [command]/usr/bin/git remote add origin https://github.com/pytorch/helion 2026-02-21T14:07:06.3703688Z ##[endgroup] 2026-02-21T14:07:06.3703989Z ##[group]Disabling automatic garbage collection 2026-02-21T14:07:06.3705090Z [command]/usr/bin/git config --local gc.auto 0 2026-02-21T14:07:06.3718452Z ##[endgroup] 2026-02-21T14:07:06.3718668Z ##[group]Setting up auth 2026-02-21T14:07:06.3719042Z Removing SSH command configuration 2026-02-21T14:07:06.3721817Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T14:07:06.3741296Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T14:07:06.3896074Z Removing HTTP extra header 2026-02-21T14:07:06.3897908Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T14:07:06.3909809Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T14:07:06.4051359Z Removing includeIf entries pointing to credentials config files 2026-02-21T14:07:06.4053367Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T14:07:06.4071923Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T14:07:06.4198341Z [command]/usr/bin/git config --file /__w/_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config http.https://github.com/.extraheader AUTHORIZATION: basic *** 2026-02-21T14:07:06.4215344Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T14:07:06.4228397Z [command]/usr/bin/git config --local includeIf.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T14:07:06.4242281Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T14:07:06.4254839Z [command]/usr/bin/git config --local includeIf.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T14:07:06.4266211Z ##[endgroup] 2026-02-21T14:07:06.4266401Z ##[group]Fetching the repository 2026-02-21T14:07:06.4270193Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +874a7d0cadab18218a84ad3579d329dc95c51820:refs/remotes/origin/main 2026-02-21T14:07:06.9923116Z From https://github.com/pytorch/helion 2026-02-21T14:07:06.9923457Z * [new ref] 874a7d0cadab18218a84ad3579d329dc95c51820 -> origin/main 2026-02-21T14:07:06.9942435Z [command]/usr/bin/git branch --list --remote origin/main 2026-02-21T14:07:06.9962513Z origin/main 2026-02-21T14:07:06.9968332Z [command]/usr/bin/git rev-parse refs/remotes/origin/main 2026-02-21T14:07:06.9984230Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T14:07:06.9987271Z ##[endgroup] 2026-02-21T14:07:06.9987523Z ##[group]Determining the checkout info 2026-02-21T14:07:06.9989715Z ##[endgroup] 2026-02-21T14:07:06.9993435Z [command]/usr/bin/git sparse-checkout disable 2026-02-21T14:07:07.0018570Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2026-02-21T14:07:07.0032211Z ##[group]Checking out the ref 2026-02-21T14:07:07.0033865Z [command]/usr/bin/git checkout --progress --force -B main refs/remotes/origin/main 2026-02-21T14:07:07.0200244Z Switched to a new branch 'main' 2026-02-21T14:07:07.0201772Z branch 'main' set up to track 'origin/main'. 2026-02-21T14:07:07.0204766Z ##[endgroup] 2026-02-21T14:07:07.0250667Z [command]/usr/bin/git log -1 --format=%H 2026-02-21T14:07:07.0263274Z 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T14:07:07.0415773Z ##[group]Run actions/setup-python@v6 2026-02-21T14:07:07.0415914Z with: 2026-02-21T14:07:07.0416004Z python-version: 3.12 2026-02-21T14:07:07.0416098Z check-latest: false 2026-02-21T14:07:07.0416237Z token: *** 2026-02-21T14:07:07.0416326Z update-environment: true 2026-02-21T14:07:07.0416436Z allow-prereleases: false 2026-02-21T14:07:07.0416547Z freethreaded: false 2026-02-21T14:07:07.0416640Z env: 2026-02-21T14:07:07.0416728Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:07:07.0416836Z ##[endgroup] 2026-02-21T14:07:07.0419133Z ##[command]/usr/bin/docker exec bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b sh -c "cat /etc/*release | grep ^ID" 2026-02-21T14:07:07.2449229Z ##[group]Installed versions 2026-02-21T14:07:07.2455536Z Version 3.12 was not found in the local cache 2026-02-21T14:07:07.8065345Z Version 3.12 is available for downloading 2026-02-21T14:07:07.8066260Z Download from "https://github.com/actions/python-versions/releases/download/3.12.12-18393146713/python-3.12.12-linux-24.04-x64.tar.gz" 2026-02-21T14:07:08.3106359Z Extract downloaded archive 2026-02-21T14:07:08.3213421Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/01644168-dd03-44aa-8106-31d4a25fe101 -f /__w/_temp/9e1510d8-85f1-42ba-8f3c-480f33987865 2026-02-21T14:07:09.5758982Z Execute installation script 2026-02-21T14:07:09.5854704Z Check if Python hostedtoolcache folder exist... 2026-02-21T14:07:09.5856147Z Creating Python hostedtoolcache folder... 2026-02-21T14:07:09.5861462Z Create Python 3.12.12 folder 2026-02-21T14:07:09.5867156Z Copy Python binaries to hostedtoolcache folder 2026-02-21T14:07:10.1101156Z Create additional symlinks (Required for the UsePythonVersion Azure Pipelines task and the setup-python GitHub Action) 2026-02-21T14:07:10.1122949Z Upgrading pip... 2026-02-21T14:07:11.1114418Z Looking in links: /tmp/tmp4xibc5t7 2026-02-21T14:07:11.1115561Z Requirement already satisfied: pip in /__w/_tool/Python/3.12.12/x64/lib/python3.12/site-packages (25.0.1) 2026-02-21T14:07:11.1156887Z ##[error]WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2026-02-21T14:07:11.5353233Z ##[error]WARNING: The directory '/github/home/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag. 2026-02-21T14:07:11.6664688Z Collecting pip 2026-02-21T14:07:11.7026243Z Downloading pip-26.0.1-py3-none-any.whl.metadata (4.7 kB) 2026-02-21T14:07:11.7088716Z Downloading pip-26.0.1-py3-none-any.whl (1.8 MB) 2026-02-21T14:07:11.7318214Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 165.8 MB/s eta 0:00:00 2026-02-21T14:07:11.7318480Z 2026-02-21T14:07:11.7393347Z Installing collected packages: pip 2026-02-21T14:07:11.7393607Z Attempting uninstall: pip 2026-02-21T14:07:11.7402957Z Found existing installation: pip 25.0.1 2026-02-21T14:07:11.7555938Z Uninstalling pip-25.0.1: 2026-02-21T14:07:11.7584236Z Successfully uninstalled pip-25.0.1 2026-02-21T14:07:12.2089379Z Successfully installed pip-26.0.1 2026-02-21T14:07:12.2464448Z Create complete file 2026-02-21T14:07:12.2496238Z Successfully set up CPython (3.12.12) 2026-02-21T14:07:12.2496662Z ##[endgroup] 2026-02-21T14:07:12.2687357Z ##[group]Run astral-sh/setup-uv@v7 2026-02-21T14:07:12.2687500Z with: 2026-02-21T14:07:12.2687593Z activate-environment: false 2026-02-21T14:07:12.2687730Z working-directory: /home/runner/_work/helion/helion 2026-02-21T14:07:12.2687938Z github-token: *** 2026-02-21T14:07:12.2688028Z enable-cache: auto 2026-02-21T14:07:12.2688301Z cache-dependency-glob: **/*requirements*.txt **/*requirements*.in **/*constraints*.txt **/*constraints*.in **/pyproject.toml **/uv.lock **/*.py.lock 2026-02-21T14:07:12.2688566Z restore-cache: true 2026-02-21T14:07:12.2688657Z save-cache: true 2026-02-21T14:07:12.2688740Z prune-cache: true 2026-02-21T14:07:12.2688829Z cache-python: false 2026-02-21T14:07:12.2688927Z ignore-nothing-to-cache: false 2026-02-21T14:07:12.2689044Z ignore-empty-workdir: false 2026-02-21T14:07:12.2689155Z add-problem-matchers: true 2026-02-21T14:07:12.2689259Z resolution-strategy: highest 2026-02-21T14:07:12.2689367Z env: 2026-02-21T14:07:12.2689467Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:07:12.2689592Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:12.2689767Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:07:12.2689923Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:12.2690062Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:12.2690198Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:12.2690336Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:07:12.2690460Z ##[endgroup] 2026-02-21T14:07:12.2693917Z ##[command]/usr/bin/docker exec bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b sh -c "cat /etc/*release | grep ^ID" 2026-02-21T14:07:12.4651762Z (node:700) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T14:07:12.4652164Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T14:07:12.4701486Z Trying to find version for uv in: /__w/helion/helion/uv.toml 2026-02-21T14:07:12.4702717Z Could not find file: /__w/helion/helion/uv.toml 2026-02-21T14:07:12.4703160Z Trying to find version for uv in: /__w/helion/helion/pyproject.toml 2026-02-21T14:07:12.4708489Z Could not determine uv version from uv.toml or pyproject.toml. Falling back to latest. 2026-02-21T14:07:12.4709540Z Getting latest version from GitHub API... 2026-02-21T14:07:12.6921946Z manifest-file not provided, reading from local file. 2026-02-21T14:07:12.6948608Z manifest-file does not contain version 0.10.4, arch x86_64, platform unknown-linux-gnu. Falling back to GitHub releases. 2026-02-21T14:07:12.6949363Z Downloading uv from "https://github.com/astral-sh/uv/releases/download/0.10.4/uv-x86_64-unknown-linux-gnu.tar.gz" ... 2026-02-21T14:07:12.9449548Z [command]/usr/bin/tar xz --warning=no-unknown-keyword --overwrite -C /__w/_temp/1cf3ea35-2468-49e5-a856-5e3c31ab64e5 -f /__w/_temp/768069fb-f62d-4189-aff9-2caf2ab656ee 2026-02-21T14:07:13.2052690Z Added /github/home/.local/bin to the path 2026-02-21T14:07:13.2052992Z Added /__w/_tool/uv/0.10.4/x86_64 to the path 2026-02-21T14:07:13.2053615Z Set UV_PYTHON_INSTALL_DIR to /github/home/.local/share/uv/python 2026-02-21T14:07:13.2053925Z Added /github/home/.local/share/uv/python to the path 2026-02-21T14:07:13.2061075Z Successfully installed uv version 0.10.4 2026-02-21T14:07:13.3373148Z ##[group]Run uv venv --python 3.12 2026-02-21T14:07:13.3373336Z uv venv --python 3.12 2026-02-21T14:07:13.3373611Z shell: bash -l {0} 2026-02-21T14:07:13.3373705Z env: 2026-02-21T14:07:13.3373798Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:07:13.3373934Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:13.3374101Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:07:13.3374264Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:13.3374408Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:13.3374554Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:13.3374698Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:07:13.3374864Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:07:13.3375011Z ##[endgroup] 2026-02-21T14:07:13.4894856Z Using CPython 3.12.12 interpreter at: /__w/_tool/Python/3.12.12/x64/bin/python3.12 2026-02-21T14:07:13.4895336Z Creating virtual environment at: .venv 2026-02-21T14:07:13.4898461Z Activate with: source .venv/bin/activate 2026-02-21T14:07:13.4952872Z ##[group]Run source .venv/bin/activate 2026-02-21T14:07:13.4953039Z source .venv/bin/activate 2026-02-21T14:07:13.4953277Z uv pip install -U "torch==2.9.*" --index-url https://download.pytorch.org/whl/rocm6.4 2026-02-21T14:07:13.4953552Z shell: bash -l {0} 2026-02-21T14:07:13.4953649Z env: 2026-02-21T14:07:13.4953743Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:07:13.4953885Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:13.4954059Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:07:13.4954224Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:13.4954374Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:13.4954524Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:13.4954677Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:07:13.4954864Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:07:13.4955012Z ##[endgroup] 2026-02-21T14:07:13.8648867Z Resolved 11 packages in 302ms 2026-02-21T14:07:13.8701010Z Downloading sympy (6.0MiB) 2026-02-21T14:07:13.8727679Z Downloading pytorch-triton-rocm (261.8MiB) 2026-02-21T14:07:13.8766601Z Downloading networkx (2.0MiB) 2026-02-21T14:07:13.9044520Z Downloading torch (4.2GiB) 2026-02-21T14:07:14.1697553Z Downloaded networkx 2026-02-21T14:07:14.3157667Z Downloaded sympy 2026-02-21T14:07:16.0197037Z Downloaded pytorch-triton-rocm 2026-02-21T14:07:40.2725742Z Downloaded torch 2026-02-21T14:07:40.2729305Z Prepared 11 packages in 26.40s 2026-02-21T14:07:40.2923490Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T14:07:40.2926007Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T14:07:40.2926912Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T14:07:45.1773753Z Installed 11 packages in 4.90s 2026-02-21T14:07:45.1790931Z + filelock==3.20.0 2026-02-21T14:07:45.1791257Z + fsspec==2025.12.0 2026-02-21T14:07:45.1791643Z + jinja2==3.1.6 2026-02-21T14:07:45.1791811Z + markupsafe==3.0.2 2026-02-21T14:07:45.1791952Z + mpmath==1.3.0 2026-02-21T14:07:45.1792075Z + networkx==3.6.1 2026-02-21T14:07:45.1792226Z + pytorch-triton-rocm==3.5.1 2026-02-21T14:07:45.1792393Z + setuptools==70.2.0 2026-02-21T14:07:45.1792522Z + sympy==1.14.0 2026-02-21T14:07:45.1793863Z + torch==2.9.1+rocm6.4 2026-02-21T14:07:45.1794009Z + typing-extensions==4.15.0 2026-02-21T14:07:45.1968710Z ##[group]Run source .venv/bin/activate 2026-02-21T14:07:45.1968882Z source .venv/bin/activate 2026-02-21T14:07:45.1969048Z SETUPTOOLS_SCM_PRETEND_VERSION="0.0.0" uv pip install -e .'[dev]' 2026-02-21T14:07:45.1969256Z python -c "import helion; print(helion.__name__)" 2026-02-21T14:07:45.1969789Z shell: bash -l {0} 2026-02-21T14:07:45.1969877Z env: 2026-02-21T14:07:45.1969964Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:07:45.1970093Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:45.1970251Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:07:45.1970406Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:45.1970543Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:45.1970679Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:07:45.1970816Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:07:45.1970990Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:07:45.1971130Z ##[endgroup] 2026-02-21T14:07:45.8207648Z Resolved 30 packages in 552ms 2026-02-21T14:07:45.8216617Z Building helion @ file:///__w/helion/helion 2026-02-21T14:07:45.8295552Z Downloading pygments (1.2MiB) 2026-02-21T14:07:45.8308765Z Downloading virtualenv (5.6MiB) 2026-02-21T14:07:45.8309628Z Downloading scikit-learn (8.5MiB) 2026-02-21T14:07:45.8337772Z Downloading scipy (33.4MiB) 2026-02-21T14:07:45.8338238Z Downloading numpy (15.8MiB) 2026-02-21T14:07:45.9211435Z Built helion @ file:///__w/helion/helion 2026-02-21T14:07:45.9423585Z Downloaded virtualenv 2026-02-21T14:07:45.9443815Z Downloaded pygments 2026-02-21T14:07:46.1130822Z Downloaded scikit-learn 2026-02-21T14:07:46.1209285Z Downloaded numpy 2026-02-21T14:07:46.2360083Z Downloaded scipy 2026-02-21T14:07:46.2360502Z Prepared 27 packages in 415ms 2026-02-21T14:07:46.2369869Z Uninstalled 1 package in 0.81ms 2026-02-21T14:07:46.2377562Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T14:07:46.2378216Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T14:07:46.2378861Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T14:07:46.3268010Z Installed 29 packages in 89ms 2026-02-21T14:07:46.3268397Z + cfgv==3.5.0 2026-02-21T14:07:46.3268528Z + distlib==0.4.0 2026-02-21T14:07:46.3268676Z + expecttest==0.3.0 2026-02-21T14:07:46.3268812Z + filecheck==1.0.3 2026-02-21T14:07:46.3268943Z - filelock==3.20.0 2026-02-21T14:07:46.3269072Z + filelock==3.24.3 2026-02-21T14:07:46.3269222Z + helion==0.0.0 (from file:///__w/helion/helion) 2026-02-21T14:07:46.3269436Z + hypothesis==6.151.9 2026-02-21T14:07:46.3269566Z + identify==2.6.16 2026-02-21T14:07:46.3269687Z + iniconfig==2.3.0 2026-02-21T14:07:46.3269806Z + joblib==1.5.3 2026-02-21T14:07:46.3269945Z + markdown-it-py==4.0.0 2026-02-21T14:07:46.3270109Z + mdurl==0.1.2 2026-02-21T14:07:46.3270227Z + nodeenv==1.10.0 2026-02-21T14:07:46.3270346Z + numpy==2.4.2 2026-02-21T14:07:46.3270462Z + packaging==26.0 2026-02-21T14:07:46.3270581Z + platformdirs==4.9.2 2026-02-21T14:07:46.3270717Z + pluggy==1.6.0 2026-02-21T14:07:46.3270842Z + pre-commit==4.5.1 2026-02-21T14:07:46.3271508Z + psutil==7.2.2 2026-02-21T14:07:46.3271629Z + pygments==2.19.2 2026-02-21T14:07:46.3271751Z + pytest==9.0.2 2026-02-21T14:07:46.3271875Z + pytest-timeout==2.4.0 2026-02-21T14:07:46.3272006Z + pyyaml==6.0.3 2026-02-21T14:07:46.3272124Z + rich==14.3.3 2026-02-21T14:07:46.3272245Z + scikit-learn==1.8.0 2026-02-21T14:07:46.3272367Z + scipy==1.17.0 2026-02-21T14:07:46.3272499Z + sortedcontainers==2.4.0 2026-02-21T14:07:46.3272643Z + threadpoolctl==3.6.0 2026-02-21T14:07:46.3272772Z + virtualenv==20.38.0 2026-02-21T14:08:01.1470738Z helion 2026-02-21T14:08:02.5689034Z ##[group]Run set -x 2026-02-21T14:08:02.5689202Z set -x 2026-02-21T14:08:02.5689321Z source .venv/bin/activate 2026-02-21T14:08:02.5689471Z uv pip install pip 2026-02-21T14:08:02.5689627Z uv pip install quack-kernels --no-deps 2026-02-21T14:08:02.5689824Z mkdir -p benchmarks/ && pushd benchmarks/ 2026-02-21T14:08:02.5690052Z git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T14:08:02.5690260Z pushd tritonbench/ 2026-02-21T14:08:02.5690410Z git submodule update --init --recursive 2026-02-21T14:08:02.5690590Z uv pip install -r requirements.txt 2026-02-21T14:08:02.5690751Z python install.py --liger 2026-02-21T14:08:02.5690899Z uv pip install -e . --no-deps 2026-02-21T14:08:02.5691040Z popd 2026-02-21T14:08:02.5691141Z popd 2026-02-21T14:08:02.5691378Z shell: bash -l {0} 2026-02-21T14:08:02.5691484Z env: 2026-02-21T14:08:02.5691609Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:08:02.5691765Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:08:02.5691963Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:08:02.5692153Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:08:02.5692329Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:08:02.5692498Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:08:02.5692677Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:08:02.5692872Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:08:02.5693049Z ##[endgroup] 2026-02-21T14:08:02.6158676Z + source .venv/bin/activate 2026-02-21T14:08:02.6158838Z ++ '[' -z '' ']' 2026-02-21T14:08:02.6158928Z ++ '[' -n x ']' 2026-02-21T14:08:02.6159028Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T14:08:02.6159207Z ++ '[' .venv/bin/activate = /__w/_temp/d3c904d1-bd83-4c08-ad64-4feadfa137f3.sh ']' 2026-02-21T14:08:02.6159415Z ++ deactivate nondestructive 2026-02-21T14:08:02.6159521Z ++ unset -f pydoc 2026-02-21T14:08:02.6159608Z ++ '[' -z '' ']' 2026-02-21T14:08:02.6159686Z ++ '[' -z '' ']' 2026-02-21T14:08:02.6159769Z ++ hash -r 2026-02-21T14:08:02.6159845Z ++ '[' -z '' ']' 2026-02-21T14:08:02.6159950Z ++ unset VIRTUAL_ENV 2026-02-21T14:08:02.6160044Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T14:08:02.6160159Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T14:08:02.6160304Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T14:08:02.6160420Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T14:08:02.6160526Z ++ '[' linux-gnu = msys ']' 2026-02-21T14:08:02.6160626Z ++ export VIRTUAL_ENV 2026-02-21T14:08:02.6160717Z ++ '[' -z '' ']' 2026-02-21T14:08:02.6160798Z ++ unset SCRIPT_PATH 2026-02-21T14:08:02.6161168Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T14:08:02.6161834Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T14:08:02.6162206Z ++ export PATH 2026-02-21T14:08:02.6162297Z ++ '[' xhelion '!=' x ']' 2026-02-21T14:08:02.6162402Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T14:08:02.6162733Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T14:08:02.6162831Z ++ '[' -z '' ']' 2026-02-21T14:08:02.6162907Z ++ '[' -z '' ']' 2026-02-21T14:08:02.6162991Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T14:08:02.6163082Z ++ PS1='(helion) ' 2026-02-21T14:08:02.6163168Z ++ export PS1 2026-02-21T14:08:02.6163248Z ++ alias pydoc 2026-02-21T14:08:02.6163329Z ++ true 2026-02-21T14:08:02.6163404Z ++ hash -r 2026-02-21T14:08:02.6163484Z + uv pip install pip 2026-02-21T14:08:02.6677101Z Resolved 1 package in 51ms 2026-02-21T14:08:02.6712727Z Downloading pip (1.7MiB) 2026-02-21T14:08:02.6968690Z Downloaded pip 2026-02-21T14:08:02.6969740Z Prepared 1 package in 29ms 2026-02-21T14:08:02.7143847Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T14:08:02.7144163Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T14:08:02.7144469Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T14:08:02.7265494Z Installed 1 package in 29ms 2026-02-21T14:08:02.7265621Z + pip==26.0.1 2026-02-21T14:08:02.7411911Z + uv pip install quack-kernels --no-deps 2026-02-21T14:08:02.7662050Z Resolved 1 package in 19ms 2026-02-21T14:08:02.7954205Z Prepared 1 package in 29ms 2026-02-21T14:08:02.8115638Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T14:08:02.8116026Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T14:08:02.8116340Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T14:08:02.8126829Z Installed 1 package in 17ms 2026-02-21T14:08:02.8127065Z + quack-kernels==0.2.10 2026-02-21T14:08:02.8242042Z + mkdir -p benchmarks/ 2026-02-21T14:08:02.8247950Z + pushd benchmarks/ 2026-02-21T14:08:02.8249216Z + git clone https://github.com/pytorch-labs/tritonbench/ 2026-02-21T14:08:02.8249458Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T14:08:02.8259344Z Cloning into 'tritonbench'... 2026-02-21T14:08:03.5423724Z + pushd tritonbench/ 2026-02-21T14:08:03.5424155Z + git submodule update --init --recursive 2026-02-21T14:08:03.5424757Z /__w/helion/helion/benchmarks/tritonbench /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T14:08:03.5592731Z Submodule 'submodules/ThunderKittens' (https://github.com/HazyResearch/ThunderKittens.git) registered for path 'submodules/ThunderKittens' 2026-02-21T14:08:03.5593487Z Submodule 'submodules/aiter' (https://github.com/ROCm/aiter.git) registered for path 'submodules/aiter' 2026-02-21T14:08:03.5594079Z Submodule 'submodules/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/cutlass' 2026-02-21T14:08:03.5594777Z Submodule 'submodules/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/flash-attention' 2026-02-21T14:08:03.5595697Z Submodule 'submodules/generative-recommenders' (https://github.com/facebookresearch/generative-recommenders.git) registered for path 'submodules/generative-recommenders' 2026-02-21T14:08:03.5596551Z Submodule 'submodules/xformers' (https://github.com/facebookresearch/xformers.git) registered for path 'submodules/xformers' 2026-02-21T14:08:03.5620836Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/ThunderKittens'... 2026-02-21T14:08:04.8571239Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter'... 2026-02-21T14:08:15.9607572Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/cutlass'... 2026-02-21T14:08:18.4604293Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention'... 2026-02-21T14:08:19.1280387Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders'... 2026-02-21T14:08:19.5200528Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers'... 2026-02-21T14:08:20.4805338Z Submodule path 'submodules/ThunderKittens': checked out '25f7568450b412a1984a4f619fb28373df06fa1b' 2026-02-21T14:08:20.7107295Z Submodule path 'submodules/aiter': checked out '1f5b378dcc9d9b0bcd9456c8c767b7424a5e8190' 2026-02-21T14:08:20.7126313Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/aiter/3rdparty/composable_kernel' 2026-02-21T14:08:20.7144028Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/aiter/3rdparty/composable_kernel'... 2026-02-21T14:08:23.5219178Z Submodule path 'submodules/aiter/3rdparty/composable_kernel': checked out 'e31a7a4f29b371c32ea9daf9211b6ae1fed2fa40' 2026-02-21T14:08:23.8686347Z Submodule path 'submodules/cutlass': checked out 'ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e' 2026-02-21T14:08:23.9205103Z Submodule path 'submodules/flash-attention': checked out '43375aab2893018dfb7950db1cfa623c14946ad6' 2026-02-21T14:08:23.9224936Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/flash-attention/csrc/composable_kernel' 2026-02-21T14:08:23.9225601Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/flash-attention/csrc/cutlass' 2026-02-21T14:08:23.9241588Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/composable_kernel'... 2026-02-21T14:08:26.4254426Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/flash-attention/csrc/cutlass'... 2026-02-21T14:08:29.3239020Z Submodule path 'submodules/flash-attention/csrc/composable_kernel': checked out 'e8709c24f403173ad21a2da907d1347957e324fb' 2026-02-21T14:08:29.6932460Z Submodule path 'submodules/flash-attention/csrc/cutlass': checked out 'b1d6e2c9b334dfa811e4183dfbd02419249e4b52' 2026-02-21T14:08:29.7185964Z Submodule path 'submodules/generative-recommenders': checked out '88512dbd71b053226bc4ef8ec1630e3db53e55e5' 2026-02-21T14:08:29.7196328Z Submodule 'generative_recommenders/ops/cpp/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass' 2026-02-21T14:08:29.7218015Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass'... 2026-02-21T14:08:32.5991136Z Submodule path 'submodules/generative-recommenders/generative_recommenders/ops/cpp/cutlass': checked out 'dc4817921edda44a549197ff3a9dcf5df0636e7b' 2026-02-21T14:08:32.6473162Z Submodule path 'submodules/xformers': checked out '8fc8ec5a4d6498ff81c0c418b89bbaf133ae3a44' 2026-02-21T14:08:32.6484578Z Submodule 'third_party/composable_kernel_tiled' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/composable_kernel_tiled' 2026-02-21T14:08:32.6485659Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/cutlass' 2026-02-21T14:08:32.6486648Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'submodules/xformers/third_party/flash-attention' 2026-02-21T14:08:32.6510514Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/composable_kernel_tiled'... 2026-02-21T14:08:36.5814693Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/cutlass'... 2026-02-21T14:08:39.0848861Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention'... 2026-02-21T14:08:39.9262635Z Submodule path 'submodules/xformers/third_party/composable_kernel_tiled': checked out '4f54fa30583704f34da2ac50372d524cae6bad7d' 2026-02-21T14:08:40.2581596Z Submodule path 'submodules/xformers/third_party/cutlass': checked out 'e9627ce55b42fd2599f58cd4396da9380954def0' 2026-02-21T14:08:40.3032840Z Submodule path 'submodules/xformers/third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2026-02-21T14:08:40.3044052Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel' 2026-02-21T14:08:40.3045096Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'submodules/xformers/third_party/flash-attention/csrc/cutlass' 2026-02-21T14:08:40.3063359Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/composable_kernel'... 2026-02-21T14:08:42.9852990Z Cloning into '/__w/helion/helion/benchmarks/tritonbench/submodules/xformers/third_party/flash-attention/csrc/cutlass'... 2026-02-21T14:08:45.6394717Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2026-02-21T14:08:45.9668367Z Submodule path 'submodules/xformers/third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2026-02-21T14:08:45.9705644Z + uv pip install -r requirements.txt 2026-02-21T14:08:45.9753681Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T14:08:46.1122391Z Resolved 30 packages in 136ms 2026-02-21T14:08:46.1225610Z Downloading pillow (6.7MiB) 2026-02-21T14:08:46.1225875Z Downloading hf-xet (3.2MiB) 2026-02-21T14:08:46.1226098Z Downloading matplotlib (8.3MiB) 2026-02-21T14:08:46.1227720Z Downloading tokenizers (3.0MiB) 2026-02-21T14:08:46.1227945Z Downloading kiwisolver (1.4MiB) 2026-02-21T14:08:46.1230290Z Downloading transformers (10.3MiB) 2026-02-21T14:08:46.1262834Z Downloading fonttools (4.7MiB) 2026-02-21T14:08:46.1803704Z Downloaded kiwisolver 2026-02-21T14:08:46.2217719Z Downloaded tokenizers 2026-02-21T14:08:46.2351054Z Downloaded hf-xet 2026-02-21T14:08:46.2897283Z Downloaded pillow 2026-02-21T14:08:46.3004527Z Downloaded fonttools 2026-02-21T14:08:46.3196068Z Downloaded matplotlib 2026-02-21T14:08:46.4432466Z Downloaded transformers 2026-02-21T14:08:46.4432846Z Prepared 23 packages in 330ms 2026-02-21T14:08:46.4619171Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T14:08:46.4620023Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T14:08:46.4620861Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T14:08:46.5402429Z Installed 23 packages in 96ms 2026-02-21T14:08:46.5402722Z + certifi==2026.1.4 2026-02-21T14:08:46.5402942Z + charset-normalizer==3.4.4 2026-02-21T14:08:46.5403179Z + contourpy==1.3.3 2026-02-21T14:08:46.5403357Z + cycler==0.12.1 2026-02-21T14:08:46.5403544Z + fonttools==4.61.1 2026-02-21T14:08:46.5403721Z + hf-xet==1.2.0 2026-02-21T14:08:46.5403919Z + huggingface-hub==0.36.2 2026-02-21T14:08:46.5404123Z + idna==3.11 2026-02-21T14:08:46.5404298Z + kiwisolver==1.4.9 2026-02-21T14:08:46.5404485Z + matplotlib==3.10.8 2026-02-21T14:08:46.5404682Z + nvidia-ml-py==13.590.48 2026-02-21T14:08:46.5404896Z + pillow==12.1.1 2026-02-21T14:08:46.5405072Z + pyparsing==3.3.2 2026-02-21T14:08:46.5405275Z + python-dateutil==2.9.0.post0 2026-02-21T14:08:46.5405487Z + regex==2026.2.19 2026-02-21T14:08:46.5405669Z + requests==2.32.5 2026-02-21T14:08:46.5405856Z + safetensors==0.7.0 2026-02-21T14:08:46.5406038Z + six==1.17.0 2026-02-21T14:08:46.5406208Z + tabulate==0.9.0 2026-02-21T14:08:46.5406428Z + tokenizers==0.21.4 2026-02-21T14:08:46.5406612Z + tqdm==4.67.3 2026-02-21T14:08:46.5406785Z + transformers==4.53.0 2026-02-21T14:08:46.5406986Z + urllib3==2.6.3 2026-02-21T14:08:46.5545226Z + python install.py --liger 2026-02-21T14:08:50.4519051Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T14:08:50.4541847Z Audited 6 packages in 2ms 2026-02-21T14:08:50.4896939Z INFO:__main__:[tritonbench] installing liger-kernels... 2026-02-21T14:08:50.4935302Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T14:08:50.5281947Z Resolved 1 package in 33ms 2026-02-21T14:08:50.5636864Z Prepared 1 package in 35ms 2026-02-21T14:08:50.5806190Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T14:08:50.5807050Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T14:08:50.5807881Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T14:08:50.5841226Z Installed 1 package in 20ms 2026-02-21T14:08:50.5841776Z + liger-kernel-nightly==0.7.0.dev20260219183429 2026-02-21T14:08:50.5955563Z INFO:__main__:[tritonbench] installation complete! 2026-02-21T14:08:51.3415075Z + uv pip install -e . --no-deps 2026-02-21T14:08:51.3715014Z Using Python 3.12.12 environment at: /__w/helion/helion/.venv 2026-02-21T14:08:51.3741114Z Resolved 1 package in 1ms 2026-02-21T14:08:51.3745929Z Building tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T14:08:52.0084324Z Built tritonbench @ file:///__w/helion/helion/benchmarks/tritonbench 2026-02-21T14:08:52.0099626Z Prepared 1 package in 635ms 2026-02-21T14:08:52.0101956Z warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance. 2026-02-21T14:08:52.0102625Z If the cache and target directories are on different filesystems, hardlinking may not be supported. 2026-02-21T14:08:52.0103282Z If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning. 2026-02-21T14:08:52.0106578Z Installed 1 package in 0.85ms 2026-02-21T14:08:52.0106946Z + tritonbench==0.0.1 (from file:///__w/helion/helion/benchmarks/tritonbench) 2026-02-21T14:08:52.0315958Z + popd 2026-02-21T14:08:52.0316113Z /__w/helion/helion/benchmarks /__w/helion/helion 2026-02-21T14:08:52.0316259Z + popd 2026-02-21T14:08:52.0316349Z /__w/helion/helion 2026-02-21T14:08:52.0362737Z ##[group]Run rm -rf /tmp/torchinductor_*/ || true 2026-02-21T14:08:52.0363027Z rm -rf /tmp/torchinductor_*/ || true 2026-02-21T14:08:52.0363147Z  2026-02-21T14:08:52.0363238Z source .venv/bin/activate 2026-02-21T14:08:52.0363342Z  2026-02-21T14:08:52.0363444Z TEST_REPORTS_DIR=$(pwd)/test/test-reports 2026-02-21T14:08:52.0363583Z mkdir -p "$TEST_REPORTS_DIR" 2026-02-21T14:08:52.0363699Z echo "$TEST_REPORTS_DIR" 2026-02-21T14:08:52.0363815Z  2026-02-21T14:08:52.0363898Z KERNEL_LIST="layer_norm,gemm" 2026-02-21T14:08:52.0364024Z for kernel in ${KERNEL_LIST//,/ }; do 2026-02-21T14:08:52.0364161Z  echo "==========================================" 2026-02-21T14:08:52.0364316Z  echo "Running benchmark for kernel: $kernel" 2026-02-21T14:08:52.0364461Z  echo "==========================================" 2026-02-21T14:08:52.0364580Z  2026-02-21T14:08:52.0364716Z  # Get available implementations and baseline for this kernel 2026-02-21T14:08:52.0364986Z  KERNEL_INFO=$(python benchmarks/run.py --list-impls-for-benchmark-ci --op $kernel | grep "^$kernel:") 2026-02-21T14:08:52.0365262Z  IMPLS=$(echo "$KERNEL_INFO" | sed -n 's/.*impls=\([^ ]*\).*/\1/p') 2026-02-21T14:08:52.0365465Z  BASELINE=$(echo "$KERNEL_INFO" | sed -n 's/.*baseline=\([^ ]*\).*/\1/p') 2026-02-21T14:08:52.0365625Z  2026-02-21T14:08:52.0365705Z  if [[ -z "$IMPLS" ]]; then 2026-02-21T14:08:52.0365874Z  echo "Warning: No implementations found for kernel $kernel, skipping..." 2026-02-21T14:08:52.0366050Z  continue 2026-02-21T14:08:52.0366137Z  fi 2026-02-21T14:08:52.0366228Z  if [[ -z "$BASELINE" ]]; then 2026-02-21T14:08:52.0366383Z  echo "Warning: No baseline found for kernel $kernel, skipping..." 2026-02-21T14:08:52.0366542Z  continue 2026-02-21T14:08:52.0366628Z  fi 2026-02-21T14:08:52.0366722Z  echo "Using baseline: $BASELINE" 2026-02-21T14:08:52.0367035Z  echo "Available implementations for $kernel: $IMPLS" 2026-02-21T14:08:52.0367173Z  2026-02-21T14:08:52.0367274Z  # Do autotuning but do not record the results 2026-02-21T14:08:52.0367409Z  python benchmarks/run.py \ 2026-02-21T14:08:52.0367526Z  --op $kernel \ 2026-02-21T14:08:52.0367640Z  --metrics speedup,accuracy \ 2026-02-21T14:08:52.0367782Z  --latency-measure-mode triton_do_bench \ 2026-02-21T14:08:52.0367912Z  --cudagraph \ 2026-02-21T14:08:52.0368017Z  --only $IMPLS \ 2026-02-21T14:08:52.0368145Z  --only-match-mode prefix-with-baseline \ 2026-02-21T14:08:52.0368279Z  --baseline $BASELINE \ 2026-02-21T14:08:52.0368392Z  --atol 1e-2 \ 2026-02-21T14:08:52.0368492Z  --rtol 1e-2 \ 2026-02-21T14:08:52.0368612Z  --input-sample-mode equally-spaced-k \ 2026-02-21T14:08:52.0368749Z  --keep-going \ 2026-02-21T14:08:52.0368848Z   2026-02-21T14:08:52.0368926Z  2026-02-21T14:08:52.0369008Z  # Relax the GPU 2026-02-21T14:08:52.0369105Z  sleep 2m 2026-02-21T14:08:52.0369190Z  2026-02-21T14:08:52.0369286Z  # Run again with cache and record results 2026-02-21T14:08:52.0381990Z  HELION_PRINT_OUTPUT_CODE=1 HELION_ASSERT_CACHE_HIT=1 python benchmarks/run.py \ 2026-02-21T14:08:52.0382193Z  --op $kernel \ 2026-02-21T14:08:52.0382312Z  --metrics speedup,accuracy \ 2026-02-21T14:08:52.0382467Z  --latency-measure-mode triton_do_bench \ 2026-02-21T14:08:52.0382604Z  --cudagraph \ 2026-02-21T14:08:52.0382708Z  --only $IMPLS \ 2026-02-21T14:08:52.0382972Z  --only-match-mode prefix-with-baseline \ 2026-02-21T14:08:52.0383117Z  --baseline $BASELINE \ 2026-02-21T14:08:52.0383244Z  --atol 1e-2 \ 2026-02-21T14:08:52.0383345Z  --rtol 1e-2 \ 2026-02-21T14:08:52.0383474Z  --input-sample-mode equally-spaced-k \ 2026-02-21T14:08:52.0383634Z  --output "$TEST_REPORTS_DIR/helionbench.json" \ 2026-02-21T14:08:52.0383778Z  --append-to-output \ 2026-02-21T14:08:52.0383892Z  --keep-going \ 2026-02-21T14:08:52.0383987Z   2026-02-21T14:08:52.0384070Z  2026-02-21T14:08:52.0384192Z  echo "✅ Completed benchmark for kernel: $kernel" 2026-02-21T14:08:52.0384328Z done 2026-02-21T14:08:52.0384408Z  2026-02-21T14:08:52.0384520Z if [[ ! -s "$TEST_REPORTS_DIR/helionbench.json" ]]; then 2026-02-21T14:08:52.0384686Z  echo "❌ helionbench.json is missing or empty" 2026-02-21T14:08:52.0384813Z  exit 1 2026-02-21T14:08:52.0384902Z fi 2026-02-21T14:08:52.0384999Z cat "$TEST_REPORTS_DIR/helionbench.json" 2026-02-21T14:08:52.0385249Z shell: bash -l {0} 2026-02-21T14:08:52.0385333Z env: 2026-02-21T14:08:52.0385418Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T14:08:52.0385545Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:08:52.0385698Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T14:08:52.0385852Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:08:52.0385985Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:08:52.0386122Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T14:08:52.0386260Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T14:08:52.0386414Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T14:08:52.0386558Z ##[endgroup] 2026-02-21T14:08:52.1473485Z /__w/helion/helion/test/test-reports 2026-02-21T14:08:52.1475453Z ========================================== 2026-02-21T14:08:52.1475907Z Running benchmark for kernel: layer_norm 2026-02-21T14:08:52.1476583Z ========================================== 2026-02-21T14:09:03.2356355Z Using baseline: torch_layer_norm 2026-02-21T14:09:03.2356904Z Available implementations for layer_norm: helion_layer_norm_tritonbench,liger_layer_norm,torch_compile_layer_norm 2026-02-21T14:09:13.6373160Z Using num_inputs=20 for layer_norm 2026-02-21T14:09:14.0854324Z Running layer_norm benchmark with Helion implementation... 2026-02-21T14:09:14.0854671Z 2026-02-21T14:09:14.2716433Z Equally-spaced-k mode: Selected 20 equally spaced inputs (total available: 30) 2026-02-21T14:09:14.2717115Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 2, 3, 5, 6, 8, 9, 11, 12, 14, 15, 17, 18, 20, 21, 23, 24, 26, 27, 29] 2026-02-21T14:09:14.2722596Z 2026-02-21T14:09:14.2729498Z 0%| | 0/20 [00:00(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T17:43:10.4546307Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:10.4547006Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:10.4547680Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:10.4548404Z #blocked3 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T17:43:10.4549034Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T17:43:10.4549693Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T17:43:10.4550430Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T17:43:10.4551448Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T17:43:10.4552283Z %cst = arith.constant dense<0.000000e+00> : tensor<32x16xf16, #blocked> 2026-02-21T17:43:10.4552706Z %cst_0 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T17:43:10.4553104Z %cst_1 = arith.constant dense<1024> : tensor<32x1xi64, #blocked1> 2026-02-21T17:43:10.4553465Z %cst_2 = arith.constant dense<0> : tensor<32x1xi64, #blocked1> 2026-02-21T17:43:10.4553757Z %cst_3 = arith.constant dense<1024> : tensor<1x16xi64, #blocked> 2026-02-21T17:43:10.4554068Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8x32xf16, #blocked2> 2026-02-21T17:43:10.4554536Z %cst_5 = arith.constant dense<1024> : tensor<1x32xi64, #blocked2> 2026-02-21T17:43:10.4554819Z %cst_6 = arith.constant dense<0> : tensor<1x32xi64, #blocked2> 2026-02-21T17:43:10.4555102Z %cst_7 = arith.constant dense<4096> : tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4555385Z %cst_8 = arith.constant dense<0> : tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4555665Z %cst_9 = arith.constant dense<1024> : tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4555921Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T17:43:10.4556125Z %c32_i32 = arith.constant 32 : i32 2026-02-21T17:43:10.4556320Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T17:43:10.4556544Z %c0_i32 = arith.constant 0 : i32 2026-02-21T17:43:10.4556744Z %c608_i32 = arith.constant 608 : i32 2026-02-21T17:43:10.4557012Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<8x16xf32, #blocked> 2026-02-21T17:43:10.4557284Z %c16_i32 = arith.constant 16 : i32 2026-02-21T17:43:10.4557467Z %c8_i32 = arith.constant 8 : i32 2026-02-21T17:43:10.4557657Z %c64_i32 = arith.constant 64 : i32 2026-02-21T17:43:10.4557843Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T17:43:10.4558054Z %0 = tt.get_program_id x : i32 2026-02-21T17:43:10.4558312Z %1 = tt.splat %arg0 : !tt.ptr -> tensor<8x32x!tt.ptr, #blocked2> 2026-02-21T17:43:10.4558645Z %2 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked3> 2026-02-21T17:43:10.4558982Z %3 = arith.extsi %2 : tensor<8xi32, #blocked3> to tensor<8xi64, #blocked3> 2026-02-21T17:43:10.4559314Z %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked3> 2026-02-21T17:43:10.4559655Z %5 = arith.extsi %4 : tensor<32xi32, #blocked3> to tensor<32xi64, #blocked3> 2026-02-21T17:43:10.4559972Z %6 = tt.splat %arg1 : !tt.ptr -> tensor<32x16x!tt.ptr, #blocked> 2026-02-21T17:43:10.4560369Z %7 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked3> 2026-02-21T17:43:10.4560702Z %8 = arith.extsi %7 : tensor<16xi32, #blocked3> to tensor<16xi64, #blocked3> 2026-02-21T17:43:10.4561029Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<8x16x!tt.ptr, #blocked> 2026-02-21T17:43:10.4561355Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked3> 2026-02-21T17:43:10.4561679Z %11 = arith.extsi %10 : tensor<8xi32, #blocked3> to tensor<8xi64, #blocked3> 2026-02-21T17:43:10.4562016Z %12 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked3> 2026-02-21T17:43:10.4562350Z %13 = arith.extsi %12 : tensor<16xi32, #blocked3> to tensor<16xi64, #blocked3> 2026-02-21T17:43:10.4562666Z scf.for %arg3 = %0 to %c32768_i32 step %c608_i32 : i32 { 2026-02-21T17:43:10.4562913Z %14 = arith.divsi %arg3, %c4096_i32 : i32 2026-02-21T17:43:10.4563115Z %15 = arith.muli %14, %c8_i32 : i32 2026-02-21T17:43:10.4563330Z %16 = arith.subi %c64_i32, %15 : i32 2026-02-21T17:43:10.4563515Z %17 = arith.minsi %16, %c8_i32 : i32 2026-02-21T17:43:10.4563713Z %18 = arith.remsi %arg3, %c4096_i32 : i32 2026-02-21T17:43:10.4563906Z %19 = arith.remsi %18, %17 : i32 2026-02-21T17:43:10.4564050Z %20 = arith.addi %15, %19 : i32 2026-02-21T17:43:10.4564191Z %21 = arith.divsi %18, %17 : i32 2026-02-21T17:43:10.4564335Z %22 = arith.muli %20, %c16_i32 : i32 2026-02-21T17:43:10.4564484Z %23 = arith.muli %21, %c8_i32 : i32 2026-02-21T17:43:10.4564626Z %24 = arith.extsi %23 : i32 to i64 2026-02-21T17:43:10.4564795Z %25 = tt.splat %24 : i64 -> tensor<8xi64, #blocked3> 2026-02-21T17:43:10.4564988Z %26 = arith.addi %25, %3 : tensor<8xi64, #blocked3> 2026-02-21T17:43:10.4565283Z %27 = ttg.convert_layout %26 : tensor<8xi64, #blocked3> -> tensor<8xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:10.4565699Z %28 = tt.expand_dims %27 {axis = 1 : i32} : tensor<8xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<8x1xi64, #blocked4> 2026-02-21T17:43:10.4566093Z %29 = ttg.convert_layout %28 : tensor<8x1xi64, #blocked4> -> tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4566347Z %30 = arith.muli %29, %cst_9 : tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4566580Z %31 = tt.broadcast %30 : tensor<8x1xi64, #blocked1> -> tensor<8x32xi64, #blocked1> 2026-02-21T17:43:10.4566873Z %32 = ttg.convert_layout %31 : tensor<8x32xi64, #blocked1> -> tensor<8x32xi64, #blocked2> 2026-02-21T17:43:10.4567128Z %33 = arith.cmpi sge, %29, %cst_8 : tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4567346Z %34 = arith.cmpi slt, %29, %cst_7 : tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4567554Z %35 = arith.andi %33, %34 : tensor<8x1xi1, #blocked1> 2026-02-21T17:43:10.4567776Z %36 = tt.broadcast %35 : tensor<8x1xi1, #blocked1> -> tensor<8x32xi1, #blocked1> 2026-02-21T17:43:10.4568064Z %37 = ttg.convert_layout %36 : tensor<8x32xi1, #blocked1> -> tensor<8x32xi1, #blocked2> 2026-02-21T17:43:10.4568283Z %38 = arith.extsi %22 : i32 to i64 2026-02-21T17:43:10.4568459Z %39 = tt.splat %38 : i64 -> tensor<16xi64, #blocked3> 2026-02-21T17:43:10.4568641Z %40 = arith.addi %39, %8 : tensor<16xi64, #blocked3> 2026-02-21T17:43:10.4568949Z %41 = ttg.convert_layout %40 : tensor<16xi64, #blocked3> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:10.4569360Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi64, #blocked5> 2026-02-21T17:43:10.4569721Z %43 = ttg.convert_layout %42 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #blocked> 2026-02-21T17:43:10.4569970Z %44 = arith.muli %43, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T17:43:10.4570206Z %45 = tt.broadcast %44 : tensor<1x16xi64, #blocked> -> tensor<32x16xi64, #blocked> 2026-02-21T17:43:10.4570486Z %46 = arith.cmpi sge, %43, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T17:43:10.4570700Z %47 = arith.cmpi slt, %43, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T17:43:10.4570900Z %48 = arith.andi %46, %47 : tensor<1x16xi1, #blocked> 2026-02-21T17:43:10.4571128Z %49 = tt.broadcast %48 : tensor<1x16xi1, #blocked> -> tensor<32x16xi1, #blocked> 2026-02-21T17:43:10.4571471Z %50 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c32_i32 iter_args(%arg5 = %cst_10) -> (tensor<8x16xf32, #blocked>) : i32 { 2026-02-21T17:43:10.4571756Z %80 = arith.extsi %arg4 : i32 to i64 2026-02-21T17:43:10.4571935Z %81 = tt.splat %80 : i64 -> tensor<32xi64, #blocked3> 2026-02-21T17:43:10.4572128Z %82 = arith.addi %81, %5 : tensor<32xi64, #blocked3> 2026-02-21T17:43:10.4572423Z %83 = ttg.convert_layout %82 : tensor<32xi64, #blocked3> -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:10.4572838Z %84 = tt.expand_dims %83 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x32xi64, #blocked5> 2026-02-21T17:43:10.4573198Z %85 = ttg.convert_layout %84 : tensor<1x32xi64, #blocked5> -> tensor<1x32xi64, #blocked2> 2026-02-21T17:43:10.4573474Z %86 = tt.broadcast %85 : tensor<1x32xi64, #blocked2> -> tensor<8x32xi64, #blocked2> 2026-02-21T17:43:10.4573673Z %87 = arith.addi %32, %86 : tensor<8x32xi64, #blocked2> 2026-02-21T17:43:10.4573873Z %88 = tt.addptr %1, %87 : tensor<8x32x!tt.ptr, #blocked2>, tensor<8x32xi64, #blocked2> 2026-02-21T17:43:10.4574090Z %89 = arith.cmpi sge, %85, %cst_6 : tensor<1x32xi64, #blocked2> 2026-02-21T17:43:10.4574264Z %90 = arith.cmpi slt, %85, %cst_5 : tensor<1x32xi64, #blocked2> 2026-02-21T17:43:10.4574432Z %91 = arith.andi %89, %90 : tensor<1x32xi1, #blocked2> 2026-02-21T17:43:10.4574617Z %92 = tt.broadcast %91 : tensor<1x32xi1, #blocked2> -> tensor<8x32xi1, #blocked2> 2026-02-21T17:43:10.4574809Z %93 = arith.andi %37, %92 : tensor<8x32xi1, #blocked2> 2026-02-21T17:43:10.4574986Z %94 = tt.load %88, %93, %cst_4 : tensor<8x32x!tt.ptr, #blocked2> 2026-02-21T17:43:10.4575280Z %95 = ttg.convert_layout %82 : tensor<32xi64, #blocked3> -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:10.4575616Z %96 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<32x1xi64, #blocked4> 2026-02-21T17:43:10.4575904Z %97 = ttg.convert_layout %96 : tensor<32x1xi64, #blocked4> -> tensor<32x1xi64, #blocked1> 2026-02-21T17:43:10.4576142Z %98 = tt.broadcast %97 : tensor<32x1xi64, #blocked1> -> tensor<32x16xi64, #blocked1> 2026-02-21T17:43:10.4576379Z %99 = ttg.convert_layout %98 : tensor<32x16xi64, #blocked1> -> tensor<32x16xi64, #blocked> 2026-02-21T17:43:10.4576588Z %100 = arith.addi %99, %45 : tensor<32x16xi64, #blocked> 2026-02-21T17:43:10.4576797Z %101 = tt.addptr %6, %100 : tensor<32x16x!tt.ptr, #blocked>, tensor<32x16xi64, #blocked> 2026-02-21T17:43:10.4577010Z %102 = arith.cmpi sge, %97, %cst_2 : tensor<32x1xi64, #blocked1> 2026-02-21T17:43:10.4577195Z %103 = arith.cmpi slt, %97, %cst_1 : tensor<32x1xi64, #blocked1> 2026-02-21T17:43:10.4577369Z %104 = arith.andi %102, %103 : tensor<32x1xi1, #blocked1> 2026-02-21T17:43:10.4577561Z %105 = tt.broadcast %104 : tensor<32x1xi1, #blocked1> -> tensor<32x16xi1, #blocked1> 2026-02-21T17:43:10.4577805Z %106 = ttg.convert_layout %105 : tensor<32x16xi1, #blocked1> -> tensor<32x16xi1, #blocked> 2026-02-21T17:43:10.4578033Z %107 = arith.andi %106, %49 : tensor<32x16xi1, #blocked> 2026-02-21T17:43:10.4578208Z %108 = tt.load %101, %107, %cst : tensor<32x16x!tt.ptr, #blocked> 2026-02-21T17:43:10.4578481Z %109 = ttg.convert_layout %94 : tensor<8x32xf16, #blocked2> -> tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T17:43:10.4578859Z %110 = ttg.convert_layout %108 : tensor<32x16xf16, #blocked> -> tensor<32x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T17:43:10.4579159Z %111 = ttg.convert_layout %arg5 : tensor<8x16xf32, #blocked> -> tensor<8x16xf32, #blocked> 2026-02-21T17:43:10.4579567Z %112 = tt.dot %109, %110, %111, inputPrecision = tf32 : tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<32x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<8x16xf32, #blocked> 2026-02-21T17:43:10.4579913Z scf.yield %112 : tensor<8x16xf32, #blocked> 2026-02-21T17:43:10.4580042Z } 2026-02-21T17:43:10.4580177Z %51 = arith.truncf %50 : tensor<8x16xf32, #blocked> to tensor<8x16xf16, #blocked> 2026-02-21T17:43:10.4580355Z %52 = arith.extsi %23 : i32 to i64 2026-02-21T17:43:10.4580473Z %53 = arith.extsi %22 : i32 to i64 2026-02-21T17:43:10.4580611Z %54 = tt.splat %52 : i64 -> tensor<8xi64, #blocked3> 2026-02-21T17:43:10.4580765Z %55 = arith.addi %54, %11 : tensor<8xi64, #blocked3> 2026-02-21T17:43:10.4581000Z %56 = ttg.convert_layout %55 : tensor<8xi64, #blocked3> -> tensor<8xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:10.4581327Z %57 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<8x1xi64, #blocked4> 2026-02-21T17:43:10.4581611Z %58 = ttg.convert_layout %57 : tensor<8x1xi64, #blocked4> -> tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4581815Z %59 = arith.muli %58, %cst_9 : tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4582009Z %60 = tt.broadcast %59 : tensor<8x1xi64, #blocked1> -> tensor<8x16xi64, #blocked1> 2026-02-21T17:43:10.4582238Z %61 = ttg.convert_layout %60 : tensor<8x16xi64, #blocked1> -> tensor<8x16xi64, #blocked> 2026-02-21T17:43:10.4582438Z %62 = tt.splat %53 : i64 -> tensor<16xi64, #blocked3> 2026-02-21T17:43:10.4582590Z %63 = arith.addi %62, %13 : tensor<16xi64, #blocked3> 2026-02-21T17:43:10.4582832Z %64 = ttg.convert_layout %63 : tensor<16xi64, #blocked3> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:10.4583161Z %65 = tt.expand_dims %64 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi64, #blocked5> 2026-02-21T17:43:10.4583499Z %66 = ttg.convert_layout %65 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #blocked> 2026-02-21T17:43:10.4583729Z %67 = tt.broadcast %66 : tensor<1x16xi64, #blocked> -> tensor<8x16xi64, #blocked> 2026-02-21T17:43:10.4583920Z %68 = arith.addi %61, %67 : tensor<8x16xi64, #blocked> 2026-02-21T17:43:10.4584117Z %69 = tt.addptr %9, %68 : tensor<8x16x!tt.ptr, #blocked>, tensor<8x16xi64, #blocked> 2026-02-21T17:43:10.4584322Z %70 = arith.cmpi sge, %58, %cst_8 : tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4584498Z %71 = arith.cmpi slt, %58, %cst_7 : tensor<8x1xi64, #blocked1> 2026-02-21T17:43:10.4584666Z %72 = arith.andi %70, %71 : tensor<8x1xi1, #blocked1> 2026-02-21T17:43:10.4584849Z %73 = tt.broadcast %72 : tensor<8x1xi1, #blocked1> -> tensor<8x16xi1, #blocked1> 2026-02-21T17:43:10.4585076Z %74 = ttg.convert_layout %73 : tensor<8x16xi1, #blocked1> -> tensor<8x16xi1, #blocked> 2026-02-21T17:43:10.4585281Z %75 = arith.cmpi sge, %66, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T17:43:10.4585454Z %76 = arith.cmpi slt, %66, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T17:43:10.4585617Z %77 = arith.andi %75, %76 : tensor<1x16xi1, #blocked> 2026-02-21T17:43:10.4585796Z %78 = tt.broadcast %77 : tensor<1x16xi1, #blocked> -> tensor<8x16xi1, #blocked> 2026-02-21T17:43:10.4585982Z %79 = arith.andi %74, %78 : tensor<8x16xi1, #blocked> 2026-02-21T17:43:10.4586142Z tt.store %69, %51, %79 : tensor<8x16x!tt.ptr, #blocked> 2026-02-21T17:43:10.4586293Z } {tt.disallow_acc_multi_buffer} 2026-02-21T17:43:10.4586404Z tt.return 2026-02-21T17:43:10.4586492Z } 2026-02-21T17:43:10.4586577Z } 2026-02-21T17:43:10.4586628Z 2026-02-21T17:43:10.4586662Z {-# 2026-02-21T17:43:10.4586796Z external_resources: { 2026-02-21T17:43:10.4586899Z mlir_reproducer: { 2026-02-21T17:43:10.4589213Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T17:43:10.4591552Z disable_threading: false, 2026-02-21T17:43:10.4591667Z verify_each: true 2026-02-21T17:43:10.4591759Z } 2026-02-21T17:43:10.4591840Z } 2026-02-21T17:43:10.4591913Z #-} 2026-02-21T17:43:10.4592203Z /tmp/torchinductor_root/md/cmdvia7vybfsqtykhrh45koxwilgevbpvsgpz3gbiw2ihuwldhjx.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T17:43:10.4592904Z /tmp/torchinductor_root/md/cmdvia7vybfsqtykhrh45koxwilgevbpvsgpz3gbiw2ihuwldhjx.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T17:43:10.4593537Z [8s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T17:43:10.4594351Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T17:43:10.4596890Z Error: RuntimeError: PassManager::run failed 2026-02-21T17:43:10.4597063Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T17:43:10.7282071Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T17:43:10.7283225Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:10.7283875Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:10.7284447Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T17:43:10.7285013Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T17:43:10.7285614Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:10.7286377Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T17:43:10.7287034Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T17:43:10.7287936Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T17:43:10.7288598Z %c4096_i64 = arith.constant 4096 : i64 2026-02-21T17:43:10.7288845Z %c0_i64 = arith.constant 0 : i64 2026-02-21T17:43:10.7289076Z %c1024_i64 = arith.constant 1024 : i64 2026-02-21T17:43:10.7289412Z %cst = arith.constant dense<0.000000e+00> : tensor<1024x4xf16, #blocked> 2026-02-21T17:43:10.7289783Z %cst_0 = arith.constant dense<0> : tensor<1x4xi64, #blocked> 2026-02-21T17:43:10.7290135Z %cst_1 = arith.constant dense<1024> : tensor<1024x1xi64, #blocked1> 2026-02-21T17:43:10.7290495Z %cst_2 = arith.constant dense<0> : tensor<1024x1xi64, #blocked1> 2026-02-21T17:43:10.7290835Z %cst_3 = arith.constant dense<1024> : tensor<1x4xi64, #blocked> 2026-02-21T17:43:10.7291130Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T17:43:10.7291357Z %c1216_i32 = arith.constant 1216 : i32 2026-02-21T17:43:10.7291656Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<1x4xf32, #blocked> 2026-02-21T17:43:10.7291963Z %c4_i32 = arith.constant 4 : i32 2026-02-21T17:43:10.7292193Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T17:43:10.7292430Z %c1048576_i32 = arith.constant 1048576 : i32 2026-02-21T17:43:10.7292676Z %0 = tt.get_program_id x : i32 2026-02-21T17:43:10.7292993Z %1 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked2> 2026-02-21T17:43:10.7293489Z %2 = ttg.convert_layout %1 : tensor<1024xi32, #blocked2> -> tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked3}>> 2026-02-21T17:43:10.7293993Z %3 = tt.expand_dims %2 {axis = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked3}>> -> tensor<1x1024xi32, #blocked3> 2026-02-21T17:43:10.7294460Z %4 = ttg.convert_layout %3 : tensor<1x1024xi32, #blocked3> -> tensor<1x1024xi32, #blocked4> 2026-02-21T17:43:10.7294803Z %5 = tt.splat %arg0 : !tt.ptr -> tensor<1x1024x!tt.ptr, #blocked4> 2026-02-21T17:43:10.7295088Z %6 = tt.splat %arg1 : !tt.ptr -> tensor<1024x4x!tt.ptr, #blocked> 2026-02-21T17:43:10.7295383Z %7 = arith.extsi %1 : tensor<1024xi32, #blocked2> to tensor<1024xi64, #blocked2> 2026-02-21T17:43:10.7295788Z %8 = ttg.convert_layout %7 : tensor<1024xi64, #blocked2> -> tensor<1024xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T17:43:10.7296266Z %9 = tt.expand_dims %8 {axis = 1 : i32} : tensor<1024xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<1024x1xi64, #blocked5> 2026-02-21T17:43:10.7296791Z %10 = ttg.convert_layout %9 : tensor<1024x1xi64, #blocked5> -> tensor<1024x1xi64, #blocked1> 2026-02-21T17:43:10.7297147Z %11 = tt.broadcast %10 : tensor<1024x1xi64, #blocked1> -> tensor<1024x4xi64, #blocked1> 2026-02-21T17:43:10.7297487Z %12 = ttg.convert_layout %11 : tensor<1024x4xi64, #blocked1> -> tensor<1024x4xi64, #blocked> 2026-02-21T17:43:10.7297821Z %13 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked2> 2026-02-21T17:43:10.7298117Z %14 = arith.extsi %13 : tensor<4xi32, #blocked2> to tensor<4xi64, #blocked2> 2026-02-21T17:43:10.7298398Z %15 = arith.cmpi sge, %10, %cst_2 : tensor<1024x1xi64, #blocked1> 2026-02-21T17:43:10.7298659Z %16 = arith.cmpi slt, %10, %cst_1 : tensor<1024x1xi64, #blocked1> 2026-02-21T17:43:10.7298903Z %17 = arith.andi %15, %16 : tensor<1024x1xi1, #blocked1> 2026-02-21T17:43:10.7299173Z %18 = tt.broadcast %17 : tensor<1024x1xi1, #blocked1> -> tensor<1024x4xi1, #blocked1> 2026-02-21T17:43:10.7299512Z %19 = ttg.convert_layout %18 : tensor<1024x4xi1, #blocked1> -> tensor<1024x4xi1, #blocked> 2026-02-21T17:43:10.7309188Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<1x4x!tt.ptr, #blocked> 2026-02-21T17:43:10.7309418Z scf.for %arg3 = %0 to %c1048576_i32 step %c1216_i32 : i32 { 2026-02-21T17:43:10.7309596Z %21 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T17:43:10.7309746Z %22 = arith.muli %21, %c4_i32 : i32 2026-02-21T17:43:10.7309886Z %23 = arith.subi %c4096_i32, %22 : i32 2026-02-21T17:43:10.7310028Z %24 = arith.minsi %23, %c4_i32 : i32 2026-02-21T17:43:10.7310168Z %25 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T17:43:10.7310308Z %26 = arith.remsi %25, %24 : i32 2026-02-21T17:43:10.7310440Z %27 = arith.addi %22, %26 : i32 2026-02-21T17:43:10.7310571Z %28 = arith.divsi %25, %24 : i32 2026-02-21T17:43:10.7310702Z %29 = arith.muli %28, %c4_i32 : i32 2026-02-21T17:43:10.7310839Z %30 = arith.muli %27, %c1024_i32 : i32 2026-02-21T17:43:10.7311000Z %31 = tt.splat %30 : i32 -> tensor<1x1024xi32, #blocked4> 2026-02-21T17:43:10.7311183Z %32 = arith.addi %31, %4 : tensor<1x1024xi32, #blocked4> 2026-02-21T17:43:10.7311422Z %33 = tt.addptr %5, %32 : tensor<1x1024x!tt.ptr, #blocked4>, tensor<1x1024xi32, #blocked4> 2026-02-21T17:43:10.7311663Z %34 = tt.load %33 : tensor<1x1024x!tt.ptr, #blocked4> 2026-02-21T17:43:10.7311824Z %35 = arith.extsi %29 : i32 to i64 2026-02-21T17:43:10.7311982Z %36 = tt.splat %35 : i64 -> tensor<4xi64, #blocked2> 2026-02-21T17:43:10.7312148Z %37 = arith.addi %36, %14 : tensor<4xi64, #blocked2> 2026-02-21T17:43:10.7312422Z %38 = ttg.convert_layout %37 : tensor<4xi64, #blocked2> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked3}>> 2026-02-21T17:43:10.7312789Z %39 = tt.expand_dims %38 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked3}>> -> tensor<1x4xi64, #blocked3> 2026-02-21T17:43:10.7313141Z %40 = ttg.convert_layout %39 : tensor<1x4xi64, #blocked3> -> tensor<1x4xi64, #blocked> 2026-02-21T17:43:10.7313375Z %41 = arith.muli %40, %cst_3 : tensor<1x4xi64, #blocked> 2026-02-21T17:43:10.7313586Z %42 = tt.broadcast %41 : tensor<1x4xi64, #blocked> -> tensor<1024x4xi64, #blocked> 2026-02-21T17:43:10.7313822Z %43 = arith.addi %12, %42 : tensor<1024x4xi64, #blocked> 2026-02-21T17:43:10.7314019Z %44 = tt.addptr %6, %43 : tensor<1024x4x!tt.ptr, #blocked>, tensor<1024x4xi64, #blocked> 2026-02-21T17:43:10.7314234Z %45 = arith.cmpi sge, %40, %cst_0 : tensor<1x4xi64, #blocked> 2026-02-21T17:43:10.7314410Z %46 = arith.cmpi slt, %40, %cst_3 : tensor<1x4xi64, #blocked> 2026-02-21T17:43:10.7314568Z %47 = arith.andi %45, %46 : tensor<1x4xi1, #blocked> 2026-02-21T17:43:10.7314752Z %48 = tt.broadcast %47 : tensor<1x4xi1, #blocked> -> tensor<1024x4xi1, #blocked> 2026-02-21T17:43:10.7314940Z %49 = arith.andi %19, %48 : tensor<1024x4xi1, #blocked> 2026-02-21T17:43:10.7315143Z %50 = tt.load %44, %49, %cst : tensor<1024x4x!tt.ptr, #blocked> 2026-02-21T17:43:10.7315412Z %51 = ttg.convert_layout %34 : tensor<1x1024xf16, #blocked4> -> tensor<1x1024xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T17:43:10.7315764Z %52 = ttg.convert_layout %50 : tensor<1024x4xf16, #blocked> -> tensor<1024x4xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T17:43:10.7316067Z %53 = ttg.convert_layout %cst_4 : tensor<1x4xf32, #blocked> -> tensor<1x4xf32, #blocked> 2026-02-21T17:43:10.7316467Z %54 = tt.dot %51, %52, %53, inputPrecision = tf32 : tensor<1x1024xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<1024x4xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<1x4xf32, #blocked> 2026-02-21T17:43:10.7316848Z %55 = arith.truncf %54 : tensor<1x4xf32, #blocked> to tensor<1x4xf16, #blocked> 2026-02-21T17:43:10.7317022Z %56 = arith.extsi %27 : i32 to i64 2026-02-21T17:43:10.7317143Z %57 = arith.muli %56, %c1024_i64 : i64 2026-02-21T17:43:10.7317286Z %58 = tt.splat %57 : i64 -> tensor<1x4xi64, #blocked> 2026-02-21T17:43:10.7317493Z %59 = arith.addi %58, %40 : tensor<1x4xi64, #blocked> 2026-02-21T17:43:10.7317683Z %60 = tt.addptr %20, %59 : tensor<1x4x!tt.ptr, #blocked>, tensor<1x4xi64, #blocked> 2026-02-21T17:43:10.7317867Z %61 = arith.cmpi sge, %56, %c0_i64 : i64 2026-02-21T17:43:10.7318004Z %62 = arith.cmpi slt, %56, %c4096_i64 : i64 2026-02-21T17:43:10.7318132Z %63 = arith.andi %61, %62 : i1 2026-02-21T17:43:10.7318265Z %64 = tt.splat %63 : i1 -> tensor<1x4xi1, #blocked> 2026-02-21T17:43:10.7318418Z %65 = arith.andi %64, %47 : tensor<1x4xi1, #blocked> 2026-02-21T17:43:10.7318575Z tt.store %60, %55, %65 : tensor<1x4x!tt.ptr, #blocked> 2026-02-21T17:43:10.7318716Z } 2026-02-21T17:43:10.7318795Z tt.return 2026-02-21T17:43:10.7318895Z } 2026-02-21T17:43:10.7318973Z } 2026-02-21T17:43:10.7319025Z 2026-02-21T17:43:10.7319060Z {-# 2026-02-21T17:43:10.7319143Z external_resources: { 2026-02-21T17:43:10.7319250Z mlir_reproducer: { 2026-02-21T17:43:10.7321506Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T17:43:10.7323846Z disable_threading: false, 2026-02-21T17:43:10.7323957Z verify_each: true 2026-02-21T17:43:10.7324055Z } 2026-02-21T17:43:10.7324130Z } 2026-02-21T17:43:10.7324208Z #-} 2026-02-21T17:43:10.7324503Z /tmp/torchinductor_root/p7/cp773kwdgczcx3q73dhrwfuvywogv23iye4osw4jy3q45u5suopy.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T17:43:10.7325189Z /tmp/torchinductor_root/p7/cp773kwdgczcx3q73dhrwfuvywogv23iye4osw4jy3q45u5suopy.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T17:43:10.7325768Z [8s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T17:43:10.7326594Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 4, 1024], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T17:43:10.7327323Z Error: RuntimeError: PassManager::run failed 2026-02-21T17:43:10.7327496Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T17:43:10.8294440Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T17:43:10.8297320Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:10.8297655Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:10.8297979Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T17:43:10.8298271Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:10.8298576Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T17:43:10.8298880Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T17:43:10.8299186Z #blocked6 = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:10.8299523Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T17:43:10.8299980Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T17:43:10.8300352Z %cst = arith.constant dense<4096> : tensor<4x1xi64, #blocked> 2026-02-21T17:43:10.8300539Z %cst_0 = arith.constant dense<0> : tensor<4x1xi64, #blocked> 2026-02-21T17:43:10.8300715Z %cst_1 = arith.constant dense<1024> : tensor<4x1xi64, #blocked> 2026-02-21T17:43:10.8300914Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<32x256xf16, #blocked1> 2026-02-21T17:43:10.8301108Z %cst_3 = arith.constant dense<0> : tensor<1x256xi64, #blocked1> 2026-02-21T17:43:10.8301294Z %cst_4 = arith.constant dense<1024> : tensor<32x1xi64, #blocked> 2026-02-21T17:43:10.8301469Z %cst_5 = arith.constant dense<0> : tensor<32x1xi64, #blocked> 2026-02-21T17:43:10.8301696Z %cst_6 = arith.constant dense<1024> : tensor<1x256xi64, #blocked1> 2026-02-21T17:43:10.8301860Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T17:43:10.8302005Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T17:43:10.8302131Z %c0_i32 = arith.constant 0 : i32 2026-02-21T17:43:10.8302248Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T17:43:10.8302401Z %cst_7 = arith.constant dense<1024> : tensor<4x1xi32, #blocked> 2026-02-21T17:43:10.8302595Z %cst_8 = arith.constant dense<0.000000e+00> : tensor<4x256xf32, #blocked1> 2026-02-21T17:43:10.8302760Z %c4_i32 = arith.constant 4 : i32 2026-02-21T17:43:10.8302881Z %c256_i32 = arith.constant 256 : i32 2026-02-21T17:43:10.8303032Z %c32_i32 = arith.constant 32 : i32 2026-02-21T17:43:10.8303155Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T17:43:10.8303279Z %0 = tt.get_program_id x : i32 2026-02-21T17:43:10.8303439Z %1 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked2> 2026-02-21T17:43:10.8303652Z %2 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked2> 2026-02-21T17:43:10.8303858Z %3 = tt.splat %arg0 : !tt.ptr -> tensor<4x32x!tt.ptr, #blocked3> 2026-02-21T17:43:10.8304065Z %4 = tt.splat %arg1 : !tt.ptr -> tensor<32x256x!tt.ptr, #blocked1> 2026-02-21T17:43:10.8304266Z %5 = arith.extsi %2 : tensor<32xi32, #blocked2> to tensor<32xi64, #blocked2> 2026-02-21T17:43:10.8304478Z %6 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked2> 2026-02-21T17:43:10.8304689Z %7 = arith.extsi %6 : tensor<256xi32, #blocked2> to tensor<256xi64, #blocked2> 2026-02-21T17:43:10.8304897Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<4x256x!tt.ptr, #blocked1> 2026-02-21T17:43:10.8305104Z %9 = arith.extsi %1 : tensor<4xi32, #blocked2> to tensor<4xi64, #blocked2> 2026-02-21T17:43:10.8305348Z %10 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked2> 2026-02-21T17:43:10.8305569Z %11 = arith.extsi %10 : tensor<256xi32, #blocked2> to tensor<256xi64, #blocked2> 2026-02-21T17:43:10.8305763Z scf.for %arg3 = %0 to %c4096_i32 step %c4864_i32 : i32 { 2026-02-21T17:43:10.8305920Z %12 = arith.divsi %arg3, %c32768_i32 : i32 2026-02-21T17:43:10.8306052Z %13 = arith.muli %12, %c32_i32 : i32 2026-02-21T17:43:10.8306172Z %14 = arith.subi %c4_i32, %13 : i32 2026-02-21T17:43:10.8306296Z %15 = arith.minsi %14, %c32_i32 : i32 2026-02-21T17:43:10.8306420Z %16 = arith.remsi %arg3, %c32768_i32 : i32 2026-02-21T17:43:10.8306547Z %17 = arith.remsi %16, %15 : i32 2026-02-21T17:43:10.8306662Z %18 = arith.addi %13, %17 : i32 2026-02-21T17:43:10.8306781Z %19 = arith.divsi %16, %15 : i32 2026-02-21T17:43:10.8306897Z %20 = arith.muli %18, %c256_i32 : i32 2026-02-21T17:43:10.8307021Z %21 = arith.muli %19, %c4_i32 : i32 2026-02-21T17:43:10.8307165Z %22 = tt.splat %21 : i32 -> tensor<4xi32, #blocked2> 2026-02-21T17:43:10.8307319Z %23 = arith.addi %22, %1 : tensor<4xi32, #blocked2> 2026-02-21T17:43:10.8307556Z %24 = ttg.convert_layout %23 : tensor<4xi32, #blocked2> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:10.8307889Z %25 = tt.expand_dims %24 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T17:43:10.8308257Z %26 = ttg.convert_layout %25 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked> 2026-02-21T17:43:10.8308559Z %27 = arith.muli %26, %cst_7 : tensor<4x1xi32, #blocked> 2026-02-21T17:43:10.8308743Z %28 = tt.broadcast %27 : tensor<4x1xi32, #blocked> -> tensor<4x32xi32, #blocked> 2026-02-21T17:43:10.8308984Z %29 = ttg.convert_layout %28 : tensor<4x32xi32, #blocked> -> tensor<4x32xi32, #blocked3> 2026-02-21T17:43:10.8309167Z %30 = arith.extsi %20 : i32 to i64 2026-02-21T17:43:10.8309311Z %31 = tt.splat %30 : i64 -> tensor<256xi64, #blocked2> 2026-02-21T17:43:10.8309487Z %32 = arith.addi %31, %7 : tensor<256xi64, #blocked2> 2026-02-21T17:43:10.8309758Z %33 = ttg.convert_layout %32 : tensor<256xi64, #blocked2> -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:10.8310103Z %34 = tt.expand_dims %33 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi64, #blocked5> 2026-02-21T17:43:10.8310399Z %35 = ttg.convert_layout %34 : tensor<1x256xi64, #blocked5> -> tensor<1x256xi64, #blocked1> 2026-02-21T17:43:10.8310616Z %36 = arith.muli %35, %cst_6 : tensor<1x256xi64, #blocked1> 2026-02-21T17:43:10.8310816Z %37 = tt.broadcast %36 : tensor<1x256xi64, #blocked1> -> tensor<32x256xi64, #blocked1> 2026-02-21T17:43:10.8311043Z %38 = arith.cmpi sge, %35, %cst_3 : tensor<1x256xi64, #blocked1> 2026-02-21T17:43:10.8311224Z %39 = arith.cmpi slt, %35, %cst_6 : tensor<1x256xi64, #blocked1> 2026-02-21T17:43:10.8311394Z %40 = arith.andi %38, %39 : tensor<1x256xi1, #blocked1> 2026-02-21T17:43:10.8311586Z %41 = tt.broadcast %40 : tensor<1x256xi1, #blocked1> -> tensor<32x256xi1, #blocked1> 2026-02-21T17:43:10.8311868Z %42 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c32_i32 iter_args(%arg5 = %cst_8) -> (tensor<4x256xf32, #blocked1>) : i32 { 2026-02-21T17:43:10.8312119Z %72 = tt.splat %arg4 : i32 -> tensor<32xi32, #blocked2> 2026-02-21T17:43:10.8312281Z %73 = arith.addi %72, %2 : tensor<32xi32, #blocked2> 2026-02-21T17:43:10.8312516Z %74 = ttg.convert_layout %73 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:10.8312853Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x32xi32, #blocked5> 2026-02-21T17:43:10.8313187Z %76 = ttg.convert_layout %75 : tensor<1x32xi32, #blocked5> -> tensor<1x32xi32, #blocked3> 2026-02-21T17:43:10.8313428Z %77 = tt.broadcast %76 : tensor<1x32xi32, #blocked3> -> tensor<4x32xi32, #blocked3> 2026-02-21T17:43:10.8313630Z %78 = arith.addi %29, %77 : tensor<4x32xi32, #blocked3> 2026-02-21T17:43:10.8313829Z %79 = tt.addptr %3, %78 : tensor<4x32x!tt.ptr, #blocked3>, tensor<4x32xi32, #blocked3> 2026-02-21T17:43:10.8314044Z %80 = tt.load %79 : tensor<4x32x!tt.ptr, #blocked3> 2026-02-21T17:43:10.8314188Z %81 = arith.extsi %arg4 : i32 to i64 2026-02-21T17:43:10.8314332Z %82 = tt.splat %81 : i64 -> tensor<32xi64, #blocked2> 2026-02-21T17:43:10.8314490Z %83 = arith.addi %82, %5 : tensor<32xi64, #blocked2> 2026-02-21T17:43:10.8314722Z %84 = ttg.convert_layout %83 : tensor<32xi64, #blocked2> -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:10.8315054Z %85 = tt.expand_dims %84 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<32x1xi64, #blocked4> 2026-02-21T17:43:10.8315344Z %86 = ttg.convert_layout %85 : tensor<32x1xi64, #blocked4> -> tensor<32x1xi64, #blocked> 2026-02-21T17:43:10.8315586Z %87 = tt.broadcast %86 : tensor<32x1xi64, #blocked> -> tensor<32x256xi64, #blocked> 2026-02-21T17:43:10.8315833Z %88 = ttg.convert_layout %87 : tensor<32x256xi64, #blocked> -> tensor<32x256xi64, #blocked1> 2026-02-21T17:43:10.8316042Z %89 = arith.addi %88, %37 : tensor<32x256xi64, #blocked1> 2026-02-21T17:43:10.8316251Z %90 = tt.addptr %4, %89 : tensor<32x256x!tt.ptr, #blocked1>, tensor<32x256xi64, #blocked1> 2026-02-21T17:43:10.8316464Z %91 = arith.cmpi sge, %86, %cst_5 : tensor<32x1xi64, #blocked> 2026-02-21T17:43:10.8316642Z %92 = arith.cmpi slt, %86, %cst_4 : tensor<32x1xi64, #blocked> 2026-02-21T17:43:10.8316807Z %93 = arith.andi %91, %92 : tensor<32x1xi1, #blocked> 2026-02-21T17:43:10.8316996Z %94 = tt.broadcast %93 : tensor<32x1xi1, #blocked> -> tensor<32x256xi1, #blocked> 2026-02-21T17:43:10.8317239Z %95 = ttg.convert_layout %94 : tensor<32x256xi1, #blocked> -> tensor<32x256xi1, #blocked1> 2026-02-21T17:43:10.8317467Z %96 = arith.andi %95, %41 : tensor<32x256xi1, #blocked1> 2026-02-21T17:43:10.8317640Z %97 = tt.load %90, %96, %cst_2 : tensor<32x256x!tt.ptr, #blocked1> 2026-02-21T17:43:10.8317904Z %98 = ttg.convert_layout %80 : tensor<4x32xf16, #blocked3> -> tensor<4x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> 2026-02-21T17:43:10.8318256Z %99 = ttg.convert_layout %97 : tensor<32x256xf16, #blocked1> -> tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> 2026-02-21T17:43:10.8318567Z %100 = ttg.convert_layout %arg5 : tensor<4x256xf32, #blocked1> -> tensor<4x256xf32, #blocked6> 2026-02-21T17:43:10.8319015Z %101 = tt.dot %98, %99, %100, inputPrecision = tf32 : tensor<4x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<4x256xf32, #blocked6> 2026-02-21T17:43:10.8319428Z %102 = ttg.convert_layout %101 : tensor<4x256xf32, #blocked6> -> tensor<4x256xf32, #blocked1> 2026-02-21T17:43:10.8319641Z scf.yield %102 : tensor<4x256xf32, #blocked1> 2026-02-21T17:43:10.8319776Z } {tt.disallow_acc_multi_buffer} 2026-02-21T17:43:10.8319954Z %43 = arith.truncf %42 : tensor<4x256xf32, #blocked1> to tensor<4x256xf16, #blocked1> 2026-02-21T17:43:10.8320135Z %44 = arith.extsi %21 : i32 to i64 2026-02-21T17:43:10.8320259Z %45 = arith.extsi %20 : i32 to i64 2026-02-21T17:43:10.8320393Z %46 = tt.splat %44 : i64 -> tensor<4xi64, #blocked2> 2026-02-21T17:43:10.8320547Z %47 = arith.addi %46, %9 : tensor<4xi64, #blocked2> 2026-02-21T17:43:10.8320781Z %48 = ttg.convert_layout %47 : tensor<4xi64, #blocked2> -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:10.8321147Z %49 = tt.expand_dims %48 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi64, #blocked4> 2026-02-21T17:43:10.8321433Z %50 = ttg.convert_layout %49 : tensor<4x1xi64, #blocked4> -> tensor<4x1xi64, #blocked> 2026-02-21T17:43:10.8321632Z %51 = arith.muli %50, %cst_1 : tensor<4x1xi64, #blocked> 2026-02-21T17:43:10.8321819Z %52 = tt.broadcast %51 : tensor<4x1xi64, #blocked> -> tensor<4x256xi64, #blocked> 2026-02-21T17:43:10.8322054Z %53 = ttg.convert_layout %52 : tensor<4x256xi64, #blocked> -> tensor<4x256xi64, #blocked1> 2026-02-21T17:43:10.8322252Z %54 = tt.splat %45 : i64 -> tensor<256xi64, #blocked2> 2026-02-21T17:43:10.8322408Z %55 = arith.addi %54, %11 : tensor<256xi64, #blocked2> 2026-02-21T17:43:10.8322645Z %56 = ttg.convert_layout %55 : tensor<256xi64, #blocked2> -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:10.8322980Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi64, #blocked5> 2026-02-21T17:43:10.8323275Z %58 = ttg.convert_layout %57 : tensor<1x256xi64, #blocked5> -> tensor<1x256xi64, #blocked1> 2026-02-21T17:43:10.8323514Z %59 = tt.broadcast %58 : tensor<1x256xi64, #blocked1> -> tensor<4x256xi64, #blocked1> 2026-02-21T17:43:10.8323712Z %60 = arith.addi %53, %59 : tensor<4x256xi64, #blocked1> 2026-02-21T17:43:10.8323936Z %61 = tt.addptr %8, %60 : tensor<4x256x!tt.ptr, #blocked1>, tensor<4x256xi64, #blocked1> 2026-02-21T17:43:10.8324153Z %62 = arith.cmpi sge, %50, %cst_0 : tensor<4x1xi64, #blocked> 2026-02-21T17:43:10.8324327Z %63 = arith.cmpi slt, %50, %cst : tensor<4x1xi64, #blocked> 2026-02-21T17:43:10.8324485Z %64 = arith.andi %62, %63 : tensor<4x1xi1, #blocked> 2026-02-21T17:43:10.8324668Z %65 = tt.broadcast %64 : tensor<4x1xi1, #blocked> -> tensor<4x256xi1, #blocked> 2026-02-21T17:43:10.8324894Z %66 = ttg.convert_layout %65 : tensor<4x256xi1, #blocked> -> tensor<4x256xi1, #blocked1> 2026-02-21T17:43:10.8325110Z %67 = arith.cmpi sge, %58, %cst_3 : tensor<1x256xi64, #blocked1> 2026-02-21T17:43:10.8325283Z %68 = arith.cmpi slt, %58, %cst_6 : tensor<1x256xi64, #blocked1> 2026-02-21T17:43:10.8325470Z %69 = arith.andi %67, %68 : tensor<1x256xi1, #blocked1> 2026-02-21T17:43:10.8325660Z %70 = tt.broadcast %69 : tensor<1x256xi1, #blocked1> -> tensor<4x256xi1, #blocked1> 2026-02-21T17:43:10.8325851Z %71 = arith.andi %66, %70 : tensor<4x256xi1, #blocked1> 2026-02-21T17:43:10.8326018Z tt.store %61, %43, %71 : tensor<4x256x!tt.ptr, #blocked1> 2026-02-21T17:43:10.8326168Z } {tt.disallow_acc_multi_buffer} 2026-02-21T17:43:10.8326285Z tt.return 2026-02-21T17:43:10.8326368Z } 2026-02-21T17:43:10.8326448Z } 2026-02-21T17:43:10.8326493Z 2026-02-21T17:43:10.8326527Z {-# 2026-02-21T17:43:10.8326615Z external_resources: { 2026-02-21T17:43:10.8326741Z mlir_reproducer: { 2026-02-21T17:43:10.8329027Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=2 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=2}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T17:43:10.8331349Z disable_threading: false, 2026-02-21T17:43:10.8331464Z verify_each: true 2026-02-21T17:43:10.8331560Z } 2026-02-21T17:43:10.8331640Z } 2026-02-21T17:43:10.8331717Z #-} 2026-02-21T17:43:10.8332002Z /tmp/torchinductor_root/zd/czdjnffevd3wcqdsdxzckild6go5gxjom6nhdphslmxwlgmawgoq.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T17:43:10.8332698Z /tmp/torchinductor_root/zd/czdjnffevd3wcqdsdxzckild6go5gxjom6nhdphslmxwlgmawgoq.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T17:43:10.8333262Z [9s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T17:43:10.8334072Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 256, 32], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T17:43:10.8334816Z Error: RuntimeError: PassManager::run failed 2026-02-21T17:43:10.8334987Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T17:43:11.3733740Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T17:43:11.3735879Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:11.3736360Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:11.3736665Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T17:43:11.3736962Z #blocked3 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T17:43:11.3737265Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T17:43:11.3737568Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T17:43:11.3737949Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T17:43:11.3738410Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T17:43:11.3738805Z %cst = arith.constant dense<0.000000e+00> : tensor<64x4xf16, #blocked> 2026-02-21T17:43:11.3739001Z %cst_0 = arith.constant dense<0> : tensor<1x4xi64, #blocked> 2026-02-21T17:43:11.3739182Z %cst_1 = arith.constant dense<1024> : tensor<64x1xi64, #blocked1> 2026-02-21T17:43:11.3739366Z %cst_2 = arith.constant dense<0> : tensor<64x1xi64, #blocked1> 2026-02-21T17:43:11.3739547Z %cst_3 = arith.constant dense<1024> : tensor<1x4xi64, #blocked> 2026-02-21T17:43:11.3739734Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<2x64xf16, #blocked2> 2026-02-21T17:43:11.3739934Z %cst_5 = arith.constant dense<1024> : tensor<1x64xi64, #blocked2> 2026-02-21T17:43:11.3740110Z %cst_6 = arith.constant dense<0> : tensor<1x64xi64, #blocked2> 2026-02-21T17:43:11.3740369Z %cst_7 = arith.constant dense<4096> : tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3740543Z %cst_8 = arith.constant dense<0> : tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3740724Z %cst_9 = arith.constant dense<1024> : tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3740878Z %c64_i32 = arith.constant 64 : i32 2026-02-21T17:43:11.3741001Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T17:43:11.3741128Z %c0_i32 = arith.constant 0 : i32 2026-02-21T17:43:11.3741244Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T17:43:11.3741408Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<2x4xf32, #blocked> 2026-02-21T17:43:11.3741566Z %c4_i32 = arith.constant 4 : i32 2026-02-21T17:43:11.3741682Z %c2_i32 = arith.constant 2 : i32 2026-02-21T17:43:11.3741798Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T17:43:11.3741926Z %c524288_i32 = arith.constant 524288 : i32 2026-02-21T17:43:11.3742059Z %0 = tt.get_program_id x : i32 2026-02-21T17:43:11.3742228Z %1 = tt.splat %arg0 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T17:43:11.3742439Z %2 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked3> 2026-02-21T17:43:11.3742639Z %3 = arith.extsi %2 : tensor<2xi32, #blocked3> to tensor<2xi64, #blocked3> 2026-02-21T17:43:11.3742845Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked3> 2026-02-21T17:43:11.3743054Z %5 = arith.extsi %4 : tensor<64xi32, #blocked3> to tensor<64xi64, #blocked3> 2026-02-21T17:43:11.3743251Z %6 = tt.splat %arg1 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked> 2026-02-21T17:43:11.3743452Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked3> 2026-02-21T17:43:11.3743647Z %8 = arith.extsi %7 : tensor<4xi32, #blocked3> to tensor<4xi64, #blocked3> 2026-02-21T17:43:11.3743846Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<2x4x!tt.ptr, #blocked> 2026-02-21T17:43:11.3744047Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked3> 2026-02-21T17:43:11.3744281Z %11 = arith.extsi %10 : tensor<2xi32, #blocked3> to tensor<2xi64, #blocked3> 2026-02-21T17:43:11.3744484Z %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked3> 2026-02-21T17:43:11.3744703Z %13 = arith.extsi %12 : tensor<4xi32, #blocked3> to tensor<4xi64, #blocked3> 2026-02-21T17:43:11.3744893Z scf.for %arg3 = %0 to %c524288_i32 step %c2432_i32 : i32 { 2026-02-21T17:43:11.3745051Z %14 = arith.remsi %arg3, %c2048_i32 : i32 2026-02-21T17:43:11.3745178Z %15 = arith.divsi %arg3, %c2048_i32 : i32 2026-02-21T17:43:11.3745302Z %16 = arith.muli %14, %c2_i32 : i32 2026-02-21T17:43:11.3745420Z %17 = arith.muli %15, %c4_i32 : i32 2026-02-21T17:43:11.3745561Z %18 = arith.extsi %16 : i32 to i64 2026-02-21T17:43:11.3745696Z %19 = tt.splat %18 : i64 -> tensor<2xi64, #blocked3> 2026-02-21T17:43:11.3745851Z %20 = arith.addi %19, %3 : tensor<2xi64, #blocked3> 2026-02-21T17:43:11.3746084Z %21 = ttg.convert_layout %20 : tensor<2xi64, #blocked3> -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:11.3746419Z %22 = tt.expand_dims %21 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<2x1xi64, #blocked4> 2026-02-21T17:43:11.3746712Z %23 = ttg.convert_layout %22 : tensor<2x1xi64, #blocked4> -> tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3746913Z %24 = arith.muli %23, %cst_9 : tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3747110Z %25 = tt.broadcast %24 : tensor<2x1xi64, #blocked1> -> tensor<2x64xi64, #blocked1> 2026-02-21T17:43:11.3747342Z %26 = ttg.convert_layout %25 : tensor<2x64xi64, #blocked1> -> tensor<2x64xi64, #blocked2> 2026-02-21T17:43:11.3747554Z %27 = arith.cmpi sge, %23, %cst_8 : tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3747729Z %28 = arith.cmpi slt, %23, %cst_7 : tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3747931Z %29 = arith.andi %27, %28 : tensor<2x1xi1, #blocked1> 2026-02-21T17:43:11.3748117Z %30 = tt.broadcast %29 : tensor<2x1xi1, #blocked1> -> tensor<2x64xi1, #blocked1> 2026-02-21T17:43:11.3748422Z %31 = ttg.convert_layout %30 : tensor<2x64xi1, #blocked1> -> tensor<2x64xi1, #blocked2> 2026-02-21T17:43:11.3748606Z %32 = arith.extsi %17 : i32 to i64 2026-02-21T17:43:11.3748743Z %33 = tt.splat %32 : i64 -> tensor<4xi64, #blocked3> 2026-02-21T17:43:11.3748893Z %34 = arith.addi %33, %8 : tensor<4xi64, #blocked3> 2026-02-21T17:43:11.3749121Z %35 = ttg.convert_layout %34 : tensor<4xi64, #blocked3> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:11.3749442Z %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4xi64, #blocked5> 2026-02-21T17:43:11.3749726Z %37 = ttg.convert_layout %36 : tensor<1x4xi64, #blocked5> -> tensor<1x4xi64, #blocked> 2026-02-21T17:43:11.3749925Z %38 = arith.muli %37, %cst_3 : tensor<1x4xi64, #blocked> 2026-02-21T17:43:11.3750107Z %39 = tt.broadcast %38 : tensor<1x4xi64, #blocked> -> tensor<64x4xi64, #blocked> 2026-02-21T17:43:11.3750308Z %40 = arith.cmpi sge, %37, %cst_0 : tensor<1x4xi64, #blocked> 2026-02-21T17:43:11.3750475Z %41 = arith.cmpi slt, %37, %cst_3 : tensor<1x4xi64, #blocked> 2026-02-21T17:43:11.3750633Z %42 = arith.andi %40, %41 : tensor<1x4xi1, #blocked> 2026-02-21T17:43:11.3750809Z %43 = tt.broadcast %42 : tensor<1x4xi1, #blocked> -> tensor<64x4xi1, #blocked> 2026-02-21T17:43:11.3751091Z %44 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c64_i32 iter_args(%arg5 = %cst_10) -> (tensor<2x4xf32, #blocked>) : i32 { 2026-02-21T17:43:11.3751331Z %74 = arith.extsi %arg4 : i32 to i64 2026-02-21T17:43:11.3751473Z %75 = tt.splat %74 : i64 -> tensor<64xi64, #blocked3> 2026-02-21T17:43:11.3751630Z %76 = arith.addi %75, %5 : tensor<64xi64, #blocked3> 2026-02-21T17:43:11.3751874Z %77 = ttg.convert_layout %76 : tensor<64xi64, #blocked3> -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:11.3752256Z %78 = tt.expand_dims %77 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x64xi64, #blocked5> 2026-02-21T17:43:11.3752556Z %79 = ttg.convert_layout %78 : tensor<1x64xi64, #blocked5> -> tensor<1x64xi64, #blocked2> 2026-02-21T17:43:11.3752795Z %80 = tt.broadcast %79 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T17:43:11.3752997Z %81 = arith.addi %26, %80 : tensor<2x64xi64, #blocked2> 2026-02-21T17:43:11.3753233Z %82 = tt.addptr %1, %81 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T17:43:11.3753445Z %83 = arith.cmpi sge, %79, %cst_6 : tensor<1x64xi64, #blocked2> 2026-02-21T17:43:11.3753647Z %84 = arith.cmpi slt, %79, %cst_5 : tensor<1x64xi64, #blocked2> 2026-02-21T17:43:11.3753813Z %85 = arith.andi %83, %84 : tensor<1x64xi1, #blocked2> 2026-02-21T17:43:11.3754009Z %86 = tt.broadcast %85 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T17:43:11.3754200Z %87 = arith.andi %31, %86 : tensor<2x64xi1, #blocked2> 2026-02-21T17:43:11.3754376Z %88 = tt.load %82, %87, %cst_4 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T17:43:11.3754639Z %89 = ttg.convert_layout %76 : tensor<64xi64, #blocked3> -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:11.3754973Z %90 = tt.expand_dims %89 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<64x1xi64, #blocked4> 2026-02-21T17:43:11.3755276Z %91 = ttg.convert_layout %90 : tensor<64x1xi64, #blocked4> -> tensor<64x1xi64, #blocked1> 2026-02-21T17:43:11.3755524Z %92 = tt.broadcast %91 : tensor<64x1xi64, #blocked1> -> tensor<64x4xi64, #blocked1> 2026-02-21T17:43:11.3755772Z %93 = ttg.convert_layout %92 : tensor<64x4xi64, #blocked1> -> tensor<64x4xi64, #blocked> 2026-02-21T17:43:11.3756009Z %94 = arith.addi %93, %39 : tensor<64x4xi64, #blocked> 2026-02-21T17:43:11.3756206Z %95 = tt.addptr %6, %94 : tensor<64x4x!tt.ptr, #blocked>, tensor<64x4xi64, #blocked> 2026-02-21T17:43:11.3756415Z %96 = arith.cmpi sge, %91, %cst_2 : tensor<64x1xi64, #blocked1> 2026-02-21T17:43:11.3756588Z %97 = arith.cmpi slt, %91, %cst_1 : tensor<64x1xi64, #blocked1> 2026-02-21T17:43:11.3756758Z %98 = arith.andi %96, %97 : tensor<64x1xi1, #blocked1> 2026-02-21T17:43:11.3756948Z %99 = tt.broadcast %98 : tensor<64x1xi1, #blocked1> -> tensor<64x4xi1, #blocked1> 2026-02-21T17:43:11.3757180Z %100 = ttg.convert_layout %99 : tensor<64x4xi1, #blocked1> -> tensor<64x4xi1, #blocked> 2026-02-21T17:43:11.3757379Z %101 = arith.andi %100, %43 : tensor<64x4xi1, #blocked> 2026-02-21T17:43:11.3757554Z %102 = tt.load %95, %101, %cst : tensor<64x4x!tt.ptr, #blocked> 2026-02-21T17:43:11.3757826Z %103 = ttg.convert_layout %88 : tensor<2x64xf16, #blocked2> -> tensor<2x64xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T17:43:11.3758170Z %104 = ttg.convert_layout %102 : tensor<64x4xf16, #blocked> -> tensor<64x4xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T17:43:11.3758475Z %105 = ttg.convert_layout %arg5 : tensor<2x4xf32, #blocked> -> tensor<2x4xf32, #blocked> 2026-02-21T17:43:11.3758875Z %106 = tt.dot %103, %104, %105, inputPrecision = tf32 : tensor<2x64xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<64x4xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<2x4xf32, #blocked> 2026-02-21T17:43:11.3759220Z scf.yield %106 : tensor<2x4xf32, #blocked> 2026-02-21T17:43:11.3759360Z } {tt.disallow_acc_multi_buffer} 2026-02-21T17:43:11.3759528Z %45 = arith.truncf %44 : tensor<2x4xf32, #blocked> to tensor<2x4xf16, #blocked> 2026-02-21T17:43:11.3759703Z %46 = arith.extsi %16 : i32 to i64 2026-02-21T17:43:11.3759828Z %47 = arith.extsi %17 : i32 to i64 2026-02-21T17:43:11.3759966Z %48 = tt.splat %46 : i64 -> tensor<2xi64, #blocked3> 2026-02-21T17:43:11.3760141Z %49 = arith.addi %48, %11 : tensor<2xi64, #blocked3> 2026-02-21T17:43:11.3760373Z %50 = ttg.convert_layout %49 : tensor<2xi64, #blocked3> -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:11.3760700Z %51 = tt.expand_dims %50 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<2x1xi64, #blocked4> 2026-02-21T17:43:11.3760982Z %52 = ttg.convert_layout %51 : tensor<2x1xi64, #blocked4> -> tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3761188Z %53 = arith.muli %52, %cst_9 : tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3761379Z %54 = tt.broadcast %53 : tensor<2x1xi64, #blocked1> -> tensor<2x4xi64, #blocked1> 2026-02-21T17:43:11.3761626Z %55 = ttg.convert_layout %54 : tensor<2x4xi64, #blocked1> -> tensor<2x4xi64, #blocked> 2026-02-21T17:43:11.3761823Z %56 = tt.splat %47 : i64 -> tensor<4xi64, #blocked3> 2026-02-21T17:43:11.3761970Z %57 = arith.addi %56, %13 : tensor<4xi64, #blocked3> 2026-02-21T17:43:11.3762203Z %58 = ttg.convert_layout %57 : tensor<4xi64, #blocked3> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:11.3762523Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4xi64, #blocked5> 2026-02-21T17:43:11.3762805Z %60 = ttg.convert_layout %59 : tensor<1x4xi64, #blocked5> -> tensor<1x4xi64, #blocked> 2026-02-21T17:43:11.3763032Z %61 = tt.broadcast %60 : tensor<1x4xi64, #blocked> -> tensor<2x4xi64, #blocked> 2026-02-21T17:43:11.3763218Z %62 = arith.addi %55, %61 : tensor<2x4xi64, #blocked> 2026-02-21T17:43:11.3763412Z %63 = tt.addptr %9, %62 : tensor<2x4x!tt.ptr, #blocked>, tensor<2x4xi64, #blocked> 2026-02-21T17:43:11.3763614Z %64 = arith.cmpi sge, %52, %cst_8 : tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3763835Z %65 = arith.cmpi slt, %52, %cst_7 : tensor<2x1xi64, #blocked1> 2026-02-21T17:43:11.3764000Z %66 = arith.andi %64, %65 : tensor<2x1xi1, #blocked1> 2026-02-21T17:43:11.3764182Z %67 = tt.broadcast %66 : tensor<2x1xi1, #blocked1> -> tensor<2x4xi1, #blocked1> 2026-02-21T17:43:11.3764404Z %68 = ttg.convert_layout %67 : tensor<2x4xi1, #blocked1> -> tensor<2x4xi1, #blocked> 2026-02-21T17:43:11.3764603Z %69 = arith.cmpi sge, %60, %cst_0 : tensor<1x4xi64, #blocked> 2026-02-21T17:43:11.3764775Z %70 = arith.cmpi slt, %60, %cst_3 : tensor<1x4xi64, #blocked> 2026-02-21T17:43:11.3764932Z %71 = arith.andi %69, %70 : tensor<1x4xi1, #blocked> 2026-02-21T17:43:11.3765113Z %72 = tt.broadcast %71 : tensor<1x4xi1, #blocked> -> tensor<2x4xi1, #blocked> 2026-02-21T17:43:11.3765296Z %73 = arith.andi %68, %72 : tensor<2x4xi1, #blocked> 2026-02-21T17:43:11.3765455Z tt.store %63, %45, %73 : tensor<2x4x!tt.ptr, #blocked> 2026-02-21T17:43:11.3765599Z } 2026-02-21T17:43:11.3765680Z tt.return 2026-02-21T17:43:11.3765773Z } 2026-02-21T17:43:11.3765852Z } 2026-02-21T17:43:11.3765907Z 2026-02-21T17:43:11.3765940Z {-# 2026-02-21T17:43:11.3766025Z external_resources: { 2026-02-21T17:43:11.3766131Z mlir_reproducer: { 2026-02-21T17:43:11.3768401Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T17:43:11.3770747Z disable_threading: false, 2026-02-21T17:43:11.3770857Z verify_each: true 2026-02-21T17:43:11.3770956Z } 2026-02-21T17:43:11.3771032Z } 2026-02-21T17:43:11.3771127Z #-} 2026-02-21T17:43:11.3771413Z /tmp/torchinductor_root/iu/ciu6w5v64sg7ebzajkeash7zfy7537edqdbi6fz4xifpkyto4lse.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T17:43:11.3772107Z /tmp/torchinductor_root/iu/ciu6w5v64sg7ebzajkeash7zfy7537edqdbi6fz4xifpkyto4lse.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T17:43:11.3772667Z [9s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T17:43:11.3773458Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 4, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T17:43:11.3774184Z Error: RuntimeError: PassManager::run failed 2026-02-21T17:43:11.3774393Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T17:43:13.0973791Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T17:43:13.0975610Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T17:43:13.0976470Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T17:43:13.0977270Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T17:43:13.0978125Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T17:43:13.0978961Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T17:43:13.0979786Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T17:43:13.0980694Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T17:43:13.0981961Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T17:43:13.0982996Z %cst = arith.constant dense<0.000000e+00> : tensor<256x16xf16, #blocked> 2026-02-21T17:43:13.0983370Z %cst_0 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T17:43:13.0983689Z %cst_1 = arith.constant dense<1024> : tensor<256x1xi64, #blocked1> 2026-02-21T17:43:13.0984004Z %cst_2 = arith.constant dense<0> : tensor<256x1xi64, #blocked1> 2026-02-21T17:43:13.0984319Z %cst_3 = arith.constant dense<1024> : tensor<1x16xi64, #blocked> 2026-02-21T17:43:13.0984952Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T17:43:13.0985165Z %c256_i32 = arith.constant 256 : i32 2026-02-21T17:43:13.0985372Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T17:43:13.0985581Z %c0_i32 = arith.constant 0 : i32 2026-02-21T17:43:13.0985781Z %c304_i32 = arith.constant 304 : i32 2026-02-21T17:43:13.0986028Z %cst_4 = arith.constant dense<1024> : tensor<4x1xi32, #blocked1> 2026-02-21T17:43:13.0986359Z %cst_5 = arith.constant dense<0.000000e+00> : tensor<4x16xf32, #blocked> 2026-02-21T17:43:13.0986637Z %c4_i32 = arith.constant 4 : i32 2026-02-21T17:43:13.0986828Z %c16_i32 = arith.constant 16 : i32 2026-02-21T17:43:13.0987104Z %c2_i32 = arith.constant 2 : i32 2026-02-21T17:43:13.0987289Z %c64_i32 = arith.constant 64 : i32 2026-02-21T17:43:13.0987500Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T17:43:13.0987758Z %0 = tt.get_program_id x : i32 2026-02-21T17:43:13.0988036Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked2> 2026-02-21T17:43:13.0988477Z %2 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked2> 2026-02-21T17:43:13.0988827Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked2> 2026-02-21T17:43:13.0989189Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<4x256x!tt.ptr, #blocked3> 2026-02-21T17:43:13.0989535Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<256x16x!tt.ptr, #blocked> 2026-02-21T17:43:13.0989888Z %6 = arith.extsi %3 : tensor<256xi32, #blocked2> to tensor<256xi64, #blocked2> 2026-02-21T17:43:13.0990242Z %7 = arith.extsi %1 : tensor<16xi32, #blocked2> to tensor<16xi64, #blocked2> 2026-02-21T17:43:13.0990579Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<4x16x!tt.ptr, #blocked> 2026-02-21T17:43:13.0990901Z scf.for %arg3 = %0 to %c65536_i32 step %c304_i32 : i32 { 2026-02-21T17:43:13.0991335Z %9 = arith.divsi %arg3, %c2048_i32 : i32 2026-02-21T17:43:13.0991554Z %10 = arith.muli %9, %c2_i32 : i32 2026-02-21T17:43:13.0991768Z %11 = arith.subi %c64_i32, %10 : i32 2026-02-21T17:43:13.0991972Z %12 = arith.minsi %11, %c2_i32 : i32 2026-02-21T17:43:13.0992190Z %13 = arith.remsi %arg3, %c2048_i32 : i32 2026-02-21T17:43:13.0992402Z %14 = arith.remsi %13, %12 : i32 2026-02-21T17:43:13.0992597Z %15 = arith.addi %10, %14 : i32 2026-02-21T17:43:13.0992790Z %16 = arith.divsi %13, %12 : i32 2026-02-21T17:43:13.0992995Z %17 = arith.muli %15, %c16_i32 : i32 2026-02-21T17:43:13.0993225Z %18 = tt.splat %17 : i32 -> tensor<16xi32, #blocked2> 2026-02-21T17:43:13.0993490Z %19 = arith.addi %18, %1 : tensor<16xi32, #blocked2> 2026-02-21T17:43:13.0993693Z %20 = arith.muli %16, %c4_i32 : i32 2026-02-21T17:43:13.0993861Z %21 = tt.splat %20 : i32 -> tensor<4xi32, #blocked2> 2026-02-21T17:43:13.0994067Z %22 = arith.addi %21, %2 : tensor<4xi32, #blocked2> 2026-02-21T17:43:13.0994371Z %23 = ttg.convert_layout %22 : tensor<4xi32, #blocked2> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:13.0994814Z %24 = tt.expand_dims %23 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T17:43:13.0995188Z %25 = ttg.convert_layout %24 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked1> 2026-02-21T17:43:13.0995449Z %26 = arith.muli %25, %cst_4 : tensor<4x1xi32, #blocked1> 2026-02-21T17:43:13.0995701Z %27 = tt.broadcast %26 : tensor<4x1xi32, #blocked1> -> tensor<4x256xi32, #blocked1> 2026-02-21T17:43:13.0996008Z %28 = ttg.convert_layout %27 : tensor<4x256xi32, #blocked1> -> tensor<4x256xi32, #blocked3> 2026-02-21T17:43:13.0996259Z %29 = arith.extsi %17 : i32 to i64 2026-02-21T17:43:13.0996431Z %30 = tt.splat %29 : i64 -> tensor<16xi64, #blocked2> 2026-02-21T17:43:13.0996631Z %31 = arith.addi %30, %7 : tensor<16xi64, #blocked2> 2026-02-21T17:43:13.0996932Z %32 = ttg.convert_layout %31 : tensor<16xi64, #blocked2> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:13.0997384Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi64, #blocked5> 2026-02-21T17:43:13.0997764Z %34 = ttg.convert_layout %33 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #blocked> 2026-02-21T17:43:13.0998025Z %35 = arith.muli %34, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T17:43:13.0998273Z %36 = tt.broadcast %35 : tensor<1x16xi64, #blocked> -> tensor<256x16xi64, #blocked> 2026-02-21T17:43:13.0998543Z %37 = arith.cmpi sge, %34, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T17:43:13.0998792Z %38 = arith.cmpi slt, %34, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T17:43:13.0999002Z %39 = arith.andi %37, %38 : tensor<1x16xi1, #blocked> 2026-02-21T17:43:13.0999230Z %40 = tt.broadcast %39 : tensor<1x16xi1, #blocked> -> tensor<256x16xi1, #blocked> 2026-02-21T17:43:13.0999590Z %41 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c256_i32 iter_args(%arg5 = %cst_5) -> (tensor<4x16xf32, #blocked>) : i32 { 2026-02-21T17:43:13.0999907Z %55 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked2> 2026-02-21T17:43:13.1000119Z %56 = arith.addi %55, %3 : tensor<256xi32, #blocked2> 2026-02-21T17:43:13.1000432Z %57 = ttg.convert_layout %56 : tensor<256xi32, #blocked2> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:13.1000863Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T17:43:13.1001259Z %59 = ttg.convert_layout %58 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T17:43:13.1001570Z %60 = tt.broadcast %59 : tensor<1x256xi32, #blocked3> -> tensor<4x256xi32, #blocked3> 2026-02-21T17:43:13.1001869Z %61 = arith.addi %28, %60 : tensor<4x256xi32, #blocked3> 2026-02-21T17:43:13.1002130Z %62 = tt.addptr %4, %61 : tensor<4x256x!tt.ptr, #blocked3>, tensor<4x256xi32, #blocked3> 2026-02-21T17:43:13.1002393Z %63 = tt.load %62 : tensor<4x256x!tt.ptr, #blocked3> 2026-02-21T17:43:13.1002584Z %64 = arith.extsi %arg4 : i32 to i64 2026-02-21T17:43:13.1002759Z %65 = tt.splat %64 : i64 -> tensor<256xi64, #blocked2> 2026-02-21T17:43:13.1002961Z %66 = arith.addi %65, %6 : tensor<256xi64, #blocked2> 2026-02-21T17:43:13.1003303Z %67 = ttg.convert_layout %66 : tensor<256xi64, #blocked2> -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:13.1003650Z %68 = tt.expand_dims %67 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<256x1xi64, #blocked4> 2026-02-21T17:43:13.1003962Z %69 = ttg.convert_layout %68 : tensor<256x1xi64, #blocked4> -> tensor<256x1xi64, #blocked1> 2026-02-21T17:43:13.1004214Z %70 = tt.broadcast %69 : tensor<256x1xi64, #blocked1> -> tensor<256x16xi64, #blocked1> 2026-02-21T17:43:13.1004469Z %71 = ttg.convert_layout %70 : tensor<256x16xi64, #blocked1> -> tensor<256x16xi64, #blocked> 2026-02-21T17:43:13.1004688Z %72 = arith.addi %71, %36 : tensor<256x16xi64, #blocked> 2026-02-21T17:43:13.1004894Z %73 = tt.addptr %5, %72 : tensor<256x16x!tt.ptr, #blocked>, tensor<256x16xi64, #blocked> 2026-02-21T17:43:13.1005117Z %74 = arith.cmpi sge, %69, %cst_2 : tensor<256x1xi64, #blocked1> 2026-02-21T17:43:13.1005301Z %75 = arith.cmpi slt, %69, %cst_1 : tensor<256x1xi64, #blocked1> 2026-02-21T17:43:13.1005478Z %76 = arith.andi %74, %75 : tensor<256x1xi1, #blocked1> 2026-02-21T17:43:13.1005677Z %77 = tt.broadcast %76 : tensor<256x1xi1, #blocked1> -> tensor<256x16xi1, #blocked1> 2026-02-21T17:43:13.1005926Z %78 = ttg.convert_layout %77 : tensor<256x16xi1, #blocked1> -> tensor<256x16xi1, #blocked> 2026-02-21T17:43:13.1006144Z %79 = arith.andi %78, %40 : tensor<256x16xi1, #blocked> 2026-02-21T17:43:13.1006337Z %80 = tt.load %73, %79, %cst : tensor<256x16x!tt.ptr, #blocked> 2026-02-21T17:43:13.1006616Z %81 = ttg.convert_layout %63 : tensor<4x256xf16, #blocked3> -> tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T17:43:13.1006976Z %82 = ttg.convert_layout %80 : tensor<256x16xf16, #blocked> -> tensor<256x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T17:43:13.1007285Z %83 = ttg.convert_layout %arg5 : tensor<4x16xf32, #blocked> -> tensor<4x16xf32, #blocked> 2026-02-21T17:43:13.1007702Z %84 = tt.dot %81, %82, %83, inputPrecision = tf32 : tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<256x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<4x16xf32, #blocked> 2026-02-21T17:43:13.1008073Z scf.yield %84 : tensor<4x16xf32, #blocked> 2026-02-21T17:43:13.1008213Z } {tt.disallow_acc_multi_buffer} 2026-02-21T17:43:13.1008388Z %42 = arith.truncf %41 : tensor<4x16xf32, #blocked> to tensor<4x16xf16, #blocked> 2026-02-21T17:43:13.1008670Z %43 = ttg.convert_layout %22 : tensor<4xi32, #blocked2> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:43:13.1009008Z %44 = tt.expand_dims %43 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T17:43:13.1009300Z %45 = ttg.convert_layout %44 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked1> 2026-02-21T17:43:13.1009508Z %46 = arith.muli %45, %cst_4 : tensor<4x1xi32, #blocked1> 2026-02-21T17:43:13.1009778Z %47 = ttg.convert_layout %19 : tensor<16xi32, #blocked2> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:43:13.1010123Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi32, #blocked5> 2026-02-21T17:43:13.1010453Z %49 = ttg.convert_layout %48 : tensor<1x16xi32, #blocked5> -> tensor<1x16xi32, #blocked> 2026-02-21T17:43:13.1010696Z %50 = tt.broadcast %46 : tensor<4x1xi32, #blocked1> -> tensor<4x16xi32, #blocked1> 2026-02-21T17:43:13.1010930Z %51 = ttg.convert_layout %50 : tensor<4x16xi32, #blocked1> -> tensor<4x16xi32, #blocked> 2026-02-21T17:43:13.1011167Z %52 = tt.broadcast %49 : tensor<1x16xi32, #blocked> -> tensor<4x16xi32, #blocked> 2026-02-21T17:43:13.1011371Z %53 = arith.addi %51, %52 : tensor<4x16xi32, #blocked> 2026-02-21T17:43:13.1011571Z %54 = tt.addptr %8, %53 : tensor<4x16x!tt.ptr, #blocked>, tensor<4x16xi32, #blocked> 2026-02-21T17:43:13.1011780Z tt.store %54, %42 : tensor<4x16x!tt.ptr, #blocked> 2026-02-21T17:43:13.1011935Z } {tt.disallow_acc_multi_buffer, tt.flatten} 2026-02-21T17:43:13.1012070Z tt.return 2026-02-21T17:43:13.1012156Z } 2026-02-21T17:43:13.1012245Z } 2026-02-21T17:43:13.1012293Z 2026-02-21T17:43:13.1012332Z {-# 2026-02-21T17:43:13.1012420Z external_resources: { 2026-02-21T17:43:13.1012529Z mlir_reproducer: { 2026-02-21T17:43:13.1014823Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T17:43:13.1017146Z disable_threading: false, 2026-02-21T17:43:13.1017258Z verify_each: true 2026-02-21T17:43:13.1017356Z } 2026-02-21T17:43:13.1017431Z } 2026-02-21T17:43:13.1017509Z #-} 2026-02-21T17:43:13.1017795Z /tmp/torchinductor_root/z4/cz4xup7xtzpgym5qm7pv2k52mtwvhtef3vnmh3gye4rsqjdc5q6z.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T17:43:13.1018511Z /tmp/torchinductor_root/z4/cz4xup7xtzpgym5qm7pv2k52mtwvhtef3vnmh3gye4rsqjdc5q6z.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T17:43:13.1019065Z [11s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T17:43:13.1019855Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 256], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T17:43:13.1020580Z Error: RuntimeError: PassManager::run failed 2026-02-21T17:43:13.1020755Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T17:43:14.8065855Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 14.6 configs/s 2026-02-21T17:43:14.8073018Z [13s] Adaptive compile timeout: 30s (90% percentile=4.8s, bounds=[30.0s, 60s]) 2026-02-21T17:43:14.9349291Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 8343.1 configs/s 2026-02-21T17:43:15.1956327Z [13s] Initial random population of 100, 5 starting points: 2026-02-21T17:43:15.1956809Z error=10 2026-02-21T17:43:15.1957029Z ok=90 2026-02-21T17:43:15.1957228Z min=0.0382 2026-02-21T17:43:15.1957439Z mid=0.7065 2026-02-21T17:43:15.1957638Z max=29.6642 2026-02-21T17:43:15.1957875Z best={'block_sizes': [64, 64, 256], 2026-02-21T17:43:15.1958243Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T17:43:15.1958608Z 'l2_groupings': [64], 2026-02-21T17:43:15.1958884Z 'load_eviction_policies': ['', ''], 2026-02-21T17:43:15.1959231Z 'loop_orders': [[1, 0]], 2026-02-21T17:43:15.1959512Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:43:15.1959803Z 'num_sm_multiplier': 8, 2026-02-21T17:43:15.1960069Z 'num_stages': 2, 2026-02-21T17:43:15.1960314Z 'num_warps': 8, 2026-02-21T17:43:15.1960577Z 'pid_type': 'persistent_blocked', 2026-02-21T17:43:15.1961095Z 'range_flattens': [False, True], 2026-02-21T17:43:15.1961410Z 'range_multi_buffers': [True, True], 2026-02-21T17:43:15.1961724Z 'range_num_stages': [0, 0], 2026-02-21T17:43:15.1962019Z 'range_unroll_factors': [0, 0], 2026-02-21T17:43:15.1962323Z 'range_warp_specializes': [], 2026-02-21T17:43:15.1962609Z 'waves_per_eu': 4} 2026-02-21T17:43:15.1982840Z [13s] Fitting surrogate: 100 points, 100 targets 2026-02-21T17:43:16.2124571Z [14s] Generation 1 starting: 88 neighbors, 5 active search path(s) 2026-02-21T17:43:20.7667968Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 22.1 configs/s 2026-02-21T17:43:25.4615105Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 16.3 configs/s 2026-02-21T17:43:27.1414007Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 548.9 2026-02-21T17:43:27.1414684Z configs/s 2026-02-21T17:43:27.5015728Z [25s] Generation 1 complete: 2026-02-21T17:43:27.5016098Z error=18 2026-02-21T17:43:27.5016306Z ok=76 2026-02-21T17:43:27.5016512Z min=0.0317 2026-02-21T17:43:27.5016714Z mid=0.0678 2026-02-21T17:43:27.5016916Z max=0.8567 2026-02-21T17:43:27.5017153Z best={'block_sizes': [128, 64, 64], 2026-02-21T17:43:27.5017535Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T17:43:27.5017904Z 'l2_groupings': [32], 2026-02-21T17:43:27.5018177Z 'load_eviction_policies': ['', ''], 2026-02-21T17:43:27.5018487Z 'loop_orders': [[0, 1]], 2026-02-21T17:43:27.5018763Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:43:27.5019033Z 'num_stages': 1, 2026-02-21T17:43:27.5019267Z 'num_warps': 8, 2026-02-21T17:43:27.5019505Z 'pid_type': 'flat', 2026-02-21T17:43:27.5020297Z 'range_flattens': [None, True], 2026-02-21T17:43:27.5020627Z 'range_multi_buffers': [None, False], 2026-02-21T17:43:27.5020942Z 'range_num_stages': [0, 0], 2026-02-21T17:43:27.5021477Z 'range_unroll_factors': [0, 0], 2026-02-21T17:43:27.5021778Z 'range_warp_specializes': [], 2026-02-21T17:43:27.5022058Z 'waves_per_eu': 2} 2026-02-21T17:43:27.5030227Z [25s] Fitting surrogate: 194 points, 194 targets 2026-02-21T17:43:28.4706877Z [26s] Generation 2 starting: 93 neighbors, 5 active search path(s) 2026-02-21T17:43:35.0113873Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 2.8 configs/s 2026-02-21T17:43:40.8319626Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 17.1 configs/s 2026-02-21T17:43:45.1028891Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 241.6 2026-02-21T17:43:45.1029515Z configs/s 2026-02-21T17:43:45.6112724Z [43s] Generation 2 complete: 2026-02-21T17:43:45.6113082Z error=2 2026-02-21T17:43:45.6113211Z ok=97 2026-02-21T17:43:45.6113340Z min=0.0274 2026-02-21T17:43:45.6113469Z mid=0.0406 2026-02-21T17:43:45.6113591Z max=0.3966 2026-02-21T17:43:45.6113751Z best={'block_sizes': [128, 64, 64], 2026-02-21T17:43:45.6113968Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T17:43:45.6114205Z 'l2_groupings': [64], 2026-02-21T17:43:45.6114377Z 'load_eviction_policies': ['', ''], 2026-02-21T17:43:45.6114555Z 'loop_orders': [[0, 1]], 2026-02-21T17:43:45.6114720Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:43:45.6114890Z 'num_sm_multiplier': 8, 2026-02-21T17:43:45.6115046Z 'num_stages': 2, 2026-02-21T17:43:45.6115195Z 'num_warps': 8, 2026-02-21T17:43:45.6115339Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:43:45.6115515Z 'range_flattens': [False, None], 2026-02-21T17:43:45.6115686Z 'range_multi_buffers': [True, True], 2026-02-21T17:43:45.6115861Z 'range_num_stages': [0, 0], 2026-02-21T17:43:45.6116018Z 'range_unroll_factors': [0, 0], 2026-02-21T17:43:45.6116180Z 'range_warp_specializes': [], 2026-02-21T17:43:45.6116336Z 'waves_per_eu': 4} 2026-02-21T17:43:45.6419360Z [43s] Fitting surrogate: 293 points, 293 targets 2026-02-21T17:43:46.6000382Z [44s] Generation 3 starting: 94 neighbors, 5 active search path(s) 2026-02-21T17:43:51.4041089Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 9.4 configs/s 2026-02-21T17:43:56.4384073Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 19.2 configs/s 2026-02-21T17:44:01.1504612Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 220.2 2026-02-21T17:44:01.1505235Z configs/s 2026-02-21T17:44:01.7089527Z [59s] Generation 3 complete: 2026-02-21T17:44:01.7089731Z error=13 2026-02-21T17:44:01.7089854Z ok=86 2026-02-21T17:44:01.7089981Z min=0.0274 2026-02-21T17:44:01.7090105Z mid=0.0354 2026-02-21T17:44:01.7090232Z max=0.8929 2026-02-21T17:44:01.7090370Z best={'block_sizes': [128, 64, 64], 2026-02-21T17:44:01.7090610Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T17:44:01.7090828Z 'l2_groupings': [64], 2026-02-21T17:44:01.7090993Z 'load_eviction_policies': ['', ''], 2026-02-21T17:44:01.7091190Z 'loop_orders': [[0, 1]], 2026-02-21T17:44:01.7091360Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:44:01.7091546Z 'num_sm_multiplier': 8, 2026-02-21T17:44:01.7091696Z 'num_stages': 2, 2026-02-21T17:44:01.7091840Z 'num_warps': 8, 2026-02-21T17:44:01.7091985Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:44:01.7092177Z 'range_flattens': [False, None], 2026-02-21T17:44:01.7092357Z 'range_multi_buffers': [True, True], 2026-02-21T17:44:01.7092538Z 'range_num_stages': [0, 0], 2026-02-21T17:44:01.7092705Z 'range_unroll_factors': [0, 0], 2026-02-21T17:44:01.7092897Z 'range_warp_specializes': [], 2026-02-21T17:44:01.7093059Z 'waves_per_eu': 4} 2026-02-21T17:44:01.7983079Z [60s] Fitting surrogate: 392 points, 392 targets 2026-02-21T17:44:02.6645778Z [60s] Generation 4 starting: 79 neighbors, 5 active search path(s) 2026-02-21T17:44:08.1371307Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 80/80 17.8 configs/s 2026-02-21T17:44:13.1971770Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 80/80 15.9 configs/s 2026-02-21T17:44:17.4143556Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 232.9 2026-02-21T17:44:17.4143869Z configs/s 2026-02-21T17:44:18.0326110Z [76s] Generation 4 complete: 2026-02-21T17:44:18.0326322Z error=6 2026-02-21T17:44:18.0326409Z ok=78 2026-02-21T17:44:18.0326487Z min=0.0260 2026-02-21T17:44:18.0326566Z mid=0.0313 2026-02-21T17:44:18.0326642Z max=0.1538 2026-02-21T17:44:18.0326736Z best={'block_sizes': [128, 128, 64], 2026-02-21T17:44:18.0326887Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T17:44:18.0327024Z 'l2_groupings': [64], 2026-02-21T17:44:18.0327126Z 'load_eviction_policies': ['', ''], 2026-02-21T17:44:18.0327280Z 'loop_orders': [[0, 1]], 2026-02-21T17:44:18.0327386Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:44:18.0327491Z 'num_sm_multiplier': 8, 2026-02-21T17:44:18.0327592Z 'num_stages': 2, 2026-02-21T17:44:18.0327718Z 'num_warps': 16, 2026-02-21T17:44:18.0327816Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:44:18.0327959Z 'range_flattens': [False, None], 2026-02-21T17:44:18.0328071Z 'range_multi_buffers': [True, True], 2026-02-21T17:44:18.0328187Z 'range_num_stages': [0, 0], 2026-02-21T17:44:18.0328296Z 'range_unroll_factors': [0, 0], 2026-02-21T17:44:18.0328405Z 'range_warp_specializes': [], 2026-02-21T17:44:18.0328509Z 'waves_per_eu': 2} 2026-02-21T17:44:18.1770277Z [76s] Fitting surrogate: 476 points, 476 targets 2026-02-21T17:44:19.1090966Z [77s] Generation 5 starting: 83 neighbors, 5 active search path(s) 2026-02-21T17:44:25.6514252Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 8.1 configs/s 2026-02-21T17:44:30.9674777Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 17.8 configs/s 2026-02-21T17:44:34.6460863Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 266.4 2026-02-21T17:44:34.6461367Z configs/s 2026-02-21T17:44:35.2316666Z [93s] Generation 5 complete: 2026-02-21T17:44:35.2317219Z error=9 2026-02-21T17:44:35.2317376Z ok=80 2026-02-21T17:44:35.2317525Z min=0.0243 2026-02-21T17:44:35.2317683Z mid=0.0340 2026-02-21T17:44:35.2317828Z max=0.4563 2026-02-21T17:44:35.2318002Z best={'block_sizes': [128, 128, 128], 2026-02-21T17:44:35.2318290Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T17:44:35.2318558Z 'l2_groupings': [32], 2026-02-21T17:44:35.2318764Z 'load_eviction_policies': ['', ''], 2026-02-21T17:44:35.2318992Z 'loop_orders': [[0, 1]], 2026-02-21T17:44:35.2319198Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:44:35.2319393Z 'num_stages': 2, 2026-02-21T17:44:35.2319566Z 'num_warps': 8, 2026-02-21T17:44:35.2319736Z 'pid_type': 'flat', 2026-02-21T17:44:35.2319941Z 'range_flattens': [None, True], 2026-02-21T17:44:35.2320155Z 'range_multi_buffers': [None, None], 2026-02-21T17:44:35.2320373Z 'range_num_stages': [0, 0], 2026-02-21T17:44:35.2320576Z 'range_unroll_factors': [0, 0], 2026-02-21T17:44:35.2320783Z 'range_warp_specializes': [], 2026-02-21T17:44:35.2320986Z 'waves_per_eu': 2} 2026-02-21T17:44:35.3565858Z [93s] Fitting surrogate: 565 points, 565 targets 2026-02-21T17:44:36.1186944Z [94s] Generation 6 starting: 61 neighbors, 4 active search path(s) 2026-02-21T17:44:40.2756590Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61/61 28.3 configs/s 2026-02-21T17:44:43.4119214Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 61/61 19.5 configs/s 2026-02-21T17:44:46.2047874Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 406.7 2026-02-21T17:44:46.2048162Z configs/s 2026-02-21T17:44:46.6749265Z [104s] Generation 6 complete: 2026-02-21T17:44:46.6749573Z error=12 2026-02-21T17:44:46.6749654Z ok=53 2026-02-21T17:44:46.6749732Z min=0.0282 2026-02-21T17:44:46.6749812Z mid=0.0371 2026-02-21T17:44:46.6749889Z max=0.5439 2026-02-21T17:44:46.6750149Z best={'block_sizes': [128, 128, 128], 2026-02-21T17:44:46.6750307Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T17:44:46.6750443Z 'l2_groupings': [64], 2026-02-21T17:44:46.6750551Z 'load_eviction_policies': ['', ''], 2026-02-21T17:44:46.6750664Z 'loop_orders': [[0, 1]], 2026-02-21T17:44:46.6750767Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:44:46.6750865Z 'num_stages': 2, 2026-02-21T17:44:46.6750953Z 'num_warps': 8, 2026-02-21T17:44:46.6751040Z 'pid_type': 'flat', 2026-02-21T17:44:46.6751140Z 'range_flattens': [None, True], 2026-02-21T17:44:46.6751253Z 'range_multi_buffers': [None, None], 2026-02-21T17:44:46.6751366Z 'range_num_stages': [0, 0], 2026-02-21T17:44:46.6751467Z 'range_unroll_factors': [0, 0], 2026-02-21T17:44:46.6751580Z 'range_warp_specializes': [], 2026-02-21T17:44:46.6751687Z 'waves_per_eu': 2} 2026-02-21T17:44:46.7438079Z [104s] Fitting surrogate: 630 points, 630 targets 2026-02-21T17:44:47.4071655Z [105s] Generation 7 starting: 54 neighbors, 3 active search path(s) 2026-02-21T17:44:51.1794255Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55/55 17.7 configs/s 2026-02-21T17:44:53.8966872Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 55/55 21.5 configs/s 2026-02-21T17:44:56.0695679Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 441.0 2026-02-21T17:44:56.0696309Z configs/s 2026-02-21T17:44:56.5392069Z [114s] Generation 7 complete: 2026-02-21T17:44:56.5392421Z error=14 2026-02-21T17:44:56.5392589Z ok=43 2026-02-21T17:44:56.5392739Z min=0.0280 2026-02-21T17:44:56.5392890Z mid=0.0342 2026-02-21T17:44:56.5393033Z max=0.0562 2026-02-21T17:44:56.5393203Z best={'block_sizes': [128, 128, 128], 2026-02-21T17:44:56.5393530Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T17:44:56.5393801Z 'l2_groupings': [64], 2026-02-21T17:44:56.5394005Z 'load_eviction_policies': ['', ''], 2026-02-21T17:44:56.5394252Z 'loop_orders': [[0, 1]], 2026-02-21T17:44:56.5394454Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:44:56.5395086Z 'num_stages': 2, 2026-02-21T17:44:56.5395257Z 'num_warps': 8, 2026-02-21T17:44:56.5395428Z 'pid_type': 'flat', 2026-02-21T17:44:56.5395624Z 'range_flattens': [None, True], 2026-02-21T17:44:56.5395843Z 'range_multi_buffers': [None, None], 2026-02-21T17:44:56.5396070Z 'range_num_stages': [0, 0], 2026-02-21T17:44:56.5396271Z 'range_unroll_factors': [0, 0], 2026-02-21T17:44:56.5396488Z 'range_warp_specializes': [], 2026-02-21T17:44:56.5396689Z 'waves_per_eu': 2} 2026-02-21T17:44:56.6034862Z [114s] Fitting surrogate: 687 points, 687 targets 2026-02-21T17:44:57.0502988Z [115s] Generation 8 starting: 32 neighbors, 2 active search path(s) 2026-02-21T17:44:59.9923360Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 22.9 configs/s 2026-02-21T17:45:01.9224220Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 33/33 18.0 configs/s 2026-02-21T17:45:03.4337491Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 626.8 2026-02-21T17:45:03.4338104Z configs/s 2026-02-21T17:45:03.8430104Z [122s] Generation 8 complete: 2026-02-21T17:45:03.8430310Z error=3 2026-02-21T17:45:03.8430387Z ok=32 2026-02-21T17:45:03.8430464Z min=0.0243 2026-02-21T17:45:03.8430542Z mid=0.0313 2026-02-21T17:45:03.8430625Z max=0.5410 2026-02-21T17:45:03.8430710Z best={'block_sizes': [128, 128, 128], 2026-02-21T17:45:03.8430855Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T17:45:03.8430991Z 'l2_groupings': [64], 2026-02-21T17:45:03.8431095Z 'load_eviction_policies': ['', ''], 2026-02-21T17:45:03.8431210Z 'loop_orders': [[0, 1]], 2026-02-21T17:45:03.8431317Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:45:03.8431882Z 'num_stages': 2, 2026-02-21T17:45:03.8431968Z 'num_warps': 8, 2026-02-21T17:45:03.8432055Z 'pid_type': 'flat', 2026-02-21T17:45:03.8432151Z 'range_flattens': [None, True], 2026-02-21T17:45:03.8432449Z 'range_multi_buffers': [None, None], 2026-02-21T17:45:03.8432563Z 'range_num_stages': [0, 0], 2026-02-21T17:45:03.8432676Z 'range_unroll_factors': [0, 0], 2026-02-21T17:45:03.8432784Z 'range_warp_specializes': [], 2026-02-21T17:45:03.8432888Z 'waves_per_eu': 2} 2026-02-21T17:45:03.8839198Z [122s] Fitting surrogate: 722 points, 722 targets 2026-02-21T17:45:04.3948810Z [122s] Generation 9 starting: 38 neighbors, 2 active search path(s) 2026-02-21T17:45:07.3070694Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 20.2 configs/s 2026-02-21T17:45:09.6401562Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 38/38 16.9 configs/s 2026-02-21T17:45:11.8823258Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 429.1 2026-02-21T17:45:11.8823930Z configs/s 2026-02-21T17:45:12.3423279Z [130s] Generation 9 complete: 2026-02-21T17:45:12.3423653Z error=2 2026-02-21T17:45:12.3423870Z ok=39 2026-02-21T17:45:12.3424118Z min=0.0261 2026-02-21T17:45:12.3424331Z mid=0.0317 2026-02-21T17:45:12.3424896Z max=0.3320 2026-02-21T17:45:12.3425124Z best={'block_sizes': [128, 128, 128], 2026-02-21T17:45:12.3425503Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T17:45:12.3425875Z 'l2_groupings': [64], 2026-02-21T17:45:12.3426161Z 'load_eviction_policies': ['', ''], 2026-02-21T17:45:12.3426472Z 'loop_orders': [[0, 1]], 2026-02-21T17:45:12.3426756Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:45:12.3427035Z 'num_sm_multiplier': 8, 2026-02-21T17:45:12.3427299Z 'num_stages': 2, 2026-02-21T17:45:12.3427525Z 'num_warps': 16, 2026-02-21T17:45:12.3427783Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:45:12.3428068Z 'range_flattens': [False, True], 2026-02-21T17:45:12.3428391Z 'range_multi_buffers': [True, None], 2026-02-21T17:45:12.3428604Z 'range_num_stages': [0, 0], 2026-02-21T17:45:12.3428797Z 'range_unroll_factors': [0, 0], 2026-02-21T17:45:12.3428996Z 'range_warp_specializes': [], 2026-02-21T17:45:12.3429179Z 'waves_per_eu': 2} 2026-02-21T17:45:12.4085169Z [130s] Fitting surrogate: 763 points, 763 targets 2026-02-21T17:45:12.9134247Z [131s] Generation 10 starting: 37 neighbors, 2 active search path(s) 2026-02-21T17:45:17.7049747Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 5.7 configs/s 2026-02-21T17:45:19.5417523Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 37/37 22.0 configs/s 2026-02-21T17:45:21.0559354Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 622.6 2026-02-21T17:45:21.0559620Z configs/s 2026-02-21T17:45:21.4371061Z [139s] Generation 10 complete: 2026-02-21T17:45:21.4371291Z error=10 2026-02-21T17:45:21.4371373Z ok=29 2026-02-21T17:45:21.4371789Z min=0.0262 2026-02-21T17:45:21.4371867Z mid=0.0302 2026-02-21T17:45:21.4371945Z max=0.0470 2026-02-21T17:45:21.4372033Z best={'block_sizes': [128, 128, 128], 2026-02-21T17:45:21.4381166Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T17:45:21.4381314Z 'l2_groupings': [64], 2026-02-21T17:45:21.4381437Z 'load_eviction_policies': ['', ''], 2026-02-21T17:45:21.4381553Z 'loop_orders': [[0, 1]], 2026-02-21T17:45:21.4381657Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:45:21.4381761Z 'num_sm_multiplier': 16, 2026-02-21T17:45:21.4381860Z 'num_stages': 2, 2026-02-21T17:45:21.4381946Z 'num_warps': 16, 2026-02-21T17:45:21.4382047Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:45:21.4382174Z 'range_flattens': [False, True], 2026-02-21T17:45:21.4382286Z 'range_multi_buffers': [True, None], 2026-02-21T17:45:21.4382404Z 'range_num_stages': [0, 0], 2026-02-21T17:45:21.4382507Z 'range_unroll_factors': [0, 0], 2026-02-21T17:45:21.4382617Z 'range_warp_specializes': [], 2026-02-21T17:45:21.4382721Z 'waves_per_eu': 2} 2026-02-21T17:45:21.4770860Z [139s] Fitting surrogate: 802 points, 802 targets 2026-02-21T17:45:21.9890029Z [140s] Generation 11 starting: 38 neighbors, 2 active search path(s) 2026-02-21T17:45:25.8425288Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 10.6 configs/s 2026-02-21T17:45:27.8257473Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 38/38 20.2 configs/s 2026-02-21T17:45:29.8691939Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 471.4 2026-02-21T17:45:29.8692576Z configs/s 2026-02-21T17:45:30.2897818Z [148s] Generation 11 complete: 2026-02-21T17:45:30.2898165Z error=8 2026-02-21T17:45:30.2898391Z ok=32 2026-02-21T17:45:30.2898602Z min=0.0235 2026-02-21T17:45:30.2898814Z mid=0.0246 2026-02-21T17:45:30.2899018Z max=0.0360 2026-02-21T17:45:30.2899248Z best={'block_sizes': [128, 128, 128], 2026-02-21T17:45:30.2899626Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T17:45:30.2900024Z 'l2_groupings': [64], 2026-02-21T17:45:30.2900303Z 'load_eviction_policies': ['', ''], 2026-02-21T17:45:30.2900610Z 'loop_orders': [[0, 1]], 2026-02-21T17:45:30.2900907Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:45:30.2901193Z 'num_sm_multiplier': 16, 2026-02-21T17:45:30.2901733Z 'num_stages': 2, 2026-02-21T17:45:30.2901967Z 'num_warps': 8, 2026-02-21T17:45:30.2902221Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:45:30.2902551Z 'range_flattens': [False, True], 2026-02-21T17:45:30.2902860Z 'range_multi_buffers': [False, None], 2026-02-21T17:45:30.2903169Z 'range_num_stages': [0, 0], 2026-02-21T17:45:30.2903452Z 'range_unroll_factors': [0, 0], 2026-02-21T17:45:30.2903747Z 'range_warp_specializes': [], 2026-02-21T17:45:30.2904027Z 'waves_per_eu': 2} 2026-02-21T17:45:30.3540602Z [148s] Fitting surrogate: 842 points, 842 targets 2026-02-21T17:45:30.6826905Z [148s] Generation 12 starting: 19 neighbors, 1 active search path(s) 2026-02-21T17:45:33.0076549Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 8.3 configs/s 2026-02-21T17:45:33.9393581Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 24.7 configs/s 2026-02-21T17:45:34.7809769Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1071.3 2026-02-21T17:45:34.7810421Z configs/s 2026-02-21T17:45:35.1077162Z [153s] Generation 12 complete: 2026-02-21T17:45:35.1077685Z error=6 2026-02-21T17:45:35.1077939Z ok=14 2026-02-21T17:45:35.1078146Z min=0.0260 2026-02-21T17:45:35.1078353Z mid=0.0278 2026-02-21T17:45:35.1078559Z max=0.0380 2026-02-21T17:45:35.1078786Z best={'block_sizes': [128, 128, 128], 2026-02-21T17:45:35.1079168Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T17:45:35.1079534Z 'l2_groupings': [64], 2026-02-21T17:45:35.1079822Z 'load_eviction_policies': ['', ''], 2026-02-21T17:45:35.1080127Z 'loop_orders': [[0, 1]], 2026-02-21T17:45:35.1080702Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:45:35.1080995Z 'num_sm_multiplier': 16, 2026-02-21T17:45:35.1081257Z 'num_stages': 2, 2026-02-21T17:45:35.1081483Z 'num_warps': 16, 2026-02-21T17:45:35.1081983Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:45:35.1082317Z 'range_flattens': [False, None], 2026-02-21T17:45:35.1082639Z 'range_multi_buffers': [False, None], 2026-02-21T17:45:35.1082949Z 'range_num_stages': [0, 0], 2026-02-21T17:45:35.1083224Z 'range_unroll_factors': [0, 0], 2026-02-21T17:45:35.1083524Z 'range_warp_specializes': [], 2026-02-21T17:45:35.1083779Z 'waves_per_eu': 2} 2026-02-21T17:45:35.1302976Z [153s] Fitting surrogate: 862 points, 862 targets 2026-02-21T17:45:35.9070260Z [154s] Generation 13 starting: 17 neighbors, 1 active search path(s) 2026-02-21T17:45:37.8485898Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 19.8 configs/s 2026-02-21T17:45:38.7782784Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 22.1 configs/s 2026-02-21T17:45:39.4351731Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1364.7 2026-02-21T17:45:39.4352345Z configs/s 2026-02-21T17:45:39.7482964Z [157s] Generation 13 complete: 2026-02-21T17:45:39.7483480Z error=4 2026-02-21T17:45:39.7483780Z ok=14 2026-02-21T17:45:39.7483984Z min=0.0262 2026-02-21T17:45:39.7484196Z mid=0.0279 2026-02-21T17:45:39.7484393Z max=0.0421 2026-02-21T17:45:39.7484619Z best={'block_sizes': [128, 128, 128], 2026-02-21T17:45:39.7484991Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T17:45:39.7485349Z 'l2_groupings': [64], 2026-02-21T17:45:39.7485630Z 'load_eviction_policies': ['', ''], 2026-02-21T17:45:39.7485936Z 'loop_orders': [[0, 1]], 2026-02-21T17:45:39.7486208Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:45:39.7486500Z 'num_sm_multiplier': 16, 2026-02-21T17:45:39.7486764Z 'num_stages': 2, 2026-02-21T17:45:39.7486984Z 'num_warps': 16, 2026-02-21T17:45:39.7487249Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:45:39.7487567Z 'range_flattens': [False, None], 2026-02-21T17:45:39.7487868Z 'range_multi_buffers': [False, None], 2026-02-21T17:45:39.7488171Z 'range_num_stages': [0, 0], 2026-02-21T17:45:39.7488464Z 'range_unroll_factors': [0, 0], 2026-02-21T17:45:39.7489022Z 'range_warp_specializes': [], 2026-02-21T17:45:39.7489298Z 'waves_per_eu': 2} 2026-02-21T17:45:39.7670403Z [157s] Fitting surrogate: 880 points, 880 targets 2026-02-21T17:45:39.9051811Z [158s] Autotuning complete in 158.1s after searching 830 configs. 2026-02-21T17:45:39.9052545Z One can hardcode the best config and skip autotuning with: 2026-02-21T17:45:39.9054617Z @helion.kernel(config=helion.Config(block_sizes=[128, 128, 128], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=2, num_warps=16, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T17:45:39.9056443Z 2026-02-21T17:45:39.9056928Z [158s] Code of selected kernel: /tmp/torchinductor_root/3t/c3tr6vr5owlau6zv6gj7scr7ea6rsywdgwqp3ntu6gt3vhtegnrt.py 2026-02-21T17:46:01.7241679Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T17:46:01.7241888Z (M, N, K) 2026-02-21T17:46:01.7241985Z ------------------ 2026-02-21T17:46:01.7242079Z (4096, 1024, 1024) 2026-02-21T17:46:01.7242132Z 2026-02-21T17:46:01.7254790Z 12%|█▎ | 1/8 [05:02<35:17, 302.44s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2: 2026-02-21T17:46:01.7254989Z (M, N, K) 2026-02-21T17:46:01.7255076Z ------------------ 2026-02-21T17:46:01.7255162Z (4096, 2048, 2048) 2026-02-21T17:46:01.7257004Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T17:46:36.5091448Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T17:47:12.3457893Z Autotune Choices Stats: 2026-02-21T17:47:12.3460374Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_62", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.06435800343751907, "best_triton_pos": 0} 2026-02-21T17:47:12.3468370Z AUTOTUNE mm(4096x2048, 2048x2048) 2026-02-21T17:47:12.3468742Z strides: [2048, 1], [1, 2048] 2026-02-21T17:47:12.3469004Z dtypes: torch.float16, torch.float16 2026-02-21T17:47:12.3469958Z triton_mm_62 0.0644 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T17:47:12.3472106Z triton_mm_70 0.0698 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:47:12.3474380Z triton_mm_66 0.0702 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:47:12.3476092Z triton_mm_63 0.0727 ms 88.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:47:12.3477204Z triton_mm_64 0.0764 ms 84.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:47:12.3478212Z triton_mm_59 0.0801 ms 80.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T17:47:12.3479227Z triton_mm_57 0.0805 ms 80.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T17:47:12.3480363Z triton_mm_52 0.0819 ms 78.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:47:12.3481370Z triton_mm_68 0.0838 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T17:47:12.3482393Z triton_mm_61 0.0842 ms 76.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=1, num_stages=2, num_warps=8 2026-02-21T17:47:12.3483168Z SingleProcess AUTOTUNE benchmarking takes 0.7336 seconds and 0.2876 seconds precompiling for 36 choices 2026-02-21T17:47:12.5042685Z INFO:tritonbench.utils.triton_op:Took 1374.97ms to get benchmark function for pt2_triton_matmul 2026-02-21T17:47:45.2681272Z WARNING:__main__:Input tensor metadata: 2026-02-21T17:47:45.2681695Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T17:47:45.2681964Z 'dtype': 'torch.float16', 2026-02-21T17:47:45.2682238Z 'shape': (4096, 2048), 2026-02-21T17:47:45.2682497Z 'stride': (2048, 1)}, 2026-02-21T17:47:45.2682733Z { 'device': 'cuda:0', 2026-02-21T17:47:45.2682977Z 'dtype': 'torch.float16', 2026-02-21T17:47:45.2683225Z 'shape': (2048, 2048), 2026-02-21T17:47:45.2684063Z 'stride': (1, 2048)}, 2026-02-21T17:47:45.2684298Z None), 2026-02-21T17:47:45.2684482Z 'kwargs': {}} 2026-02-21T17:47:45.2703932Z INFO:tritonbench.utils.triton_op:Took 2.80ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T17:47:45.3456343Z [0s] Autotune random seed: 2169133161 2026-02-21T17:47:45.4099711Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T17:47:56.8989141Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.4 configs/s 2026-02-21T17:48:02.1767929Z Initial population exploring neighbors 27% ━━━╸ 27/100 5.1 configs/s 2026-02-21T17:48:02.1770334Z WARNING:tritonbench.utils.triton_op:Completed input ID 2: 2026-02-21T17:48:02.1770759Z (M, N, K) 2026-02-21T17:48:02.1770980Z ------------------ 2026-02-21T17:48:02.1771222Z (4096, 2048, 2048) 2026-02-21T17:48:02.1771381Z 2026-02-21T17:48:02.1778516Z 25%|██▌ | 2/8 [07:02<19:32, 195.39s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3: 2026-02-21T17:48:02.1779062Z (M, N, K) 2026-02-21T17:48:02.1779277Z ------------------ 2026-02-21T17:48:02.1779495Z (2048, 4096, 2048) 2026-02-21T17:48:02.1780226Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T17:48:43.5257698Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T17:49:22.8964836Z Autotune Choices Stats: 2026-02-21T17:49:22.8966491Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_98", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.06695900112390518, "best_triton_pos": 0} 2026-02-21T17:49:22.8972867Z AUTOTUNE mm(2048x2048, 2048x4096) 2026-02-21T17:49:22.8973095Z strides: [2048, 1], [1, 2048] 2026-02-21T17:49:22.8973255Z dtypes: torch.float16, torch.float16 2026-02-21T17:49:22.8975938Z triton_mm_98 0.0670 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T17:49:22.8977047Z triton_mm_106 0.0706 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:49:22.8977645Z triton_mm_102 0.0706 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:49:22.8978238Z triton_mm_99 0.0790 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:49:22.8978837Z triton_mm_100 0.0791 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:49:22.8979440Z triton_mm_93 0.0866 ms 77.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T17:49:22.8980043Z triton_mm_95 0.0891 ms 75.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T17:49:22.8980646Z triton_mm_88 0.0916 ms 73.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:49:22.8981489Z triton_mm_107 0.0948 ms 70.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:49:22.8982086Z triton_mm_97 0.0953 ms 70.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=1, num_stages=2, num_warps=8 2026-02-21T17:49:22.8982538Z SingleProcess AUTOTUNE benchmarking takes 0.7627 seconds and 0.2447 seconds precompiling for 36 choices 2026-02-21T17:49:23.0488041Z INFO:tritonbench.utils.triton_op:Took 1348.26ms to get benchmark function for pt2_triton_matmul 2026-02-21T17:49:56.1356345Z WARNING:__main__:Input tensor metadata: 2026-02-21T17:49:56.1356853Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T17:49:56.1357215Z 'dtype': 'torch.float16', 2026-02-21T17:49:56.1357542Z 'shape': (2048, 2048), 2026-02-21T17:49:56.1357842Z 'stride': (2048, 1)}, 2026-02-21T17:49:56.1358161Z { 'device': 'cuda:0', 2026-02-21T17:49:56.1358460Z 'dtype': 'torch.float16', 2026-02-21T17:49:56.1358771Z 'shape': (2048, 4096), 2026-02-21T17:49:56.1359057Z 'stride': (1, 2048)}, 2026-02-21T17:49:56.1359343Z None), 2026-02-21T17:49:56.1359575Z 'kwargs': {}} 2026-02-21T17:49:56.1362875Z INFO:tritonbench.utils.triton_op:Took 1.20ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T17:49:56.2097998Z [0s] Autotune random seed: 2169133161 2026-02-21T17:49:56.2740863Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T17:50:07.6340522Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 2.5 configs/s 2026-02-21T17:50:18.2624640Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T17:50:18.2625499Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T17:50:18.2625833Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T17:50:18.2626120Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T17:50:18.2626413Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T17:50:18.2626709Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T17:50:18.2627046Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T17:50:18.2627504Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T17:50:18.2627852Z %c8192_i32 = arith.constant 8192 : i32 2026-02-21T17:50:18.2627988Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T17:50:18.2628108Z %c0_i32 = arith.constant 0 : i32 2026-02-21T17:50:18.2628324Z %cst = arith.constant dense<4096> : tensor<8x1xi32, #blocked> 2026-02-21T17:50:18.2628506Z %cst_0 = arith.constant dense<2048> : tensor<1x32xi32, #blocked1> 2026-02-21T17:50:18.2628682Z %cst_1 = arith.constant dense<2048> : tensor<8x1xi32, #blocked> 2026-02-21T17:50:18.2628865Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<8x32xf32, #blocked1> 2026-02-21T17:50:18.2629110Z %c8_i32 = arith.constant 8 : i32 2026-02-21T17:50:18.2629229Z %c32_i32 = arith.constant 32 : i32 2026-02-21T17:50:18.2629347Z %c128_i32 = arith.constant 128 : i32 2026-02-21T17:50:18.2629602Z %0 = tt.get_program_id x : i32 2026-02-21T17:50:18.2629719Z %1 = arith.divsi %0, %c8192_i32 : i32 2026-02-21T17:50:18.2629837Z %2 = arith.muli %1, %c32_i32 : i32 2026-02-21T17:50:18.2629953Z %3 = arith.subi %c128_i32, %2 : i32 2026-02-21T17:50:18.2630065Z %4 = arith.minsi %3, %c32_i32 : i32 2026-02-21T17:50:18.2630177Z %5 = arith.remsi %0, %c8192_i32 : i32 2026-02-21T17:50:18.2630291Z %6 = arith.remsi %5, %4 : i32 2026-02-21T17:50:18.2630402Z %7 = arith.addi %2, %6 : i32 2026-02-21T17:50:18.2630508Z %8 = arith.divsi %5, %4 : i32 2026-02-21T17:50:18.2630613Z %9 = arith.muli %7, %c32_i32 : i32 2026-02-21T17:50:18.2630770Z %10 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked2> 2026-02-21T17:50:18.2630957Z %11 = tt.splat %9 : i32 -> tensor<32xi32, #blocked2> 2026-02-21T17:50:18.2631117Z %12 = arith.addi %11, %10 : tensor<32xi32, #blocked2> 2026-02-21T17:50:18.2631247Z %13 = arith.muli %8, %c8_i32 : i32 2026-02-21T17:50:18.2631404Z %14 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked2> 2026-02-21T17:50:18.2631582Z %15 = tt.splat %13 : i32 -> tensor<8xi32, #blocked2> 2026-02-21T17:50:18.2631723Z %16 = arith.addi %15, %14 : tensor<8xi32, #blocked2> 2026-02-21T17:50:18.2631954Z %17 = ttg.convert_layout %16 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T17:50:18.2632278Z %18 = tt.expand_dims %17 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<8x1xi32, #blocked3> 2026-02-21T17:50:18.2632561Z %19 = ttg.convert_layout %18 : tensor<8x1xi32, #blocked3> -> tensor<8x1xi32, #blocked> 2026-02-21T17:50:18.2632759Z %20 = arith.muli %19, %cst_1 : tensor<8x1xi32, #blocked> 2026-02-21T17:50:18.2632939Z %21 = tt.broadcast %20 : tensor<8x1xi32, #blocked> -> tensor<8x32xi32, #blocked> 2026-02-21T17:50:18.2633165Z %22 = ttg.convert_layout %21 : tensor<8x32xi32, #blocked> -> tensor<8x32xi32, #blocked1> 2026-02-21T17:50:18.2633383Z %23 = tt.splat %arg0 : !tt.ptr -> tensor<8x32x!tt.ptr, #blocked1> 2026-02-21T17:50:18.2633673Z %24 = ttg.convert_layout %12 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked4}>> 2026-02-21T17:50:18.2633995Z %25 = tt.expand_dims %24 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked4}>> -> tensor<1x32xi32, #blocked4> 2026-02-21T17:50:18.2634277Z %26 = ttg.convert_layout %25 : tensor<1x32xi32, #blocked4> -> tensor<1x32xi32, #blocked1> 2026-02-21T17:50:18.2634476Z %27 = arith.muli %26, %cst_0 : tensor<1x32xi32, #blocked1> 2026-02-21T17:50:18.2634661Z %28 = tt.broadcast %27 : tensor<1x32xi32, #blocked1> -> tensor<32x32xi32, #blocked1> 2026-02-21T17:50:18.2634880Z %29 = tt.splat %arg1 : !tt.ptr -> tensor<32x32x!tt.ptr, #blocked1> 2026-02-21T17:50:18.2635139Z %30 = scf.for %arg3 = %c0_i32 to %c2048_i32 step %c32_i32 iter_args(%arg4 = %cst_2) -> (tensor<8x32xf32, #blocked1>) : i32 { 2026-02-21T17:50:18.2635376Z %45 = tt.splat %arg3 : i32 -> tensor<32xi32, #blocked2> 2026-02-21T17:50:18.2635530Z %46 = arith.addi %45, %10 : tensor<32xi32, #blocked2> 2026-02-21T17:50:18.2635760Z %47 = ttg.convert_layout %46 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked4}>> 2026-02-21T17:50:18.2636085Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked4}>> -> tensor<1x32xi32, #blocked4> 2026-02-21T17:50:18.2636369Z %49 = ttg.convert_layout %48 : tensor<1x32xi32, #blocked4> -> tensor<1x32xi32, #blocked1> 2026-02-21T17:50:18.2636594Z %50 = tt.broadcast %49 : tensor<1x32xi32, #blocked1> -> tensor<8x32xi32, #blocked1> 2026-02-21T17:50:18.2636782Z %51 = arith.addi %22, %50 : tensor<8x32xi32, #blocked1> 2026-02-21T17:50:18.2636999Z %52 = tt.addptr %23, %51 : tensor<8x32x!tt.ptr, #blocked1>, tensor<8x32xi32, #blocked1> 2026-02-21T17:50:18.2637234Z %53 = tt.load %52 : tensor<8x32x!tt.ptr, #blocked1> 2026-02-21T17:50:18.2637469Z %54 = ttg.convert_layout %46 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T17:50:18.2637790Z %55 = tt.expand_dims %54 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<32x1xi32, #blocked3> 2026-02-21T17:50:18.2638071Z %56 = ttg.convert_layout %55 : tensor<32x1xi32, #blocked3> -> tensor<32x1xi32, #blocked> 2026-02-21T17:50:18.2638294Z %57 = tt.broadcast %56 : tensor<32x1xi32, #blocked> -> tensor<32x32xi32, #blocked> 2026-02-21T17:50:18.2638520Z %58 = ttg.convert_layout %57 : tensor<32x32xi32, #blocked> -> tensor<32x32xi32, #blocked1> 2026-02-21T17:50:18.2638721Z %59 = arith.addi %58, %28 : tensor<32x32xi32, #blocked1> 2026-02-21T17:50:18.2638915Z %60 = tt.addptr %29, %59 : tensor<32x32x!tt.ptr, #blocked1>, tensor<32x32xi32, #blocked1> 2026-02-21T17:50:18.2639115Z %61 = tt.load %60 : tensor<32x32x!tt.ptr, #blocked1> 2026-02-21T17:50:18.2639369Z %62 = ttg.convert_layout %53 : tensor<8x32xf16, #blocked1> -> tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked1}>> 2026-02-21T17:50:18.2639708Z %63 = ttg.convert_layout %61 : tensor<32x32xf16, #blocked1> -> tensor<32x32xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked1}>> 2026-02-21T17:50:18.2640004Z %64 = ttg.convert_layout %arg4 : tensor<8x32xf32, #blocked1> -> tensor<8x32xf32, #blocked1> 2026-02-21T17:50:18.2640405Z %65 = tt.dot %62, %63, %64, inputPrecision = tf32 : tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked1}>> * tensor<32x32xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked1}>> -> tensor<8x32xf32, #blocked1> 2026-02-21T17:50:18.2640749Z scf.yield %65 : tensor<8x32xf32, #blocked1> 2026-02-21T17:50:18.2640886Z } {tt.disallow_acc_multi_buffer, tt.flatten} 2026-02-21T17:50:18.2641064Z %31 = arith.truncf %30 : tensor<8x32xf32, #blocked1> to tensor<8x32xf16, #blocked1> 2026-02-21T17:50:18.2641334Z %32 = ttg.convert_layout %16 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T17:50:18.2641668Z %33 = tt.expand_dims %32 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<8x1xi32, #blocked3> 2026-02-21T17:50:18.2641944Z %34 = ttg.convert_layout %33 : tensor<8x1xi32, #blocked3> -> tensor<8x1xi32, #blocked> 2026-02-21T17:50:18.2642135Z %35 = arith.muli %34, %cst : tensor<8x1xi32, #blocked> 2026-02-21T17:50:18.2642366Z %36 = ttg.convert_layout %12 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked4}>> 2026-02-21T17:50:18.2642691Z %37 = tt.expand_dims %36 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked4}>> -> tensor<1x32xi32, #blocked4> 2026-02-21T17:50:18.2642976Z %38 = ttg.convert_layout %37 : tensor<1x32xi32, #blocked4> -> tensor<1x32xi32, #blocked1> 2026-02-21T17:50:18.2643200Z %39 = tt.broadcast %35 : tensor<8x1xi32, #blocked> -> tensor<8x32xi32, #blocked> 2026-02-21T17:50:18.2643420Z %40 = ttg.convert_layout %39 : tensor<8x32xi32, #blocked> -> tensor<8x32xi32, #blocked1> 2026-02-21T17:50:18.2643643Z %41 = tt.broadcast %38 : tensor<1x32xi32, #blocked1> -> tensor<8x32xi32, #blocked1> 2026-02-21T17:50:18.2643833Z %42 = arith.addi %40, %41 : tensor<8x32xi32, #blocked1> 2026-02-21T17:50:18.2644008Z %43 = tt.splat %arg2 : !tt.ptr -> tensor<8x32x!tt.ptr, #blocked1> 2026-02-21T17:50:18.2644224Z %44 = tt.addptr %43, %42 : tensor<8x32x!tt.ptr, #blocked1>, tensor<8x32xi32, #blocked1> 2026-02-21T17:50:18.2644428Z tt.store %44, %31 : tensor<8x32x!tt.ptr, #blocked1> 2026-02-21T17:50:18.2644561Z tt.return 2026-02-21T17:50:18.2644643Z } 2026-02-21T17:50:18.2644722Z } 2026-02-21T17:50:18.2644765Z 2026-02-21T17:50:18.2644819Z {-# 2026-02-21T17:50:18.2644899Z external_resources: { 2026-02-21T17:50:18.2645016Z mlir_reproducer: { 2026-02-21T17:50:18.2650887Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T17:50:18.2653202Z disable_threading: false, 2026-02-21T17:50:18.2653310Z verify_each: true 2026-02-21T17:50:18.2653400Z } 2026-02-21T17:50:18.2653476Z } 2026-02-21T17:50:18.2653545Z #-} 2026-02-21T17:50:18.2653826Z /tmp/torchinductor_root/3y/c3ykzwgw6a2ag23mrcgsjzedvjuqiczdmdtbmwumjywiaxup4m55.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T17:50:18.2654514Z /tmp/torchinductor_root/3y/c3ykzwgw6a2ag23mrcgsjzedvjuqiczdmdtbmwumjywiaxup4m55.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T17:50:18.2655063Z [21s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T17:50:18.2655808Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 32, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T17:50:18.2656460Z Error: RuntimeError: PassManager::run failed 2026-02-21T17:50:18.2656627Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T17:50:18.2991679Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T17:50:18.2994451Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T17:50:18.2994835Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T17:50:18.2995125Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T17:50:18.2995415Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T17:50:18.2995706Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T17:50:18.2996000Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T17:50:18.2996368Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T17:50:18.2996877Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T17:50:18.2997251Z %cst = arith.constant dense<0.000000e+00> : tensor<512x16xf16, #blocked> 2026-02-21T17:50:18.2997446Z %cst_0 = arith.constant dense<4096> : tensor<1x16xi64, #blocked> 2026-02-21T17:50:18.2997623Z %cst_1 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T17:50:18.2997800Z %cst_2 = arith.constant dense<2048> : tensor<512x1xi64, #blocked1> 2026-02-21T17:50:18.2997980Z %cst_3 = arith.constant dense<0> : tensor<512x1xi64, #blocked1> 2026-02-21T17:50:18.2998156Z %cst_4 = arith.constant dense<2048> : tensor<1x16xi64, #blocked> 2026-02-21T17:50:18.2998307Z %c512_i32 = arith.constant 512 : i32 2026-02-21T17:50:18.2998430Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T17:50:18.2998551Z %c0_i32 = arith.constant 0 : i32 2026-02-21T17:50:18.2998664Z %c1_i32 = arith.constant 1 : i32 2026-02-21T17:50:18.2998801Z %cst_5 = arith.constant dense<4096> : tensor<8x1xi32, #blocked1> 2026-02-21T17:50:18.2998974Z %cst_6 = arith.constant dense<2048> : tensor<8x1xi32, #blocked1> 2026-02-21T17:50:18.2999155Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<8x16xf32, #blocked> 2026-02-21T17:50:18.2999315Z %c16_i32 = arith.constant 16 : i32 2026-02-21T17:50:18.2999431Z %c8_i32 = arith.constant 8 : i32 2026-02-21T17:50:18.2999542Z %c256_i32 = arith.constant 256 : i32 2026-02-21T17:50:18.2999663Z %c65536_i32 = arith.constant 65536 : i32 2026-02-21T17:50:18.2999780Z %c14_i32 = arith.constant 14 : i32 2026-02-21T17:50:18.2999895Z %0 = tt.get_program_id x : i32 2026-02-21T17:50:18.3000009Z %1 = arith.muli %0, %c14_i32 : i32 2026-02-21T17:50:18.3000122Z %2 = arith.addi %1, %c14_i32 : i32 2026-02-21T17:50:18.3000233Z %3 = arith.minsi %2, %c65536_i32 : i32 2026-02-21T17:50:18.3000408Z %4 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked2> 2026-02-21T17:50:18.3000639Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked2> 2026-02-21T17:50:18.3000843Z %6 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked2> 2026-02-21T17:50:18.3001050Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr, #blocked3> 2026-02-21T17:50:18.3001245Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<512x16x!tt.ptr, #blocked> 2026-02-21T17:50:18.3001443Z %9 = arith.extsi %6 : tensor<512xi32, #blocked2> to tensor<512xi64, #blocked2> 2026-02-21T17:50:18.3001650Z %10 = arith.extsi %5 : tensor<16xi32, #blocked2> to tensor<16xi64, #blocked2> 2026-02-21T17:50:18.3001846Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<8x16x!tt.ptr, #blocked> 2026-02-21T17:50:18.3002016Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T17:50:18.3002149Z %12 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T17:50:18.3002277Z %13 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T17:50:18.3002398Z %14 = arith.muli %12, %c8_i32 : i32 2026-02-21T17:50:18.3002532Z %15 = tt.splat %14 : i32 -> tensor<8xi32, #blocked2> 2026-02-21T17:50:18.3002682Z %16 = arith.addi %15, %4 : tensor<8xi32, #blocked2> 2026-02-21T17:50:18.3002812Z %17 = arith.muli %13, %c16_i32 : i32 2026-02-21T17:50:18.3002946Z %18 = tt.splat %17 : i32 -> tensor<16xi32, #blocked2> 2026-02-21T17:50:18.3003094Z %19 = arith.addi %18, %5 : tensor<16xi32, #blocked2> 2026-02-21T17:50:18.3003325Z %20 = ttg.convert_layout %16 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:50:18.3003652Z %21 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<8x1xi32, #blocked4> 2026-02-21T17:50:18.3003951Z %22 = ttg.convert_layout %21 : tensor<8x1xi32, #blocked4> -> tensor<8x1xi32, #blocked1> 2026-02-21T17:50:18.3004181Z %23 = arith.muli %22, %cst_6 : tensor<8x1xi32, #blocked1> 2026-02-21T17:50:18.3004370Z %24 = tt.broadcast %23 : tensor<8x1xi32, #blocked1> -> tensor<8x512xi32, #blocked1> 2026-02-21T17:50:18.3004605Z %25 = ttg.convert_layout %24 : tensor<8x512xi32, #blocked1> -> tensor<8x512xi32, #blocked3> 2026-02-21T17:50:18.3004790Z %26 = arith.extsi %17 : i32 to i64 2026-02-21T17:50:18.3004923Z %27 = tt.splat %26 : i64 -> tensor<16xi64, #blocked2> 2026-02-21T17:50:18.3005074Z %28 = arith.addi %27, %10 : tensor<16xi64, #blocked2> 2026-02-21T17:50:18.3005305Z %29 = ttg.convert_layout %28 : tensor<16xi64, #blocked2> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:50:18.3005645Z %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi64, #blocked5> 2026-02-21T17:50:18.3005933Z %31 = ttg.convert_layout %30 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #blocked> 2026-02-21T17:50:18.3006132Z %32 = arith.muli %31, %cst_4 : tensor<1x16xi64, #blocked> 2026-02-21T17:50:18.3006322Z %33 = tt.broadcast %32 : tensor<1x16xi64, #blocked> -> tensor<512x16xi64, #blocked> 2026-02-21T17:50:18.3006518Z %34 = arith.cmpi sge, %31, %cst_1 : tensor<1x16xi64, #blocked> 2026-02-21T17:50:18.3006692Z %35 = arith.cmpi slt, %31, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T17:50:18.3006850Z %36 = arith.andi %34, %35 : tensor<1x16xi1, #blocked> 2026-02-21T17:50:18.3007030Z %37 = tt.broadcast %36 : tensor<1x16xi1, #blocked> -> tensor<512x16xi1, #blocked> 2026-02-21T17:50:18.3007305Z %38 = scf.for %arg4 = %c0_i32 to %c2048_i32 step %c512_i32 iter_args(%arg5 = %cst_7) -> (tensor<8x16xf32, #blocked>) : i32 { 2026-02-21T17:50:18.3007547Z %52 = tt.splat %arg4 : i32 -> tensor<512xi32, #blocked2> 2026-02-21T17:50:18.3007708Z %53 = arith.addi %52, %6 : tensor<512xi32, #blocked2> 2026-02-21T17:50:18.3007949Z %54 = ttg.convert_layout %53 : tensor<512xi32, #blocked2> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:50:18.3008302Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x512xi32, #blocked5> 2026-02-21T17:50:18.3008599Z %56 = ttg.convert_layout %55 : tensor<1x512xi32, #blocked5> -> tensor<1x512xi32, #blocked3> 2026-02-21T17:50:18.3008837Z %57 = tt.broadcast %56 : tensor<1x512xi32, #blocked3> -> tensor<8x512xi32, #blocked3> 2026-02-21T17:50:18.3009031Z %58 = arith.addi %25, %57 : tensor<8x512xi32, #blocked3> 2026-02-21T17:50:18.3009227Z %59 = tt.addptr %7, %58 : tensor<8x512x!tt.ptr, #blocked3>, tensor<8x512xi32, #blocked3> 2026-02-21T17:50:18.3009430Z %60 = tt.load %59 : tensor<8x512x!tt.ptr, #blocked3> 2026-02-21T17:50:18.3009574Z %61 = arith.extsi %arg4 : i32 to i64 2026-02-21T17:50:18.3009711Z %62 = tt.splat %61 : i64 -> tensor<512xi64, #blocked2> 2026-02-21T17:50:18.3009867Z %63 = arith.addi %62, %9 : tensor<512xi64, #blocked2> 2026-02-21T17:50:18.3010100Z %64 = ttg.convert_layout %63 : tensor<512xi64, #blocked2> -> tensor<512xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:50:18.3010438Z %65 = tt.expand_dims %64 {axis = 1 : i32} : tensor<512xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<512x1xi64, #blocked4> 2026-02-21T17:50:18.3010734Z %66 = ttg.convert_layout %65 : tensor<512x1xi64, #blocked4> -> tensor<512x1xi64, #blocked1> 2026-02-21T17:50:18.3010987Z %67 = tt.broadcast %66 : tensor<512x1xi64, #blocked1> -> tensor<512x16xi64, #blocked1> 2026-02-21T17:50:18.3011224Z %68 = ttg.convert_layout %67 : tensor<512x16xi64, #blocked1> -> tensor<512x16xi64, #blocked> 2026-02-21T17:50:18.3011446Z %69 = arith.addi %68, %33 : tensor<512x16xi64, #blocked> 2026-02-21T17:50:18.3011644Z %70 = tt.addptr %8, %69 : tensor<512x16x!tt.ptr, #blocked>, tensor<512x16xi64, #blocked> 2026-02-21T17:50:18.3011888Z %71 = arith.cmpi sge, %66, %cst_3 : tensor<512x1xi64, #blocked1> 2026-02-21T17:50:18.3012064Z %72 = arith.cmpi slt, %66, %cst_2 : tensor<512x1xi64, #blocked1> 2026-02-21T17:50:18.3012233Z %73 = arith.andi %71, %72 : tensor<512x1xi1, #blocked1> 2026-02-21T17:50:18.3012420Z %74 = tt.broadcast %73 : tensor<512x1xi1, #blocked1> -> tensor<512x16xi1, #blocked1> 2026-02-21T17:50:18.3012656Z %75 = ttg.convert_layout %74 : tensor<512x16xi1, #blocked1> -> tensor<512x16xi1, #blocked> 2026-02-21T17:50:18.3012855Z %76 = arith.andi %75, %37 : tensor<512x16xi1, #blocked> 2026-02-21T17:50:18.3013017Z %77 = tt.load %70, %76, %cst : tensor<512x16x!tt.ptr, #blocked> 2026-02-21T17:50:18.3013280Z %78 = ttg.convert_layout %60 : tensor<8x512xf16, #blocked3> -> tensor<8x512xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T17:50:18.3013624Z %79 = ttg.convert_layout %77 : tensor<512x16xf16, #blocked> -> tensor<512x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T17:50:18.3013919Z %80 = ttg.convert_layout %arg5 : tensor<8x16xf32, #blocked> -> tensor<8x16xf32, #blocked> 2026-02-21T17:50:18.3014320Z %81 = tt.dot %78, %79, %80, inputPrecision = tf32 : tensor<8x512xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<8x16xf32, #blocked> 2026-02-21T17:50:18.3014658Z scf.yield %81 : tensor<8x16xf32, #blocked> 2026-02-21T17:50:18.3014789Z } {tt.disallow_acc_multi_buffer} 2026-02-21T17:50:18.3014952Z %39 = arith.truncf %38 : tensor<8x16xf32, #blocked> to tensor<8x16xf16, #blocked> 2026-02-21T17:50:18.3015220Z %40 = ttg.convert_layout %16 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:50:18.3015543Z %41 = tt.expand_dims %40 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<8x1xi32, #blocked4> 2026-02-21T17:50:18.3015822Z %42 = ttg.convert_layout %41 : tensor<8x1xi32, #blocked4> -> tensor<8x1xi32, #blocked1> 2026-02-21T17:50:18.3016043Z %43 = arith.muli %42, %cst_5 : tensor<8x1xi32, #blocked1> 2026-02-21T17:50:18.3016275Z %44 = ttg.convert_layout %19 : tensor<16xi32, #blocked2> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:50:18.3016594Z %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi32, #blocked5> 2026-02-21T17:50:18.3016876Z %46 = ttg.convert_layout %45 : tensor<1x16xi32, #blocked5> -> tensor<1x16xi32, #blocked> 2026-02-21T17:50:18.3017101Z %47 = tt.broadcast %43 : tensor<8x1xi32, #blocked1> -> tensor<8x16xi32, #blocked1> 2026-02-21T17:50:18.3017330Z %48 = ttg.convert_layout %47 : tensor<8x16xi32, #blocked1> -> tensor<8x16xi32, #blocked> 2026-02-21T17:50:18.3017560Z %49 = tt.broadcast %46 : tensor<1x16xi32, #blocked> -> tensor<8x16xi32, #blocked> 2026-02-21T17:50:18.3017755Z %50 = arith.addi %48, %49 : tensor<8x16xi32, #blocked> 2026-02-21T17:50:18.3017947Z %51 = tt.addptr %11, %50 : tensor<8x16x!tt.ptr, #blocked>, tensor<8x16xi32, #blocked> 2026-02-21T17:50:18.3018146Z tt.store %51, %39 : tensor<8x16x!tt.ptr, #blocked> 2026-02-21T17:50:18.3018277Z } 2026-02-21T17:50:18.3018354Z tt.return 2026-02-21T17:50:18.3018434Z } 2026-02-21T17:50:18.3018505Z } 2026-02-21T17:50:18.3018549Z 2026-02-21T17:50:18.3018579Z {-# 2026-02-21T17:50:18.3018659Z external_resources: { 2026-02-21T17:50:18.3018761Z mlir_reproducer: { 2026-02-21T17:50:18.3026290Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T17:50:18.3028689Z disable_threading: false, 2026-02-21T17:50:18.3028801Z verify_each: true 2026-02-21T17:50:18.3028895Z } 2026-02-21T17:50:18.3028973Z } 2026-02-21T17:50:18.3029042Z #-} 2026-02-21T17:50:18.3029328Z /tmp/torchinductor_root/e2/ce2qh5rzlltxdowg3p4z53szx5keozy57i6yjndjjmhhagab2zgx.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T17:50:18.3030012Z /tmp/torchinductor_root/e2/ce2qh5rzlltxdowg3p4z53szx5keozy57i6yjndjjmhhagab2zgx.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T17:50:18.3030568Z [22s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T17:50:18.3031374Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16, 512], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T17:50:18.3049080Z Error: RuntimeError: PassManager::run failed 2026-02-21T17:50:18.3049253Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T17:50:21.8566856Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T17:50:21.8568566Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T17:50:21.8569464Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T17:50:21.8570269Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T17:50:21.8571078Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T17:50:21.8571894Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T17:50:21.8572716Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T17:50:21.8573629Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T17:50:21.8574895Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T17:50:21.8576428Z %c256_i32 = arith.constant 256 : i32 2026-02-21T17:50:21.8579779Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T17:50:21.8579948Z %c0_i32 = arith.constant 0 : i32 2026-02-21T17:50:21.8580070Z %c608_i32 = arith.constant 608 : i32 2026-02-21T17:50:21.8580219Z %cst = arith.constant dense<4096> : tensor<128x1xi32, #blocked> 2026-02-21T17:50:21.8580405Z %cst_0 = arith.constant dense<2048> : tensor<1x2xi32, #blocked1> 2026-02-21T17:50:21.8580583Z %cst_1 = arith.constant dense<2048> : tensor<128x1xi32, #blocked> 2026-02-21T17:50:21.8580771Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<128x2xf32, #blocked1> 2026-02-21T17:50:21.8580935Z %c2_i32 = arith.constant 2 : i32 2026-02-21T17:50:21.8581047Z %c128_i32 = arith.constant 128 : i32 2026-02-21T17:50:21.8581169Z %c16_i32 = arith.constant 16 : i32 2026-02-21T17:50:21.8581305Z %c32768_i32 = arith.constant 32768 : i32 2026-02-21T17:50:21.8581426Z %0 = tt.get_program_id x : i32 2026-02-21T17:50:21.8581589Z %1 = tt.make_range {end = 128 : i32, start = 0 : i32} : tensor<128xi32, #blocked2> 2026-02-21T17:50:21.8581793Z %2 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked2> 2026-02-21T17:50:21.8581993Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked2> 2026-02-21T17:50:21.8582209Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<128x256x!tt.ptr, #blocked3> 2026-02-21T17:50:21.8582406Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T17:50:21.8582596Z %6 = tt.splat %arg2 : !tt.ptr -> tensor<128x2x!tt.ptr, #blocked1> 2026-02-21T17:50:21.8582777Z scf.for %arg3 = %0 to %c32768_i32 step %c608_i32 : i32 { 2026-02-21T17:50:21.8582928Z %7 = arith.divsi %arg3, %c32768_i32 : i32 2026-02-21T17:50:21.8583049Z %8 = arith.muli %7, %c16_i32 : i32 2026-02-21T17:50:21.8583166Z %9 = arith.subi %c16_i32, %8 : i32 2026-02-21T17:50:21.8583282Z %10 = arith.minsi %9, %c16_i32 : i32 2026-02-21T17:50:21.8583407Z %11 = arith.remsi %arg3, %c32768_i32 : i32 2026-02-21T17:50:21.8583528Z %12 = arith.remsi %11, %10 : i32 2026-02-21T17:50:21.8583715Z %13 = arith.addi %8, %12 : i32 2026-02-21T17:50:21.8583849Z %14 = arith.divsi %11, %10 : i32 2026-02-21T17:50:21.8583963Z %15 = arith.muli %13, %c128_i32 : i32 2026-02-21T17:50:21.8584100Z %16 = tt.splat %15 : i32 -> tensor<128xi32, #blocked2> 2026-02-21T17:50:21.8584255Z %17 = arith.addi %16, %1 : tensor<128xi32, #blocked2> 2026-02-21T17:50:21.8584391Z %18 = arith.muli %14, %c2_i32 : i32 2026-02-21T17:50:21.8584520Z %19 = tt.splat %18 : i32 -> tensor<2xi32, #blocked2> 2026-02-21T17:50:21.8584665Z %20 = arith.addi %19, %2 : tensor<2xi32, #blocked2> 2026-02-21T17:50:21.8584898Z %21 = ttg.convert_layout %17 : tensor<128xi32, #blocked2> -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:50:21.8585244Z %22 = tt.expand_dims %21 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<128x1xi32, #blocked4> 2026-02-21T17:50:21.8585543Z %23 = ttg.convert_layout %22 : tensor<128x1xi32, #blocked4> -> tensor<128x1xi32, #blocked> 2026-02-21T17:50:21.8585747Z %24 = arith.muli %23, %cst_1 : tensor<128x1xi32, #blocked> 2026-02-21T17:50:21.8585941Z %25 = tt.broadcast %24 : tensor<128x1xi32, #blocked> -> tensor<128x256xi32, #blocked> 2026-02-21T17:50:21.8586182Z %26 = ttg.convert_layout %25 : tensor<128x256xi32, #blocked> -> tensor<128x256xi32, #blocked3> 2026-02-21T17:50:21.8586466Z %27 = ttg.convert_layout %20 : tensor<2xi32, #blocked2> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:50:21.8586784Z %28 = tt.expand_dims %27 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x2xi32, #blocked5> 2026-02-21T17:50:21.8587087Z %29 = ttg.convert_layout %28 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #blocked1> 2026-02-21T17:50:21.8587320Z %30 = arith.muli %29, %cst_0 : tensor<1x2xi32, #blocked1> 2026-02-21T17:50:21.8587506Z %31 = tt.broadcast %30 : tensor<1x2xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T17:50:21.8587784Z %32 = scf.for %arg4 = %c0_i32 to %c2048_i32 step %c256_i32 iter_args(%arg5 = %cst_2) -> (tensor<128x2xf32, #blocked1>) : i32 { 2026-02-21T17:50:21.8588025Z %46 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked2> 2026-02-21T17:50:21.8588286Z %47 = arith.addi %46, %3 : tensor<256xi32, #blocked2> 2026-02-21T17:50:21.8588524Z %48 = ttg.convert_layout %47 : tensor<256xi32, #blocked2> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:50:21.8588855Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T17:50:21.8589153Z %50 = ttg.convert_layout %49 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T17:50:21.8589397Z %51 = tt.broadcast %50 : tensor<1x256xi32, #blocked3> -> tensor<128x256xi32, #blocked3> 2026-02-21T17:50:21.8589594Z %52 = arith.addi %26, %51 : tensor<128x256xi32, #blocked3> 2026-02-21T17:50:21.8589804Z %53 = tt.addptr %4, %52 : tensor<128x256x!tt.ptr, #blocked3>, tensor<128x256xi32, #blocked3> 2026-02-21T17:50:21.8590013Z %54 = tt.load %53 : tensor<128x256x!tt.ptr, #blocked3> 2026-02-21T17:50:21.8590259Z %55 = ttg.convert_layout %47 : tensor<256xi32, #blocked2> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:50:21.8590587Z %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<256x1xi32, #blocked4> 2026-02-21T17:50:21.8590876Z %57 = ttg.convert_layout %56 : tensor<256x1xi32, #blocked4> -> tensor<256x1xi32, #blocked> 2026-02-21T17:50:21.8591108Z %58 = tt.broadcast %57 : tensor<256x1xi32, #blocked> -> tensor<256x2xi32, #blocked> 2026-02-21T17:50:21.8591337Z %59 = ttg.convert_layout %58 : tensor<256x2xi32, #blocked> -> tensor<256x2xi32, #blocked1> 2026-02-21T17:50:21.8591537Z %60 = arith.addi %59, %31 : tensor<256x2xi32, #blocked1> 2026-02-21T17:50:21.8591757Z %61 = tt.addptr %5, %60 : tensor<256x2x!tt.ptr, #blocked1>, tensor<256x2xi32, #blocked1> 2026-02-21T17:50:21.8591956Z %62 = tt.load %61 : tensor<256x2x!tt.ptr, #blocked1> 2026-02-21T17:50:21.8592214Z %63 = ttg.convert_layout %54 : tensor<128x256xf16, #blocked3> -> tensor<128x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked1}>> 2026-02-21T17:50:21.8592559Z %64 = ttg.convert_layout %62 : tensor<256x2xf16, #blocked1> -> tensor<256x2xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked1}>> 2026-02-21T17:50:21.8592858Z %65 = ttg.convert_layout %arg5 : tensor<128x2xf32, #blocked1> -> tensor<128x2xf32, #blocked1> 2026-02-21T17:50:21.8593274Z %66 = tt.dot %63, %64, %65, inputPrecision = tf32 : tensor<128x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked1}>> * tensor<256x2xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked1}>> -> tensor<128x2xf32, #blocked1> 2026-02-21T17:50:21.8593624Z scf.yield %66 : tensor<128x2xf32, #blocked1> 2026-02-21T17:50:21.8593764Z } {tt.disallow_acc_multi_buffer} 2026-02-21T17:50:21.8593932Z %33 = arith.truncf %32 : tensor<128x2xf32, #blocked1> to tensor<128x2xf16, #blocked1> 2026-02-21T17:50:21.8594208Z %34 = ttg.convert_layout %17 : tensor<128xi32, #blocked2> -> tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:50:21.8594540Z %35 = tt.expand_dims %34 {axis = 1 : i32} : tensor<128xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<128x1xi32, #blocked4> 2026-02-21T17:50:21.8594826Z %36 = ttg.convert_layout %35 : tensor<128x1xi32, #blocked4> -> tensor<128x1xi32, #blocked> 2026-02-21T17:50:21.8595050Z %37 = arith.muli %36, %cst : tensor<128x1xi32, #blocked> 2026-02-21T17:50:21.8595279Z %38 = ttg.convert_layout %20 : tensor<2xi32, #blocked2> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:50:21.8595639Z %39 = tt.expand_dims %38 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x2xi32, #blocked5> 2026-02-21T17:50:21.8595923Z %40 = ttg.convert_layout %39 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #blocked1> 2026-02-21T17:50:21.8596146Z %41 = tt.broadcast %37 : tensor<128x1xi32, #blocked> -> tensor<128x2xi32, #blocked> 2026-02-21T17:50:21.8596371Z %42 = ttg.convert_layout %41 : tensor<128x2xi32, #blocked> -> tensor<128x2xi32, #blocked1> 2026-02-21T17:50:21.8596596Z %43 = tt.broadcast %40 : tensor<1x2xi32, #blocked1> -> tensor<128x2xi32, #blocked1> 2026-02-21T17:50:21.8596784Z %44 = arith.addi %42, %43 : tensor<128x2xi32, #blocked1> 2026-02-21T17:50:21.8596979Z %45 = tt.addptr %6, %44 : tensor<128x2x!tt.ptr, #blocked1>, tensor<128x2xi32, #blocked1> 2026-02-21T17:50:21.8597183Z tt.store %45, %33 : tensor<128x2x!tt.ptr, #blocked1> 2026-02-21T17:50:21.8597328Z } {tt.disallow_acc_multi_buffer} 2026-02-21T17:50:21.8597438Z tt.return 2026-02-21T17:50:21.8597519Z } 2026-02-21T17:50:21.8597598Z } 2026-02-21T17:50:21.8597646Z 2026-02-21T17:50:21.8597676Z {-# 2026-02-21T17:50:21.8597755Z external_resources: { 2026-02-21T17:50:21.8597858Z mlir_reproducer: { 2026-02-21T17:50:21.8600103Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T17:50:21.8602437Z disable_threading: false, 2026-02-21T17:50:21.8602544Z verify_each: true 2026-02-21T17:50:21.8602634Z } 2026-02-21T17:50:21.8602708Z } 2026-02-21T17:50:21.8602777Z #-} 2026-02-21T17:50:21.8603060Z /tmp/torchinductor_root/bt/cbtna2zzptmsqkkg2z4undqx5cozvvcy3e4yj27qsamfqakdsa4g.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T17:50:21.8603743Z /tmp/torchinductor_root/bt/cbtna2zzptmsqkkg2z4undqx5cozvvcy3e4yj27qsamfqakdsa4g.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T17:50:21.8604298Z [25s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T17:50:21.8605093Z Config: @helion.kernel(config=helion.Config(block_sizes=[128, 2, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=2, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T17:50:21.8605879Z Error: RuntimeError: PassManager::run failed 2026-02-21T17:50:21.8606046Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T17:50:22.0632704Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 7.0 configs/s 2026-02-21T17:50:22.0641170Z [25s] Adaptive compile timeout: 30s (90% percentile=8.4s, bounds=[30.0s, 30s]) 2026-02-21T17:50:22.1552247Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 945/945 - configs/s 2026-02-21T17:50:22.9871768Z [26s] Initial random population of 100, 5 starting points: 2026-02-21T17:50:22.9872185Z error=18 2026-02-21T17:50:22.9872385Z ok=82 2026-02-21T17:50:22.9872575Z min=0.2141 2026-02-21T17:50:22.9872779Z mid=2.3471 2026-02-21T17:50:22.9872968Z max=290.7404 2026-02-21T17:50:22.9873204Z best={'block_sizes': [64, 128, 32], 2026-02-21T17:50:22.9873572Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T17:50:22.9873909Z 'l2_groupings': [4], 2026-02-21T17:50:22.9874172Z 'load_eviction_policies': ['', ''], 2026-02-21T17:50:22.9874483Z 'loop_orders': [[1, 0]], 2026-02-21T17:50:22.9874802Z 'matrix_instr_nonkdim': 32, 2026-02-21T17:50:22.9875077Z 'num_sm_multiplier': 8, 2026-02-21T17:50:22.9875332Z 'num_stages': 4, 2026-02-21T17:50:22.9875554Z 'num_warps': 2, 2026-02-21T17:50:22.9875802Z 'pid_type': 'persistent_blocked', 2026-02-21T17:50:22.9876106Z 'range_flattens': [False, None], 2026-02-21T17:50:22.9876401Z 'range_multi_buffers': [None, False], 2026-02-21T17:50:22.9876705Z 'range_num_stages': [0, 0], 2026-02-21T17:50:22.9876976Z 'range_unroll_factors': [0, 0], 2026-02-21T17:50:22.9877264Z 'range_warp_specializes': [], 2026-02-21T17:50:22.9877530Z 'waves_per_eu': 3} 2026-02-21T17:50:22.9898340Z [26s] Fitting surrogate: 100 points, 100 targets 2026-02-21T17:50:23.9840299Z [27s] Generation 1 starting: 90 neighbors, 5 active search path(s) 2026-02-21T17:50:34.5884793Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 16.8 configs/s 2026-02-21T17:50:39.8131081Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 18.3 configs/s 2026-02-21T17:50:40.4933601Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 976.5 2026-02-21T17:50:40.4934534Z configs/s 2026-02-21T17:50:41.0735607Z [44s] Generation 1 complete: 2026-02-21T17:50:41.0735976Z error=18 2026-02-21T17:50:41.0736224Z ok=77 2026-02-21T17:50:41.0736425Z min=0.0884 2026-02-21T17:50:41.0736639Z mid=0.3063 2026-02-21T17:50:41.0736841Z max=32.8225 2026-02-21T17:50:41.0737079Z best={'block_sizes': [128, 128, 32], 2026-02-21T17:50:41.0737454Z 'indexing': ['pointer', 'block_ptr', 'pointer'], 2026-02-21T17:50:41.0737808Z 'l2_groupings': [4], 2026-02-21T17:50:41.0738080Z 'load_eviction_policies': ['', ''], 2026-02-21T17:50:41.0738399Z 'loop_orders': [[1, 0]], 2026-02-21T17:50:41.0738700Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:50:41.0738982Z 'num_sm_multiplier': 8, 2026-02-21T17:50:41.0739251Z 'num_stages': 3, 2026-02-21T17:50:41.0739482Z 'num_warps': 16, 2026-02-21T17:50:41.0739756Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:50:41.0740101Z 'range_flattens': [None, True], 2026-02-21T17:50:41.0740403Z 'range_multi_buffers': [False, False], 2026-02-21T17:50:41.0740725Z 'range_num_stages': [0, 0], 2026-02-21T17:50:41.0741005Z 'range_unroll_factors': [0, 0], 2026-02-21T17:50:41.0741303Z 'range_warp_specializes': [], 2026-02-21T17:50:41.0741581Z 'waves_per_eu': 1} 2026-02-21T17:50:41.0814454Z [44s] Fitting surrogate: 195 points, 195 targets 2026-02-21T17:50:42.1163436Z [45s] Generation 2 starting: 93 neighbors, 5 active search path(s) 2026-02-21T17:50:51.9612459Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 16.2 configs/s 2026-02-21T17:50:57.3875654Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 18.4 configs/s 2026-02-21T17:51:00.5434122Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 285.6 2026-02-21T17:51:00.5435053Z configs/s 2026-02-21T17:51:01.1698917Z [64s] Generation 2 complete: 2026-02-21T17:51:01.1699317Z error=15 2026-02-21T17:51:01.1699533Z ok=84 2026-02-21T17:51:01.1699743Z min=0.0806 2026-02-21T17:51:01.1699952Z mid=0.1682 2026-02-21T17:51:01.1700154Z max=27.4566 2026-02-21T17:51:01.1700388Z best={'block_sizes': [128, 128, 32], 2026-02-21T17:51:01.1700789Z 'indexing': ['pointer', 'block_ptr', 'pointer'], 2026-02-21T17:51:01.1701148Z 'l2_groupings': [4], 2026-02-21T17:51:01.1701421Z 'load_eviction_policies': ['', ''], 2026-02-21T17:51:01.1701728Z 'loop_orders': [[1, 0]], 2026-02-21T17:51:01.1702009Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:51:01.1702288Z 'num_sm_multiplier': 8, 2026-02-21T17:51:01.1702552Z 'num_stages': 3, 2026-02-21T17:51:01.1702784Z 'num_warps': 8, 2026-02-21T17:51:01.1703055Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:51:01.1703381Z 'range_flattens': [None, True], 2026-02-21T17:51:01.1703682Z 'range_multi_buffers': [False, False], 2026-02-21T17:51:01.1704005Z 'range_num_stages': [0, 0], 2026-02-21T17:51:01.1704287Z 'range_unroll_factors': [0, 0], 2026-02-21T17:51:01.1704905Z 'range_warp_specializes': [], 2026-02-21T17:51:01.1705182Z 'waves_per_eu': 1} 2026-02-21T17:51:01.2085427Z [64s] Fitting surrogate: 294 points, 294 targets 2026-02-21T17:51:02.6828483Z [66s] Generation 3 starting: 86 neighbors, 5 active search path(s) 2026-02-21T17:51:12.1376095Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 10.8 configs/s 2026-02-21T17:51:16.8025082Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 19.7 configs/s 2026-02-21T17:51:19.7387950Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 307.1 2026-02-21T17:51:19.7388691Z configs/s 2026-02-21T17:51:20.4206135Z [84s] Generation 3 complete: 2026-02-21T17:51:20.4206510Z error=14 2026-02-21T17:51:20.4206723Z ok=78 2026-02-21T17:51:20.4206924Z min=0.0714 2026-02-21T17:51:20.4207131Z mid=0.1342 2026-02-21T17:51:20.4207359Z max=2.5875 2026-02-21T17:51:20.4207596Z best={'block_sizes': [128, 128, 32], 2026-02-21T17:51:20.4207989Z 'indexing': ['pointer', 'block_ptr', 'pointer'], 2026-02-21T17:51:20.4208347Z 'l2_groupings': [4], 2026-02-21T17:51:20.4208620Z 'load_eviction_policies': ['', ''], 2026-02-21T17:51:20.4208923Z 'loop_orders': [[1, 0]], 2026-02-21T17:51:20.4209196Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:51:20.4209474Z 'num_sm_multiplier': 8, 2026-02-21T17:51:20.4209738Z 'num_stages': 3, 2026-02-21T17:51:20.4209963Z 'num_warps': 4, 2026-02-21T17:51:20.4210224Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:51:20.4210549Z 'range_flattens': [None, True], 2026-02-21T17:51:20.4210852Z 'range_multi_buffers': [True, False], 2026-02-21T17:51:20.4211161Z 'range_num_stages': [0, 0], 2026-02-21T17:51:20.4211772Z 'range_unroll_factors': [0, 0], 2026-02-21T17:51:20.4212068Z 'range_warp_specializes': [], 2026-02-21T17:51:20.4212344Z 'waves_per_eu': 1} 2026-02-21T17:51:20.4597245Z [84s] Fitting surrogate: 386 points, 386 targets 2026-02-21T17:51:21.4046280Z [85s] Generation 4 starting: 84 neighbors, 5 active search path(s) 2026-02-21T17:51:30.5298366Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 14.8 configs/s 2026-02-21T17:51:34.9499168Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 20.2 configs/s 2026-02-21T17:51:38.9321234Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 263.4 2026-02-21T17:51:38.9321874Z configs/s 2026-02-21T17:51:39.6450124Z [103s] Generation 4 complete: 2026-02-21T17:51:39.6450463Z error=15 2026-02-21T17:51:39.6450677Z ok=75 2026-02-21T17:51:39.6450884Z min=0.0736 2026-02-21T17:51:39.6451120Z mid=0.1213 2026-02-21T17:51:39.6451346Z max=2.5178 2026-02-21T17:51:39.6451578Z best={'block_sizes': [128, 128, 32], 2026-02-21T17:51:39.6451955Z 'indexing': ['pointer', 'block_ptr', 'pointer'], 2026-02-21T17:51:39.6452320Z 'l2_groupings': [4], 2026-02-21T17:51:39.6458142Z 'load_eviction_policies': ['', ''], 2026-02-21T17:51:39.6458623Z 'loop_orders': [[1, 0]], 2026-02-21T17:51:39.6458915Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:51:39.6459204Z 'num_sm_multiplier': 8, 2026-02-21T17:51:39.6459464Z 'num_stages': 3, 2026-02-21T17:51:39.6459699Z 'num_warps': 4, 2026-02-21T17:51:39.6459960Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:51:39.6460296Z 'range_flattens': [None, True], 2026-02-21T17:51:39.6460600Z 'range_multi_buffers': [True, False], 2026-02-21T17:51:39.6460949Z 'range_num_stages': [0, 0], 2026-02-21T17:51:39.6461175Z 'range_unroll_factors': [0, 0], 2026-02-21T17:51:39.6461387Z 'range_warp_specializes': [], 2026-02-21T17:51:39.6461671Z 'waves_per_eu': 1} 2026-02-21T17:51:39.6488342Z [103s] Fitting surrogate: 476 points, 476 targets 2026-02-21T17:51:40.5842649Z [104s] Generation 5 starting: 81 neighbors, 5 active search path(s) 2026-02-21T17:51:49.6987176Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82/82 4.9 configs/s 2026-02-21T17:51:54.2540975Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 82/82 18.4 configs/s 2026-02-21T17:51:59.1241937Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 193.9 2026-02-21T17:51:59.1242492Z configs/s 2026-02-21T17:51:59.9067441Z [123s] Generation 5 complete: 2026-02-21T17:51:59.9067651Z error=8 2026-02-21T17:51:59.9067791Z ok=79 2026-02-21T17:51:59.9067918Z min=0.0726 2026-02-21T17:51:59.9068042Z mid=0.1098 2026-02-21T17:51:59.9068239Z max=0.7911 2026-02-21T17:51:59.9068380Z best={'block_sizes': [128, 128, 32], 2026-02-21T17:51:59.9068600Z 'indexing': ['pointer', 'block_ptr', 'pointer'], 2026-02-21T17:51:59.9068813Z 'l2_groupings': [4], 2026-02-21T17:51:59.9068996Z 'load_eviction_policies': ['', ''], 2026-02-21T17:51:59.9069182Z 'loop_orders': [[1, 0]], 2026-02-21T17:51:59.9069345Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:51:59.9069511Z 'num_sm_multiplier': 8, 2026-02-21T17:51:59.9069673Z 'num_stages': 3, 2026-02-21T17:51:59.9069824Z 'num_warps': 4, 2026-02-21T17:51:59.9069991Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:51:59.9070183Z 'range_flattens': [None, True], 2026-02-21T17:51:59.9070359Z 'range_multi_buffers': [True, False], 2026-02-21T17:51:59.9070539Z 'range_num_stages': [0, 0], 2026-02-21T17:51:59.9070705Z 'range_unroll_factors': [0, 0], 2026-02-21T17:51:59.9070877Z 'range_warp_specializes': [], 2026-02-21T17:51:59.9071040Z 'waves_per_eu': 1} 2026-02-21T17:51:59.9105744Z [123s] Fitting surrogate: 563 points, 563 targets 2026-02-21T17:52:00.9279668Z [124s] Generation 6 starting: 86 neighbors, 5 active search path(s) 2026-02-21T17:52:09.1509715Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 3.8 configs/s 2026-02-21T17:52:13.6236661Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 19.7 configs/s 2026-02-21T17:52:17.2309432Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 257.0 2026-02-21T17:52:17.2310089Z configs/s 2026-02-21T17:52:17.9212737Z [141s] Generation 6 complete: 2026-02-21T17:52:17.9213105Z error=16 2026-02-21T17:52:17.9213311Z ok=76 2026-02-21T17:52:17.9213515Z min=0.0648 2026-02-21T17:52:17.9213718Z mid=0.1117 2026-02-21T17:52:17.9213917Z max=4.3792 2026-02-21T17:52:17.9214144Z best={'block_sizes': [128, 128, 64], 2026-02-21T17:52:17.9214525Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T17:52:17.9214895Z 'l2_groupings': [2], 2026-02-21T17:52:17.9215168Z 'load_eviction_policies': ['', ''], 2026-02-21T17:52:17.9215479Z 'loop_orders': [[0, 1]], 2026-02-21T17:52:17.9215755Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:52:17.9216041Z 'num_sm_multiplier': 128, 2026-02-21T17:52:17.9216329Z 'num_stages': 2, 2026-02-21T17:52:17.9216564Z 'num_warps': 4, 2026-02-21T17:52:17.9216828Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:52:17.9217157Z 'range_flattens': [False, None], 2026-02-21T17:52:17.9217802Z 'range_multi_buffers': [False, True], 2026-02-21T17:52:17.9218123Z 'range_num_stages': [0, 0], 2026-02-21T17:52:17.9218403Z 'range_unroll_factors': [0, 0], 2026-02-21T17:52:17.9218700Z 'range_warp_specializes': [], 2026-02-21T17:52:17.9218977Z 'waves_per_eu': 2} 2026-02-21T17:52:17.9723181Z [141s] Fitting surrogate: 655 points, 655 targets 2026-02-21T17:52:18.9480284Z [142s] Generation 7 starting: 88 neighbors, 5 active search path(s) 2026-02-21T17:52:24.7722118Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 23.7 configs/s 2026-02-21T17:52:28.9686581Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 21.5 configs/s 2026-02-21T17:52:32.8474099Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 239.8 2026-02-21T17:52:32.8474742Z configs/s 2026-02-21T17:52:33.5299678Z [157s] Generation 7 complete: 2026-02-21T17:52:33.5300067Z error=20 2026-02-21T17:52:33.5300301Z ok=73 2026-02-21T17:52:33.5300517Z min=0.0644 2026-02-21T17:52:33.5301034Z mid=0.1093 2026-02-21T17:52:33.5301241Z max=1.5096 2026-02-21T17:52:33.5301475Z best={'block_sizes': [128, 128, 64], 2026-02-21T17:52:33.5301857Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T17:52:33.5302224Z 'l2_groupings': [2], 2026-02-21T17:52:33.5302495Z 'load_eviction_policies': ['', ''], 2026-02-21T17:52:33.5302803Z 'loop_orders': [[0, 1]], 2026-02-21T17:52:33.5303084Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:52:33.5303371Z 'num_sm_multiplier': 128, 2026-02-21T17:52:33.5303635Z 'num_stages': 2, 2026-02-21T17:52:33.5303872Z 'num_warps': 4, 2026-02-21T17:52:33.5304137Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:52:33.5304464Z 'range_flattens': [False, None], 2026-02-21T17:52:33.5304774Z 'range_multi_buffers': [False, True], 2026-02-21T17:52:33.5305085Z 'range_num_stages': [0, 0], 2026-02-21T17:52:33.5305384Z 'range_unroll_factors': [0, 0], 2026-02-21T17:52:33.5305685Z 'range_warp_specializes': [], 2026-02-21T17:52:33.5305972Z 'waves_per_eu': 2} 2026-02-21T17:52:33.5348936Z [157s] Fitting surrogate: 748 points, 748 targets 2026-02-21T17:52:34.8065846Z [158s] Generation 8 starting: 71 neighbors, 4 active search path(s) 2026-02-21T17:52:39.0525381Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72/72 65.9 configs/s 2026-02-21T17:52:42.9799705Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 72/72 18.5 configs/s 2026-02-21T17:52:47.0583527Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 229.3 2026-02-21T17:52:47.0584147Z configs/s 2026-02-21T17:52:47.7376509Z [171s] Generation 8 complete: 2026-02-21T17:52:47.7376859Z error=8 2026-02-21T17:52:47.7377464Z ok=67 2026-02-21T17:52:47.7377673Z min=0.0652 2026-02-21T17:52:47.7377888Z mid=0.0989 2026-02-21T17:52:47.7378098Z max=2.6229 2026-02-21T17:52:47.7378332Z best={'block_sizes': [128, 128, 64], 2026-02-21T17:52:47.7378872Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T17:52:47.7379264Z 'l2_groupings': [2], 2026-02-21T17:52:47.7379538Z 'load_eviction_policies': ['', ''], 2026-02-21T17:52:47.7379848Z 'loop_orders': [[0, 1]], 2026-02-21T17:52:47.7380129Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:52:47.7380410Z 'num_sm_multiplier': 128, 2026-02-21T17:52:47.7380667Z 'num_stages': 2, 2026-02-21T17:52:47.7380895Z 'num_warps': 4, 2026-02-21T17:52:47.7381151Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:52:47.7381471Z 'range_flattens': [False, None], 2026-02-21T17:52:47.7381767Z 'range_multi_buffers': [False, True], 2026-02-21T17:52:47.7382064Z 'range_num_stages': [0, 0], 2026-02-21T17:52:47.7382337Z 'range_unroll_factors': [0, 0], 2026-02-21T17:52:47.7382628Z 'range_warp_specializes': [], 2026-02-21T17:52:47.7382896Z 'waves_per_eu': 2} 2026-02-21T17:52:47.7433746Z [171s] Fitting surrogate: 823 points, 823 targets 2026-02-21T17:52:48.3785053Z [172s] Generation 9 starting: 54 neighbors, 3 active search path(s) 2026-02-21T17:52:52.5107346Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54/54 5.4 configs/s 2026-02-21T17:52:55.3315157Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 54/54 19.5 configs/s 2026-02-21T17:52:58.1040406Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 326.9 2026-02-21T17:52:58.1041025Z configs/s 2026-02-21T17:52:58.6994564Z [182s] Generation 9 complete: 2026-02-21T17:52:58.6994933Z error=10 2026-02-21T17:52:58.6995145Z ok=48 2026-02-21T17:52:58.6995354Z min=0.0652 2026-02-21T17:52:58.6995564Z mid=0.0991 2026-02-21T17:52:58.6995766Z max=5.0325 2026-02-21T17:52:58.6995997Z best={'block_sizes': [128, 128, 64], 2026-02-21T17:52:58.6996407Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T17:52:58.6996784Z 'l2_groupings': [2], 2026-02-21T17:52:58.6997063Z 'load_eviction_policies': ['', ''], 2026-02-21T17:52:58.6997392Z 'loop_orders': [[0, 1]], 2026-02-21T17:52:58.6997689Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:52:58.6998248Z 'num_sm_multiplier': 128, 2026-02-21T17:52:58.6998519Z 'num_stages': 2, 2026-02-21T17:52:58.6998756Z 'num_warps': 4, 2026-02-21T17:52:58.6999051Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:52:58.6999364Z 'range_flattens': [False, None], 2026-02-21T17:52:58.6999654Z 'range_multi_buffers': [False, True], 2026-02-21T17:52:58.6999941Z 'range_num_stages': [0, 0], 2026-02-21T17:52:58.7000205Z 'range_unroll_factors': [0, 0], 2026-02-21T17:52:58.7000489Z 'range_warp_specializes': [], 2026-02-21T17:52:58.7000747Z 'waves_per_eu': 2} 2026-02-21T17:52:58.7367967Z [182s] Fitting surrogate: 881 points, 881 targets 2026-02-21T17:52:59.2058637Z [182s] Generation 10 starting: 36 neighbors, 2 active search path(s) 2026-02-21T17:53:01.9223266Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 34.5 configs/s 2026-02-21T17:53:03.4669478Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 36/36 23.1 configs/s 2026-02-21T17:53:05.4351556Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 447.7 2026-02-21T17:53:05.4352183Z configs/s 2026-02-21T17:53:05.9689384Z [189s] Generation 10 complete: 2026-02-21T17:53:05.9689903Z error=12 2026-02-21T17:53:05.9690215Z ok=27 2026-02-21T17:53:05.9690423Z min=0.0623 2026-02-21T17:53:05.9690659Z mid=0.0836 2026-02-21T17:53:05.9690862Z max=0.7321 2026-02-21T17:53:05.9691087Z best={'block_sizes': [128, 128, 64], 2026-02-21T17:53:05.9691469Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T17:53:05.9691846Z 'l2_groupings': [2], 2026-02-21T17:53:05.9692116Z 'load_eviction_policies': ['', ''], 2026-02-21T17:53:05.9692428Z 'loop_orders': [[0, 1]], 2026-02-21T17:53:05.9693038Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:53:05.9693324Z 'num_sm_multiplier': 128, 2026-02-21T17:53:05.9693592Z 'num_stages': 2, 2026-02-21T17:53:05.9693960Z 'num_warps': 4, 2026-02-21T17:53:05.9694237Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:53:05.9694583Z 'range_flattens': [False, None], 2026-02-21T17:53:05.9694891Z 'range_multi_buffers': [False, True], 2026-02-21T17:53:05.9695208Z 'range_num_stages': [0, 0], 2026-02-21T17:53:05.9695490Z 'range_unroll_factors': [0, 0], 2026-02-21T17:53:05.9695785Z 'range_warp_specializes': [], 2026-02-21T17:53:05.9696065Z 'waves_per_eu': 2} 2026-02-21T17:53:05.9939311Z [189s] Fitting surrogate: 920 points, 920 targets 2026-02-21T17:53:06.2889487Z [190s] Generation 11 starting: 18 neighbors, 1 active search path(s) 2026-02-21T17:53:07.5940471Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 16.3 configs/s 2026-02-21T17:53:08.5120405Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 19.6 configs/s 2026-02-21T17:53:09.8663786Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 619.5 2026-02-21T17:53:09.8664426Z configs/s 2026-02-21T17:53:10.4103368Z [194s] Generation 11 complete: 2026-02-21T17:53:10.4103838Z error=3 2026-02-21T17:53:10.4104026Z ok=17 2026-02-21T17:53:10.4104248Z min=0.0616 2026-02-21T17:53:10.4104442Z mid=0.0718 2026-02-21T17:53:10.4104619Z max=0.4870 2026-02-21T17:53:10.4104826Z best={'block_sizes': [128, 128, 64], 2026-02-21T17:53:10.4105166Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T17:53:10.4105505Z 'l2_groupings': [2], 2026-02-21T17:53:10.4105740Z 'load_eviction_policies': ['', ''], 2026-02-21T17:53:10.4106025Z 'loop_orders': [[0, 1]], 2026-02-21T17:53:10.4106279Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:53:10.4106538Z 'num_sm_multiplier': 128, 2026-02-21T17:53:10.4106778Z 'num_stages': 2, 2026-02-21T17:53:10.4106996Z 'num_warps': 4, 2026-02-21T17:53:10.4107226Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:53:10.4107487Z 'range_flattens': [False, None], 2026-02-21T17:53:10.4107738Z 'range_multi_buffers': [False, True], 2026-02-21T17:53:10.4107995Z 'range_num_stages': [0, 0], 2026-02-21T17:53:10.4108305Z 'range_unroll_factors': [0, 0], 2026-02-21T17:53:10.4108551Z 'range_warp_specializes': [], 2026-02-21T17:53:10.4108780Z 'waves_per_eu': 2} 2026-02-21T17:53:10.4316842Z [194s] Fitting surrogate: 940 points, 940 targets 2026-02-21T17:53:10.7030884Z [194s] Generation 12 starting: 17 neighbors, 1 active search path(s) 2026-02-21T17:53:11.8654934Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 21.8 configs/s 2026-02-21T17:53:12.6968469Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 22.3 configs/s 2026-02-21T17:53:14.0266608Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 634.0 2026-02-21T17:53:14.0267071Z configs/s 2026-02-21T17:53:14.5613250Z [198s] Generation 12 complete: 2026-02-21T17:53:14.5613386Z error=4 2026-02-21T17:53:14.5613473Z ok=15 2026-02-21T17:53:14.5613551Z min=0.0581 2026-02-21T17:53:14.5613831Z mid=0.0726 2026-02-21T17:53:14.5613919Z max=0.1068 2026-02-21T17:53:14.5614024Z best={'block_sizes': [128, 128, 64], 2026-02-21T17:53:14.5614183Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T17:53:14.5614315Z 'l2_groupings': [2], 2026-02-21T17:53:14.5614421Z 'load_eviction_policies': ['', ''], 2026-02-21T17:53:14.5614534Z 'loop_orders': [[1, 0]], 2026-02-21T17:53:14.5614639Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:53:14.5614747Z 'num_sm_multiplier': 2, 2026-02-21T17:53:14.5614849Z 'num_stages': 2, 2026-02-21T17:53:14.5614936Z 'num_warps': 4, 2026-02-21T17:53:14.5615036Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:53:14.5615155Z 'range_flattens': [False, True], 2026-02-21T17:53:14.5615272Z 'range_multi_buffers': [True, False], 2026-02-21T17:53:14.5615390Z 'range_num_stages': [0, 0], 2026-02-21T17:53:14.5615493Z 'range_unroll_factors': [0, 0], 2026-02-21T17:53:14.5615605Z 'range_warp_specializes': [], 2026-02-21T17:53:14.5615705Z 'waves_per_eu': 2} 2026-02-21T17:53:14.5819486Z [198s] Fitting surrogate: 959 points, 959 targets 2026-02-21T17:53:14.8998553Z [198s] Generation 13 starting: 20 neighbors, 1 active search path(s) 2026-02-21T17:53:17.8332784Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 3.4 configs/s 2026-02-21T17:53:18.8950140Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 22.1 configs/s 2026-02-21T17:53:20.1496368Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 660.3 2026-02-21T17:53:20.1496980Z configs/s 2026-02-21T17:53:20.6267549Z [204s] Generation 13 complete: 2026-02-21T17:53:20.6267919Z error=4 2026-02-21T17:53:20.6268131Z ok=17 2026-02-21T17:53:20.6268428Z min=0.0618 2026-02-21T17:53:20.6268666Z mid=0.0776 2026-02-21T17:53:20.6268867Z max=1.8826 2026-02-21T17:53:20.6269096Z best={'block_sizes': [128, 128, 64], 2026-02-21T17:53:20.6269477Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T17:53:20.6269846Z 'l2_groupings': [2], 2026-02-21T17:53:20.6270537Z 'load_eviction_policies': ['', ''], 2026-02-21T17:53:20.6271001Z 'loop_orders': [[1, 0]], 2026-02-21T17:53:20.6271277Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:53:20.6271561Z 'num_sm_multiplier': 2, 2026-02-21T17:53:20.6271825Z 'num_stages': 2, 2026-02-21T17:53:20.6272056Z 'num_warps': 4, 2026-02-21T17:53:20.6272311Z 'pid_type': 'persistent_interleaved', 2026-02-21T17:53:20.6272643Z 'range_flattens': [False, True], 2026-02-21T17:53:20.6272942Z 'range_multi_buffers': [True, None], 2026-02-21T17:53:20.6273252Z 'range_num_stages': [0, 0], 2026-02-21T17:53:20.6273528Z 'range_unroll_factors': [0, 0], 2026-02-21T17:53:20.6273830Z 'range_warp_specializes': [], 2026-02-21T17:53:20.6274114Z 'waves_per_eu': 2} 2026-02-21T17:53:20.6452423Z [204s] Fitting surrogate: 980 points, 980 targets 2026-02-21T17:53:20.9537125Z [204s] Generation 14 starting: 19 neighbors, 1 active search path(s) 2026-02-21T17:53:22.4100637Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 16.3 configs/s 2026-02-21T17:53:23.5800227Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 18.6 configs/s 2026-02-21T17:53:25.1090259Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 555.6 2026-02-21T17:53:25.1090854Z configs/s 2026-02-21T17:53:25.5809044Z [209s] Generation 14 complete: 2026-02-21T17:53:25.5809422Z error=1 2026-02-21T17:53:25.5809632Z ok=19 2026-02-21T17:53:25.5809843Z min=0.0620 2026-02-21T17:53:25.5810055Z mid=0.0776 2026-02-21T17:53:25.5810256Z max=0.1234 2026-02-21T17:53:25.5810483Z best={'block_sizes': [128, 128, 64], 2026-02-21T17:53:25.5810856Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T17:53:25.5811213Z 'l2_groupings': [2], 2026-02-21T17:53:25.5811838Z 'load_eviction_policies': ['', ''], 2026-02-21T17:53:25.5812144Z 'loop_orders': [[1, 0]], 2026-02-21T17:53:25.5812429Z 'matrix_instr_nonkdim': 16, 2026-02-21T17:53:25.5815592Z 'num_sm_multiplier': 2, 2026-02-21T17:53:25.5815989Z 'num_stages': 2, 2026-02-21T17:53:25.5816248Z 'num_warps': 4, 2026-02-21T17:53:25.5816456Z 'pid_type': 'persistent_blocked', 2026-02-21T17:53:25.5816693Z 'range_flattens': [False, True], 2026-02-21T17:53:25.5816911Z 'range_multi_buffers': [True, None], 2026-02-21T17:53:25.5817138Z 'range_num_stages': [0, 0], 2026-02-21T17:53:25.5817341Z 'range_unroll_factors': [0, 0], 2026-02-21T17:53:25.5817555Z 'range_warp_specializes': [], 2026-02-21T17:53:25.5817750Z 'waves_per_eu': 2} 2026-02-21T17:53:25.6017177Z [209s] Fitting surrogate: 1000 points, 1000 targets 2026-02-21T17:53:25.7346228Z [209s] Autotuning complete in 209.5s after searching 961 configs. 2026-02-21T17:53:25.7346443Z One can hardcode the best config and skip autotuning with: 2026-02-21T17:53:25.7347393Z @helion.kernel(config=helion.Config(block_sizes=[128, 128, 64], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=2, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, True], range_multi_buffers=[True, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T17:53:25.7348066Z 2026-02-21T17:53:25.7348310Z [209s] Code of selected kernel: /tmp/torchinductor_root/zt/cztp7kv3rw4vcgk7fo4w5k2rebn5k44kj7djvv625alyap66kg2f.py 2026-02-21T17:53:52.9031040Z WARNING:tritonbench.utils.triton_op:Completed input ID 3: 2026-02-21T17:53:52.9031531Z (M, N, K) 2026-02-21T17:53:52.9031760Z ------------------ 2026-02-21T17:53:52.9032002Z (2048, 4096, 2048) 2026-02-21T17:53:52.9032158Z 2026-02-21T17:53:52.9041307Z 38%|███▊ | 3/8 [12:53<22:11, 266.32s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5: 2026-02-21T17:53:52.9041921Z (M, N, K) 2026-02-21T17:53:52.9042142Z ------------------ 2026-02-21T17:53:52.9042386Z (1024, 8192, 1024) 2026-02-21T17:53:52.9042838Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T17:54:29.8825467Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T17:55:14.8043836Z Autotune Choices Stats: 2026-02-21T17:55:14.8045487Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_138", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.04087899997830391, "best_triton_pos": 0} 2026-02-21T17:55:14.8052409Z AUTOTUNE mm(1024x1024, 1024x8192) 2026-02-21T17:55:14.8052795Z strides: [1024, 1], [1, 1024] 2026-02-21T17:55:14.8053118Z dtypes: torch.float16, torch.float16 2026-02-21T17:55:14.8054152Z triton_mm_138 0.0409 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:55:14.8055335Z triton_mm_142 0.0411 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:55:14.8056492Z triton_mm_134 0.0424 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T17:55:14.8057622Z triton_mm_135 0.0471 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:55:14.8059337Z triton_mm_136 0.0481 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:55:14.8060472Z triton_mm_129 0.0510 ms 80.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T17:55:14.8061596Z triton_mm_139 0.0538 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:55:14.8062718Z triton_mm_143 0.0544 ms 75.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:55:14.8063955Z triton_mm_124 0.0558 ms 73.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:55:14.8064948Z triton_mm_120 0.0561 ms 72.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:55:14.8065583Z SingleProcess AUTOTUNE benchmarking takes 0.5397 seconds and 0.2428 seconds precompiling for 36 choices 2026-02-21T17:55:14.9592790Z INFO:tritonbench.utils.triton_op:Took 1565.81ms to get benchmark function for pt2_triton_matmul 2026-02-21T17:55:49.3394384Z WARNING:__main__:Input tensor metadata: 2026-02-21T17:55:49.3394846Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T17:55:49.3395186Z 'dtype': 'torch.float16', 2026-02-21T17:55:49.3395545Z 'shape': (1024, 1024), 2026-02-21T17:55:49.3395844Z 'stride': (1024, 1)}, 2026-02-21T17:55:49.3396143Z { 'device': 'cuda:0', 2026-02-21T17:55:49.3396455Z 'dtype': 'torch.float16', 2026-02-21T17:55:49.3397282Z 'shape': (1024, 8192), 2026-02-21T17:55:49.3397576Z 'stride': (1, 1024)}, 2026-02-21T17:55:49.3397861Z None), 2026-02-21T17:55:49.3398093Z 'kwargs': {}} 2026-02-21T17:55:49.3420200Z INFO:tritonbench.utils.triton_op:Took 3.06ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T17:55:49.4101552Z [0s] Autotune random seed: 2169133161 2026-02-21T17:55:49.4731832Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T17:55:55.3970276Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━ 100/100 14.4 configs/s 2026-02-21T17:56:02.7911891Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T17:56:02.7914053Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T17:56:02.7915012Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T17:56:02.7915825Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T17:56:02.7916623Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T17:56:02.7917435Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T17:56:02.7918261Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T17:56:02.7919815Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T17:56:02.7921072Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T17:56:02.7922028Z %cst = arith.constant dense<0.000000e+00> : tensor<256x16xf16, #blocked> 2026-02-21T17:56:02.7922375Z %cst_0 = arith.constant dense<8192> : tensor<1x16xi64, #blocked> 2026-02-21T17:56:02.7922678Z %cst_1 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T17:56:02.7922993Z %cst_2 = arith.constant dense<1024> : tensor<256x1xi64, #blocked1> 2026-02-21T17:56:02.7923299Z %cst_3 = arith.constant dense<0> : tensor<256x1xi64, #blocked1> 2026-02-21T17:56:02.7923588Z %cst_4 = arith.constant dense<1024> : tensor<1x16xi64, #blocked> 2026-02-21T17:56:02.7923835Z %c256_i32 = arith.constant 256 : i32 2026-02-21T17:56:02.7924035Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T17:56:02.7924239Z %c0_i32 = arith.constant 0 : i32 2026-02-21T17:56:02.7924514Z %c304_i32 = arith.constant 304 : i32 2026-02-21T17:56:02.7924748Z %cst_5 = arith.constant dense<8192> : tensor<4x1xi32, #blocked1> 2026-02-21T17:56:02.7925028Z %cst_6 = arith.constant dense<1024> : tensor<4x1xi32, #blocked1> 2026-02-21T17:56:02.7925324Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<4x16xf32, #blocked> 2026-02-21T17:56:02.7925584Z %c4_i32 = arith.constant 4 : i32 2026-02-21T17:56:02.7925770Z %c16_i32 = arith.constant 16 : i32 2026-02-21T17:56:02.7925953Z %c2_i32 = arith.constant 2 : i32 2026-02-21T17:56:02.7926139Z %c512_i32 = arith.constant 512 : i32 2026-02-21T17:56:02.7926332Z %c131072_i32 = arith.constant 131072 : i32 2026-02-21T17:56:02.7926530Z %0 = tt.get_program_id x : i32 2026-02-21T17:56:02.7926790Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked2> 2026-02-21T17:56:02.7927127Z %2 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked2> 2026-02-21T17:56:02.7927454Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked2> 2026-02-21T17:56:02.7927870Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<4x256x!tt.ptr, #blocked3> 2026-02-21T17:56:02.7928191Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<256x16x!tt.ptr, #blocked> 2026-02-21T17:56:02.7928517Z %6 = arith.extsi %3 : tensor<256xi32, #blocked2> to tensor<256xi64, #blocked2> 2026-02-21T17:56:02.7928846Z %7 = arith.extsi %1 : tensor<16xi32, #blocked2> to tensor<16xi64, #blocked2> 2026-02-21T17:56:02.7929171Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<4x16x!tt.ptr, #blocked> 2026-02-21T17:56:02.7929475Z scf.for %arg3 = %0 to %c131072_i32 step %c304_i32 : i32 { 2026-02-21T17:56:02.7929711Z %9 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T17:56:02.7929863Z %10 = arith.muli %9, %c2_i32 : i32 2026-02-21T17:56:02.7930008Z %11 = arith.subi %c512_i32, %10 : i32 2026-02-21T17:56:02.7930157Z %12 = arith.minsi %11, %c2_i32 : i32 2026-02-21T17:56:02.7930305Z %13 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T17:56:02.7930453Z %14 = arith.remsi %13, %12 : i32 2026-02-21T17:56:02.7930582Z %15 = arith.addi %10, %14 : i32 2026-02-21T17:56:02.7930723Z %16 = arith.divsi %13, %12 : i32 2026-02-21T17:56:02.7930864Z %17 = arith.muli %15, %c16_i32 : i32 2026-02-21T17:56:02.7931026Z %18 = tt.splat %17 : i32 -> tensor<16xi32, #blocked2> 2026-02-21T17:56:02.7931208Z %19 = arith.addi %18, %1 : tensor<16xi32, #blocked2> 2026-02-21T17:56:02.7931370Z %20 = arith.muli %16, %c4_i32 : i32 2026-02-21T17:56:02.7931530Z %21 = tt.splat %20 : i32 -> tensor<4xi32, #blocked2> 2026-02-21T17:56:02.7931703Z %22 = arith.addi %21, %2 : tensor<4xi32, #blocked2> 2026-02-21T17:56:02.7932009Z %23 = ttg.convert_layout %22 : tensor<4xi32, #blocked2> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:56:02.7932435Z %24 = tt.expand_dims %23 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T17:56:02.7932787Z %25 = ttg.convert_layout %24 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked1> 2026-02-21T17:56:02.7933036Z %26 = arith.muli %25, %cst_6 : tensor<4x1xi32, #blocked1> 2026-02-21T17:56:02.7933272Z %27 = tt.broadcast %26 : tensor<4x1xi32, #blocked1> -> tensor<4x256xi32, #blocked1> 2026-02-21T17:56:02.7933553Z %28 = ttg.convert_layout %27 : tensor<4x256xi32, #blocked1> -> tensor<4x256xi32, #blocked3> 2026-02-21T17:56:02.7933784Z %29 = arith.extsi %17 : i32 to i64 2026-02-21T17:56:02.7933942Z %30 = tt.splat %29 : i64 -> tensor<16xi64, #blocked2> 2026-02-21T17:56:02.7934119Z %31 = arith.addi %30, %7 : tensor<16xi64, #blocked2> 2026-02-21T17:56:02.7934404Z %32 = ttg.convert_layout %31 : tensor<16xi64, #blocked2> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:56:02.7934827Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi64, #blocked5> 2026-02-21T17:56:02.7935179Z %34 = ttg.convert_layout %33 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #blocked> 2026-02-21T17:56:02.7935429Z %35 = arith.muli %34, %cst_4 : tensor<1x16xi64, #blocked> 2026-02-21T17:56:02.7935653Z %36 = tt.broadcast %35 : tensor<1x16xi64, #blocked> -> tensor<256x16xi64, #blocked> 2026-02-21T17:56:02.7935896Z %37 = arith.cmpi sge, %34, %cst_1 : tensor<1x16xi64, #blocked> 2026-02-21T17:56:02.7936106Z %38 = arith.cmpi slt, %34, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T17:56:02.7936304Z %39 = arith.andi %37, %38 : tensor<1x16xi1, #blocked> 2026-02-21T17:56:02.7936521Z %40 = tt.broadcast %39 : tensor<1x16xi1, #blocked> -> tensor<256x16xi1, #blocked> 2026-02-21T17:56:02.7936862Z %41 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c256_i32 iter_args(%arg5 = %cst_7) -> (tensor<4x16xf32, #blocked>) : i32 { 2026-02-21T17:56:02.7937168Z %55 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked2> 2026-02-21T17:56:02.7937385Z %56 = arith.addi %55, %3 : tensor<256xi32, #blocked2> 2026-02-21T17:56:02.7937680Z %57 = ttg.convert_layout %56 : tensor<256xi32, #blocked2> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:56:02.7938096Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T17:56:02.7938459Z %59 = ttg.convert_layout %58 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T17:56:02.7938769Z %60 = tt.broadcast %59 : tensor<1x256xi32, #blocked3> -> tensor<4x256xi32, #blocked3> 2026-02-21T17:56:02.7939007Z %61 = arith.addi %28, %60 : tensor<4x256xi32, #blocked3> 2026-02-21T17:56:02.7939202Z %62 = tt.addptr %4, %61 : tensor<4x256x!tt.ptr, #blocked3>, tensor<4x256xi32, #blocked3> 2026-02-21T17:56:02.7939405Z %63 = tt.load %62 : tensor<4x256x!tt.ptr, #blocked3> 2026-02-21T17:56:02.7939551Z %64 = arith.extsi %arg4 : i32 to i64 2026-02-21T17:56:02.7939685Z %65 = tt.splat %64 : i64 -> tensor<256xi64, #blocked2> 2026-02-21T17:56:02.7939836Z %66 = arith.addi %65, %6 : tensor<256xi64, #blocked2> 2026-02-21T17:56:02.7940068Z %67 = ttg.convert_layout %66 : tensor<256xi64, #blocked2> -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:56:02.7940399Z %68 = tt.expand_dims %67 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<256x1xi64, #blocked4> 2026-02-21T17:56:02.7940689Z %69 = ttg.convert_layout %68 : tensor<256x1xi64, #blocked4> -> tensor<256x1xi64, #blocked1> 2026-02-21T17:56:02.7940942Z %70 = tt.broadcast %69 : tensor<256x1xi64, #blocked1> -> tensor<256x16xi64, #blocked1> 2026-02-21T17:56:02.7941197Z %71 = ttg.convert_layout %70 : tensor<256x16xi64, #blocked1> -> tensor<256x16xi64, #blocked> 2026-02-21T17:56:02.7941399Z %72 = arith.addi %71, %36 : tensor<256x16xi64, #blocked> 2026-02-21T17:56:02.7941596Z %73 = tt.addptr %5, %72 : tensor<256x16x!tt.ptr, #blocked>, tensor<256x16xi64, #blocked> 2026-02-21T17:56:02.7941803Z %74 = arith.cmpi sge, %69, %cst_3 : tensor<256x1xi64, #blocked1> 2026-02-21T17:56:02.7941977Z %75 = arith.cmpi slt, %69, %cst_2 : tensor<256x1xi64, #blocked1> 2026-02-21T17:56:02.7942142Z %76 = arith.andi %74, %75 : tensor<256x1xi1, #blocked1> 2026-02-21T17:56:02.7942328Z %77 = tt.broadcast %76 : tensor<256x1xi1, #blocked1> -> tensor<256x16xi1, #blocked1> 2026-02-21T17:56:02.7942562Z %78 = ttg.convert_layout %77 : tensor<256x16xi1, #blocked1> -> tensor<256x16xi1, #blocked> 2026-02-21T17:56:02.7942759Z %79 = arith.andi %78, %40 : tensor<256x16xi1, #blocked> 2026-02-21T17:56:02.7942922Z %80 = tt.load %73, %79, %cst : tensor<256x16x!tt.ptr, #blocked> 2026-02-21T17:56:02.7943200Z %81 = ttg.convert_layout %63 : tensor<4x256xf16, #blocked3> -> tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T17:56:02.7943542Z %82 = ttg.convert_layout %80 : tensor<256x16xf16, #blocked> -> tensor<256x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T17:56:02.7943837Z %83 = ttg.convert_layout %arg5 : tensor<4x16xf32, #blocked> -> tensor<4x16xf32, #blocked> 2026-02-21T17:56:02.7944235Z %84 = tt.dot %81, %82, %83, inputPrecision = tf32 : tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<256x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<4x16xf32, #blocked> 2026-02-21T17:56:02.7944574Z scf.yield %84 : tensor<4x16xf32, #blocked> 2026-02-21T17:56:02.7944704Z } {tt.disallow_acc_multi_buffer} 2026-02-21T17:56:02.7944867Z %42 = arith.truncf %41 : tensor<4x16xf32, #blocked> to tensor<4x16xf16, #blocked> 2026-02-21T17:56:02.7945134Z %43 = ttg.convert_layout %22 : tensor<4xi32, #blocked2> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T17:56:02.7945452Z %44 = tt.expand_dims %43 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T17:56:02.7945753Z %45 = ttg.convert_layout %44 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked1> 2026-02-21T17:56:02.7945951Z %46 = arith.muli %45, %cst_5 : tensor<4x1xi32, #blocked1> 2026-02-21T17:56:02.7946185Z %47 = ttg.convert_layout %19 : tensor<16xi32, #blocked2> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T17:56:02.7946510Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi32, #blocked5> 2026-02-21T17:56:02.7946789Z %49 = ttg.convert_layout %48 : tensor<1x16xi32, #blocked5> -> tensor<1x16xi32, #blocked> 2026-02-21T17:56:02.7947018Z %50 = tt.broadcast %46 : tensor<4x1xi32, #blocked1> -> tensor<4x16xi32, #blocked1> 2026-02-21T17:56:02.7947243Z %51 = ttg.convert_layout %50 : tensor<4x16xi32, #blocked1> -> tensor<4x16xi32, #blocked> 2026-02-21T17:56:02.7947463Z %52 = tt.broadcast %49 : tensor<1x16xi32, #blocked> -> tensor<4x16xi32, #blocked> 2026-02-21T17:56:02.7947647Z %53 = arith.addi %51, %52 : tensor<4x16xi32, #blocked> 2026-02-21T17:56:02.7947833Z %54 = tt.addptr %8, %53 : tensor<4x16x!tt.ptr, #blocked>, tensor<4x16xi32, #blocked> 2026-02-21T17:56:02.7948029Z tt.store %54, %42 : tensor<4x16x!tt.ptr, #blocked> 2026-02-21T17:56:02.7948244Z } {tt.disallow_acc_multi_buffer, tt.flatten} 2026-02-21T17:56:02.7948366Z tt.return 2026-02-21T17:56:02.7948448Z } 2026-02-21T17:56:02.7948523Z } 2026-02-21T17:56:02.7948566Z 2026-02-21T17:56:02.7948599Z {-# 2026-02-21T17:56:02.7948682Z external_resources: { 2026-02-21T17:56:02.7948815Z mlir_reproducer: { 2026-02-21T17:56:02.7951093Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T17:56:02.7953409Z disable_threading: false, 2026-02-21T17:56:02.7953516Z verify_each: true 2026-02-21T17:56:02.7953608Z } 2026-02-21T17:56:02.7953683Z } 2026-02-21T17:56:02.7953752Z #-} 2026-02-21T17:56:02.7954036Z /tmp/torchinductor_root/ih/cihioyifcfjhbnozeaabqskrwhdrvb6fr5i6ozy7afvb7zd7biia.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T17:56:02.7954718Z /tmp/torchinductor_root/ih/cihioyifcfjhbnozeaabqskrwhdrvb6fr5i6ozy7afvb7zd7biia.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T17:56:02.7955272Z [13s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T17:56:02.7956066Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 256], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T17:56:02.7956808Z Error: RuntimeError: PassManager::run failed 2026-02-21T17:56:02.7956973Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T17:56:04.5588717Z Initial population exploring neighbors 95% ━━━━━━━━━━━━ 95/100 10.5 configs/s 2026-02-21T17:56:04.5591362Z WARNING:tritonbench.utils.triton_op:Completed input ID 5: 2026-02-21T17:56:04.5599569Z (M, N, K) 2026-02-21T17:56:04.5599655Z ------------------ 2026-02-21T17:56:04.5599771Z (1024, 8192, 1024) 2026-02-21T17:56:04.5599826Z 2026-02-21T17:56:04.5600448Z 50%|█████ | 4/8 [15:05<14:12, 213.16s/it]WARNING:tritonbench.utils.triton_op:Running input ID 6: 2026-02-21T17:56:04.5600648Z (M, N, K) 2026-02-21T17:56:04.5600729Z ------------------ 2026-02-21T17:56:04.5600810Z (8192, 2048, 2048) 2026-02-21T17:56:04.5602432Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T17:56:40.7464609Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T17:57:19.3339091Z Autotune Choices Stats: 2026-02-21T17:57:19.3347823Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_179", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.10759799927473068, "best_triton_pos": 0} 2026-02-21T17:57:19.3350227Z AUTOTUNE mm(8192x2048, 2048x2048) 2026-02-21T17:57:19.3350542Z strides: [2048, 1], [1, 2048] 2026-02-21T17:57:19.3350899Z dtypes: torch.float16, torch.float16 2026-02-21T17:57:19.3351916Z triton_mm_179 0.1076 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:57:19.3353347Z triton_mm_170 0.1269 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T17:57:19.3354754Z triton_mm_178 0.1336 ms 80.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:57:19.3356308Z triton_mm_173 0.1353 ms 79.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T17:57:19.3357706Z triton_mm_176 0.1404 ms 76.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T17:57:19.3359099Z triton_mm_174 0.1414 ms 76.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:57:19.3360488Z triton_mm_171 0.1417 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:57:19.3361648Z triton_mm_177 0.1461 ms 73.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:57:19.3362660Z triton_mm_172 0.1480 ms 72.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:57:19.3363603Z triton_mm_169 0.1578 ms 68.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=1, num_stages=2, num_warps=8 2026-02-21T17:57:19.3364327Z SingleProcess AUTOTUNE benchmarking takes 1.0407 seconds and 0.2611 seconds precompiling for 36 choices 2026-02-21T17:57:19.5546448Z INFO:tritonbench.utils.triton_op:Took 1701.91ms to get benchmark function for pt2_triton_matmul 2026-02-21T17:57:53.5189247Z WARNING:__main__:Input tensor metadata: 2026-02-21T17:57:53.5189710Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T17:57:53.5190056Z 'dtype': 'torch.float16', 2026-02-21T17:57:53.5190374Z 'shape': (8192, 2048), 2026-02-21T17:57:53.5190681Z 'stride': (2048, 1)}, 2026-02-21T17:57:53.5190980Z { 'device': 'cuda:0', 2026-02-21T17:57:53.5191272Z 'dtype': 'torch.float16', 2026-02-21T17:57:53.5191568Z 'shape': (2048, 2048), 2026-02-21T17:57:53.5191852Z 'stride': (1, 2048)}, 2026-02-21T17:57:53.5192128Z None), 2026-02-21T17:57:53.5192360Z 'kwargs': {}} 2026-02-21T17:57:53.5197614Z INFO:tritonbench.utils.triton_op:Took 1.28ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T17:57:53.5850522Z [0s] Autotune random seed: 2169133161 2026-02-21T17:57:53.6464079Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T17:58:01.9818214Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 1.3 configs/s 2026-02-21T17:58:11.4394689Z Initial population exploring neighbors 27% ━━━╸ 27/100 2.8 configs/s 2026-02-21T17:58:11.4397003Z WARNING:tritonbench.utils.triton_op:Completed input ID 6: 2026-02-21T17:58:11.4397410Z (M, N, K) 2026-02-21T17:58:11.4397627Z ------------------ 2026-02-21T17:58:11.4397869Z (8192, 2048, 2048) 2026-02-21T17:58:11.4398021Z 2026-02-21T17:58:11.4404110Z 62%|██████▎ | 5/8 [17:12<09:06, 182.04s/it]WARNING:tritonbench.utils.triton_op:Running input ID 8: 2026-02-21T17:58:11.4404486Z (M, N, K) 2026-02-21T17:58:11.4404650Z ------------------- 2026-02-21T17:58:11.4404820Z (12288, 1024, 1024) 2026-02-21T17:58:11.4406983Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T17:58:55.4937139Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T17:59:31.8792806Z Autotune Choices Stats: 2026-02-21T17:59:31.8795118Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_215", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.0549979992210865, "best_triton_pos": 0} 2026-02-21T17:59:31.8801604Z AUTOTUNE mm(12288x1024, 1024x1024) 2026-02-21T17:59:31.8801986Z strides: [1024, 1], [1, 1024] 2026-02-21T17:59:31.8802313Z dtypes: torch.float16, torch.float16 2026-02-21T17:59:31.8803370Z triton_mm_215 0.0550 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:59:31.8804756Z triton_mm_207 0.0598 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:59:31.8806137Z triton_mm_204 0.0611 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:59:31.8807379Z triton_mm_206 0.0611 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T17:59:31.8808614Z triton_mm_208 0.0620 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T17:59:31.8809853Z triton_mm_203 0.0622 ms 88.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T17:59:31.8811093Z triton_mm_201 0.0624 ms 88.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T17:59:31.8812337Z triton_mm_205 0.0629 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=1, num_stages=2, num_warps=8 2026-02-21T17:59:31.8813580Z triton_mm_209 0.0630 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T17:59:31.8814684Z triton_mm_212 0.0651 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T17:59:31.8815375Z SingleProcess AUTOTUNE benchmarking takes 0.5968 seconds and 0.2578 seconds precompiling for 36 choices 2026-02-21T17:59:32.0947522Z INFO:tritonbench.utils.triton_op:Took 1245.17ms to get benchmark function for pt2_triton_matmul 2026-02-21T18:00:05.0256459Z WARNING:__main__:Input tensor metadata: 2026-02-21T18:00:05.0256985Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T18:00:05.0257318Z 'dtype': 'torch.float16', 2026-02-21T18:00:05.0257646Z 'shape': (12288, 1024), 2026-02-21T18:00:05.0257948Z 'stride': (1024, 1)}, 2026-02-21T18:00:05.0258249Z { 'device': 'cuda:0', 2026-02-21T18:00:05.0258448Z 'dtype': 'torch.float16', 2026-02-21T18:00:05.0258563Z 'shape': (1024, 1024), 2026-02-21T18:00:05.0258675Z 'stride': (1, 1024)}, 2026-02-21T18:00:05.0258782Z None), 2026-02-21T18:00:05.0258872Z 'kwargs': {}} 2026-02-21T18:00:05.0265394Z INFO:tritonbench.utils.triton_op:Took 1.33ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T18:00:05.0919415Z [0s] Autotune random seed: 2169133161 2026-02-21T18:00:05.1496830Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T18:00:11.7753584Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━ 100/100 19.4 configs/s 2026-02-21T18:00:14.9723837Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:00:14.9725607Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:14.9726498Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:14.9727806Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [8, 8], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:14.9728575Z #blocked3 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T18:00:14.9729233Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:14.9729893Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:14.9730640Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:00:14.9731675Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:00:14.9732536Z %cst = arith.constant dense<0.000000e+00> : tensor<16x16xf16, #blocked> 2026-02-21T18:00:14.9732968Z %cst_0 = arith.constant dense<1024> : tensor<1x16xi64, #blocked> 2026-02-21T18:00:14.9733359Z %cst_1 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T18:00:14.9733757Z %cst_2 = arith.constant dense<12288> : tensor<16x1xi64, #blocked1> 2026-02-21T18:00:14.9734155Z %cst_3 = arith.constant dense<0> : tensor<16x1xi64, #blocked1> 2026-02-21T18:00:14.9734541Z %cst_4 = arith.constant dense<1024> : tensor<16x1xi64, #blocked1> 2026-02-21T18:00:14.9734881Z %c4096_i32 = arith.constant 4096 : i32 2026-02-21T18:00:14.9735145Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:00:14.9735401Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:00:14.9735776Z %c1_i32 = arith.constant 1 : i32 2026-02-21T18:00:14.9736088Z %cst_5 = arith.constant dense<1024> : tensor<16x1xi32, #blocked1> 2026-02-21T18:00:14.9736587Z %cst_6 = arith.constant dense<1024> : tensor<1x8xi32, #blocked2> 2026-02-21T18:00:14.9737002Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<16x8xf32, #blocked2> 2026-02-21T18:00:14.9737316Z %c8_i32 = arith.constant 8 : i32 2026-02-21T18:00:14.9737486Z %c16_i32 = arith.constant 16 : i32 2026-02-21T18:00:14.9737661Z %c32_i32 = arith.constant 32 : i32 2026-02-21T18:00:14.9737838Z %c768_i32 = arith.constant 768 : i32 2026-02-21T18:00:14.9738023Z %c98304_i32 = arith.constant 98304 : i32 2026-02-21T18:00:14.9738203Z %c6_i32 = arith.constant 6 : i32 2026-02-21T18:00:14.9738383Z %0 = tt.get_program_id x : i32 2026-02-21T18:00:14.9738557Z %1 = arith.muli %0, %c6_i32 : i32 2026-02-21T18:00:14.9738728Z %2 = arith.addi %1, %c6_i32 : i32 2026-02-21T18:00:14.9738913Z %3 = arith.minsi %2, %c98304_i32 : i32 2026-02-21T18:00:14.9739161Z %4 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked3> 2026-02-21T18:00:14.9739479Z %5 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked3> 2026-02-21T18:00:14.9739881Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<16x16x!tt.ptr, #blocked> 2026-02-21T18:00:14.9740194Z %7 = arith.extsi %4 : tensor<16xi32, #blocked3> to tensor<16xi64, #blocked3> 2026-02-21T18:00:14.9740502Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T18:00:14.9740795Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T18:00:14.9741053Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T18:00:14.9741264Z %10 = arith.divsi %arg3, %c4096_i32 : i32 2026-02-21T18:00:14.9741450Z %11 = arith.muli %10, %c32_i32 : i32 2026-02-21T18:00:14.9741630Z %12 = arith.subi %c768_i32, %11 : i32 2026-02-21T18:00:14.9741809Z %13 = arith.minsi %12, %c32_i32 : i32 2026-02-21T18:00:14.9741991Z %14 = arith.remsi %arg3, %c4096_i32 : i32 2026-02-21T18:00:14.9742175Z %15 = arith.remsi %14, %13 : i32 2026-02-21T18:00:14.9742352Z %16 = arith.addi %11, %15 : i32 2026-02-21T18:00:14.9742524Z %17 = arith.divsi %14, %13 : i32 2026-02-21T18:00:14.9742733Z %18 = arith.muli %16, %c16_i32 : i32 2026-02-21T18:00:14.9742940Z %19 = tt.splat %18 : i32 -> tensor<16xi32, #blocked3> 2026-02-21T18:00:14.9743173Z %20 = arith.addi %19, %4 : tensor<16xi32, #blocked3> 2026-02-21T18:00:14.9743374Z %21 = arith.muli %17, %c8_i32 : i32 2026-02-21T18:00:14.9743579Z %22 = tt.splat %21 : i32 -> tensor<8xi32, #blocked3> 2026-02-21T18:00:14.9743797Z %23 = arith.addi %22, %5 : tensor<8xi32, #blocked3> 2026-02-21T18:00:14.9744001Z %24 = arith.extsi %18 : i32 to i64 2026-02-21T18:00:14.9744204Z %25 = tt.splat %24 : i64 -> tensor<16xi64, #blocked3> 2026-02-21T18:00:14.9744436Z %26 = arith.addi %25, %7 : tensor<16xi64, #blocked3> 2026-02-21T18:00:14.9744800Z %27 = ttg.convert_layout %26 : tensor<16xi64, #blocked3> -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:14.9745319Z %28 = tt.expand_dims %27 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<16x1xi64, #blocked4> 2026-02-21T18:00:14.9745773Z %29 = ttg.convert_layout %28 : tensor<16x1xi64, #blocked4> -> tensor<16x1xi64, #blocked1> 2026-02-21T18:00:14.9746087Z %30 = arith.muli %29, %cst_4 : tensor<16x1xi64, #blocked1> 2026-02-21T18:00:14.9746381Z %31 = tt.broadcast %30 : tensor<16x1xi64, #blocked1> -> tensor<16x16xi64, #blocked1> 2026-02-21T18:00:14.9746743Z %32 = ttg.convert_layout %31 : tensor<16x16xi64, #blocked1> -> tensor<16x16xi64, #blocked> 2026-02-21T18:00:14.9747060Z %33 = arith.cmpi sge, %29, %cst_3 : tensor<16x1xi64, #blocked1> 2026-02-21T18:00:14.9747274Z %34 = arith.cmpi slt, %29, %cst_2 : tensor<16x1xi64, #blocked1> 2026-02-21T18:00:14.9747490Z %35 = arith.andi %33, %34 : tensor<16x1xi1, #blocked1> 2026-02-21T18:00:14.9747733Z %36 = tt.broadcast %35 : tensor<16x1xi1, #blocked1> -> tensor<16x16xi1, #blocked1> 2026-02-21T18:00:14.9748002Z %37 = ttg.convert_layout %36 : tensor<16x16xi1, #blocked1> -> tensor<16x16xi1, #blocked> 2026-02-21T18:00:14.9748415Z %38 = ttg.convert_layout %23 : tensor<8xi32, #blocked3> -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:14.9748799Z %39 = tt.expand_dims %38 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x8xi32, #blocked5> 2026-02-21T18:00:14.9749126Z %40 = ttg.convert_layout %39 : tensor<1x8xi32, #blocked5> -> tensor<1x8xi32, #blocked2> 2026-02-21T18:00:14.9749364Z %41 = arith.muli %40, %cst_6 : tensor<1x8xi32, #blocked2> 2026-02-21T18:00:14.9749581Z %42 = tt.broadcast %41 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T18:00:14.9749909Z %43 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c16_i32 iter_args(%arg5 = %cst_7) -> (tensor<16x8xf32, #blocked2>) : i32 { 2026-02-21T18:00:14.9750196Z %57 = tt.splat %arg4 : i32 -> tensor<16xi32, #blocked3> 2026-02-21T18:00:14.9750411Z %58 = arith.addi %57, %4 : tensor<16xi32, #blocked3> 2026-02-21T18:00:14.9750581Z %59 = arith.extsi %arg4 : i32 to i64 2026-02-21T18:00:14.9750746Z %60 = tt.splat %59 : i64 -> tensor<16xi64, #blocked3> 2026-02-21T18:00:14.9750918Z %61 = arith.addi %60, %7 : tensor<16xi64, #blocked3> 2026-02-21T18:00:14.9751195Z %62 = ttg.convert_layout %61 : tensor<16xi64, #blocked3> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:14.9751583Z %63 = tt.expand_dims %62 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi64, #blocked5> 2026-02-21T18:00:14.9751926Z %64 = ttg.convert_layout %63 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #blocked> 2026-02-21T18:00:14.9752196Z %65 = tt.broadcast %64 : tensor<1x16xi64, #blocked> -> tensor<16x16xi64, #blocked> 2026-02-21T18:00:14.9752427Z %66 = arith.addi %32, %65 : tensor<16x16xi64, #blocked> 2026-02-21T18:00:14.9752660Z %67 = tt.addptr %6, %66 : tensor<16x16x!tt.ptr, #blocked>, tensor<16x16xi64, #blocked> 2026-02-21T18:00:14.9752931Z %68 = arith.cmpi sge, %64, %cst_1 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:14.9753135Z %69 = arith.cmpi slt, %64, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:14.9753323Z %70 = arith.andi %68, %69 : tensor<1x16xi1, #blocked> 2026-02-21T18:00:14.9753537Z %71 = tt.broadcast %70 : tensor<1x16xi1, #blocked> -> tensor<16x16xi1, #blocked> 2026-02-21T18:00:14.9753764Z %72 = arith.andi %37, %71 : tensor<16x16xi1, #blocked> 2026-02-21T18:00:14.9753949Z %73 = tt.load %67, %72, %cst : tensor<16x16x!tt.ptr, #blocked> 2026-02-21T18:00:14.9754246Z %74 = ttg.convert_layout %58 : tensor<16xi32, #blocked3> -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:14.9754638Z %75 = tt.expand_dims %74 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<16x1xi32, #blocked4> 2026-02-21T18:00:14.9754998Z %76 = ttg.convert_layout %75 : tensor<16x1xi32, #blocked4> -> tensor<16x1xi32, #blocked1> 2026-02-21T18:00:14.9755275Z %77 = tt.broadcast %76 : tensor<16x1xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T18:00:14.9755551Z %78 = ttg.convert_layout %77 : tensor<16x8xi32, #blocked1> -> tensor<16x8xi32, #blocked2> 2026-02-21T18:00:14.9755786Z %79 = arith.addi %78, %42 : tensor<16x8xi32, #blocked2> 2026-02-21T18:00:14.9756021Z %80 = tt.addptr %8, %79 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T18:00:14.9756252Z %81 = tt.load %80 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T18:00:14.9756551Z %82 = ttg.convert_layout %73 : tensor<16x16xf16, #blocked> -> tensor<16x16xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked2}>> 2026-02-21T18:00:14.9757007Z %83 = ttg.convert_layout %81 : tensor<16x8xf16, #blocked2> -> tensor<16x8xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked2}>> 2026-02-21T18:00:14.9757310Z %84 = ttg.convert_layout %arg5 : tensor<16x8xf32, #blocked2> -> tensor<16x8xf32, #blocked2> 2026-02-21T18:00:14.9757714Z %85 = tt.dot %82, %83, %84, inputPrecision = tf32 : tensor<16x16xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked2}>> * tensor<16x8xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked2}>> -> tensor<16x8xf32, #blocked2> 2026-02-21T18:00:14.9758053Z scf.yield %85 : tensor<16x8xf32, #blocked2> 2026-02-21T18:00:14.9758180Z } 2026-02-21T18:00:14.9758313Z %44 = arith.truncf %43 : tensor<16x8xf32, #blocked2> to tensor<16x8xf16, #blocked2> 2026-02-21T18:00:14.9758586Z %45 = ttg.convert_layout %20 : tensor<16xi32, #blocked3> -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:14.9758913Z %46 = tt.expand_dims %45 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<16x1xi32, #blocked4> 2026-02-21T18:00:14.9759220Z %47 = ttg.convert_layout %46 : tensor<16x1xi32, #blocked4> -> tensor<16x1xi32, #blocked1> 2026-02-21T18:00:14.9759421Z %48 = arith.muli %47, %cst_5 : tensor<16x1xi32, #blocked1> 2026-02-21T18:00:14.9759652Z %49 = ttg.convert_layout %23 : tensor<8xi32, #blocked3> -> tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:14.9759968Z %50 = tt.expand_dims %49 {axis = 0 : i32} : tensor<8xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x8xi32, #blocked5> 2026-02-21T18:00:14.9760249Z %51 = ttg.convert_layout %50 : tensor<1x8xi32, #blocked5> -> tensor<1x8xi32, #blocked2> 2026-02-21T18:00:14.9760474Z %52 = tt.broadcast %48 : tensor<16x1xi32, #blocked1> -> tensor<16x8xi32, #blocked1> 2026-02-21T18:00:14.9760699Z %53 = ttg.convert_layout %52 : tensor<16x8xi32, #blocked1> -> tensor<16x8xi32, #blocked2> 2026-02-21T18:00:14.9760925Z %54 = tt.broadcast %51 : tensor<1x8xi32, #blocked2> -> tensor<16x8xi32, #blocked2> 2026-02-21T18:00:14.9761113Z %55 = arith.addi %53, %54 : tensor<16x8xi32, #blocked2> 2026-02-21T18:00:14.9761309Z %56 = tt.addptr %9, %55 : tensor<16x8x!tt.ptr, #blocked2>, tensor<16x8xi32, #blocked2> 2026-02-21T18:00:14.9761529Z tt.store %56, %44 : tensor<16x8x!tt.ptr, #blocked2> 2026-02-21T18:00:14.9761661Z } 2026-02-21T18:00:14.9761738Z tt.return 2026-02-21T18:00:14.9761820Z } 2026-02-21T18:00:14.9761898Z } 2026-02-21T18:00:14.9761944Z 2026-02-21T18:00:14.9761974Z {-# 2026-02-21T18:00:14.9762055Z external_resources: { 2026-02-21T18:00:14.9762159Z mlir_reproducer: { 2026-02-21T18:00:14.9764430Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:00:14.9766753Z disable_threading: false, 2026-02-21T18:00:14.9766915Z verify_each: true 2026-02-21T18:00:14.9767019Z } 2026-02-21T18:00:14.9767099Z } 2026-02-21T18:00:14.9767172Z #-} 2026-02-21T18:00:14.9767458Z /tmp/torchinductor_root/mg/cmgeqelwwpdohj5dzppurhyprgljuxhck7d6ip74bdoa5esajqmo.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:00:14.9768182Z /tmp/torchinductor_root/mg/cmgeqelwwpdohj5dzppurhyprgljuxhck7d6ip74bdoa5esajqmo.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:00:14.9768735Z [9s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:00:14.9769535Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 8, 16], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=64, num_stages=3, num_warps=1, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T18:00:14.9770248Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:00:14.9770415Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:00:15.3932192Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:00:15.3935075Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:15.3935836Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:15.3936522Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:15.3937287Z #blocked3 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T18:00:15.3937861Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:15.3938456Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:15.3939119Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:00:15.3940027Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:00:15.3940779Z %cst = arith.constant dense<0.000000e+00> : tensor<32x16xf16, #blocked> 2026-02-21T18:00:15.3941164Z %cst_0 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T18:00:15.3941529Z %cst_1 = arith.constant dense<1024> : tensor<32x1xi64, #blocked1> 2026-02-21T18:00:15.3941884Z %cst_2 = arith.constant dense<0> : tensor<32x1xi64, #blocked1> 2026-02-21T18:00:15.3942229Z %cst_3 = arith.constant dense<1024> : tensor<1x16xi64, #blocked> 2026-02-21T18:00:15.3942597Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<8x32xf16, #blocked2> 2026-02-21T18:00:15.3942970Z %cst_5 = arith.constant dense<1024> : tensor<1x32xi64, #blocked2> 2026-02-21T18:00:15.3943313Z %cst_6 = arith.constant dense<0> : tensor<1x32xi64, #blocked2> 2026-02-21T18:00:15.3943654Z %cst_7 = arith.constant dense<12288> : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3944039Z %cst_8 = arith.constant dense<0> : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3944379Z %cst_9 = arith.constant dense<1024> : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3944733Z %c12288_i32 = arith.constant 12288 : i32 2026-02-21T18:00:15.3944981Z %c32_i32 = arith.constant 32 : i32 2026-02-21T18:00:15.3945217Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:00:15.3945454Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:00:15.3945678Z %c608_i32 = arith.constant 608 : i32 2026-02-21T18:00:15.3945991Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<8x16xf32, #blocked> 2026-02-21T18:00:15.3946313Z %c16_i32 = arith.constant 16 : i32 2026-02-21T18:00:15.3946539Z %c8_i32 = arith.constant 8 : i32 2026-02-21T18:00:15.3946759Z %c64_i32 = arith.constant 64 : i32 2026-02-21T18:00:15.3946986Z %c98304_i32 = arith.constant 98304 : i32 2026-02-21T18:00:15.3947303Z %0 = tt.get_program_id x : i32 2026-02-21T18:00:15.3947592Z %1 = tt.splat %arg0 : !tt.ptr -> tensor<8x32x!tt.ptr, #blocked2> 2026-02-21T18:00:15.3947884Z %2 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked3> 2026-02-21T18:00:15.3948320Z %3 = arith.extsi %2 : tensor<8xi32, #blocked3> to tensor<8xi64, #blocked3> 2026-02-21T18:00:15.3948669Z %4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked3> 2026-02-21T18:00:15.3948968Z %5 = arith.extsi %4 : tensor<32xi32, #blocked3> to tensor<32xi64, #blocked3> 2026-02-21T18:00:15.3949277Z %6 = tt.splat %arg1 : !tt.ptr -> tensor<32x16x!tt.ptr, #blocked> 2026-02-21T18:00:15.3949566Z %7 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked3> 2026-02-21T18:00:15.3949853Z %8 = arith.extsi %7 : tensor<16xi32, #blocked3> to tensor<16xi64, #blocked3> 2026-02-21T18:00:15.3950133Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<8x16x!tt.ptr, #blocked> 2026-02-21T18:00:15.3950419Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked3> 2026-02-21T18:00:15.3950706Z %11 = arith.extsi %10 : tensor<8xi32, #blocked3> to tensor<8xi64, #blocked3> 2026-02-21T18:00:15.3950997Z %12 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked3> 2026-02-21T18:00:15.3951296Z %13 = arith.extsi %12 : tensor<16xi32, #blocked3> to tensor<16xi64, #blocked3> 2026-02-21T18:00:15.3951597Z scf.for %arg3 = %0 to %c98304_i32 step %c608_i32 : i32 { 2026-02-21T18:00:15.3951811Z %14 = arith.divsi %arg3, %c12288_i32 : i32 2026-02-21T18:00:15.3951987Z %15 = arith.muli %14, %c8_i32 : i32 2026-02-21T18:00:15.3952157Z %16 = arith.subi %c64_i32, %15 : i32 2026-02-21T18:00:15.3952322Z %17 = arith.minsi %16, %c8_i32 : i32 2026-02-21T18:00:15.3952494Z %18 = arith.remsi %arg3, %c12288_i32 : i32 2026-02-21T18:00:15.3952662Z %19 = arith.remsi %18, %17 : i32 2026-02-21T18:00:15.3952823Z %20 = arith.addi %15, %19 : i32 2026-02-21T18:00:15.3952977Z %21 = arith.divsi %18, %17 : i32 2026-02-21T18:00:15.3953138Z %22 = arith.muli %20, %c16_i32 : i32 2026-02-21T18:00:15.3953301Z %23 = arith.muli %21, %c8_i32 : i32 2026-02-21T18:00:15.3953465Z %24 = arith.extsi %23 : i32 to i64 2026-02-21T18:00:15.3953662Z %25 = tt.splat %24 : i64 -> tensor<8xi64, #blocked3> 2026-02-21T18:00:15.3953879Z %26 = arith.addi %25, %3 : tensor<8xi64, #blocked3> 2026-02-21T18:00:15.3954212Z %27 = ttg.convert_layout %26 : tensor<8xi64, #blocked3> -> tensor<8xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:15.3954680Z %28 = tt.expand_dims %27 {axis = 1 : i32} : tensor<8xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<8x1xi64, #blocked4> 2026-02-21T18:00:15.3955090Z %29 = ttg.convert_layout %28 : tensor<8x1xi64, #blocked4> -> tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3955377Z %30 = arith.muli %29, %cst_9 : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3955648Z %31 = tt.broadcast %30 : tensor<8x1xi64, #blocked1> -> tensor<8x32xi64, #blocked1> 2026-02-21T18:00:15.3956008Z %32 = ttg.convert_layout %31 : tensor<8x32xi64, #blocked1> -> tensor<8x32xi64, #blocked2> 2026-02-21T18:00:15.3956330Z %33 = arith.cmpi sge, %29, %cst_8 : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3956575Z %34 = arith.cmpi slt, %29, %cst_7 : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3956807Z %35 = arith.andi %33, %34 : tensor<8x1xi1, #blocked1> 2026-02-21T18:00:15.3957064Z %36 = tt.broadcast %35 : tensor<8x1xi1, #blocked1> -> tensor<8x32xi1, #blocked1> 2026-02-21T18:00:15.3957385Z %37 = ttg.convert_layout %36 : tensor<8x32xi1, #blocked1> -> tensor<8x32xi1, #blocked2> 2026-02-21T18:00:15.3957639Z %38 = arith.extsi %22 : i32 to i64 2026-02-21T18:00:15.3957827Z %39 = tt.splat %38 : i64 -> tensor<16xi64, #blocked3> 2026-02-21T18:00:15.3958045Z %40 = arith.addi %39, %8 : tensor<16xi64, #blocked3> 2026-02-21T18:00:15.3958304Z %41 = ttg.convert_layout %40 : tensor<16xi64, #blocked3> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:15.3958671Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi64, #blocked5> 2026-02-21T18:00:15.3959010Z %43 = ttg.convert_layout %42 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #blocked> 2026-02-21T18:00:15.3959234Z %44 = arith.muli %43, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:15.3959440Z %45 = tt.broadcast %44 : tensor<1x16xi64, #blocked> -> tensor<32x16xi64, #blocked> 2026-02-21T18:00:15.3959676Z %46 = arith.cmpi sge, %43, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:15.3959861Z %47 = arith.cmpi slt, %43, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:15.3960038Z %48 = arith.andi %46, %47 : tensor<1x16xi1, #blocked> 2026-02-21T18:00:15.3960235Z %49 = tt.broadcast %48 : tensor<1x16xi1, #blocked> -> tensor<32x16xi1, #blocked> 2026-02-21T18:00:15.3960541Z %50 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c32_i32 iter_args(%arg5 = %cst_10) -> (tensor<8x16xf32, #blocked>) : i32 { 2026-02-21T18:00:15.3960794Z %80 = arith.extsi %arg4 : i32 to i64 2026-02-21T18:00:15.3960948Z %81 = tt.splat %80 : i64 -> tensor<32xi64, #blocked3> 2026-02-21T18:00:15.3961121Z %82 = arith.addi %81, %5 : tensor<32xi64, #blocked3> 2026-02-21T18:00:15.3961400Z %83 = ttg.convert_layout %82 : tensor<32xi64, #blocked3> -> tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:15.3961768Z %84 = tt.expand_dims %83 {axis = 0 : i32} : tensor<32xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x32xi64, #blocked5> 2026-02-21T18:00:15.3962095Z %85 = ttg.convert_layout %84 : tensor<1x32xi64, #blocked5> -> tensor<1x32xi64, #blocked2> 2026-02-21T18:00:15.3962356Z %86 = tt.broadcast %85 : tensor<1x32xi64, #blocked2> -> tensor<8x32xi64, #blocked2> 2026-02-21T18:00:15.3962574Z %87 = arith.addi %32, %86 : tensor<8x32xi64, #blocked2> 2026-02-21T18:00:15.3962794Z %88 = tt.addptr %1, %87 : tensor<8x32x!tt.ptr, #blocked2>, tensor<8x32xi64, #blocked2> 2026-02-21T18:00:15.3963030Z %89 = arith.cmpi sge, %85, %cst_6 : tensor<1x32xi64, #blocked2> 2026-02-21T18:00:15.3963227Z %90 = arith.cmpi slt, %85, %cst_5 : tensor<1x32xi64, #blocked2> 2026-02-21T18:00:15.3963411Z %91 = arith.andi %89, %90 : tensor<1x32xi1, #blocked2> 2026-02-21T18:00:15.3963619Z %92 = tt.broadcast %91 : tensor<1x32xi1, #blocked2> -> tensor<8x32xi1, #blocked2> 2026-02-21T18:00:15.3963827Z %93 = arith.andi %37, %92 : tensor<8x32xi1, #blocked2> 2026-02-21T18:00:15.3964014Z %94 = tt.load %88, %93, %cst_4 : tensor<8x32x!tt.ptr, #blocked2> 2026-02-21T18:00:15.3964296Z %95 = ttg.convert_layout %82 : tensor<32xi64, #blocked3> -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:15.3964666Z %96 = tt.expand_dims %95 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<32x1xi64, #blocked4> 2026-02-21T18:00:15.3965019Z %97 = ttg.convert_layout %96 : tensor<32x1xi64, #blocked4> -> tensor<32x1xi64, #blocked1> 2026-02-21T18:00:15.3965298Z %98 = tt.broadcast %97 : tensor<32x1xi64, #blocked1> -> tensor<32x16xi64, #blocked1> 2026-02-21T18:00:15.3965559Z %99 = ttg.convert_layout %98 : tensor<32x16xi64, #blocked1> -> tensor<32x16xi64, #blocked> 2026-02-21T18:00:15.3965784Z %100 = arith.addi %99, %45 : tensor<32x16xi64, #blocked> 2026-02-21T18:00:15.3966010Z %101 = tt.addptr %6, %100 : tensor<32x16x!tt.ptr, #blocked>, tensor<32x16xi64, #blocked> 2026-02-21T18:00:15.3966248Z %102 = arith.cmpi sge, %97, %cst_2 : tensor<32x1xi64, #blocked1> 2026-02-21T18:00:15.3966442Z %103 = arith.cmpi slt, %97, %cst_1 : tensor<32x1xi64, #blocked1> 2026-02-21T18:00:15.3966634Z %104 = arith.andi %102, %103 : tensor<32x1xi1, #blocked1> 2026-02-21T18:00:15.3966847Z %105 = tt.broadcast %104 : tensor<32x1xi1, #blocked1> -> tensor<32x16xi1, #blocked1> 2026-02-21T18:00:15.3967111Z %106 = ttg.convert_layout %105 : tensor<32x16xi1, #blocked1> -> tensor<32x16xi1, #blocked> 2026-02-21T18:00:15.3967339Z %107 = arith.andi %106, %49 : tensor<32x16xi1, #blocked> 2026-02-21T18:00:15.3967547Z %108 = tt.load %101, %107, %cst : tensor<32x16x!tt.ptr, #blocked> 2026-02-21T18:00:15.3967846Z %109 = ttg.convert_layout %94 : tensor<8x32xf16, #blocked2> -> tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T18:00:15.3968209Z %110 = ttg.convert_layout %108 : tensor<32x16xf16, #blocked> -> tensor<32x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T18:00:15.3968502Z %111 = ttg.convert_layout %arg5 : tensor<8x16xf32, #blocked> -> tensor<8x16xf32, #blocked> 2026-02-21T18:00:15.3968912Z %112 = tt.dot %109, %110, %111, inputPrecision = tf32 : tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<32x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<8x16xf32, #blocked> 2026-02-21T18:00:15.3969256Z scf.yield %112 : tensor<8x16xf32, #blocked> 2026-02-21T18:00:15.3969378Z } 2026-02-21T18:00:15.3969511Z %51 = arith.truncf %50 : tensor<8x16xf32, #blocked> to tensor<8x16xf16, #blocked> 2026-02-21T18:00:15.3969714Z %52 = arith.extsi %23 : i32 to i64 2026-02-21T18:00:15.3969830Z %53 = arith.extsi %22 : i32 to i64 2026-02-21T18:00:15.3969960Z %54 = tt.splat %52 : i64 -> tensor<8xi64, #blocked3> 2026-02-21T18:00:15.3970107Z %55 = arith.addi %54, %11 : tensor<8xi64, #blocked3> 2026-02-21T18:00:15.3970330Z %56 = ttg.convert_layout %55 : tensor<8xi64, #blocked3> -> tensor<8xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:15.3970649Z %57 = tt.expand_dims %56 {axis = 1 : i32} : tensor<8xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<8x1xi64, #blocked4> 2026-02-21T18:00:15.3970929Z %58 = ttg.convert_layout %57 : tensor<8x1xi64, #blocked4> -> tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3971128Z %59 = arith.muli %58, %cst_9 : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3971315Z %60 = tt.broadcast %59 : tensor<8x1xi64, #blocked1> -> tensor<8x16xi64, #blocked1> 2026-02-21T18:00:15.3971539Z %61 = ttg.convert_layout %60 : tensor<8x16xi64, #blocked1> -> tensor<8x16xi64, #blocked> 2026-02-21T18:00:15.3971732Z %62 = tt.splat %53 : i64 -> tensor<16xi64, #blocked3> 2026-02-21T18:00:15.3971876Z %63 = arith.addi %62, %13 : tensor<16xi64, #blocked3> 2026-02-21T18:00:15.3972105Z %64 = ttg.convert_layout %63 : tensor<16xi64, #blocked3> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:15.3972428Z %65 = tt.expand_dims %64 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi64, #blocked5> 2026-02-21T18:00:15.3972708Z %66 = ttg.convert_layout %65 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #blocked> 2026-02-21T18:00:15.3972931Z %67 = tt.broadcast %66 : tensor<1x16xi64, #blocked> -> tensor<8x16xi64, #blocked> 2026-02-21T18:00:15.3980166Z %68 = arith.addi %61, %67 : tensor<8x16xi64, #blocked> 2026-02-21T18:00:15.3980381Z %69 = tt.addptr %9, %68 : tensor<8x16x!tt.ptr, #blocked>, tensor<8x16xi64, #blocked> 2026-02-21T18:00:15.3980589Z %70 = arith.cmpi sge, %58, %cst_8 : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3980757Z %71 = arith.cmpi slt, %58, %cst_7 : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:15.3980916Z %72 = arith.andi %70, %71 : tensor<8x1xi1, #blocked1> 2026-02-21T18:00:15.3981094Z %73 = tt.broadcast %72 : tensor<8x1xi1, #blocked1> -> tensor<8x16xi1, #blocked1> 2026-02-21T18:00:15.3981313Z %74 = ttg.convert_layout %73 : tensor<8x16xi1, #blocked1> -> tensor<8x16xi1, #blocked> 2026-02-21T18:00:15.3981509Z %75 = arith.cmpi sge, %66, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:15.3981672Z %76 = arith.cmpi slt, %66, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:15.3981831Z %77 = arith.andi %75, %76 : tensor<1x16xi1, #blocked> 2026-02-21T18:00:15.3982003Z %78 = tt.broadcast %77 : tensor<1x16xi1, #blocked> -> tensor<8x16xi1, #blocked> 2026-02-21T18:00:15.3982186Z %79 = arith.andi %74, %78 : tensor<8x16xi1, #blocked> 2026-02-21T18:00:15.3982360Z tt.store %69, %51, %79 : tensor<8x16x!tt.ptr, #blocked> 2026-02-21T18:00:15.3982506Z } {tt.disallow_acc_multi_buffer} 2026-02-21T18:00:15.3982612Z tt.return 2026-02-21T18:00:15.3982691Z } 2026-02-21T18:00:15.3982763Z } 2026-02-21T18:00:15.3982808Z 2026-02-21T18:00:15.3982838Z {-# 2026-02-21T18:00:15.3982921Z external_resources: { 2026-02-21T18:00:15.3983019Z mlir_reproducer: { 2026-02-21T18:00:15.3985278Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=4 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=4}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:00:15.3987581Z disable_threading: false, 2026-02-21T18:00:15.3987686Z verify_each: true 2026-02-21T18:00:15.3987776Z } 2026-02-21T18:00:15.3987850Z } 2026-02-21T18:00:15.3987920Z #-} 2026-02-21T18:00:15.3988233Z /tmp/torchinductor_root/cm/ccmggqrlegkg452qwkeob6qnspmi6lhehlxfvsxolheog6pkaljq.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:00:15.3988932Z /tmp/torchinductor_root/cm/ccmggqrlegkg452qwkeob6qnspmi6lhehlxfvsxolheog6pkaljq.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:00:15.3989496Z [10s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:00:15.3990311Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16, 32], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=2, num_stages=4, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[False, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T18:00:15.3991048Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:00:15.3991214Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:00:15.6695981Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:00:15.6699060Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:15.6699438Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:15.6699767Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T18:00:15.6700232Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:15.6700564Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:15.6700891Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:15.6701267Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:00:15.6701777Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:00:15.6702154Z %c12288_i64 = arith.constant 12288 : i64 2026-02-21T18:00:15.6702298Z %c0_i64 = arith.constant 0 : i64 2026-02-21T18:00:15.6702435Z %c1024_i64 = arith.constant 1024 : i64 2026-02-21T18:00:15.6702650Z %cst = arith.constant dense<0.000000e+00> : tensor<1024x4xf16, #blocked> 2026-02-21T18:00:15.6702868Z %cst_0 = arith.constant dense<0> : tensor<1x4xi64, #blocked> 2026-02-21T18:00:15.6703066Z %cst_1 = arith.constant dense<1024> : tensor<1024x1xi64, #blocked1> 2026-02-21T18:00:15.6703271Z %cst_2 = arith.constant dense<0> : tensor<1024x1xi64, #blocked1> 2026-02-21T18:00:15.6703464Z %cst_3 = arith.constant dense<1024> : tensor<1x4xi64, #blocked> 2026-02-21T18:00:15.6703631Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:00:15.6703761Z %c1216_i32 = arith.constant 1216 : i32 2026-02-21T18:00:15.6703931Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<1x4xf32, #blocked> 2026-02-21T18:00:15.6704108Z %c4_i32 = arith.constant 4 : i32 2026-02-21T18:00:15.6704237Z %c12288_i32 = arith.constant 12288 : i32 2026-02-21T18:00:15.6704378Z %c3145728_i32 = arith.constant 3145728 : i32 2026-02-21T18:00:15.6704519Z %0 = tt.get_program_id x : i32 2026-02-21T18:00:15.6704720Z %1 = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked2> 2026-02-21T18:00:15.6705033Z %2 = ttg.convert_layout %1 : tensor<1024xi32, #blocked2> -> tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked3}>> 2026-02-21T18:00:15.6705415Z %3 = tt.expand_dims %2 {axis = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked3}>> -> tensor<1x1024xi32, #blocked3> 2026-02-21T18:00:15.6705754Z %4 = ttg.convert_layout %3 : tensor<1x1024xi32, #blocked3> -> tensor<1x1024xi32, #blocked4> 2026-02-21T18:00:15.6706009Z %5 = tt.splat %arg0 : !tt.ptr -> tensor<1x1024x!tt.ptr, #blocked4> 2026-02-21T18:00:15.6706226Z %6 = tt.splat %arg1 : !tt.ptr -> tensor<1024x4x!tt.ptr, #blocked> 2026-02-21T18:00:15.6706482Z %7 = arith.extsi %1 : tensor<1024xi32, #blocked2> to tensor<1024xi64, #blocked2> 2026-02-21T18:00:15.6706803Z %8 = ttg.convert_layout %7 : tensor<1024xi64, #blocked2> -> tensor<1024xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> 2026-02-21T18:00:15.6707175Z %9 = tt.expand_dims %8 {axis = 1 : i32} : tensor<1024xi64, #ttg.slice<{dim = 1, parent = #blocked5}>> -> tensor<1024x1xi64, #blocked5> 2026-02-21T18:00:15.6707507Z %10 = ttg.convert_layout %9 : tensor<1024x1xi64, #blocked5> -> tensor<1024x1xi64, #blocked1> 2026-02-21T18:00:15.6707775Z %11 = tt.broadcast %10 : tensor<1024x1xi64, #blocked1> -> tensor<1024x4xi64, #blocked1> 2026-02-21T18:00:15.6708034Z %12 = ttg.convert_layout %11 : tensor<1024x4xi64, #blocked1> -> tensor<1024x4xi64, #blocked> 2026-02-21T18:00:15.6708337Z %13 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked2> 2026-02-21T18:00:15.6708541Z %14 = arith.extsi %13 : tensor<4xi32, #blocked2> to tensor<4xi64, #blocked2> 2026-02-21T18:00:15.6708733Z %15 = arith.cmpi sge, %10, %cst_2 : tensor<1024x1xi64, #blocked1> 2026-02-21T18:00:15.6708909Z %16 = arith.cmpi slt, %10, %cst_1 : tensor<1024x1xi64, #blocked1> 2026-02-21T18:00:15.6709094Z %17 = arith.andi %15, %16 : tensor<1024x1xi1, #blocked1> 2026-02-21T18:00:15.6709282Z %18 = tt.broadcast %17 : tensor<1024x1xi1, #blocked1> -> tensor<1024x4xi1, #blocked1> 2026-02-21T18:00:15.6709515Z %19 = ttg.convert_layout %18 : tensor<1024x4xi1, #blocked1> -> tensor<1024x4xi1, #blocked> 2026-02-21T18:00:15.6709729Z %20 = tt.splat %arg2 : !tt.ptr -> tensor<1x4x!tt.ptr, #blocked> 2026-02-21T18:00:15.6709913Z scf.for %arg3 = %0 to %c3145728_i32 step %c1216_i32 : i32 { 2026-02-21T18:00:15.6710061Z %21 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T18:00:15.6710184Z %22 = arith.muli %21, %c4_i32 : i32 2026-02-21T18:00:15.6710305Z %23 = arith.subi %c12288_i32, %22 : i32 2026-02-21T18:00:15.6710422Z %24 = arith.minsi %23, %c4_i32 : i32 2026-02-21T18:00:15.6710545Z %25 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T18:00:15.6710662Z %26 = arith.remsi %25, %24 : i32 2026-02-21T18:00:15.6710780Z %27 = arith.addi %22, %26 : i32 2026-02-21T18:00:15.6710890Z %28 = arith.divsi %25, %24 : i32 2026-02-21T18:00:15.6711020Z %29 = arith.muli %28, %c4_i32 : i32 2026-02-21T18:00:15.6711138Z %30 = arith.muli %27, %c1024_i32 : i32 2026-02-21T18:00:15.6711274Z %31 = tt.splat %30 : i32 -> tensor<1x1024xi32, #blocked4> 2026-02-21T18:00:15.6711432Z %32 = arith.addi %31, %4 : tensor<1x1024xi32, #blocked4> 2026-02-21T18:00:15.6711627Z %33 = tt.addptr %5, %32 : tensor<1x1024x!tt.ptr, #blocked4>, tensor<1x1024xi32, #blocked4> 2026-02-21T18:00:15.6711831Z %34 = tt.load %33 : tensor<1x1024x!tt.ptr, #blocked4> 2026-02-21T18:00:15.6711967Z %35 = arith.extsi %29 : i32 to i64 2026-02-21T18:00:15.6712098Z %36 = tt.splat %35 : i64 -> tensor<4xi64, #blocked2> 2026-02-21T18:00:15.6712249Z %37 = arith.addi %36, %14 : tensor<4xi64, #blocked2> 2026-02-21T18:00:15.6712478Z %38 = ttg.convert_layout %37 : tensor<4xi64, #blocked2> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked3}>> 2026-02-21T18:00:15.6712804Z %39 = tt.expand_dims %38 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked3}>> -> tensor<1x4xi64, #blocked3> 2026-02-21T18:00:15.6713085Z %40 = ttg.convert_layout %39 : tensor<1x4xi64, #blocked3> -> tensor<1x4xi64, #blocked> 2026-02-21T18:00:15.6713280Z %41 = arith.muli %40, %cst_3 : tensor<1x4xi64, #blocked> 2026-02-21T18:00:15.6713462Z %42 = tt.broadcast %41 : tensor<1x4xi64, #blocked> -> tensor<1024x4xi64, #blocked> 2026-02-21T18:00:15.6713646Z %43 = arith.addi %12, %42 : tensor<1024x4xi64, #blocked> 2026-02-21T18:00:15.6713844Z %44 = tt.addptr %6, %43 : tensor<1024x4x!tt.ptr, #blocked>, tensor<1024x4xi64, #blocked> 2026-02-21T18:00:15.6714069Z %45 = arith.cmpi sge, %40, %cst_0 : tensor<1x4xi64, #blocked> 2026-02-21T18:00:15.6714235Z %46 = arith.cmpi slt, %40, %cst_3 : tensor<1x4xi64, #blocked> 2026-02-21T18:00:15.6714409Z %47 = arith.andi %45, %46 : tensor<1x4xi1, #blocked> 2026-02-21T18:00:15.6714588Z %48 = tt.broadcast %47 : tensor<1x4xi1, #blocked> -> tensor<1024x4xi1, #blocked> 2026-02-21T18:00:15.6714778Z %49 = arith.andi %19, %48 : tensor<1024x4xi1, #blocked> 2026-02-21T18:00:15.6714940Z %50 = tt.load %44, %49, %cst : tensor<1024x4x!tt.ptr, #blocked> 2026-02-21T18:00:15.6715206Z %51 = ttg.convert_layout %34 : tensor<1x1024xf16, #blocked4> -> tensor<1x1024xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T18:00:15.6715549Z %52 = ttg.convert_layout %50 : tensor<1024x4xf16, #blocked> -> tensor<1024x4xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T18:00:15.6715842Z %53 = ttg.convert_layout %cst_4 : tensor<1x4xf32, #blocked> -> tensor<1x4xf32, #blocked> 2026-02-21T18:00:15.6716238Z %54 = tt.dot %51, %52, %53, inputPrecision = tf32 : tensor<1x1024xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<1024x4xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<1x4xf32, #blocked> 2026-02-21T18:00:15.6716626Z %55 = arith.truncf %54 : tensor<1x4xf32, #blocked> to tensor<1x4xf16, #blocked> 2026-02-21T18:00:15.6716797Z %56 = arith.extsi %27 : i32 to i64 2026-02-21T18:00:15.6716913Z %57 = arith.muli %56, %c1024_i64 : i64 2026-02-21T18:00:15.6717046Z %58 = tt.splat %57 : i64 -> tensor<1x4xi64, #blocked> 2026-02-21T18:00:15.6717197Z %59 = arith.addi %58, %40 : tensor<1x4xi64, #blocked> 2026-02-21T18:00:15.6717379Z %60 = tt.addptr %20, %59 : tensor<1x4x!tt.ptr, #blocked>, tensor<1x4xi64, #blocked> 2026-02-21T18:00:15.6717558Z %61 = arith.cmpi sge, %56, %c0_i64 : i64 2026-02-21T18:00:15.6717685Z %62 = arith.cmpi slt, %56, %c12288_i64 : i64 2026-02-21T18:00:15.6717811Z %63 = arith.andi %61, %62 : i1 2026-02-21T18:00:15.6717939Z %64 = tt.splat %63 : i1 -> tensor<1x4xi1, #blocked> 2026-02-21T18:00:15.6718086Z %65 = arith.andi %64, %47 : tensor<1x4xi1, #blocked> 2026-02-21T18:00:15.6718244Z tt.store %60, %55, %65 : tensor<1x4x!tt.ptr, #blocked> 2026-02-21T18:00:15.6718379Z } 2026-02-21T18:00:15.6718475Z tt.return 2026-02-21T18:00:15.6718563Z } 2026-02-21T18:00:15.6718638Z } 2026-02-21T18:00:15.6718681Z 2026-02-21T18:00:15.6718715Z {-# 2026-02-21T18:00:15.6718793Z external_resources: { 2026-02-21T18:00:15.6718894Z mlir_reproducer: { 2026-02-21T18:00:15.6721147Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:00:15.6723452Z disable_threading: false, 2026-02-21T18:00:15.6723573Z verify_each: true 2026-02-21T18:00:15.6723664Z } 2026-02-21T18:00:15.6723738Z } 2026-02-21T18:00:15.6723808Z #-} 2026-02-21T18:00:15.6724105Z /tmp/torchinductor_root/ah/cahouxzyud5h3zh3v72qygou5hfzmrw2etug3badxcb2vx3lcnfj.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:00:15.6724797Z /tmp/torchinductor_root/ah/cahouxzyud5h3zh3v72qygou5hfzmrw2etug3badxcb2vx3lcnfj.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:00:15.6725361Z [10s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:00:15.6726159Z Config: @helion.kernel(config=helion.Config(block_sizes=[1, 4, 1024], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[False, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T18:00:15.6726899Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:00:15.6727069Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:00:15.7059604Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:00:15.7061222Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:15.7062064Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:15.7062864Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T18:00:15.7063695Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:15.7064510Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:15.7065432Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:15.7066345Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:00:15.7067605Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:00:15.7068696Z %cst = arith.constant dense<12288> : tensor<16x1xi64, #blocked> 2026-02-21T18:00:15.7069190Z %cst_0 = arith.constant dense<0> : tensor<16x1xi64, #blocked> 2026-02-21T18:00:15.7069668Z %cst_1 = arith.constant dense<1024> : tensor<16x1xi64, #blocked> 2026-02-21T18:00:15.7069895Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<256x4xf16, #blocked1> 2026-02-21T18:00:15.7070084Z %cst_3 = arith.constant dense<0> : tensor<1x4xi64, #blocked1> 2026-02-21T18:00:15.7070257Z %cst_4 = arith.constant dense<1024> : tensor<256x1xi64, #blocked> 2026-02-21T18:00:15.7070429Z %cst_5 = arith.constant dense<0> : tensor<256x1xi64, #blocked> 2026-02-21T18:00:15.7070602Z %cst_6 = arith.constant dense<1024> : tensor<1x4xi64, #blocked1> 2026-02-21T18:00:15.7070754Z %c6144_i32 = arith.constant 6144 : i32 2026-02-21T18:00:15.7070874Z %c256_i32 = arith.constant 256 : i32 2026-02-21T18:00:15.7070993Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:00:15.7071111Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:00:15.7071248Z %c4864_i32 = arith.constant 4864 : i32 2026-02-21T18:00:15.7071388Z %cst_7 = arith.constant dense<1024> : tensor<16x1xi32, #blocked> 2026-02-21T18:00:15.7071602Z %cst_8 = arith.constant dense<0.000000e+00> : tensor<16x4xf32, #blocked1> 2026-02-21T18:00:15.7071766Z %c16_i32 = arith.constant 16 : i32 2026-02-21T18:00:15.7071881Z %c4_i32 = arith.constant 4 : i32 2026-02-21T18:00:15.7071992Z %c8_i32 = arith.constant 8 : i32 2026-02-21T18:00:15.7072108Z %c196608_i32 = arith.constant 196608 : i32 2026-02-21T18:00:15.7072230Z %0 = tt.get_program_id x : i32 2026-02-21T18:00:15.7072384Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked2> 2026-02-21T18:00:15.7072593Z %2 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked2> 2026-02-21T18:00:15.7072803Z %3 = tt.splat %arg0 : !tt.ptr -> tensor<16x256x!tt.ptr, #blocked3> 2026-02-21T18:00:15.7072997Z %4 = tt.splat %arg1 : !tt.ptr -> tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T18:00:15.7073200Z %5 = arith.extsi %2 : tensor<256xi32, #blocked2> to tensor<256xi64, #blocked2> 2026-02-21T18:00:15.7073401Z %6 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked2> 2026-02-21T18:00:15.7073619Z %7 = arith.extsi %6 : tensor<4xi32, #blocked2> to tensor<4xi64, #blocked2> 2026-02-21T18:00:15.7073813Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<16x4x!tt.ptr, #blocked1> 2026-02-21T18:00:15.7074011Z %9 = arith.extsi %1 : tensor<16xi32, #blocked2> to tensor<16xi64, #blocked2> 2026-02-21T18:00:15.7074211Z %10 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked2> 2026-02-21T18:00:15.7074410Z %11 = arith.extsi %10 : tensor<4xi32, #blocked2> to tensor<4xi64, #blocked2> 2026-02-21T18:00:15.7074595Z scf.for %arg3 = %0 to %c196608_i32 step %c4864_i32 : i32 { 2026-02-21T18:00:15.7074744Z %12 = arith.divsi %arg3, %c6144_i32 : i32 2026-02-21T18:00:15.7074866Z %13 = arith.muli %12, %c8_i32 : i32 2026-02-21T18:00:15.7074984Z %14 = arith.subi %c256_i32, %13 : i32 2026-02-21T18:00:15.7075104Z %15 = arith.minsi %14, %c8_i32 : i32 2026-02-21T18:00:15.7075224Z %16 = arith.remsi %arg3, %c6144_i32 : i32 2026-02-21T18:00:15.7075346Z %17 = arith.remsi %16, %15 : i32 2026-02-21T18:00:15.7075481Z %18 = arith.addi %13, %17 : i32 2026-02-21T18:00:15.7075590Z %19 = arith.divsi %16, %15 : i32 2026-02-21T18:00:15.7075701Z %20 = arith.muli %18, %c4_i32 : i32 2026-02-21T18:00:15.7075816Z %21 = arith.muli %19, %c16_i32 : i32 2026-02-21T18:00:15.7075951Z %22 = tt.splat %21 : i32 -> tensor<16xi32, #blocked2> 2026-02-21T18:00:15.7076101Z %23 = arith.addi %22, %1 : tensor<16xi32, #blocked2> 2026-02-21T18:00:15.7076334Z %24 = ttg.convert_layout %23 : tensor<16xi32, #blocked2> -> tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:15.7076665Z %25 = tt.expand_dims %24 {axis = 1 : i32} : tensor<16xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<16x1xi32, #blocked4> 2026-02-21T18:00:15.7076949Z %26 = ttg.convert_layout %25 : tensor<16x1xi32, #blocked4> -> tensor<16x1xi32, #blocked> 2026-02-21T18:00:15.7077151Z %27 = arith.muli %26, %cst_7 : tensor<16x1xi32, #blocked> 2026-02-21T18:00:15.7077341Z %28 = tt.broadcast %27 : tensor<16x1xi32, #blocked> -> tensor<16x256xi32, #blocked> 2026-02-21T18:00:15.7077575Z %29 = ttg.convert_layout %28 : tensor<16x256xi32, #blocked> -> tensor<16x256xi32, #blocked3> 2026-02-21T18:00:15.7077761Z %30 = arith.extsi %20 : i32 to i64 2026-02-21T18:00:15.7077890Z %31 = tt.splat %30 : i64 -> tensor<4xi64, #blocked2> 2026-02-21T18:00:15.7078038Z %32 = arith.addi %31, %7 : tensor<4xi64, #blocked2> 2026-02-21T18:00:15.7078260Z %33 = ttg.convert_layout %32 : tensor<4xi64, #blocked2> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:15.7078578Z %34 = tt.expand_dims %33 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4xi64, #blocked5> 2026-02-21T18:00:15.7078894Z %35 = ttg.convert_layout %34 : tensor<1x4xi64, #blocked5> -> tensor<1x4xi64, #blocked1> 2026-02-21T18:00:15.7079110Z %36 = arith.muli %35, %cst_6 : tensor<1x4xi64, #blocked1> 2026-02-21T18:00:15.7079297Z %37 = tt.broadcast %36 : tensor<1x4xi64, #blocked1> -> tensor<256x4xi64, #blocked1> 2026-02-21T18:00:15.7079492Z %38 = arith.cmpi sge, %35, %cst_3 : tensor<1x4xi64, #blocked1> 2026-02-21T18:00:15.7079660Z %39 = arith.cmpi slt, %35, %cst_6 : tensor<1x4xi64, #blocked1> 2026-02-21T18:00:15.7079821Z %40 = arith.andi %38, %39 : tensor<1x4xi1, #blocked1> 2026-02-21T18:00:15.7080004Z %41 = tt.broadcast %40 : tensor<1x4xi1, #blocked1> -> tensor<256x4xi1, #blocked1> 2026-02-21T18:00:15.7080278Z %42 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c256_i32 iter_args(%arg5 = %cst_8) -> (tensor<16x4xf32, #blocked1>) : i32 { 2026-02-21T18:00:15.7080521Z %72 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked2> 2026-02-21T18:00:15.7080679Z %73 = arith.addi %72, %2 : tensor<256xi32, #blocked2> 2026-02-21T18:00:15.7080921Z %74 = ttg.convert_layout %73 : tensor<256xi32, #blocked2> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:15.7081278Z %75 = tt.expand_dims %74 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T18:00:15.7081574Z %76 = ttg.convert_layout %75 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T18:00:15.7081811Z %77 = tt.broadcast %76 : tensor<1x256xi32, #blocked3> -> tensor<16x256xi32, #blocked3> 2026-02-21T18:00:15.7082007Z %78 = arith.addi %29, %77 : tensor<16x256xi32, #blocked3> 2026-02-21T18:00:15.7082207Z %79 = tt.addptr %3, %78 : tensor<16x256x!tt.ptr, #blocked3>, tensor<16x256xi32, #blocked3> 2026-02-21T18:00:15.7082412Z %80 = tt.load %79 : tensor<16x256x!tt.ptr, #blocked3> 2026-02-21T18:00:15.7082557Z %81 = arith.extsi %arg4 : i32 to i64 2026-02-21T18:00:15.7082691Z %82 = tt.splat %81 : i64 -> tensor<256xi64, #blocked2> 2026-02-21T18:00:15.7082843Z %83 = arith.addi %82, %5 : tensor<256xi64, #blocked2> 2026-02-21T18:00:15.7083077Z %84 = ttg.convert_layout %83 : tensor<256xi64, #blocked2> -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:15.7083427Z %85 = tt.expand_dims %84 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<256x1xi64, #blocked4> 2026-02-21T18:00:15.7083716Z %86 = ttg.convert_layout %85 : tensor<256x1xi64, #blocked4> -> tensor<256x1xi64, #blocked> 2026-02-21T18:00:15.7083946Z %87 = tt.broadcast %86 : tensor<256x1xi64, #blocked> -> tensor<256x4xi64, #blocked> 2026-02-21T18:00:15.7084176Z %88 = ttg.convert_layout %87 : tensor<256x4xi64, #blocked> -> tensor<256x4xi64, #blocked1> 2026-02-21T18:00:15.7084375Z %89 = arith.addi %88, %37 : tensor<256x4xi64, #blocked1> 2026-02-21T18:00:15.7084574Z %90 = tt.addptr %4, %89 : tensor<256x4x!tt.ptr, #blocked1>, tensor<256x4xi64, #blocked1> 2026-02-21T18:00:15.7084783Z %91 = arith.cmpi sge, %86, %cst_5 : tensor<256x1xi64, #blocked> 2026-02-21T18:00:15.7084951Z %92 = arith.cmpi slt, %86, %cst_4 : tensor<256x1xi64, #blocked> 2026-02-21T18:00:15.7085116Z %93 = arith.andi %91, %92 : tensor<256x1xi1, #blocked> 2026-02-21T18:00:15.7085301Z %94 = tt.broadcast %93 : tensor<256x1xi1, #blocked> -> tensor<256x4xi1, #blocked> 2026-02-21T18:00:15.7085527Z %95 = ttg.convert_layout %94 : tensor<256x4xi1, #blocked> -> tensor<256x4xi1, #blocked1> 2026-02-21T18:00:15.7085721Z %96 = arith.andi %95, %41 : tensor<256x4xi1, #blocked1> 2026-02-21T18:00:15.7085904Z %97 = tt.load %90, %96, %cst_2 : tensor<256x4x!tt.ptr, #blocked1> 2026-02-21T18:00:15.7086171Z %98 = ttg.convert_layout %80 : tensor<16x256xf16, #blocked3> -> tensor<16x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked1}>> 2026-02-21T18:00:15.7086537Z %99 = ttg.convert_layout %97 : tensor<256x4xf16, #blocked1> -> tensor<256x4xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked1}>> 2026-02-21T18:00:15.7086854Z %100 = ttg.convert_layout %arg5 : tensor<16x4xf32, #blocked1> -> tensor<16x4xf32, #blocked1> 2026-02-21T18:00:15.7087266Z %101 = tt.dot %98, %99, %100, inputPrecision = tf32 : tensor<16x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked1}>> * tensor<256x4xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked1}>> -> tensor<16x4xf32, #blocked1> 2026-02-21T18:00:15.7087623Z scf.yield %101 : tensor<16x4xf32, #blocked1> 2026-02-21T18:00:15.7087759Z } {tt.disallow_acc_multi_buffer} 2026-02-21T18:00:15.7087926Z %43 = arith.truncf %42 : tensor<16x4xf32, #blocked1> to tensor<16x4xf16, #blocked1> 2026-02-21T18:00:15.7088100Z %44 = arith.extsi %21 : i32 to i64 2026-02-21T18:00:15.7088218Z %45 = arith.extsi %20 : i32 to i64 2026-02-21T18:00:15.7088351Z %46 = tt.splat %44 : i64 -> tensor<16xi64, #blocked2> 2026-02-21T18:00:15.7088500Z %47 = arith.addi %46, %9 : tensor<16xi64, #blocked2> 2026-02-21T18:00:15.7088746Z %48 = ttg.convert_layout %47 : tensor<16xi64, #blocked2> -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:15.7089073Z %49 = tt.expand_dims %48 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<16x1xi64, #blocked4> 2026-02-21T18:00:15.7089356Z %50 = ttg.convert_layout %49 : tensor<16x1xi64, #blocked4> -> tensor<16x1xi64, #blocked> 2026-02-21T18:00:15.7089552Z %51 = arith.muli %50, %cst_1 : tensor<16x1xi64, #blocked> 2026-02-21T18:00:15.7089735Z %52 = tt.broadcast %51 : tensor<16x1xi64, #blocked> -> tensor<16x4xi64, #blocked> 2026-02-21T18:00:15.7089957Z %53 = ttg.convert_layout %52 : tensor<16x4xi64, #blocked> -> tensor<16x4xi64, #blocked1> 2026-02-21T18:00:15.7090148Z %54 = tt.splat %45 : i64 -> tensor<4xi64, #blocked2> 2026-02-21T18:00:15.7090292Z %55 = arith.addi %54, %11 : tensor<4xi64, #blocked2> 2026-02-21T18:00:15.7090518Z %56 = ttg.convert_layout %55 : tensor<4xi64, #blocked2> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:15.7090837Z %57 = tt.expand_dims %56 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4xi64, #blocked5> 2026-02-21T18:00:15.7091129Z %58 = ttg.convert_layout %57 : tensor<1x4xi64, #blocked5> -> tensor<1x4xi64, #blocked1> 2026-02-21T18:00:15.7091355Z %59 = tt.broadcast %58 : tensor<1x4xi64, #blocked1> -> tensor<16x4xi64, #blocked1> 2026-02-21T18:00:15.7091541Z %60 = arith.addi %53, %59 : tensor<16x4xi64, #blocked1> 2026-02-21T18:00:15.7091737Z %61 = tt.addptr %8, %60 : tensor<16x4x!tt.ptr, #blocked1>, tensor<16x4xi64, #blocked1> 2026-02-21T18:00:15.7091942Z %62 = arith.cmpi sge, %50, %cst_0 : tensor<16x1xi64, #blocked> 2026-02-21T18:00:15.7092106Z %63 = arith.cmpi slt, %50, %cst : tensor<16x1xi64, #blocked> 2026-02-21T18:00:15.7092263Z %64 = arith.andi %62, %63 : tensor<16x1xi1, #blocked> 2026-02-21T18:00:15.7092440Z %65 = tt.broadcast %64 : tensor<16x1xi1, #blocked> -> tensor<16x4xi1, #blocked> 2026-02-21T18:00:15.7092663Z %66 = ttg.convert_layout %65 : tensor<16x4xi1, #blocked> -> tensor<16x4xi1, #blocked1> 2026-02-21T18:00:15.7092867Z %67 = arith.cmpi sge, %58, %cst_3 : tensor<1x4xi64, #blocked1> 2026-02-21T18:00:15.7093032Z %68 = arith.cmpi slt, %58, %cst_6 : tensor<1x4xi64, #blocked1> 2026-02-21T18:00:15.7093190Z %69 = arith.andi %67, %68 : tensor<1x4xi1, #blocked1> 2026-02-21T18:00:15.7093367Z %70 = tt.broadcast %69 : tensor<1x4xi1, #blocked1> -> tensor<16x4xi1, #blocked1> 2026-02-21T18:00:15.7093549Z %71 = arith.andi %66, %70 : tensor<16x4xi1, #blocked1> 2026-02-21T18:00:15.7093706Z tt.store %61, %43, %71 : tensor<16x4x!tt.ptr, #blocked1> 2026-02-21T18:00:15.7093853Z } {tt.disallow_acc_multi_buffer} 2026-02-21T18:00:15.7093981Z tt.return 2026-02-21T18:00:15.7094060Z } 2026-02-21T18:00:15.7094136Z } 2026-02-21T18:00:15.7094178Z 2026-02-21T18:00:15.7094208Z {-# 2026-02-21T18:00:15.7094290Z external_resources: { 2026-02-21T18:00:15.7094405Z mlir_reproducer: { 2026-02-21T18:00:15.7096655Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=2 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=2}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:00:15.7098946Z disable_threading: false, 2026-02-21T18:00:15.7099052Z verify_each: true 2026-02-21T18:00:15.7099140Z } 2026-02-21T18:00:15.7099213Z } 2026-02-21T18:00:15.7099282Z #-} 2026-02-21T18:00:15.7099562Z /tmp/torchinductor_root/3w/c3wjcgnvjom46gfzrcd2e32v2yghbmlboh7bbyzarsyscvjuesl6.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:00:15.7100255Z /tmp/torchinductor_root/3w/c3wjcgnvjom46gfzrcd2e32v2yghbmlboh7bbyzarsyscvjuesl6.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:00:15.7100825Z [10s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:00:15.7101622Z Config: @helion.kernel(config=helion.Config(block_sizes=[16, 4, 256], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=2, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, None], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T18:00:15.7102349Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:00:15.7102519Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:00:16.5655343Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:00:16.5658929Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [16, 4], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:16.5659935Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:16.5660865Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:16.5661746Z #blocked3 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T18:00:16.5662633Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:16.5663735Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:16.5664393Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:00:16.5665463Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:00:16.5666306Z %cst = arith.constant dense<0.000000e+00> : tensor<64x4xf16, #blocked> 2026-02-21T18:00:16.5666681Z %cst_0 = arith.constant dense<0> : tensor<1x4xi64, #blocked> 2026-02-21T18:00:16.5675914Z %cst_1 = arith.constant dense<1024> : tensor<64x1xi64, #blocked1> 2026-02-21T18:00:16.5676324Z %cst_2 = arith.constant dense<0> : tensor<64x1xi64, #blocked1> 2026-02-21T18:00:16.5676588Z %cst_3 = arith.constant dense<1024> : tensor<1x4xi64, #blocked> 2026-02-21T18:00:16.5676859Z %cst_4 = arith.constant dense<0.000000e+00> : tensor<2x64xf16, #blocked2> 2026-02-21T18:00:16.5677247Z %cst_5 = arith.constant dense<1024> : tensor<1x64xi64, #blocked2> 2026-02-21T18:00:16.5677505Z %cst_6 = arith.constant dense<0> : tensor<1x64xi64, #blocked2> 2026-02-21T18:00:16.5677759Z %cst_7 = arith.constant dense<12288> : tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5677985Z %cst_8 = arith.constant dense<0> : tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5678182Z %cst_9 = arith.constant dense<1024> : tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5678357Z %c64_i32 = arith.constant 64 : i32 2026-02-21T18:00:16.5678495Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:00:16.5678630Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:00:16.5678762Z %c2432_i32 = arith.constant 2432 : i32 2026-02-21T18:00:16.5678937Z %cst_10 = arith.constant dense<0.000000e+00> : tensor<2x4xf32, #blocked> 2026-02-21T18:00:16.5679122Z %c4_i32 = arith.constant 4 : i32 2026-02-21T18:00:16.5679247Z %c2_i32 = arith.constant 2 : i32 2026-02-21T18:00:16.5679378Z %c6144_i32 = arith.constant 6144 : i32 2026-02-21T18:00:16.5679518Z %c1572864_i32 = arith.constant 1572864 : i32 2026-02-21T18:00:16.5679704Z %0 = tt.get_program_id x : i32 2026-02-21T18:00:16.5679885Z %1 = tt.splat %arg0 : !tt.ptr -> tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T18:00:16.5680112Z %2 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked3> 2026-02-21T18:00:16.5680342Z %3 = arith.extsi %2 : tensor<2xi32, #blocked3> to tensor<2xi64, #blocked3> 2026-02-21T18:00:16.5680570Z %4 = tt.make_range {end = 64 : i32, start = 0 : i32} : tensor<64xi32, #blocked3> 2026-02-21T18:00:16.5680802Z %5 = arith.extsi %4 : tensor<64xi32, #blocked3> to tensor<64xi64, #blocked3> 2026-02-21T18:00:16.5681024Z %6 = tt.splat %arg1 : !tt.ptr -> tensor<64x4x!tt.ptr, #blocked> 2026-02-21T18:00:16.5681248Z %7 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked3> 2026-02-21T18:00:16.5681473Z %8 = arith.extsi %7 : tensor<4xi32, #blocked3> to tensor<4xi64, #blocked3> 2026-02-21T18:00:16.5681690Z %9 = tt.splat %arg2 : !tt.ptr -> tensor<2x4x!tt.ptr, #blocked> 2026-02-21T18:00:16.5681918Z %10 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked3> 2026-02-21T18:00:16.5682143Z %11 = arith.extsi %10 : tensor<2xi32, #blocked3> to tensor<2xi64, #blocked3> 2026-02-21T18:00:16.5682368Z %12 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked3> 2026-02-21T18:00:16.5682593Z %13 = arith.extsi %12 : tensor<4xi32, #blocked3> to tensor<4xi64, #blocked3> 2026-02-21T18:00:16.5682800Z scf.for %arg3 = %0 to %c1572864_i32 step %c2432_i32 : i32 { 2026-02-21T18:00:16.5682978Z %14 = arith.remsi %arg3, %c6144_i32 : i32 2026-02-21T18:00:16.5683121Z %15 = arith.divsi %arg3, %c6144_i32 : i32 2026-02-21T18:00:16.5683294Z %16 = arith.muli %14, %c2_i32 : i32 2026-02-21T18:00:16.5683423Z %17 = arith.muli %15, %c4_i32 : i32 2026-02-21T18:00:16.5683608Z %18 = arith.extsi %16 : i32 to i64 2026-02-21T18:00:16.5683765Z %19 = tt.splat %18 : i64 -> tensor<2xi64, #blocked3> 2026-02-21T18:00:16.5683938Z %20 = arith.addi %19, %3 : tensor<2xi64, #blocked3> 2026-02-21T18:00:16.5684204Z %21 = ttg.convert_layout %20 : tensor<2xi64, #blocked3> -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:16.5684575Z %22 = tt.expand_dims %21 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<2x1xi64, #blocked4> 2026-02-21T18:00:16.5684901Z %23 = ttg.convert_layout %22 : tensor<2x1xi64, #blocked4> -> tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5685154Z %24 = arith.muli %23, %cst_9 : tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5685365Z %25 = tt.broadcast %24 : tensor<2x1xi64, #blocked1> -> tensor<2x64xi64, #blocked1> 2026-02-21T18:00:16.5685628Z %26 = ttg.convert_layout %25 : tensor<2x64xi64, #blocked1> -> tensor<2x64xi64, #blocked2> 2026-02-21T18:00:16.5685866Z %27 = arith.cmpi sge, %23, %cst_8 : tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5686075Z %28 = arith.cmpi slt, %23, %cst_7 : tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5686264Z %29 = arith.andi %27, %28 : tensor<2x1xi1, #blocked1> 2026-02-21T18:00:16.5686466Z %30 = tt.broadcast %29 : tensor<2x1xi1, #blocked1> -> tensor<2x64xi1, #blocked1> 2026-02-21T18:00:16.5686721Z %31 = ttg.convert_layout %30 : tensor<2x64xi1, #blocked1> -> tensor<2x64xi1, #blocked2> 2026-02-21T18:00:16.5686923Z %32 = arith.extsi %17 : i32 to i64 2026-02-21T18:00:16.5687072Z %33 = tt.splat %32 : i64 -> tensor<4xi64, #blocked3> 2026-02-21T18:00:16.5687240Z %34 = arith.addi %33, %8 : tensor<4xi64, #blocked3> 2026-02-21T18:00:16.5687495Z %35 = ttg.convert_layout %34 : tensor<4xi64, #blocked3> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:16.5687862Z %36 = tt.expand_dims %35 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4xi64, #blocked5> 2026-02-21T18:00:16.5688144Z %37 = ttg.convert_layout %36 : tensor<1x4xi64, #blocked5> -> tensor<1x4xi64, #blocked> 2026-02-21T18:00:16.5688367Z %38 = arith.muli %37, %cst_3 : tensor<1x4xi64, #blocked> 2026-02-21T18:00:16.5688548Z %39 = tt.broadcast %38 : tensor<1x4xi64, #blocked> -> tensor<64x4xi64, #blocked> 2026-02-21T18:00:16.5688739Z %40 = arith.cmpi sge, %37, %cst_0 : tensor<1x4xi64, #blocked> 2026-02-21T18:00:16.5688905Z %41 = arith.cmpi slt, %37, %cst_3 : tensor<1x4xi64, #blocked> 2026-02-21T18:00:16.5689059Z %42 = arith.andi %40, %41 : tensor<1x4xi1, #blocked> 2026-02-21T18:00:16.5689235Z %43 = tt.broadcast %42 : tensor<1x4xi1, #blocked> -> tensor<64x4xi1, #blocked> 2026-02-21T18:00:16.5689501Z %44 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c64_i32 iter_args(%arg5 = %cst_10) -> (tensor<2x4xf32, #blocked>) : i32 { 2026-02-21T18:00:16.5689727Z %74 = arith.extsi %arg4 : i32 to i64 2026-02-21T18:00:16.5689867Z %75 = tt.splat %74 : i64 -> tensor<64xi64, #blocked3> 2026-02-21T18:00:16.5690016Z %76 = arith.addi %75, %5 : tensor<64xi64, #blocked3> 2026-02-21T18:00:16.5690251Z %77 = ttg.convert_layout %76 : tensor<64xi64, #blocked3> -> tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:16.5690578Z %78 = tt.expand_dims %77 {axis = 0 : i32} : tensor<64xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x64xi64, #blocked5> 2026-02-21T18:00:16.5690868Z %79 = ttg.convert_layout %78 : tensor<1x64xi64, #blocked5> -> tensor<1x64xi64, #blocked2> 2026-02-21T18:00:16.5691101Z %80 = tt.broadcast %79 : tensor<1x64xi64, #blocked2> -> tensor<2x64xi64, #blocked2> 2026-02-21T18:00:16.5691291Z %81 = arith.addi %26, %80 : tensor<2x64xi64, #blocked2> 2026-02-21T18:00:16.5691519Z %82 = tt.addptr %1, %81 : tensor<2x64x!tt.ptr, #blocked2>, tensor<2x64xi64, #blocked2> 2026-02-21T18:00:16.5691725Z %83 = arith.cmpi sge, %79, %cst_6 : tensor<1x64xi64, #blocked2> 2026-02-21T18:00:16.5691934Z %84 = arith.cmpi slt, %79, %cst_5 : tensor<1x64xi64, #blocked2> 2026-02-21T18:00:16.5692099Z %85 = arith.andi %83, %84 : tensor<1x64xi1, #blocked2> 2026-02-21T18:00:16.5692279Z %86 = tt.broadcast %85 : tensor<1x64xi1, #blocked2> -> tensor<2x64xi1, #blocked2> 2026-02-21T18:00:16.5692465Z %87 = arith.andi %31, %86 : tensor<2x64xi1, #blocked2> 2026-02-21T18:00:16.5692627Z %88 = tt.load %82, %87, %cst_4 : tensor<2x64x!tt.ptr, #blocked2> 2026-02-21T18:00:16.5692877Z %89 = ttg.convert_layout %76 : tensor<64xi64, #blocked3> -> tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:16.5693207Z %90 = tt.expand_dims %89 {axis = 1 : i32} : tensor<64xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<64x1xi64, #blocked4> 2026-02-21T18:00:16.5693500Z %91 = ttg.convert_layout %90 : tensor<64x1xi64, #blocked4> -> tensor<64x1xi64, #blocked1> 2026-02-21T18:00:16.5693735Z %92 = tt.broadcast %91 : tensor<64x1xi64, #blocked1> -> tensor<64x4xi64, #blocked1> 2026-02-21T18:00:16.5693979Z %93 = ttg.convert_layout %92 : tensor<64x4xi64, #blocked1> -> tensor<64x4xi64, #blocked> 2026-02-21T18:00:16.5694176Z %94 = arith.addi %93, %39 : tensor<64x4xi64, #blocked> 2026-02-21T18:00:16.5694363Z %95 = tt.addptr %6, %94 : tensor<64x4x!tt.ptr, #blocked>, tensor<64x4xi64, #blocked> 2026-02-21T18:00:16.5694567Z %96 = arith.cmpi sge, %91, %cst_2 : tensor<64x1xi64, #blocked1> 2026-02-21T18:00:16.5694737Z %97 = arith.cmpi slt, %91, %cst_1 : tensor<64x1xi64, #blocked1> 2026-02-21T18:00:16.5694897Z %98 = arith.andi %96, %97 : tensor<64x1xi1, #blocked1> 2026-02-21T18:00:16.5695078Z %99 = tt.broadcast %98 : tensor<64x1xi1, #blocked1> -> tensor<64x4xi1, #blocked1> 2026-02-21T18:00:16.5695305Z %100 = ttg.convert_layout %99 : tensor<64x4xi1, #blocked1> -> tensor<64x4xi1, #blocked> 2026-02-21T18:00:16.5695506Z %101 = arith.andi %100, %43 : tensor<64x4xi1, #blocked> 2026-02-21T18:00:16.5695677Z %102 = tt.load %95, %101, %cst : tensor<64x4x!tt.ptr, #blocked> 2026-02-21T18:00:16.5696081Z %103 = ttg.convert_layout %88 : tensor<2x64xf16, #blocked2> -> tensor<2x64xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T18:00:16.5696421Z %104 = ttg.convert_layout %102 : tensor<64x4xf16, #blocked> -> tensor<64x4xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T18:00:16.5696708Z %105 = ttg.convert_layout %arg5 : tensor<2x4xf32, #blocked> -> tensor<2x4xf32, #blocked> 2026-02-21T18:00:16.5697109Z %106 = tt.dot %103, %104, %105, inputPrecision = tf32 : tensor<2x64xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<64x4xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<2x4xf32, #blocked> 2026-02-21T18:00:16.5697451Z scf.yield %106 : tensor<2x4xf32, #blocked> 2026-02-21T18:00:16.5697583Z } {tt.disallow_acc_multi_buffer} 2026-02-21T18:00:16.5697750Z %45 = arith.truncf %44 : tensor<2x4xf32, #blocked> to tensor<2x4xf16, #blocked> 2026-02-21T18:00:16.5697919Z %46 = arith.extsi %16 : i32 to i64 2026-02-21T18:00:16.5698037Z %47 = arith.extsi %17 : i32 to i64 2026-02-21T18:00:16.5698168Z %48 = tt.splat %46 : i64 -> tensor<2xi64, #blocked3> 2026-02-21T18:00:16.5698316Z %49 = arith.addi %48, %11 : tensor<2xi64, #blocked3> 2026-02-21T18:00:16.5698545Z %50 = ttg.convert_layout %49 : tensor<2xi64, #blocked3> -> tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:00:16.5698866Z %51 = tt.expand_dims %50 {axis = 1 : i32} : tensor<2xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<2x1xi64, #blocked4> 2026-02-21T18:00:16.5699150Z %52 = ttg.convert_layout %51 : tensor<2x1xi64, #blocked4> -> tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5699364Z %53 = arith.muli %52, %cst_9 : tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5699547Z %54 = tt.broadcast %53 : tensor<2x1xi64, #blocked1> -> tensor<2x4xi64, #blocked1> 2026-02-21T18:00:16.5699788Z %55 = ttg.convert_layout %54 : tensor<2x4xi64, #blocked1> -> tensor<2x4xi64, #blocked> 2026-02-21T18:00:16.5699975Z %56 = tt.splat %47 : i64 -> tensor<4xi64, #blocked3> 2026-02-21T18:00:16.5700118Z %57 = arith.addi %56, %13 : tensor<4xi64, #blocked3> 2026-02-21T18:00:16.5700339Z %58 = ttg.convert_layout %57 : tensor<4xi64, #blocked3> -> tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:00:16.5700656Z %59 = tt.expand_dims %58 {axis = 0 : i32} : tensor<4xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x4xi64, #blocked5> 2026-02-21T18:00:16.5700932Z %60 = ttg.convert_layout %59 : tensor<1x4xi64, #blocked5> -> tensor<1x4xi64, #blocked> 2026-02-21T18:00:16.5701149Z %61 = tt.broadcast %60 : tensor<1x4xi64, #blocked> -> tensor<2x4xi64, #blocked> 2026-02-21T18:00:16.5701335Z %62 = arith.addi %55, %61 : tensor<2x4xi64, #blocked> 2026-02-21T18:00:16.5701520Z %63 = tt.addptr %9, %62 : tensor<2x4x!tt.ptr, #blocked>, tensor<2x4xi64, #blocked> 2026-02-21T18:00:16.5701741Z %64 = arith.cmpi sge, %52, %cst_8 : tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5701911Z %65 = arith.cmpi slt, %52, %cst_7 : tensor<2x1xi64, #blocked1> 2026-02-21T18:00:16.5702067Z %66 = arith.andi %64, %65 : tensor<2x1xi1, #blocked1> 2026-02-21T18:00:16.5702246Z %67 = tt.broadcast %66 : tensor<2x1xi1, #blocked1> -> tensor<2x4xi1, #blocked1> 2026-02-21T18:00:16.5702461Z %68 = ttg.convert_layout %67 : tensor<2x4xi1, #blocked1> -> tensor<2x4xi1, #blocked> 2026-02-21T18:00:16.5702659Z %69 = arith.cmpi sge, %60, %cst_0 : tensor<1x4xi64, #blocked> 2026-02-21T18:00:16.5702824Z %70 = arith.cmpi slt, %60, %cst_3 : tensor<1x4xi64, #blocked> 2026-02-21T18:00:16.5702981Z %71 = arith.andi %69, %70 : tensor<1x4xi1, #blocked> 2026-02-21T18:00:16.5703154Z %72 = tt.broadcast %71 : tensor<1x4xi1, #blocked> -> tensor<2x4xi1, #blocked> 2026-02-21T18:00:16.5703330Z %73 = arith.andi %68, %72 : tensor<2x4xi1, #blocked> 2026-02-21T18:00:16.5703485Z tt.store %63, %45, %73 : tensor<2x4x!tt.ptr, #blocked> 2026-02-21T18:00:16.5703640Z } 2026-02-21T18:00:16.5703719Z tt.return 2026-02-21T18:00:16.5703799Z } 2026-02-21T18:00:16.5703874Z } 2026-02-21T18:00:16.5703918Z 2026-02-21T18:00:16.5703948Z {-# 2026-02-21T18:00:16.5704031Z external_resources: { 2026-02-21T18:00:16.5704133Z mlir_reproducer: { 2026-02-21T18:00:16.5706396Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:00:16.5708756Z disable_threading: false, 2026-02-21T18:00:16.5708865Z verify_each: true 2026-02-21T18:00:16.5708952Z } 2026-02-21T18:00:16.5709027Z } 2026-02-21T18:00:16.5709096Z #-} 2026-02-21T18:00:16.5709402Z /tmp/torchinductor_root/fn/cfnk56zodguasb3n3iz26drwbiwqq4lkdqogg76sxee7nqfobkxw.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:00:16.5710096Z /tmp/torchinductor_root/fn/cfnk56zodguasb3n3iz26drwbiwqq4lkdqogg76sxee7nqfobkxw.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:00:16.5710643Z [11s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:00:16.5711445Z Config: @helion.kernel(config=helion.Config(block_sizes=[2, 4, 64], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=8, num_stages=3, num_warps=1, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T18:00:16.5712166Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:00:16.5712332Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:00:19.6250003Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:00:19.6252473Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:19.6253837Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:00:19.6255101Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T18:00:19.6256323Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:19.6257498Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:00:19.6258332Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:00:19.6259433Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:00:19.6260161Z %cst = arith.constant dense<0.000000e+00> : tensor<16x16xf16, #blocked> 2026-02-21T18:00:19.6260542Z %cst_0 = arith.constant dense<1024> : tensor<16x1xi64, #blocked1> 2026-02-21T18:00:19.6260900Z %cst_1 = arith.constant dense<0> : tensor<16x1xi64, #blocked1> 2026-02-21T18:00:19.6261285Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<8x16xf16, #blocked> 2026-02-21T18:00:19.6261837Z %cst_3 = arith.constant dense<1024> : tensor<1x16xi64, #blocked> 2026-02-21T18:00:19.6262201Z %cst_4 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T18:00:19.6262537Z %cst_5 = arith.constant dense<12288> : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:19.6262873Z %cst_6 = arith.constant dense<0> : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:19.6263205Z %cst_7 = arith.constant dense<1024> : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:19.6263501Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:00:19.6263734Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:00:19.6263950Z %c1_i32 = arith.constant 1 : i32 2026-02-21T18:00:19.6264222Z %cst_8 = arith.constant dense<1024> : tensor<8x1xi32, #blocked1> 2026-02-21T18:00:19.6264643Z %cst_9 = arith.constant dense<0.000000e+00> : tensor<8x16xf32, #blocked> 2026-02-21T18:00:19.6264946Z %c8_i32 = arith.constant 8 : i32 2026-02-21T18:00:19.6265214Z %c16_i32 = arith.constant 16 : i32 2026-02-21T18:00:19.6265457Z %c1536_i32 = arith.constant 1536 : i32 2026-02-21T18:00:19.6265695Z %c98304_i32 = arith.constant 98304 : i32 2026-02-21T18:00:19.6265924Z %c81_i32 = arith.constant 81 : i32 2026-02-21T18:00:19.6266149Z %0 = tt.get_program_id x : i32 2026-02-21T18:00:19.6266362Z %1 = arith.muli %0, %c81_i32 : i32 2026-02-21T18:00:19.6266575Z %2 = arith.addi %1, %c81_i32 : i32 2026-02-21T18:00:19.6266788Z %3 = arith.minsi %2, %c98304_i32 : i32 2026-02-21T18:00:19.6267116Z %4 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked2> 2026-02-21T18:00:19.6267507Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked2> 2026-02-21T18:00:19.6267806Z %6 = tt.splat %arg0 : !tt.ptr -> tensor<8x16x!tt.ptr, #blocked> 2026-02-21T18:00:19.6268082Z %7 = arith.extsi %4 : tensor<8xi32, #blocked2> to tensor<8xi64, #blocked2> 2026-02-21T18:00:19.6268462Z %8 = arith.extsi %5 : tensor<16xi32, #blocked2> to tensor<16xi64, #blocked2> 2026-02-21T18:00:19.6268780Z %9 = tt.splat %arg1 : !tt.ptr -> tensor<16x16x!tt.ptr, #blocked> 2026-02-21T18:00:19.6269044Z %10 = tt.splat %arg2 : !tt.ptr -> tensor<8x16x!tt.ptr, #blocked> 2026-02-21T18:00:19.6269279Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T18:00:19.6269469Z %11 = arith.divsi %arg3, %c1024_i32 : i32 2026-02-21T18:00:19.6269640Z %12 = arith.muli %11, %c16_i32 : i32 2026-02-21T18:00:19.6269801Z %13 = arith.subi %c1536_i32, %12 : i32 2026-02-21T18:00:19.6269965Z %14 = arith.minsi %13, %c16_i32 : i32 2026-02-21T18:00:19.6270132Z %15 = arith.remsi %arg3, %c1024_i32 : i32 2026-02-21T18:00:19.6270297Z %16 = arith.remsi %15, %14 : i32 2026-02-21T18:00:19.6270457Z %17 = arith.addi %12, %16 : i32 2026-02-21T18:00:19.6270607Z %18 = arith.divsi %15, %14 : i32 2026-02-21T18:00:19.6270764Z %19 = arith.muli %17, %c8_i32 : i32 2026-02-21T18:00:19.6270953Z %20 = tt.splat %19 : i32 -> tensor<8xi32, #blocked2> 2026-02-21T18:00:19.6271191Z %21 = arith.addi %20, %4 : tensor<8xi32, #blocked2> 2026-02-21T18:00:19.6271372Z %22 = arith.muli %18, %c16_i32 : i32 2026-02-21T18:00:19.6271554Z %23 = tt.splat %22 : i32 -> tensor<16xi32, #blocked2> 2026-02-21T18:00:19.6271760Z %24 = arith.addi %23, %5 : tensor<16xi32, #blocked2> 2026-02-21T18:00:19.6271943Z %25 = arith.extsi %19 : i32 to i64 2026-02-21T18:00:19.6272124Z %26 = tt.splat %25 : i64 -> tensor<8xi64, #blocked2> 2026-02-21T18:00:19.6272325Z %27 = arith.addi %26, %7 : tensor<8xi64, #blocked2> 2026-02-21T18:00:19.6272650Z %28 = ttg.convert_layout %27 : tensor<8xi64, #blocked2> -> tensor<8xi64, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:00:19.6273110Z %29 = tt.expand_dims %28 {axis = 1 : i32} : tensor<8xi64, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<8x1xi64, #blocked3> 2026-02-21T18:00:19.6273512Z %30 = ttg.convert_layout %29 : tensor<8x1xi64, #blocked3> -> tensor<8x1xi64, #blocked1> 2026-02-21T18:00:19.6273799Z %31 = arith.muli %30, %cst_7 : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:19.6274058Z %32 = tt.broadcast %31 : tensor<8x1xi64, #blocked1> -> tensor<8x16xi64, #blocked1> 2026-02-21T18:00:19.6274382Z %33 = ttg.convert_layout %32 : tensor<8x16xi64, #blocked1> -> tensor<8x16xi64, #blocked> 2026-02-21T18:00:19.6274669Z %34 = arith.cmpi sge, %30, %cst_6 : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:19.6274906Z %35 = arith.cmpi slt, %30, %cst_5 : tensor<8x1xi64, #blocked1> 2026-02-21T18:00:19.6275129Z %36 = arith.andi %34, %35 : tensor<8x1xi1, #blocked1> 2026-02-21T18:00:19.6275379Z %37 = tt.broadcast %36 : tensor<8x1xi1, #blocked1> -> tensor<8x16xi1, #blocked1> 2026-02-21T18:00:19.6275718Z %38 = ttg.convert_layout %37 : tensor<8x16xi1, #blocked1> -> tensor<8x16xi1, #blocked> 2026-02-21T18:00:19.6275962Z %39 = arith.extsi %22 : i32 to i64 2026-02-21T18:00:19.6276172Z %40 = tt.splat %39 : i64 -> tensor<16xi64, #blocked2> 2026-02-21T18:00:19.6276382Z %41 = arith.addi %40, %8 : tensor<16xi64, #blocked2> 2026-02-21T18:00:19.6276702Z %42 = ttg.convert_layout %41 : tensor<16xi64, #blocked2> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked4}>> 2026-02-21T18:00:19.6277160Z %43 = tt.expand_dims %42 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked4}>> -> tensor<1x16xi64, #blocked4> 2026-02-21T18:00:19.6277541Z %44 = ttg.convert_layout %43 : tensor<1x16xi64, #blocked4> -> tensor<1x16xi64, #blocked> 2026-02-21T18:00:19.6277760Z %45 = arith.muli %44, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:19.6277961Z %46 = tt.broadcast %45 : tensor<1x16xi64, #blocked> -> tensor<16x16xi64, #blocked> 2026-02-21T18:00:19.6278179Z %47 = arith.cmpi sge, %44, %cst_4 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:19.6278360Z %48 = arith.cmpi slt, %44, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:19.6278531Z %49 = arith.andi %47, %48 : tensor<1x16xi1, #blocked> 2026-02-21T18:00:19.6278745Z %50 = tt.broadcast %49 : tensor<1x16xi1, #blocked> -> tensor<16x16xi1, #blocked> 2026-02-21T18:00:19.6279040Z %51 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c16_i32 iter_args(%arg5 = %cst_9) -> (tensor<8x16xf32, #blocked>) : i32 { 2026-02-21T18:00:19.6279285Z %65 = arith.extsi %arg4 : i32 to i64 2026-02-21T18:00:19.6279434Z %66 = tt.splat %65 : i64 -> tensor<16xi64, #blocked2> 2026-02-21T18:00:19.6279601Z %67 = arith.addi %66, %8 : tensor<16xi64, #blocked2> 2026-02-21T18:00:19.6279855Z %68 = ttg.convert_layout %67 : tensor<16xi64, #blocked2> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked4}>> 2026-02-21T18:00:19.6280214Z %69 = tt.expand_dims %68 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked4}>> -> tensor<1x16xi64, #blocked4> 2026-02-21T18:00:19.6280533Z %70 = ttg.convert_layout %69 : tensor<1x16xi64, #blocked4> -> tensor<1x16xi64, #blocked> 2026-02-21T18:00:19.6280783Z %71 = tt.broadcast %70 : tensor<1x16xi64, #blocked> -> tensor<8x16xi64, #blocked> 2026-02-21T18:00:19.6281010Z %72 = arith.addi %33, %71 : tensor<8x16xi64, #blocked> 2026-02-21T18:00:19.6281225Z %73 = tt.addptr %6, %72 : tensor<8x16x!tt.ptr, #blocked>, tensor<8x16xi64, #blocked> 2026-02-21T18:00:19.6281443Z %74 = arith.cmpi sge, %70, %cst_4 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:19.6281630Z %75 = arith.cmpi slt, %70, %cst_3 : tensor<1x16xi64, #blocked> 2026-02-21T18:00:19.6281805Z %76 = arith.andi %74, %75 : tensor<1x16xi1, #blocked> 2026-02-21T18:00:19.6282002Z %77 = tt.broadcast %76 : tensor<1x16xi1, #blocked> -> tensor<8x16xi1, #blocked> 2026-02-21T18:00:19.6282204Z %78 = arith.andi %38, %77 : tensor<8x16xi1, #blocked> 2026-02-21T18:00:19.6282379Z %79 = tt.load %73, %78, %cst_2 : tensor<8x16x!tt.ptr, #blocked> 2026-02-21T18:00:19.6282659Z %80 = ttg.convert_layout %67 : tensor<16xi64, #blocked2> -> tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:00:19.6283017Z %81 = tt.expand_dims %80 {axis = 1 : i32} : tensor<16xi64, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<16x1xi64, #blocked3> 2026-02-21T18:00:19.6283334Z %82 = ttg.convert_layout %81 : tensor<16x1xi64, #blocked3> -> tensor<16x1xi64, #blocked1> 2026-02-21T18:00:19.6283588Z %83 = tt.broadcast %82 : tensor<16x1xi64, #blocked1> -> tensor<16x16xi64, #blocked1> 2026-02-21T18:00:19.6283839Z %84 = ttg.convert_layout %83 : tensor<16x16xi64, #blocked1> -> tensor<16x16xi64, #blocked> 2026-02-21T18:00:19.6284060Z %85 = arith.addi %84, %46 : tensor<16x16xi64, #blocked> 2026-02-21T18:00:19.6284272Z %86 = tt.addptr %9, %85 : tensor<16x16x!tt.ptr, #blocked>, tensor<16x16xi64, #blocked> 2026-02-21T18:00:19.6284521Z %87 = arith.cmpi sge, %82, %cst_1 : tensor<16x1xi64, #blocked1> 2026-02-21T18:00:19.6284746Z %88 = arith.cmpi slt, %82, %cst_0 : tensor<16x1xi64, #blocked1> 2026-02-21T18:00:19.6284923Z %89 = arith.andi %87, %88 : tensor<16x1xi1, #blocked1> 2026-02-21T18:00:19.6285126Z %90 = tt.broadcast %89 : tensor<16x1xi1, #blocked1> -> tensor<16x16xi1, #blocked1> 2026-02-21T18:00:19.6285374Z %91 = ttg.convert_layout %90 : tensor<16x16xi1, #blocked1> -> tensor<16x16xi1, #blocked> 2026-02-21T18:00:19.6285589Z %92 = arith.andi %91, %50 : tensor<16x16xi1, #blocked> 2026-02-21T18:00:19.6285768Z %93 = tt.load %86, %92, %cst : tensor<16x16x!tt.ptr, #blocked> 2026-02-21T18:00:19.6286048Z %94 = ttg.convert_layout %79 : tensor<8x16xf16, #blocked> -> tensor<8x16xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T18:00:19.6286422Z %95 = ttg.convert_layout %93 : tensor<16x16xf16, #blocked> -> tensor<16x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T18:00:19.6286743Z %96 = ttg.convert_layout %arg5 : tensor<8x16xf32, #blocked> -> tensor<8x16xf32, #blocked> 2026-02-21T18:00:19.6287198Z %97 = tt.dot %94, %95, %96, inputPrecision = tf32 : tensor<8x16xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<16x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<8x16xf32, #blocked> 2026-02-21T18:00:19.6287562Z scf.yield %97 : tensor<8x16xf32, #blocked> 2026-02-21T18:00:19.6287682Z } 2026-02-21T18:00:19.6287816Z %52 = arith.truncf %51 : tensor<8x16xf32, #blocked> to tensor<8x16xf16, #blocked> 2026-02-21T18:00:19.6288080Z %53 = ttg.convert_layout %21 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:00:19.6288402Z %54 = tt.expand_dims %53 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<8x1xi32, #blocked3> 2026-02-21T18:00:19.6288686Z %55 = ttg.convert_layout %54 : tensor<8x1xi32, #blocked3> -> tensor<8x1xi32, #blocked1> 2026-02-21T18:00:19.6288884Z %56 = arith.muli %55, %cst_8 : tensor<8x1xi32, #blocked1> 2026-02-21T18:00:19.6289125Z %57 = ttg.convert_layout %24 : tensor<16xi32, #blocked2> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked4}>> 2026-02-21T18:00:19.6289461Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked4}>> -> tensor<1x16xi32, #blocked4> 2026-02-21T18:00:19.6289743Z %59 = ttg.convert_layout %58 : tensor<1x16xi32, #blocked4> -> tensor<1x16xi32, #blocked> 2026-02-21T18:00:19.6289970Z %60 = tt.broadcast %56 : tensor<8x1xi32, #blocked1> -> tensor<8x16xi32, #blocked1> 2026-02-21T18:00:19.6290193Z %61 = ttg.convert_layout %60 : tensor<8x16xi32, #blocked1> -> tensor<8x16xi32, #blocked> 2026-02-21T18:00:19.6290414Z %62 = tt.broadcast %59 : tensor<1x16xi32, #blocked> -> tensor<8x16xi32, #blocked> 2026-02-21T18:00:19.6290600Z %63 = arith.addi %61, %62 : tensor<8x16xi32, #blocked> 2026-02-21T18:00:19.6290789Z %64 = tt.addptr %10, %63 : tensor<8x16x!tt.ptr, #blocked>, tensor<8x16xi32, #blocked> 2026-02-21T18:00:19.6290985Z tt.store %64, %52 : tensor<8x16x!tt.ptr, #blocked> 2026-02-21T18:00:19.6291119Z } {tt.flatten} 2026-02-21T18:00:19.6291207Z tt.return 2026-02-21T18:00:19.6291287Z } 2026-02-21T18:00:19.6291364Z } 2026-02-21T18:00:19.6291408Z 2026-02-21T18:00:19.6291441Z {-# 2026-02-21T18:00:19.6291524Z external_resources: { 2026-02-21T18:00:19.6291623Z mlir_reproducer: { 2026-02-21T18:00:19.6293876Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=2 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=2}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:00:19.6296203Z disable_threading: false, 2026-02-21T18:00:19.6296310Z verify_each: true 2026-02-21T18:00:19.6296401Z } 2026-02-21T18:00:19.6296475Z } 2026-02-21T18:00:19.6296545Z #-} 2026-02-21T18:00:19.6296841Z /tmp/torchinductor_root/q5/cq5gtvi5yumh5xk7aiunsgkhbjcdigw6repsafemw6kovk4ulk6q.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:00:19.6297522Z /tmp/torchinductor_root/q5/cq5gtvi5yumh5xk7aiunsgkhbjcdigw6repsafemw6kovk4ulk6q.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:00:19.6298069Z [14s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:00:19.6298854Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16, 16], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=4, num_stages=2, num_warps=1, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[True, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T18:00:19.6299582Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:00:19.6299750Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:00:22.5372227Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 9.3 configs/s 2026-02-21T18:00:22.5379612Z [17s] Adaptive compile timeout: 30s (90% percentile=5.1s, bounds=[30.0s, 30s]) 2026-02-21T18:00:22.6047837Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 - configs/s 2026-02-21T18:00:22.8851112Z [17s] Initial random population of 100, 5 starting points: 2026-02-21T18:00:22.8851672Z error=14 2026-02-21T18:00:22.8851875Z ok=86 2026-02-21T18:00:22.8852064Z min=0.1040 2026-02-21T18:00:22.8852257Z mid=1.5114 2026-02-21T18:00:22.8852460Z max=88.6379 2026-02-21T18:00:22.8852684Z best={'block_sizes': [128, 64, 32], 2026-02-21T18:00:22.8853021Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T18:00:22.8853368Z 'l2_groupings': [32], 2026-02-21T18:00:22.8853632Z 'load_eviction_policies': ['', ''], 2026-02-21T18:00:22.8853934Z 'loop_orders': [[0, 1]], 2026-02-21T18:00:22.8854192Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:00:22.8854449Z 'num_stages': 1, 2026-02-21T18:00:22.8854664Z 'num_warps': 16, 2026-02-21T18:00:22.8854885Z 'pid_type': 'flat', 2026-02-21T18:00:22.8855132Z 'range_flattens': [None, None], 2026-02-21T18:00:22.8855414Z 'range_multi_buffers': [None, False], 2026-02-21T18:00:22.8855710Z 'range_num_stages': [0, 0], 2026-02-21T18:00:22.8855969Z 'range_unroll_factors': [0, 0], 2026-02-21T18:00:22.8856277Z 'range_warp_specializes': [], 2026-02-21T18:00:22.8856501Z 'waves_per_eu': 2} 2026-02-21T18:00:22.8872879Z [17s] Fitting surrogate: 100 points, 100 targets 2026-02-21T18:00:23.7791607Z [18s] Generation 1 starting: 82 neighbors, 5 active search path(s) 2026-02-21T18:00:29.1373952Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 34.0 configs/s 2026-02-21T18:00:33.5977472Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 18.8 configs/s 2026-02-21T18:00:36.0033035Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 363.1 2026-02-21T18:00:36.0033645Z configs/s 2026-02-21T18:00:36.6104815Z [31s] Generation 1 complete: 2026-02-21T18:00:36.6105214Z error=18 2026-02-21T18:00:36.6105400Z ok=69 2026-02-21T18:00:36.6105579Z min=0.0727 2026-02-21T18:00:36.6105759Z mid=0.2350 2026-02-21T18:00:36.6105935Z max=3.0542 2026-02-21T18:00:36.6106142Z best={'block_sizes': [64, 64, 64], 2026-02-21T18:00:36.6106473Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:00:36.6106799Z 'l2_groupings': [16], 2026-02-21T18:00:36.6107045Z 'load_eviction_policies': ['', ''], 2026-02-21T18:00:36.6107366Z 'loop_orders': [[1, 0]], 2026-02-21T18:00:36.6107609Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:00:36.6107854Z 'num_sm_multiplier': 16, 2026-02-21T18:00:36.6108086Z 'num_stages': 2, 2026-02-21T18:00:36.6108516Z 'num_warps': 4, 2026-02-21T18:00:36.6109295Z 'pid_type': 'persistent_blocked', 2026-02-21T18:00:36.6109607Z 'range_flattens': [False, None], 2026-02-21T18:00:36.6109877Z 'range_multi_buffers': [True, None], 2026-02-21T18:00:36.6110153Z 'range_num_stages': [0, 0], 2026-02-21T18:00:36.6110407Z 'range_unroll_factors': [0, 0], 2026-02-21T18:00:36.6110667Z 'range_warp_specializes': [], 2026-02-21T18:00:36.6110911Z 'waves_per_eu': 3} 2026-02-21T18:00:36.6375565Z [31s] Fitting surrogate: 187 points, 187 targets 2026-02-21T18:00:37.6135059Z [32s] Generation 2 starting: 93 neighbors, 5 active search path(s) 2026-02-21T18:00:50.1274248Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 96/96 0.8 configs/s 2026-02-21T18:00:55.7570280Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 96/96 17.5 configs/s 2026-02-21T18:01:00.5293815Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 198.0 2026-02-21T18:01:00.5294489Z configs/s 2026-02-21T18:01:01.2968002Z [56s] Generation 2 complete: 2026-02-21T18:01:01.2971013Z error=5 2026-02-21T18:01:01.2971282Z ok=93 2026-02-21T18:01:01.2971571Z min=0.0683 2026-02-21T18:01:01.2971921Z mid=0.1179 2026-02-21T18:01:01.2972202Z max=4.5736 2026-02-21T18:01:01.2972463Z best={'block_sizes': [128, 64, 64], 2026-02-21T18:01:01.2972813Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T18:01:01.2973137Z 'l2_groupings': [8], 2026-02-21T18:01:01.2973375Z 'load_eviction_policies': ['', ''], 2026-02-21T18:01:01.2973653Z 'loop_orders': [[0, 1]], 2026-02-21T18:01:01.2973898Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:01:01.2974136Z 'num_stages': 1, 2026-02-21T18:01:01.2974344Z 'num_warps': 4, 2026-02-21T18:01:01.2974564Z 'pid_type': 'flat', 2026-02-21T18:01:01.2974797Z 'range_flattens': [None, False], 2026-02-21T18:01:01.2975066Z 'range_multi_buffers': [None, None], 2026-02-21T18:01:01.2975341Z 'range_num_stages': [0, 0], 2026-02-21T18:01:01.2975590Z 'range_unroll_factors': [0, 0], 2026-02-21T18:01:01.2975862Z 'range_warp_specializes': [], 2026-02-21T18:01:01.2976108Z 'waves_per_eu': 1} 2026-02-21T18:01:01.3086832Z [56s] Fitting surrogate: 285 points, 285 targets 2026-02-21T18:01:02.3483840Z [57s] Generation 3 starting: 86 neighbors, 5 active search path(s) 2026-02-21T18:01:10.5602359Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 8.1 configs/s 2026-02-21T18:01:15.7797414Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 17.1 configs/s 2026-02-21T18:01:22.9856026Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 134.4 2026-02-21T18:01:22.9856593Z configs/s 2026-02-21T18:01:23.8311424Z [78s] Generation 3 complete: 2026-02-21T18:01:23.8311798Z error=2 2026-02-21T18:01:23.8312009Z ok=89 2026-02-21T18:01:23.8312215Z min=0.0572 2026-02-21T18:01:23.8312434Z mid=0.0882 2026-02-21T18:01:23.8313060Z max=1.7392 2026-02-21T18:01:23.8313316Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:01:23.8313732Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T18:01:23.8314104Z 'l2_groupings': [8], 2026-02-21T18:01:23.8314380Z 'load_eviction_policies': ['', ''], 2026-02-21T18:01:23.8314688Z 'loop_orders': [[0, 1]], 2026-02-21T18:01:23.8314978Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:01:23.8315246Z 'num_stages': 1, 2026-02-21T18:01:23.8315488Z 'num_warps': 4, 2026-02-21T18:01:23.8315719Z 'pid_type': 'flat', 2026-02-21T18:01:23.8315989Z 'range_flattens': [None, False], 2026-02-21T18:01:23.8316300Z 'range_multi_buffers': [None, None], 2026-02-21T18:01:23.8316639Z 'range_num_stages': [0, 0], 2026-02-21T18:01:23.8316929Z 'range_unroll_factors': [0, 0], 2026-02-21T18:01:23.8317238Z 'range_warp_specializes': [], 2026-02-21T18:01:23.8317525Z 'waves_per_eu': 1} 2026-02-21T18:01:23.9607888Z [78s] Fitting surrogate: 376 points, 376 targets 2026-02-21T18:01:24.7090950Z [79s] Generation 4 starting: 58 neighbors, 4 active search path(s) 2026-02-21T18:01:29.5246209Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58/58 15.8 configs/s 2026-02-21T18:01:32.7186099Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 58/58 19.1 configs/s 2026-02-21T18:01:38.1224653Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 203.1 2026-02-21T18:01:38.1225045Z configs/s 2026-02-21T18:01:38.7733468Z [93s] Generation 4 complete: 2026-02-21T18:01:38.7733873Z error=7 2026-02-21T18:01:38.7734081Z ok=55 2026-02-21T18:01:38.7734288Z min=0.0542 2026-02-21T18:01:38.7734506Z mid=0.0697 2026-02-21T18:01:38.7734713Z max=0.7169 2026-02-21T18:01:38.7734944Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:01:38.7735356Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T18:01:38.7735731Z 'l2_groupings': [64], 2026-02-21T18:01:38.7736004Z 'load_eviction_policies': ['', ''], 2026-02-21T18:01:38.7736327Z 'loop_orders': [[0, 1]], 2026-02-21T18:01:38.7736620Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:01:38.7736913Z 'num_stages': 1, 2026-02-21T18:01:38.7737144Z 'num_warps': 4, 2026-02-21T18:01:38.7737381Z 'pid_type': 'flat', 2026-02-21T18:01:38.7737648Z 'range_flattens': [None, None], 2026-02-21T18:01:38.7737959Z 'range_multi_buffers': [None, None], 2026-02-21T18:01:38.7738293Z 'range_num_stages': [0, 0], 2026-02-21T18:01:38.7738553Z 'range_unroll_factors': [0, 0], 2026-02-21T18:01:38.7738841Z 'range_warp_specializes': [], 2026-02-21T18:01:38.7739101Z 'waves_per_eu': 2} 2026-02-21T18:01:38.8573702Z [93s] Fitting surrogate: 438 points, 438 targets 2026-02-21T18:01:39.6395723Z [94s] Generation 5 starting: 60 neighbors, 4 active search path(s) 2026-02-21T18:01:44.5525173Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60/60 13.4 configs/s 2026-02-21T18:01:48.0375393Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 60/60 17.4 configs/s 2026-02-21T18:01:52.4651745Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 214.9 2026-02-21T18:01:52.4652359Z configs/s 2026-02-21T18:01:53.1595862Z [108s] Generation 5 complete: 2026-02-21T18:01:53.1596212Z error=3 2026-02-21T18:01:53.1596409Z ok=61 2026-02-21T18:01:53.1596603Z min=0.0538 2026-02-21T18:01:53.1596806Z mid=0.0706 2026-02-21T18:01:53.1596995Z max=1.9640 2026-02-21T18:01:53.1597208Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:01:53.1597567Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T18:01:53.1597912Z 'l2_groupings': [4], 2026-02-21T18:01:53.1598171Z 'load_eviction_policies': ['', ''], 2026-02-21T18:01:53.1598463Z 'loop_orders': [[0, 1]], 2026-02-21T18:01:53.1598727Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:01:53.1598998Z 'num_stages': 1, 2026-02-21T18:01:53.1599221Z 'num_warps': 4, 2026-02-21T18:01:53.1599443Z 'pid_type': 'flat', 2026-02-21T18:01:53.1599695Z 'range_flattens': [None, None], 2026-02-21T18:01:53.1599988Z 'range_multi_buffers': [None, True], 2026-02-21T18:01:53.1600503Z 'range_num_stages': [0, 0], 2026-02-21T18:01:53.1600776Z 'range_unroll_factors': [0, 0], 2026-02-21T18:01:53.1601054Z 'range_warp_specializes': [], 2026-02-21T18:01:53.1601322Z 'waves_per_eu': 1} 2026-02-21T18:01:53.2443769Z [108s] Fitting surrogate: 502 points, 502 targets 2026-02-21T18:01:53.8376838Z [108s] Generation 6 starting: 41 neighbors, 3 active search path(s) 2026-02-21T18:01:58.8857238Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41/41 12.4 configs/s 2026-02-21T18:02:01.4034421Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 41/41 16.8 configs/s 2026-02-21T18:02:04.4591990Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 304.0 2026-02-21T18:02:04.4592633Z configs/s 2026-02-21T18:02:05.0133002Z [119s] Generation 6 complete: 2026-02-21T18:02:05.0133330Z ok=44 2026-02-21T18:02:05.0133545Z min=0.0561 2026-02-21T18:02:05.0133785Z mid=0.0736 2026-02-21T18:02:05.0134006Z max=0.7474 2026-02-21T18:02:05.0134530Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:02:05.0134908Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T18:02:05.0135270Z 'l2_groupings': [4], 2026-02-21T18:02:05.0135544Z 'load_eviction_policies': ['', ''], 2026-02-21T18:02:05.0135853Z 'loop_orders': [[0, 1]], 2026-02-21T18:02:05.0136132Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:02:05.0136403Z 'num_stages': 1, 2026-02-21T18:02:05.0136628Z 'num_warps': 4, 2026-02-21T18:02:05.0136872Z 'pid_type': 'flat', 2026-02-21T18:02:05.0137132Z 'range_flattens': [None, None], 2026-02-21T18:02:05.0137440Z 'range_multi_buffers': [None, True], 2026-02-21T18:02:05.0137747Z 'range_num_stages': [0, 0], 2026-02-21T18:02:05.0138037Z 'range_unroll_factors': [0, 0], 2026-02-21T18:02:05.0138331Z 'range_warp_specializes': [], 2026-02-21T18:02:05.0138611Z 'waves_per_eu': 1} 2026-02-21T18:02:05.0704313Z [119s] Fitting surrogate: 546 points, 546 targets 2026-02-21T18:02:05.5031469Z [120s] Generation 7 starting: 26 neighbors, 2 active search path(s) 2026-02-21T18:02:08.8149073Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 7.3 configs/s 2026-02-21T18:02:10.3744490Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 26/26 18.5 configs/s 2026-02-21T18:02:12.7366622Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 535.2 2026-02-21T18:02:12.7367200Z configs/s 2026-02-21T18:02:13.2608194Z [128s] Generation 7 complete: 2026-02-21T18:02:13.2608467Z error=2 2026-02-21T18:02:13.2608659Z ok=26 2026-02-21T18:02:13.2608828Z min=0.0557 2026-02-21T18:02:13.2608998Z mid=0.0727 2026-02-21T18:02:13.2609165Z max=0.1151 2026-02-21T18:02:13.2609626Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:02:13.2609938Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T18:02:13.2610236Z 'l2_groupings': [4], 2026-02-21T18:02:13.2610580Z 'load_eviction_policies': ['', ''], 2026-02-21T18:02:13.2610843Z 'loop_orders': [[0, 1]], 2026-02-21T18:02:13.2611095Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:02:13.2611313Z 'num_stages': 1, 2026-02-21T18:02:13.2611504Z 'num_warps': 4, 2026-02-21T18:02:13.2611697Z 'pid_type': 'flat', 2026-02-21T18:02:13.2611912Z 'range_flattens': [None, None], 2026-02-21T18:02:13.2612162Z 'range_multi_buffers': [None, True], 2026-02-21T18:02:13.2612414Z 'range_num_stages': [0, 0], 2026-02-21T18:02:13.2612640Z 'range_unroll_factors': [0, 0], 2026-02-21T18:02:13.2612905Z 'range_warp_specializes': [], 2026-02-21T18:02:13.2613145Z 'waves_per_eu': 1} 2026-02-21T18:02:13.2897751Z [128s] Fitting surrogate: 574 points, 574 targets 2026-02-21T18:02:13.4242046Z [128s] Autotuning complete in 128.3s after searching 551 configs. 2026-02-21T18:02:13.4242602Z One can hardcode the best config and skip autotuning with: 2026-02-21T18:02:13.4244689Z @helion.kernel(config=helion.Config(block_sizes=[128, 128, 64], indexing=['block_ptr', 'block_ptr', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=4, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T18:02:13.4246391Z 2026-02-21T18:02:13.4246849Z [128s] Code of selected kernel: /tmp/torchinductor_root/rd/crdvntgcxnvj3jxqld4s5yn2yxtbwufzthtiaznjzvvjyebtia2m.py 2026-02-21T18:02:36.1552511Z WARNING:tritonbench.utils.triton_op:Completed input ID 8: 2026-02-21T18:02:36.1552982Z (M, N, K) 2026-02-21T18:02:36.1553208Z ------------------- 2026-02-21T18:02:36.1553456Z (12288, 1024, 1024) 2026-02-21T18:02:36.1553624Z 2026-02-21T18:02:36.1559426Z 75%|███████▌ | 6/8 [21:36<07:00, 210.15s/it]WARNING:tritonbench.utils.triton_op:Running input ID 9: 2026-02-21T18:02:36.1559962Z (M, N, K) 2026-02-21T18:02:36.1560182Z ------------------- 2026-02-21T18:02:36.1560428Z (1024, 12288, 1024) 2026-02-21T18:02:36.1560892Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T18:03:10.7621430Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T18:03:48.1089762Z Autotune Choices Stats: 2026-02-21T18:03:48.1091437Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_251", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.056919001042842865, "best_triton_pos": 0} 2026-02-21T18:03:48.1101040Z AUTOTUNE mm(1024x1024, 1024x12288) 2026-02-21T18:03:48.1101515Z strides: [1024, 1], [1, 1024] 2026-02-21T18:03:48.1101877Z dtypes: torch.float16, torch.float16 2026-02-21T18:03:48.1102791Z triton_mm_251 0.0569 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:03:48.1104702Z triton_mm_243 0.0661 ms 86.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:03:48.1105860Z triton_mm_244 0.0708 ms 80.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:03:48.1107018Z triton_mm_242 0.0712 ms 80.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T18:03:48.1108811Z triton_mm_237 0.0725 ms 78.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T18:03:48.1109945Z triton_mm_248 0.0753 ms 75.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T18:03:48.1111080Z triton_mm_250 0.0762 ms 74.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:03:48.1112211Z triton_mm_249 0.0781 ms 72.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:03:48.1113442Z triton_mm_246 0.0794 ms 71.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:03:48.1114579Z triton_mm_247 0.0804 ms 70.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:03:48.1115259Z SingleProcess AUTOTUNE benchmarking takes 0.6750 seconds and 0.2517 seconds precompiling for 36 choices 2026-02-21T18:03:48.3290970Z INFO:tritonbench.utils.triton_op:Took 1334.82ms to get benchmark function for pt2_triton_matmul 2026-02-21T18:04:21.4914503Z WARNING:__main__:Input tensor metadata: 2026-02-21T18:04:21.4914987Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T18:04:21.4915294Z 'dtype': 'torch.float16', 2026-02-21T18:04:21.4915608Z 'shape': (1024, 1024), 2026-02-21T18:04:21.4915907Z 'stride': (1024, 1)}, 2026-02-21T18:04:21.4916193Z { 'device': 'cuda:0', 2026-02-21T18:04:21.4916902Z 'dtype': 'torch.float16', 2026-02-21T18:04:21.4917186Z 'shape': (1024, 12288), 2026-02-21T18:04:21.4917466Z 'stride': (1, 1024)}, 2026-02-21T18:04:21.4917721Z None), 2026-02-21T18:04:21.4917942Z 'kwargs': {}} 2026-02-21T18:04:21.4930265Z INFO:tritonbench.utils.triton_op:Took 2.06ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T18:04:21.5692127Z [0s] Autotune random seed: 2169133161 2026-02-21T18:04:21.6364111Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T18:04:29.6529881Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━ 100/100 16.5 configs/s 2026-02-21T18:04:38.9466224Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:04:38.9468054Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:04:38.9469320Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:04:38.9470036Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T18:04:38.9470969Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T18:04:38.9472064Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T18:04:38.9473665Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T18:04:38.9475023Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:04:38.9476693Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:04:38.9478060Z %cst = arith.constant dense<0.000000e+00> : tensor<256x16xf16, #blocked> 2026-02-21T18:04:38.9478758Z %cst_0 = arith.constant dense<12288> : tensor<1x16xi64, #blocked> 2026-02-21T18:04:38.9479398Z %cst_1 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T18:04:38.9480031Z %cst_2 = arith.constant dense<1024> : tensor<256x1xi64, #blocked1> 2026-02-21T18:04:38.9480483Z %cst_3 = arith.constant dense<0> : tensor<256x1xi64, #blocked1> 2026-02-21T18:04:38.9480762Z %cst_4 = arith.constant dense<1024> : tensor<1x16xi64, #blocked> 2026-02-21T18:04:38.9481005Z %c512_i32 = arith.constant 512 : i32 2026-02-21T18:04:38.9481280Z %c256_i32 = arith.constant 256 : i32 2026-02-21T18:04:38.9481469Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:04:38.9481659Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:04:38.9481835Z %c304_i32 = arith.constant 304 : i32 2026-02-21T18:04:38.9482089Z %cst_5 = arith.constant dense<12288> : tensor<4x1xi32, #blocked1> 2026-02-21T18:04:38.9482373Z %cst_6 = arith.constant dense<1024> : tensor<4x1xi32, #blocked1> 2026-02-21T18:04:38.9482668Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<4x16xf32, #blocked> 2026-02-21T18:04:38.9482927Z %c4_i32 = arith.constant 4 : i32 2026-02-21T18:04:38.9483104Z %c16_i32 = arith.constant 16 : i32 2026-02-21T18:04:38.9483287Z %c2_i32 = arith.constant 2 : i32 2026-02-21T18:04:38.9483465Z %c768_i32 = arith.constant 768 : i32 2026-02-21T18:04:38.9483659Z %c196608_i32 = arith.constant 196608 : i32 2026-02-21T18:04:38.9483855Z %0 = tt.get_program_id x : i32 2026-02-21T18:04:38.9484107Z %1 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked2> 2026-02-21T18:04:38.9484496Z %2 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked2> 2026-02-21T18:04:38.9484813Z %3 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked2> 2026-02-21T18:04:38.9485144Z %4 = tt.splat %arg0 : !tt.ptr -> tensor<4x256x!tt.ptr, #blocked3> 2026-02-21T18:04:38.9485454Z %5 = tt.splat %arg1 : !tt.ptr -> tensor<256x16x!tt.ptr, #blocked> 2026-02-21T18:04:38.9485772Z %6 = arith.extsi %3 : tensor<256xi32, #blocked2> to tensor<256xi64, #blocked2> 2026-02-21T18:04:38.9486092Z %7 = arith.extsi %1 : tensor<16xi32, #blocked2> to tensor<16xi64, #blocked2> 2026-02-21T18:04:38.9486398Z %8 = tt.splat %arg2 : !tt.ptr -> tensor<4x16x!tt.ptr, #blocked> 2026-02-21T18:04:38.9486692Z scf.for %arg3 = %0 to %c196608_i32 step %c304_i32 : i32 { 2026-02-21T18:04:38.9486931Z %9 = arith.divsi %arg3, %c512_i32 : i32 2026-02-21T18:04:38.9487145Z %10 = arith.muli %9, %c2_i32 : i32 2026-02-21T18:04:38.9487435Z %11 = arith.subi %c768_i32, %10 : i32 2026-02-21T18:04:38.9487718Z %12 = arith.minsi %11, %c2_i32 : i32 2026-02-21T18:04:38.9487911Z %13 = arith.remsi %arg3, %c512_i32 : i32 2026-02-21T18:04:38.9488098Z %14 = arith.remsi %13, %12 : i32 2026-02-21T18:04:38.9488279Z %15 = arith.addi %10, %14 : i32 2026-02-21T18:04:38.9488451Z %16 = arith.divsi %13, %12 : i32 2026-02-21T18:04:38.9488630Z %17 = arith.muli %15, %c16_i32 : i32 2026-02-21T18:04:38.9488841Z %18 = tt.splat %17 : i32 -> tensor<16xi32, #blocked2> 2026-02-21T18:04:38.9489082Z %19 = arith.addi %18, %1 : tensor<16xi32, #blocked2> 2026-02-21T18:04:38.9489292Z %20 = arith.muli %16, %c4_i32 : i32 2026-02-21T18:04:38.9489530Z %21 = tt.splat %20 : i32 -> tensor<4xi32, #blocked2> 2026-02-21T18:04:38.9489762Z %22 = arith.addi %21, %2 : tensor<4xi32, #blocked2> 2026-02-21T18:04:38.9490221Z %23 = ttg.convert_layout %22 : tensor<4xi32, #blocked2> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:04:38.9490873Z %24 = tt.expand_dims %23 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T18:04:38.9491268Z %25 = ttg.convert_layout %24 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked1> 2026-02-21T18:04:38.9491515Z %26 = arith.muli %25, %cst_6 : tensor<4x1xi32, #blocked1> 2026-02-21T18:04:38.9491746Z %27 = tt.broadcast %26 : tensor<4x1xi32, #blocked1> -> tensor<4x256xi32, #blocked1> 2026-02-21T18:04:38.9492030Z %28 = ttg.convert_layout %27 : tensor<4x256xi32, #blocked1> -> tensor<4x256xi32, #blocked3> 2026-02-21T18:04:38.9492257Z %29 = arith.extsi %17 : i32 to i64 2026-02-21T18:04:38.9492417Z %30 = tt.splat %29 : i64 -> tensor<16xi64, #blocked2> 2026-02-21T18:04:38.9492597Z %31 = arith.addi %30, %7 : tensor<16xi64, #blocked2> 2026-02-21T18:04:38.9492895Z %32 = ttg.convert_layout %31 : tensor<16xi64, #blocked2> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:04:38.9493289Z %33 = tt.expand_dims %32 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi64, #blocked5> 2026-02-21T18:04:38.9493640Z %34 = ttg.convert_layout %33 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #blocked> 2026-02-21T18:04:38.9493879Z %35 = arith.muli %34, %cst_4 : tensor<1x16xi64, #blocked> 2026-02-21T18:04:38.9494104Z %36 = tt.broadcast %35 : tensor<1x16xi64, #blocked> -> tensor<256x16xi64, #blocked> 2026-02-21T18:04:38.9494342Z %37 = arith.cmpi sge, %34, %cst_1 : tensor<1x16xi64, #blocked> 2026-02-21T18:04:38.9494546Z %38 = arith.cmpi slt, %34, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T18:04:38.9494742Z %39 = arith.andi %37, %38 : tensor<1x16xi1, #blocked> 2026-02-21T18:04:38.9494958Z %40 = tt.broadcast %39 : tensor<1x16xi1, #blocked> -> tensor<256x16xi1, #blocked> 2026-02-21T18:04:38.9495291Z %41 = scf.for %arg4 = %c0_i32 to %c1024_i32 step %c256_i32 iter_args(%arg5 = %cst_7) -> (tensor<4x16xf32, #blocked>) : i32 { 2026-02-21T18:04:38.9495605Z %55 = tt.splat %arg4 : i32 -> tensor<256xi32, #blocked2> 2026-02-21T18:04:38.9495797Z %56 = arith.addi %55, %3 : tensor<256xi32, #blocked2> 2026-02-21T18:04:38.9496090Z %57 = ttg.convert_layout %56 : tensor<256xi32, #blocked2> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:04:38.9496500Z %58 = tt.expand_dims %57 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T18:04:38.9496861Z %59 = ttg.convert_layout %58 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked3> 2026-02-21T18:04:38.9497148Z %60 = tt.broadcast %59 : tensor<1x256xi32, #blocked3> -> tensor<4x256xi32, #blocked3> 2026-02-21T18:04:38.9497389Z %61 = arith.addi %28, %60 : tensor<4x256xi32, #blocked3> 2026-02-21T18:04:38.9497633Z %62 = tt.addptr %4, %61 : tensor<4x256x!tt.ptr, #blocked3>, tensor<4x256xi32, #blocked3> 2026-02-21T18:04:38.9497881Z %63 = tt.load %62 : tensor<4x256x!tt.ptr, #blocked3> 2026-02-21T18:04:38.9498055Z %64 = arith.extsi %arg4 : i32 to i64 2026-02-21T18:04:38.9498220Z %65 = tt.splat %64 : i64 -> tensor<256xi64, #blocked2> 2026-02-21T18:04:38.9498406Z %66 = arith.addi %65, %6 : tensor<256xi64, #blocked2> 2026-02-21T18:04:38.9498695Z %67 = ttg.convert_layout %66 : tensor<256xi64, #blocked2> -> tensor<256xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:04:38.9499100Z %68 = tt.expand_dims %67 {axis = 1 : i32} : tensor<256xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<256x1xi64, #blocked4> 2026-02-21T18:04:38.9499475Z %69 = ttg.convert_layout %68 : tensor<256x1xi64, #blocked4> -> tensor<256x1xi64, #blocked1> 2026-02-21T18:04:38.9499783Z %70 = tt.broadcast %69 : tensor<256x1xi64, #blocked1> -> tensor<256x16xi64, #blocked1> 2026-02-21T18:04:38.9500078Z %71 = ttg.convert_layout %70 : tensor<256x16xi64, #blocked1> -> tensor<256x16xi64, #blocked> 2026-02-21T18:04:38.9500329Z %72 = arith.addi %71, %36 : tensor<256x16xi64, #blocked> 2026-02-21T18:04:38.9500571Z %73 = tt.addptr %5, %72 : tensor<256x16x!tt.ptr, #blocked>, tensor<256x16xi64, #blocked> 2026-02-21T18:04:38.9500825Z %74 = arith.cmpi sge, %69, %cst_3 : tensor<256x1xi64, #blocked1> 2026-02-21T18:04:38.9501036Z %75 = arith.cmpi slt, %69, %cst_2 : tensor<256x1xi64, #blocked1> 2026-02-21T18:04:38.9501218Z %76 = arith.andi %74, %75 : tensor<256x1xi1, #blocked1> 2026-02-21T18:04:38.9501407Z %77 = tt.broadcast %76 : tensor<256x1xi1, #blocked1> -> tensor<256x16xi1, #blocked1> 2026-02-21T18:04:38.9501643Z %78 = ttg.convert_layout %77 : tensor<256x16xi1, #blocked1> -> tensor<256x16xi1, #blocked> 2026-02-21T18:04:38.9501870Z %79 = arith.andi %78, %40 : tensor<256x16xi1, #blocked> 2026-02-21T18:04:38.9502049Z %80 = tt.load %73, %79, %cst : tensor<256x16x!tt.ptr, #blocked> 2026-02-21T18:04:38.9502313Z %81 = ttg.convert_layout %63 : tensor<4x256xf16, #blocked3> -> tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T18:04:38.9502661Z %82 = ttg.convert_layout %80 : tensor<256x16xf16, #blocked> -> tensor<256x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T18:04:38.9503048Z %83 = ttg.convert_layout %arg5 : tensor<4x16xf32, #blocked> -> tensor<4x16xf32, #blocked> 2026-02-21T18:04:38.9503487Z %84 = tt.dot %81, %82, %83, inputPrecision = tf32 : tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<256x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<4x16xf32, #blocked> 2026-02-21T18:04:38.9503851Z scf.yield %84 : tensor<4x16xf32, #blocked> 2026-02-21T18:04:38.9504103Z } {tt.disallow_acc_multi_buffer} 2026-02-21T18:04:38.9511278Z %42 = arith.truncf %41 : tensor<4x16xf32, #blocked> to tensor<4x16xf16, #blocked> 2026-02-21T18:04:38.9514385Z %43 = ttg.convert_layout %22 : tensor<4xi32, #blocked2> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:04:38.9514712Z %44 = tt.expand_dims %43 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T18:04:38.9514995Z %45 = ttg.convert_layout %44 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked1> 2026-02-21T18:04:38.9515195Z %46 = arith.muli %45, %cst_5 : tensor<4x1xi32, #blocked1> 2026-02-21T18:04:38.9515431Z %47 = ttg.convert_layout %19 : tensor<16xi32, #blocked2> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:04:38.9515752Z %48 = tt.expand_dims %47 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi32, #blocked5> 2026-02-21T18:04:38.9516038Z %49 = ttg.convert_layout %48 : tensor<1x16xi32, #blocked5> -> tensor<1x16xi32, #blocked> 2026-02-21T18:04:38.9516264Z %50 = tt.broadcast %46 : tensor<4x1xi32, #blocked1> -> tensor<4x16xi32, #blocked1> 2026-02-21T18:04:38.9516493Z %51 = ttg.convert_layout %50 : tensor<4x16xi32, #blocked1> -> tensor<4x16xi32, #blocked> 2026-02-21T18:04:38.9516715Z %52 = tt.broadcast %49 : tensor<1x16xi32, #blocked> -> tensor<4x16xi32, #blocked> 2026-02-21T18:04:38.9516900Z %53 = arith.addi %51, %52 : tensor<4x16xi32, #blocked> 2026-02-21T18:04:38.9517092Z %54 = tt.addptr %8, %53 : tensor<4x16x!tt.ptr, #blocked>, tensor<4x16xi32, #blocked> 2026-02-21T18:04:38.9517288Z tt.store %54, %42 : tensor<4x16x!tt.ptr, #blocked> 2026-02-21T18:04:38.9517438Z } {tt.disallow_acc_multi_buffer, tt.flatten} 2026-02-21T18:04:38.9517560Z tt.return 2026-02-21T18:04:38.9517670Z } 2026-02-21T18:04:38.9517749Z } 2026-02-21T18:04:38.9517793Z 2026-02-21T18:04:38.9517823Z {-# 2026-02-21T18:04:38.9517907Z external_resources: { 2026-02-21T18:04:38.9518028Z mlir_reproducer: { 2026-02-21T18:04:38.9520293Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:04:38.9522614Z disable_threading: false, 2026-02-21T18:04:38.9522723Z verify_each: true 2026-02-21T18:04:38.9522814Z } 2026-02-21T18:04:38.9522889Z } 2026-02-21T18:04:38.9522958Z #-} 2026-02-21T18:04:38.9523238Z /tmp/torchinductor_root/cq/ccqe7yecbnowhdzz7ovbdvo7g4yqqnnjdukp3euajvhjkks25glh.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:04:38.9523933Z /tmp/torchinductor_root/cq/ccqe7yecbnowhdzz7ovbdvo7g4yqqnnjdukp3euajvhjkks25glh.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:04:38.9524504Z [17s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:04:38.9525296Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 16, 256], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_sm_multiplier=1, num_stages=3, num_warps=2, pid_type='persistent_interleaved', range_flattens=[True, False], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T18:04:38.9526015Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:04:38.9526183Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:04:41.2171569Z Initial population exploring neighbors 95% ━━━━━━━━━━━━━ 95/100 8.2 configs/s 2026-02-21T18:04:41.2175027Z WARNING:tritonbench.utils.triton_op:Completed input ID 9: 2026-02-21T18:04:41.2175681Z (M, N, K) 2026-02-21T18:04:41.2175924Z ------------------- 2026-02-21T18:04:41.2176158Z (1024, 12288, 1024) 2026-02-21T18:04:41.2176307Z 2026-02-21T18:04:41.2186773Z 88%|████████▊ | 7/8 [23:41<03:02, 182.33s/it]WARNING:tritonbench.utils.triton_op:Running input ID 11: 2026-02-21T18:04:41.2187204Z (M, N, K) 2026-02-21T18:04:41.2187370Z ------------------- 2026-02-21T18:04:41.2187545Z (2048, 12288, 2048) 2026-02-21T18:04:41.2187882Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T18:05:17.1169137Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T18:05:54.7644878Z Autotune Choices Stats: 2026-02-21T18:05:54.7646388Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_278", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.19371600449085236, "best_triton_pos": 0} 2026-02-21T18:05:54.7657412Z AUTOTUNE mm(2048x2048, 2048x12288) 2026-02-21T18:05:54.7657665Z strides: [2048, 1], [1, 2048] 2026-02-21T18:05:54.7657850Z dtypes: torch.float16, torch.float16 2026-02-21T18:05:54.7658334Z triton_mm_278 0.1937 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2026-02-21T18:05:54.7659127Z triton_mm_287 0.2132 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:05:54.7659990Z triton_mm_280 0.2171 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:05:54.7660754Z triton_mm_279 0.2201 ms 88.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:05:54.7661512Z triton_mm_282 0.2219 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=256, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:05:54.7662269Z triton_mm_277 0.2256 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=1, num_stages=2, num_warps=8 2026-02-21T18:05:54.7663034Z triton_mm_286 0.2266 ms 85.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:05:54.7663850Z triton_mm_276 0.2280 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2026-02-21T18:05:54.7664616Z triton_mm_275 0.2375 ms 81.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T18:05:54.7665387Z triton_mm_284 0.2390 ms 81.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=256, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2026-02-21T18:05:54.7665980Z SingleProcess AUTOTUNE benchmarking takes 1.2417 seconds and 0.2491 seconds precompiling for 36 choices 2026-02-21T18:05:54.9250767Z INFO:tritonbench.utils.triton_op:Took 1841.09ms to get benchmark function for pt2_triton_matmul 2026-02-21T18:06:26.5819109Z WARNING:__main__:Input tensor metadata: 2026-02-21T18:06:26.5819497Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T18:06:26.5819769Z 'dtype': 'torch.float16', 2026-02-21T18:06:26.5820046Z 'shape': (2048, 2048), 2026-02-21T18:06:26.5820295Z 'stride': (2048, 1)}, 2026-02-21T18:06:26.5820547Z { 'device': 'cuda:0', 2026-02-21T18:06:26.5820792Z 'dtype': 'torch.float16', 2026-02-21T18:06:26.5821052Z 'shape': (2048, 12288), 2026-02-21T18:06:26.5821303Z 'stride': (1, 2048)}, 2026-02-21T18:06:26.5821537Z None), 2026-02-21T18:06:26.5822096Z 'kwargs': {}} 2026-02-21T18:06:26.5829435Z INFO:tritonbench.utils.triton_op:Took 1.49ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T18:06:26.6557674Z [0s] Autotune random seed: 2169133161 2026-02-21T18:06:26.7239316Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T18:06:38.1059620Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 7.0 configs/s 2026-02-21T18:07:06.2192347Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:07:06.2193935Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:07:06.2194792Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T18:07:06.2195639Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T18:07:06.2196793Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T18:07:06.2197622Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:07:06.2198428Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T18:07:06.2199257Z #blocked6 = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T18:07:06.2200155Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:07:06.2201412Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:07:06.2202331Z %c96_i32 = arith.constant 96 : i32 2026-02-21T18:07:06.2202664Z %c32_i32 = arith.constant 32 : i32 2026-02-21T18:07:06.2203100Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T18:07:06.2203421Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:07:06.2203746Z %cst = arith.constant dense<12288> : tensor<8x1xi32, #blocked> 2026-02-21T18:07:06.2204086Z %cst_0 = arith.constant dense<2048> : tensor<1x512xi32, #blocked1> 2026-02-21T18:07:06.2204394Z %cst_1 = arith.constant dense<2048> : tensor<8x1xi32, #blocked> 2026-02-21T18:07:06.2204714Z %cst_2 = arith.constant dense<0.000000e+00> : tensor<8x512xf32, #blocked1> 2026-02-21T18:07:06.2204996Z %c512_i32 = arith.constant 512 : i32 2026-02-21T18:07:06.2205198Z %c8_i32 = arith.constant 8 : i32 2026-02-21T18:07:06.2205383Z %c4_i32 = arith.constant 4 : i32 2026-02-21T18:07:06.2205577Z %c256_i32 = arith.constant 256 : i32 2026-02-21T18:07:06.2205773Z %0 = tt.get_program_id x : i32 2026-02-21T18:07:06.2205967Z %1 = arith.divsi %0, %c96_i32 : i32 2026-02-21T18:07:06.2206158Z %2 = arith.muli %1, %c4_i32 : i32 2026-02-21T18:07:06.2206356Z %3 = arith.subi %c256_i32, %2 : i32 2026-02-21T18:07:06.2206551Z %4 = arith.minsi %3, %c4_i32 : i32 2026-02-21T18:07:06.2206736Z %5 = arith.remsi %0, %c96_i32 : i32 2026-02-21T18:07:06.2206921Z %6 = arith.remsi %5, %4 : i32 2026-02-21T18:07:06.2207105Z %7 = arith.addi %2, %6 : i32 2026-02-21T18:07:06.2207281Z %8 = arith.divsi %5, %4 : i32 2026-02-21T18:07:06.2207462Z %9 = arith.muli %7, %c8_i32 : i32 2026-02-21T18:07:06.2207729Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked2> 2026-02-21T18:07:06.2208043Z %11 = tt.splat %9 : i32 -> tensor<8xi32, #blocked2> 2026-02-21T18:07:06.2208295Z %12 = arith.addi %11, %10 : tensor<8xi32, #blocked2> 2026-02-21T18:07:06.2208602Z %13 = arith.muli %8, %c512_i32 : i32 2026-02-21T18:07:06.2208869Z %14 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked2> 2026-02-21T18:07:06.2209260Z %15 = tt.splat %13 : i32 -> tensor<512xi32, #blocked2> 2026-02-21T18:07:06.2209514Z %16 = arith.addi %15, %14 : tensor<512xi32, #blocked2> 2026-02-21T18:07:06.2209809Z %17 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked2> 2026-02-21T18:07:06.2210250Z %18 = ttg.convert_layout %12 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:07:06.2210806Z %19 = tt.expand_dims %18 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<8x1xi32, #blocked3> 2026-02-21T18:07:06.2211282Z %20 = ttg.convert_layout %19 : tensor<8x1xi32, #blocked3> -> tensor<8x1xi32, #blocked> 2026-02-21T18:07:06.2211618Z %21 = arith.muli %20, %cst_1 : tensor<8x1xi32, #blocked> 2026-02-21T18:07:06.2211927Z %22 = tt.broadcast %21 : tensor<8x1xi32, #blocked> -> tensor<8x32xi32, #blocked> 2026-02-21T18:07:06.2212300Z %23 = ttg.convert_layout %22 : tensor<8x32xi32, #blocked> -> tensor<8x32xi32, #blocked4> 2026-02-21T18:07:06.2212708Z %24 = tt.splat %arg0 : !tt.ptr -> tensor<8x32x!tt.ptr, #blocked4> 2026-02-21T18:07:06.2213159Z %25 = ttg.convert_layout %16 : tensor<512xi32, #blocked2> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:07:06.2213724Z %26 = tt.expand_dims %25 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x512xi32, #blocked5> 2026-02-21T18:07:06.2214229Z %27 = ttg.convert_layout %26 : tensor<1x512xi32, #blocked5> -> tensor<1x512xi32, #blocked1> 2026-02-21T18:07:06.2214546Z %28 = arith.muli %27, %cst_0 : tensor<1x512xi32, #blocked1> 2026-02-21T18:07:06.2214816Z %29 = tt.broadcast %28 : tensor<1x512xi32, #blocked1> -> tensor<32x512xi32, #blocked1> 2026-02-21T18:07:06.2215089Z %30 = tt.splat %arg1 : !tt.ptr -> tensor<32x512x!tt.ptr, #blocked1> 2026-02-21T18:07:06.2215427Z %31 = scf.for %arg3 = %c0_i32 to %c2048_i32 step %c32_i32 iter_args(%arg4 = %cst_2) -> (tensor<8x512xf32, #blocked1>) : i32 { 2026-02-21T18:07:06.2215735Z %46 = tt.splat %arg3 : i32 -> tensor<32xi32, #blocked2> 2026-02-21T18:07:06.2215949Z %47 = arith.addi %46, %17 : tensor<32xi32, #blocked2> 2026-02-21T18:07:06.2216247Z %48 = ttg.convert_layout %47 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:07:06.2216664Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x32xi32, #blocked5> 2026-02-21T18:07:06.2217050Z %50 = ttg.convert_layout %49 : tensor<1x32xi32, #blocked5> -> tensor<1x32xi32, #blocked4> 2026-02-21T18:07:06.2217338Z %51 = tt.broadcast %50 : tensor<1x32xi32, #blocked4> -> tensor<8x32xi32, #blocked4> 2026-02-21T18:07:06.2217582Z %52 = arith.addi %23, %51 : tensor<8x32xi32, #blocked4> 2026-02-21T18:07:06.2217829Z %53 = tt.addptr %24, %52 : tensor<8x32x!tt.ptr, #blocked4>, tensor<8x32xi32, #blocked4> 2026-02-21T18:07:06.2218087Z %54 = tt.load %53 : tensor<8x32x!tt.ptr, #blocked4> 2026-02-21T18:07:06.2218387Z %55 = ttg.convert_layout %47 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:07:06.2218799Z %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<32x1xi32, #blocked3> 2026-02-21T18:07:06.2219164Z %57 = ttg.convert_layout %56 : tensor<32x1xi32, #blocked3> -> tensor<32x1xi32, #blocked> 2026-02-21T18:07:06.2219449Z %58 = tt.broadcast %57 : tensor<32x1xi32, #blocked> -> tensor<32x512xi32, #blocked> 2026-02-21T18:07:06.2219746Z %59 = ttg.convert_layout %58 : tensor<32x512xi32, #blocked> -> tensor<32x512xi32, #blocked1> 2026-02-21T18:07:06.2220004Z %60 = arith.addi %59, %29 : tensor<32x512xi32, #blocked1> 2026-02-21T18:07:06.2220274Z %61 = tt.addptr %30, %60 : tensor<32x512x!tt.ptr, #blocked1>, tensor<32x512xi32, #blocked1> 2026-02-21T18:07:06.2220555Z %62 = tt.load %61 : tensor<32x512x!tt.ptr, #blocked1> 2026-02-21T18:07:06.2220888Z %63 = ttg.convert_layout %54 : tensor<8x32xf16, #blocked4> -> tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> 2026-02-21T18:07:06.2221323Z %64 = ttg.convert_layout %62 : tensor<32x512xf16, #blocked1> -> tensor<32x512xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> 2026-02-21T18:07:06.2221702Z %65 = ttg.convert_layout %arg4 : tensor<8x512xf32, #blocked1> -> tensor<8x512xf32, #blocked6> 2026-02-21T18:07:06.2222218Z %66 = tt.dot %63, %64, %65, inputPrecision = tf32 : tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<32x512xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<8x512xf32, #blocked6> 2026-02-21T18:07:06.2222718Z %67 = ttg.convert_layout %66 : tensor<8x512xf32, #blocked6> -> tensor<8x512xf32, #blocked1> 2026-02-21T18:07:06.2222965Z scf.yield %67 : tensor<8x512xf32, #blocked1> 2026-02-21T18:07:06.2223141Z } {tt.disallow_acc_multi_buffer, tt.flatten} 2026-02-21T18:07:06.2223387Z %32 = arith.truncf %31 : tensor<8x512xf32, #blocked1> to tensor<8x512xf16, #blocked1> 2026-02-21T18:07:06.2223728Z %33 = ttg.convert_layout %12 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:07:06.2224134Z %34 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<8x1xi32, #blocked3> 2026-02-21T18:07:06.2224481Z %35 = ttg.convert_layout %34 : tensor<8x1xi32, #blocked3> -> tensor<8x1xi32, #blocked> 2026-02-21T18:07:06.2224720Z %36 = arith.muli %35, %cst : tensor<8x1xi32, #blocked> 2026-02-21T18:07:06.2225014Z %37 = ttg.convert_layout %16 : tensor<512xi32, #blocked2> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:07:06.2225432Z %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x512xi32, #blocked5> 2026-02-21T18:07:06.2225765Z %39 = ttg.convert_layout %38 : tensor<1x512xi32, #blocked5> -> tensor<1x512xi32, #blocked1> 2026-02-21T18:07:06.2226002Z %40 = tt.broadcast %36 : tensor<8x1xi32, #blocked> -> tensor<8x512xi32, #blocked> 2026-02-21T18:07:06.2226253Z %41 = ttg.convert_layout %40 : tensor<8x512xi32, #blocked> -> tensor<8x512xi32, #blocked1> 2026-02-21T18:07:06.2226486Z %42 = tt.broadcast %39 : tensor<1x512xi32, #blocked1> -> tensor<8x512xi32, #blocked1> 2026-02-21T18:07:06.2226676Z %43 = arith.addi %41, %42 : tensor<8x512xi32, #blocked1> 2026-02-21T18:07:06.2226857Z %44 = tt.splat %arg2 : !tt.ptr -> tensor<8x512x!tt.ptr, #blocked1> 2026-02-21T18:07:06.2227085Z %45 = tt.addptr %44, %43 : tensor<8x512x!tt.ptr, #blocked1>, tensor<8x512xi32, #blocked1> 2026-02-21T18:07:06.2227291Z tt.store %45, %32 : tensor<8x512x!tt.ptr, #blocked1> 2026-02-21T18:07:06.2227431Z tt.return 2026-02-21T18:07:06.2227512Z } 2026-02-21T18:07:06.2227594Z } 2026-02-21T18:07:06.2227638Z 2026-02-21T18:07:06.2227670Z {-# 2026-02-21T18:07:06.2227754Z external_resources: { 2026-02-21T18:07:06.2227857Z mlir_reproducer: { 2026-02-21T18:07:06.2230286Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:07:06.2232681Z disable_threading: false, 2026-02-21T18:07:06.2232788Z verify_each: true 2026-02-21T18:07:06.2232882Z } 2026-02-21T18:07:06.2232956Z } 2026-02-21T18:07:06.2233028Z #-} 2026-02-21T18:07:06.2233313Z /tmp/torchinductor_root/rp/crpau6isktrl3t5bwjgebx6hv7gfrdnmq4oxmbz5djlw5ip5k22n.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:07:06.2234042Z /tmp/torchinductor_root/rp/crpau6isktrl3t5bwjgebx6hv7gfrdnmq4oxmbz5djlw5ip5k22n.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:07:06.2234608Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:07:06.2235328Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 512, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T18:07:06.2235983Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:07:06.2236153Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:07:06.2575766Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:07:06.2577633Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [4, 16], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:07:06.2578496Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:07:06.2579315Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T18:07:06.2580124Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T18:07:06.2580948Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T18:07:06.2581765Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T18:07:06.2582695Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:07:06.2583980Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:07:06.2585033Z %cst = arith.constant dense<0.000000e+00> : tensor<512x16xf16, #blocked> 2026-02-21T18:07:06.2585563Z %cst_0 = arith.constant dense<12288> : tensor<1x16xi64, #blocked> 2026-02-21T18:07:06.2585854Z %cst_1 = arith.constant dense<0> : tensor<1x16xi64, #blocked> 2026-02-21T18:07:06.2586067Z %cst_2 = arith.constant dense<2048> : tensor<512x1xi64, #blocked1> 2026-02-21T18:07:06.2586284Z %cst_3 = arith.constant dense<0> : tensor<512x1xi64, #blocked1> 2026-02-21T18:07:06.2586459Z %cst_4 = arith.constant dense<2048> : tensor<1x16xi64, #blocked> 2026-02-21T18:07:06.2586630Z %c512_i32 = arith.constant 512 : i32 2026-02-21T18:07:06.2586753Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T18:07:06.2586874Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:07:06.2586985Z %c1_i32 = arith.constant 1 : i32 2026-02-21T18:07:06.2587128Z %cst_5 = arith.constant dense<12288> : tensor<8x1xi32, #blocked1> 2026-02-21T18:07:06.2587303Z %cst_6 = arith.constant dense<2048> : tensor<8x1xi32, #blocked1> 2026-02-21T18:07:06.2587485Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<8x16xf32, #blocked> 2026-02-21T18:07:06.2587668Z %c16_i32 = arith.constant 16 : i32 2026-02-21T18:07:06.2587783Z %c8_i32 = arith.constant 8 : i32 2026-02-21T18:07:06.2587894Z %c256_i32 = arith.constant 256 : i32 2026-02-21T18:07:06.2588019Z %c196608_i32 = arith.constant 196608 : i32 2026-02-21T18:07:06.2588218Z %c41_i32 = arith.constant 41 : i32 2026-02-21T18:07:06.2588334Z %0 = tt.get_program_id x : i32 2026-02-21T18:07:06.2588445Z %1 = arith.muli %0, %c41_i32 : i32 2026-02-21T18:07:06.2588583Z %2 = arith.addi %1, %c41_i32 : i32 2026-02-21T18:07:06.2588700Z %3 = arith.minsi %2, %c196608_i32 : i32 2026-02-21T18:07:06.2588859Z %4 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked2> 2026-02-21T18:07:06.2589062Z %5 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #blocked2> 2026-02-21T18:07:06.2589266Z %6 = tt.make_range {end = 512 : i32, start = 0 : i32} : tensor<512xi32, #blocked2> 2026-02-21T18:07:06.2589473Z %7 = tt.splat %arg0 : !tt.ptr -> tensor<8x512x!tt.ptr, #blocked3> 2026-02-21T18:07:06.2589668Z %8 = tt.splat %arg1 : !tt.ptr -> tensor<512x16x!tt.ptr, #blocked> 2026-02-21T18:07:06.2589872Z %9 = arith.extsi %6 : tensor<512xi32, #blocked2> to tensor<512xi64, #blocked2> 2026-02-21T18:07:06.2590075Z %10 = arith.extsi %5 : tensor<16xi32, #blocked2> to tensor<16xi64, #blocked2> 2026-02-21T18:07:06.2590277Z %11 = tt.splat %arg2 : !tt.ptr -> tensor<8x16x!tt.ptr, #blocked> 2026-02-21T18:07:06.2590448Z scf.for %arg3 = %1 to %3 step %c1_i32 : i32 { 2026-02-21T18:07:06.2590604Z %12 = arith.remsi %arg3, %c256_i32 : i32 2026-02-21T18:07:06.2590735Z %13 = arith.divsi %arg3, %c256_i32 : i32 2026-02-21T18:07:06.2590855Z %14 = arith.muli %12, %c8_i32 : i32 2026-02-21T18:07:06.2590989Z %15 = tt.splat %14 : i32 -> tensor<8xi32, #blocked2> 2026-02-21T18:07:06.2591135Z %16 = arith.addi %15, %4 : tensor<8xi32, #blocked2> 2026-02-21T18:07:06.2591271Z %17 = arith.muli %13, %c16_i32 : i32 2026-02-21T18:07:06.2591405Z %18 = tt.splat %17 : i32 -> tensor<16xi32, #blocked2> 2026-02-21T18:07:06.2591551Z %19 = arith.addi %18, %5 : tensor<16xi32, #blocked2> 2026-02-21T18:07:06.2591783Z %20 = ttg.convert_layout %16 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:07:06.2592110Z %21 = tt.expand_dims %20 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<8x1xi32, #blocked4> 2026-02-21T18:07:06.2592397Z %22 = ttg.convert_layout %21 : tensor<8x1xi32, #blocked4> -> tensor<8x1xi32, #blocked1> 2026-02-21T18:07:06.2592599Z %23 = arith.muli %22, %cst_6 : tensor<8x1xi32, #blocked1> 2026-02-21T18:07:06.2592785Z %24 = tt.broadcast %23 : tensor<8x1xi32, #blocked1> -> tensor<8x512xi32, #blocked1> 2026-02-21T18:07:06.2593023Z %25 = ttg.convert_layout %24 : tensor<8x512xi32, #blocked1> -> tensor<8x512xi32, #blocked3> 2026-02-21T18:07:06.2593204Z %26 = arith.extsi %17 : i32 to i64 2026-02-21T18:07:06.2593337Z %27 = tt.splat %26 : i64 -> tensor<16xi64, #blocked2> 2026-02-21T18:07:06.2593482Z %28 = arith.addi %27, %10 : tensor<16xi64, #blocked2> 2026-02-21T18:07:06.2593712Z %29 = ttg.convert_layout %28 : tensor<16xi64, #blocked2> -> tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:07:06.2594081Z %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<16xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi64, #blocked5> 2026-02-21T18:07:06.2594365Z %31 = ttg.convert_layout %30 : tensor<1x16xi64, #blocked5> -> tensor<1x16xi64, #blocked> 2026-02-21T18:07:06.2594567Z %32 = arith.muli %31, %cst_4 : tensor<1x16xi64, #blocked> 2026-02-21T18:07:06.2594750Z %33 = tt.broadcast %32 : tensor<1x16xi64, #blocked> -> tensor<512x16xi64, #blocked> 2026-02-21T18:07:06.2594946Z %34 = arith.cmpi sge, %31, %cst_1 : tensor<1x16xi64, #blocked> 2026-02-21T18:07:06.2595114Z %35 = arith.cmpi slt, %31, %cst_0 : tensor<1x16xi64, #blocked> 2026-02-21T18:07:06.2595269Z %36 = arith.andi %34, %35 : tensor<1x16xi1, #blocked> 2026-02-21T18:07:06.2595447Z %37 = tt.broadcast %36 : tensor<1x16xi1, #blocked> -> tensor<512x16xi1, #blocked> 2026-02-21T18:07:06.2595724Z %38 = scf.for %arg4 = %c0_i32 to %c2048_i32 step %c512_i32 iter_args(%arg5 = %cst_7) -> (tensor<8x16xf32, #blocked>) : i32 { 2026-02-21T18:07:06.2595969Z %52 = tt.splat %arg4 : i32 -> tensor<512xi32, #blocked2> 2026-02-21T18:07:06.2596141Z %53 = arith.addi %52, %6 : tensor<512xi32, #blocked2> 2026-02-21T18:07:06.2596379Z %54 = ttg.convert_layout %53 : tensor<512xi32, #blocked2> -> tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:07:06.2596715Z %55 = tt.expand_dims %54 {axis = 0 : i32} : tensor<512xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x512xi32, #blocked5> 2026-02-21T18:07:06.2597006Z %56 = ttg.convert_layout %55 : tensor<1x512xi32, #blocked5> -> tensor<1x512xi32, #blocked3> 2026-02-21T18:07:06.2597244Z %57 = tt.broadcast %56 : tensor<1x512xi32, #blocked3> -> tensor<8x512xi32, #blocked3> 2026-02-21T18:07:06.2597438Z %58 = arith.addi %25, %57 : tensor<8x512xi32, #blocked3> 2026-02-21T18:07:06.2597635Z %59 = tt.addptr %7, %58 : tensor<8x512x!tt.ptr, #blocked3>, tensor<8x512xi32, #blocked3> 2026-02-21T18:07:06.2597840Z %60 = tt.load %59 : tensor<8x512x!tt.ptr, #blocked3> 2026-02-21T18:07:06.2597982Z %61 = arith.extsi %arg4 : i32 to i64 2026-02-21T18:07:06.2598137Z %62 = tt.splat %61 : i64 -> tensor<512xi64, #blocked2> 2026-02-21T18:07:06.2598287Z %63 = arith.addi %62, %9 : tensor<512xi64, #blocked2> 2026-02-21T18:07:06.2598523Z %64 = ttg.convert_layout %63 : tensor<512xi64, #blocked2> -> tensor<512xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:07:06.2598859Z %65 = tt.expand_dims %64 {axis = 1 : i32} : tensor<512xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<512x1xi64, #blocked4> 2026-02-21T18:07:06.2599149Z %66 = ttg.convert_layout %65 : tensor<512x1xi64, #blocked4> -> tensor<512x1xi64, #blocked1> 2026-02-21T18:07:06.2599389Z %67 = tt.broadcast %66 : tensor<512x1xi64, #blocked1> -> tensor<512x16xi64, #blocked1> 2026-02-21T18:07:06.2599624Z %68 = ttg.convert_layout %67 : tensor<512x16xi64, #blocked1> -> tensor<512x16xi64, #blocked> 2026-02-21T18:07:06.2599829Z %69 = arith.addi %68, %33 : tensor<512x16xi64, #blocked> 2026-02-21T18:07:06.2600026Z %70 = tt.addptr %8, %69 : tensor<512x16x!tt.ptr, #blocked>, tensor<512x16xi64, #blocked> 2026-02-21T18:07:06.2600234Z %71 = arith.cmpi sge, %66, %cst_3 : tensor<512x1xi64, #blocked1> 2026-02-21T18:07:06.2600407Z %72 = arith.cmpi slt, %66, %cst_2 : tensor<512x1xi64, #blocked1> 2026-02-21T18:07:06.2600571Z %73 = arith.andi %71, %72 : tensor<512x1xi1, #blocked1> 2026-02-21T18:07:06.2600761Z %74 = tt.broadcast %73 : tensor<512x1xi1, #blocked1> -> tensor<512x16xi1, #blocked1> 2026-02-21T18:07:06.2600997Z %75 = ttg.convert_layout %74 : tensor<512x16xi1, #blocked1> -> tensor<512x16xi1, #blocked> 2026-02-21T18:07:06.2601192Z %76 = arith.andi %75, %37 : tensor<512x16xi1, #blocked> 2026-02-21T18:07:06.2601374Z %77 = tt.load %70, %76, %cst : tensor<512x16x!tt.ptr, #blocked> 2026-02-21T18:07:06.2601649Z %78 = ttg.convert_layout %60 : tensor<8x512xf16, #blocked3> -> tensor<8x512xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> 2026-02-21T18:07:06.2601992Z %79 = ttg.convert_layout %77 : tensor<512x16xf16, #blocked> -> tensor<512x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> 2026-02-21T18:07:06.2602288Z %80 = ttg.convert_layout %arg5 : tensor<8x16xf32, #blocked> -> tensor<8x16xf32, #blocked> 2026-02-21T18:07:06.2602686Z %81 = tt.dot %78, %79, %80, inputPrecision = tf32 : tensor<8x512xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> * tensor<512x16xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked}>> -> tensor<8x16xf32, #blocked> 2026-02-21T18:07:06.2603024Z scf.yield %81 : tensor<8x16xf32, #blocked> 2026-02-21T18:07:06.2603153Z } {tt.disallow_acc_multi_buffer} 2026-02-21T18:07:06.2603321Z %39 = arith.truncf %38 : tensor<8x16xf32, #blocked> to tensor<8x16xf16, #blocked> 2026-02-21T18:07:06.2603592Z %40 = ttg.convert_layout %16 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:07:06.2603927Z %41 = tt.expand_dims %40 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<8x1xi32, #blocked4> 2026-02-21T18:07:06.2604208Z %42 = ttg.convert_layout %41 : tensor<8x1xi32, #blocked4> -> tensor<8x1xi32, #blocked1> 2026-02-21T18:07:06.2604403Z %43 = arith.muli %42, %cst_5 : tensor<8x1xi32, #blocked1> 2026-02-21T18:07:06.2604637Z %44 = ttg.convert_layout %19 : tensor<16xi32, #blocked2> -> tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:07:06.2604957Z %45 = tt.expand_dims %44 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x16xi32, #blocked5> 2026-02-21T18:07:06.2605236Z %46 = ttg.convert_layout %45 : tensor<1x16xi32, #blocked5> -> tensor<1x16xi32, #blocked> 2026-02-21T18:07:06.2605462Z %47 = tt.broadcast %43 : tensor<8x1xi32, #blocked1> -> tensor<8x16xi32, #blocked1> 2026-02-21T18:07:06.2605686Z %48 = ttg.convert_layout %47 : tensor<8x16xi32, #blocked1> -> tensor<8x16xi32, #blocked> 2026-02-21T18:07:06.2605909Z %49 = tt.broadcast %46 : tensor<1x16xi32, #blocked> -> tensor<8x16xi32, #blocked> 2026-02-21T18:07:06.2606123Z %50 = arith.addi %48, %49 : tensor<8x16xi32, #blocked> 2026-02-21T18:07:06.2606312Z %51 = tt.addptr %11, %50 : tensor<8x16x!tt.ptr, #blocked>, tensor<8x16xi32, #blocked> 2026-02-21T18:07:06.2606508Z tt.store %51, %39 : tensor<8x16x!tt.ptr, #blocked> 2026-02-21T18:07:06.2606634Z } 2026-02-21T18:07:06.2606714Z tt.return 2026-02-21T18:07:06.2606792Z } 2026-02-21T18:07:06.2606867Z } 2026-02-21T18:07:06.2606909Z 2026-02-21T18:07:06.2606942Z {-# 2026-02-21T18:07:06.2607021Z external_resources: { 2026-02-21T18:07:06.2607122Z mlir_reproducer: { 2026-02-21T18:07:06.2609419Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:07:06.2621027Z disable_threading: false, 2026-02-21T18:07:06.2621135Z verify_each: true 2026-02-21T18:07:06.2621229Z } 2026-02-21T18:07:06.2621301Z } 2026-02-21T18:07:06.2621374Z #-} 2026-02-21T18:07:06.2621653Z /tmp/torchinductor_root/o5/co5w4prkvquhkfcreahrotogwlvqqv4q4qblvz4vzc6rsg7olooy.py:14:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:07:06.2622338Z /tmp/torchinductor_root/o5/co5w4prkvquhkfcreahrotogwlvqqv4q4qblvz4vzc6rsg7olooy.py:14:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:07:06.2622892Z [39s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:07:06.2623727Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 16, 512], indexing=['pointer', 'block_ptr', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_sm_multiplier=16, num_stages=3, num_warps=2, pid_type='persistent_blocked', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T18:07:06.2624447Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:07:06.2624617Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:07:14.5356080Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 2.9 configs/s 2026-02-21T18:07:14.5363690Z [47s] Adaptive compile timeout: 30s (90% percentile=9.5s, bounds=[30.0s, 30s]) 2026-02-21T18:07:14.5899511Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 371/371 - configs/s 2026-02-21T18:07:15.5471842Z [48s] Initial random population of 100, 5 starting points: 2026-02-21T18:07:15.5472341Z error=11 2026-02-21T18:07:15.5472556Z ok=89 2026-02-21T18:07:15.5473110Z min=0.4779 2026-02-21T18:07:15.5473314Z mid=5.5911 2026-02-21T18:07:15.5473516Z max=872.5618 2026-02-21T18:07:15.5473768Z best={'block_sizes': [128, 16, 64], 2026-02-21T18:07:15.5474124Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:07:15.5474484Z 'l2_groupings': [1], 2026-02-21T18:07:15.5474761Z 'load_eviction_policies': ['', ''], 2026-02-21T18:07:15.5475073Z 'loop_orders': [[0, 1]], 2026-02-21T18:07:15.5475343Z 'matrix_instr_nonkdim': 0, 2026-02-21T18:07:15.5475630Z 'num_sm_multiplier': 16, 2026-02-21T18:07:15.5475891Z 'num_stages': 1, 2026-02-21T18:07:15.5476128Z 'num_warps': 4, 2026-02-21T18:07:15.5476387Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:07:15.5476727Z 'range_flattens': [False, None], 2026-02-21T18:07:15.5477036Z 'range_multi_buffers': [False, True], 2026-02-21T18:07:15.5477347Z 'range_num_stages': [0, 0], 2026-02-21T18:07:15.5477633Z 'range_unroll_factors': [0, 0], 2026-02-21T18:07:15.5477937Z 'range_warp_specializes': [], 2026-02-21T18:07:15.5478216Z 'waves_per_eu': 4} 2026-02-21T18:07:15.5494944Z [48s] Fitting surrogate: 100 points, 100 targets 2026-02-21T18:07:16.5257427Z [49s] Generation 1 starting: 87 neighbors, 5 active search path(s) 2026-02-21T18:07:27.0693433Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 9.6 configs/s 2026-02-21T18:07:34.0566781Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 13.3 configs/s 2026-02-21T18:07:36.5850045Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━━ 706/706 189.7 configs/s 2026-02-21T18:07:37.9802503Z [71s] Generation 1 complete: 2026-02-21T18:07:37.9802891Z error=5 2026-02-21T18:07:37.9803096Z ok=88 2026-02-21T18:07:37.9803306Z min=0.2633 2026-02-21T18:07:37.9803860Z mid=0.7025 2026-02-21T18:07:37.9804065Z max=78.9533 2026-02-21T18:07:37.9804311Z best={'block_sizes': [128, 64, 64], 2026-02-21T18:07:37.9804689Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:07:37.9805197Z 'l2_groupings': [1], 2026-02-21T18:07:37.9805488Z 'load_eviction_policies': ['', ''], 2026-02-21T18:07:37.9805820Z 'loop_orders': [[1, 0]], 2026-02-21T18:07:37.9806095Z 'matrix_instr_nonkdim': 0, 2026-02-21T18:07:37.9806375Z 'num_sm_multiplier': 16, 2026-02-21T18:07:37.9806634Z 'num_stages': 2, 2026-02-21T18:07:37.9806867Z 'num_warps': 8, 2026-02-21T18:07:37.9807123Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:07:37.9807458Z 'range_flattens': [False, None], 2026-02-21T18:07:37.9807764Z 'range_multi_buffers': [False, True], 2026-02-21T18:07:37.9808083Z 'range_num_stages': [0, 0], 2026-02-21T18:07:37.9808369Z 'range_unroll_factors': [0, 0], 2026-02-21T18:07:37.9808665Z 'range_warp_specializes': [], 2026-02-21T18:07:37.9808944Z 'waves_per_eu': 4} 2026-02-21T18:07:37.9930135Z [71s] Fitting surrogate: 193 points, 193 targets 2026-02-21T18:07:39.8008176Z [73s] Generation 2 starting: 93 neighbors, 5 active search path(s) 2026-02-21T18:07:51.2019296Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 5.8 configs/s 2026-02-21T18:07:56.0294206Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 20.8 configs/s 2026-02-21T18:08:02.2509534Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━━ 824/824 114.8 configs/s 2026-02-21T18:08:03.7912701Z [97s] Generation 2 complete: 2026-02-21T18:08:03.7913128Z error=21 2026-02-21T18:08:03.7913354Z ok=78 2026-02-21T18:08:03.7913565Z min=0.2246 2026-02-21T18:08:03.7913767Z mid=0.4311 2026-02-21T18:08:03.7913965Z max=7.0046 2026-02-21T18:08:03.7914193Z best={'block_sizes': [128, 128, 32], 2026-02-21T18:08:03.7914575Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T18:08:03.7914923Z 'l2_groupings': [4], 2026-02-21T18:08:03.7915195Z 'load_eviction_policies': ['', ''], 2026-02-21T18:08:03.7915551Z 'loop_orders': [[1, 0]], 2026-02-21T18:08:03.7915828Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:08:03.7916104Z 'num_sm_multiplier': 8, 2026-02-21T18:08:03.7916368Z 'num_stages': 3, 2026-02-21T18:08:03.7916611Z 'num_warps': 4, 2026-02-21T18:08:03.7916882Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:08:03.7917701Z 'range_flattens': [None, False], 2026-02-21T18:08:03.7917947Z 'range_multi_buffers': [False, False], 2026-02-21T18:08:03.7918208Z 'range_num_stages': [0, 0], 2026-02-21T18:08:03.7918440Z 'range_unroll_factors': [0, 0], 2026-02-21T18:08:03.7918686Z 'range_warp_specializes': [], 2026-02-21T18:08:03.7918906Z 'waves_per_eu': 1} 2026-02-21T18:08:03.8222361Z [97s] Fitting surrogate: 292 points, 292 targets 2026-02-21T18:08:04.9044587Z [98s] Generation 3 starting: 97 neighbors, 5 active search path(s) 2026-02-21T18:08:15.3023764Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98/98 4.3 configs/s 2026-02-21T18:08:20.3570893Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 98/98 20.0 configs/s 2026-02-21T18:08:30.0417648Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━━━ 890/890 84.2 configs/s 2026-02-21T18:08:31.4305987Z [124s] Generation 3 complete: 2026-02-21T18:08:31.4306711Z error=16 2026-02-21T18:08:31.4306965Z ok=86 2026-02-21T18:08:31.4307191Z min=0.2245 2026-02-21T18:08:31.4307419Z mid=0.3596 2026-02-21T18:08:31.4307616Z max=6.7837 2026-02-21T18:08:31.4307847Z best={'block_sizes': [128, 128, 32], 2026-02-21T18:08:31.4308341Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T18:08:31.4308696Z 'l2_groupings': [4], 2026-02-21T18:08:31.4308973Z 'load_eviction_policies': ['', ''], 2026-02-21T18:08:31.4309278Z 'loop_orders': [[1, 0]], 2026-02-21T18:08:31.4309558Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:08:31.4309836Z 'num_sm_multiplier': 8, 2026-02-21T18:08:31.4310097Z 'num_stages': 3, 2026-02-21T18:08:31.4310321Z 'num_warps': 4, 2026-02-21T18:08:31.4310582Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:08:31.4310913Z 'range_flattens': [None, False], 2026-02-21T18:08:31.4311237Z 'range_multi_buffers': [False, False], 2026-02-21T18:08:31.4311554Z 'range_num_stages': [0, 0], 2026-02-21T18:08:31.4311838Z 'range_unroll_factors': [0, 0], 2026-02-21T18:08:31.4312143Z 'range_warp_specializes': [], 2026-02-21T18:08:31.4312424Z 'waves_per_eu': 1} 2026-02-21T18:08:31.4868500Z [124s] Fitting surrogate: 394 points, 394 targets 2026-02-21T18:08:32.4826618Z [125s] Generation 4 starting: 83 neighbors, 5 active search path(s) 2026-02-21T18:08:40.9159805Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 16.2 configs/s 2026-02-21T18:08:45.7154766Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 18.1 configs/s 2026-02-21T18:08:56.7356735Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━━━ 890/890 74.9 configs/s 2026-02-21T18:08:58.1260421Z [151s] Generation 4 complete: 2026-02-21T18:08:58.1260801Z error=9 2026-02-21T18:08:58.1261014Z ok=79 2026-02-21T18:08:58.1261218Z min=0.2336 2026-02-21T18:08:58.1261469Z mid=0.3415 2026-02-21T18:08:58.1261671Z max=12.1327 2026-02-21T18:08:58.1261910Z best={'block_sizes': [128, 128, 32], 2026-02-21T18:08:58.1262281Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T18:08:58.1262693Z 'l2_groupings': [4], 2026-02-21T18:08:58.1263283Z 'load_eviction_policies': ['', ''], 2026-02-21T18:08:58.1263731Z 'loop_orders': [[1, 0]], 2026-02-21T18:08:58.1264008Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:08:58.1264301Z 'num_sm_multiplier': 8, 2026-02-21T18:08:58.1264568Z 'num_stages': 3, 2026-02-21T18:08:58.1264803Z 'num_warps': 4, 2026-02-21T18:08:58.1265067Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:08:58.1265402Z 'range_flattens': [None, False], 2026-02-21T18:08:58.1265717Z 'range_multi_buffers': [False, False], 2026-02-21T18:08:58.1266033Z 'range_num_stages': [0, 0], 2026-02-21T18:08:58.1266317Z 'range_unroll_factors': [0, 0], 2026-02-21T18:08:58.1266617Z 'range_warp_specializes': [], 2026-02-21T18:08:58.1266902Z 'waves_per_eu': 1} 2026-02-21T18:08:58.1944690Z [151s] Fitting surrogate: 482 points, 482 targets 2026-02-21T18:08:58.9865036Z [152s] Generation 5 starting: 66 neighbors, 4 active search path(s) 2026-02-21T18:09:05.1258147Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 68/68 25.2 configs/s 2026-02-21T18:09:09.0774150Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 68/68 17.6 configs/s 2026-02-21T18:09:15.1620055Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━━ 951/951 136.3 configs/s 2026-02-21T18:09:16.3283635Z [169s] Generation 5 complete: 2026-02-21T18:09:16.3283995Z error=6 2026-02-21T18:09:16.3284199Z ok=65 2026-02-21T18:09:16.3284410Z min=0.2067 2026-02-21T18:09:16.3284618Z mid=0.4048 2026-02-21T18:09:16.3284822Z max=12.2308 2026-02-21T18:09:16.3285056Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:09:16.3285460Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:09:16.3285820Z 'l2_groupings': [1], 2026-02-21T18:09:16.3286096Z 'load_eviction_policies': ['', ''], 2026-02-21T18:09:16.3286850Z 'loop_orders': [[1, 0]], 2026-02-21T18:09:16.3287132Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:09:16.3287412Z 'num_sm_multiplier': 16, 2026-02-21T18:09:16.3287670Z 'num_stages': 2, 2026-02-21T18:09:16.3288054Z 'num_warps': 8, 2026-02-21T18:09:16.3288328Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:09:16.3288679Z 'range_flattens': [False, False], 2026-02-21T18:09:16.3288989Z 'range_multi_buffers': [False, True], 2026-02-21T18:09:16.3289300Z 'range_num_stages': [0, 0], 2026-02-21T18:09:16.3289579Z 'range_unroll_factors': [0, 0], 2026-02-21T18:09:16.3289877Z 'range_warp_specializes': [], 2026-02-21T18:09:16.3290149Z 'waves_per_eu': 4} 2026-02-21T18:09:16.3637685Z [169s] Fitting surrogate: 553 points, 553 targets 2026-02-21T18:09:17.0672904Z [170s] Generation 6 starting: 57 neighbors, 3 active search path(s) 2026-02-21T18:09:22.6516572Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57/57 4.9 configs/s 2026-02-21T18:09:25.8645381Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 57/57 18.6 configs/s 2026-02-21T18:09:34.0031038Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━━ 967/967 107.8 configs/s 2026-02-21T18:09:35.4179536Z [188s] Generation 6 complete: 2026-02-21T18:09:35.4179788Z error=9 2026-02-21T18:09:35.4179874Z ok=51 2026-02-21T18:09:35.4180350Z min=0.2123 2026-02-21T18:09:35.4181330Z mid=0.2896 2026-02-21T18:09:35.4181929Z max=16.5142 2026-02-21T18:09:35.4182106Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:09:35.4182343Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:09:35.4182552Z 'l2_groupings': [1], 2026-02-21T18:09:35.4182705Z 'load_eviction_policies': ['', ''], 2026-02-21T18:09:35.4182873Z 'loop_orders': [[1, 0]], 2026-02-21T18:09:35.4183026Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:09:35.4183186Z 'num_sm_multiplier': 16, 2026-02-21T18:09:35.4183334Z 'num_stages': 2, 2026-02-21T18:09:35.4183474Z 'num_warps': 8, 2026-02-21T18:09:35.4183625Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:09:35.4183840Z 'range_flattens': [False, False], 2026-02-21T18:09:35.4184012Z 'range_multi_buffers': [False, True], 2026-02-21T18:09:35.4184187Z 'range_num_stages': [0, 0], 2026-02-21T18:09:35.4184349Z 'range_unroll_factors': [0, 0], 2026-02-21T18:09:35.4184510Z 'range_warp_specializes': [], 2026-02-21T18:09:35.4184851Z 'waves_per_eu': 4} 2026-02-21T18:09:35.4695438Z [188s] Fitting surrogate: 613 points, 613 targets 2026-02-21T18:09:36.8646168Z [190s] Generation 7 starting: 48 neighbors, 3 active search path(s) 2026-02-21T18:09:41.0449785Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 13.2 configs/s 2026-02-21T18:09:43.4415670Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 48/48 21.5 configs/s 2026-02-21T18:09:49.2555425Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━━ 967/967 145.2 configs/s 2026-02-21T18:09:50.3825205Z [203s] Generation 7 complete: 2026-02-21T18:09:50.3825694Z error=10 2026-02-21T18:09:50.3825933Z ok=41 2026-02-21T18:09:50.3826134Z min=0.2063 2026-02-21T18:09:50.3826375Z mid=0.3015 2026-02-21T18:09:50.3826573Z max=2.9533 2026-02-21T18:09:50.3826804Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:09:50.3827172Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:09:50.3827682Z 'l2_groupings': [1], 2026-02-21T18:09:50.3828076Z 'load_eviction_policies': ['', ''], 2026-02-21T18:09:50.3828520Z 'loop_orders': [[1, 0]], 2026-02-21T18:09:50.3828800Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:09:50.3829217Z 'num_sm_multiplier': 16, 2026-02-21T18:09:50.3829640Z 'num_stages': 2, 2026-02-21T18:09:50.3829983Z 'num_warps': 8, 2026-02-21T18:09:50.3830371Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:09:50.3830830Z 'range_flattens': [None, False], 2026-02-21T18:09:50.3831135Z 'range_multi_buffers': [False, True], 2026-02-21T18:09:50.3831447Z 'range_num_stages': [0, 0], 2026-02-21T18:09:50.3831728Z 'range_unroll_factors': [0, 0], 2026-02-21T18:09:50.3832026Z 'range_warp_specializes': [], 2026-02-21T18:09:50.3832296Z 'waves_per_eu': 4} 2026-02-21T18:09:50.4188415Z [203s] Fitting surrogate: 664 points, 664 targets 2026-02-21T18:09:51.0659889Z [204s] Generation 8 starting: 50 neighbors, 3 active search path(s) 2026-02-21T18:09:55.8577142Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 9.2 configs/s 2026-02-21T18:09:58.5011588Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 20.5 configs/s 2026-02-21T18:10:06.1334189Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━━ 996/996 118.0 configs/s 2026-02-21T18:10:07.2705018Z [220s] Generation 8 complete: 2026-02-21T18:10:07.2705424Z error=9 2026-02-21T18:10:07.2705633Z ok=44 2026-02-21T18:10:07.2705840Z min=0.2014 2026-02-21T18:10:07.2706046Z mid=0.2682 2026-02-21T18:10:07.2706245Z max=2.3842 2026-02-21T18:10:07.2706471Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:10:07.2706847Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:10:07.2707205Z 'l2_groupings': [1], 2026-02-21T18:10:07.2707481Z 'load_eviction_policies': ['', ''], 2026-02-21T18:10:07.2707821Z 'loop_orders': [[1, 0]], 2026-02-21T18:10:07.2708098Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:10:07.2708487Z 'num_sm_multiplier': 16, 2026-02-21T18:10:07.2708747Z 'num_stages': 2, 2026-02-21T18:10:07.2708996Z 'num_warps': 8, 2026-02-21T18:10:07.2709622Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:10:07.2709970Z 'range_flattens': [None, False], 2026-02-21T18:10:07.2710275Z 'range_multi_buffers': [None, True], 2026-02-21T18:10:07.2710586Z 'range_num_stages': [0, 0], 2026-02-21T18:10:07.2710866Z 'range_unroll_factors': [0, 0], 2026-02-21T18:10:07.2711171Z 'range_warp_specializes': [], 2026-02-21T18:10:07.2711445Z 'waves_per_eu': 4} 2026-02-21T18:10:07.3204120Z [220s] Fitting surrogate: 717 points, 717 targets 2026-02-21T18:10:07.8206080Z [221s] Generation 9 starting: 35 neighbors, 2 active search path(s) 2026-02-21T18:10:12.2105852Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 15.3 configs/s 2026-02-21T18:10:14.1480252Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 35/35 18.5 configs/s 2026-02-21T18:10:18.7608553Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━━ 998/998 186.2 configs/s 2026-02-21T18:10:19.8462912Z [233s] Generation 9 complete: 2026-02-21T18:10:19.8463288Z error=5 2026-02-21T18:10:19.8463528Z ok=32 2026-02-21T18:10:19.8464012Z min=0.1957 2026-02-21T18:10:19.8464245Z mid=0.2642 2026-02-21T18:10:19.8464442Z max=4.1546 2026-02-21T18:10:19.8464672Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:10:19.8465046Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:10:19.8465417Z 'l2_groupings': [1], 2026-02-21T18:10:19.8465697Z 'load_eviction_policies': ['', ''], 2026-02-21T18:10:19.8466001Z 'loop_orders': [[1, 0]], 2026-02-21T18:10:19.8466285Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:10:19.8466563Z 'num_sm_multiplier': 16, 2026-02-21T18:10:19.8466830Z 'num_stages': 2, 2026-02-21T18:10:19.8467063Z 'num_warps': 8, 2026-02-21T18:10:19.8467326Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:10:19.8467659Z 'range_flattens': [None, False], 2026-02-21T18:10:19.8467969Z 'range_multi_buffers': [None, True], 2026-02-21T18:10:19.8468386Z 'range_num_stages': [0, 0], 2026-02-21T18:10:19.8468677Z 'range_unroll_factors': [0, 0], 2026-02-21T18:10:19.8468978Z 'range_warp_specializes': [], 2026-02-21T18:10:19.8469257Z 'waves_per_eu': 4} 2026-02-21T18:10:19.8775904Z [233s] Fitting surrogate: 754 points, 754 targets 2026-02-21T18:10:20.1970322Z [233s] Generation 10 starting: 17 neighbors, 1 active search path(s) 2026-02-21T18:10:22.4087419Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 8.8 configs/s 2026-02-21T18:10:23.3704135Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 21.1 configs/s 2026-02-21T18:10:23.5029411Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 8121.4 2026-02-21T18:10:23.5029978Z configs/s 2026-02-21T18:10:24.5851913Z [237s] Generation 10 complete: 2026-02-21T18:10:24.5852566Z error=3 2026-02-21T18:10:24.5852644Z ok=16 2026-02-21T18:10:24.5852723Z min=0.1941 2026-02-21T18:10:24.5852799Z mid=0.3925 2026-02-21T18:10:24.5852879Z max=2.9313 2026-02-21T18:10:24.5853063Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:10:24.5853213Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:10:24.5853368Z 'l2_groupings': [1], 2026-02-21T18:10:24.5853468Z 'load_eviction_policies': ['', ''], 2026-02-21T18:10:24.5853585Z 'loop_orders': [[1, 0]], 2026-02-21T18:10:24.5853690Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:10:24.5853795Z 'num_sm_multiplier': 16, 2026-02-21T18:10:24.5853890Z 'num_stages': 2, 2026-02-21T18:10:24.5853978Z 'num_warps': 8, 2026-02-21T18:10:24.5854074Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:10:24.5854196Z 'range_flattens': [None, False], 2026-02-21T18:10:24.5854310Z 'range_multi_buffers': [None, True], 2026-02-21T18:10:24.5854422Z 'range_num_stages': [0, 0], 2026-02-21T18:10:24.5854526Z 'range_unroll_factors': [0, 0], 2026-02-21T18:10:24.5854638Z 'range_warp_specializes': [], 2026-02-21T18:10:24.5854742Z 'waves_per_eu': 4} 2026-02-21T18:10:24.5907755Z [237s] Fitting surrogate: 773 points, 773 targets 2026-02-21T18:10:24.8587369Z [238s] Generation 11 starting: 14 neighbors, 1 active search path(s) 2026-02-21T18:10:26.6548041Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 14/14 13.2 configs/s 2026-02-21T18:10:27.5032491Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 14/14 20.2 configs/s 2026-02-21T18:10:30.2562567Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 366.3 2026-02-21T18:10:30.2563137Z configs/s 2026-02-21T18:10:30.6607478Z [243s] Generation 11 complete: 2026-02-21T18:10:30.6607832Z error=2 2026-02-21T18:10:30.6608049Z ok=14 2026-02-21T18:10:30.6608255Z min=0.2022 2026-02-21T18:10:30.6608462Z mid=0.2680 2026-02-21T18:10:30.6608666Z max=0.3178 2026-02-21T18:10:30.6608895Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:10:30.6609298Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:10:30.6609659Z 'l2_groupings': [1], 2026-02-21T18:10:30.6609936Z 'load_eviction_policies': ['', ''], 2026-02-21T18:10:30.6610257Z 'loop_orders': [[1, 0]], 2026-02-21T18:10:30.6610554Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:10:30.6611239Z 'num_sm_multiplier': 16, 2026-02-21T18:10:30.6611504Z 'num_stages': 2, 2026-02-21T18:10:30.6611735Z 'num_warps': 8, 2026-02-21T18:10:30.6611995Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:10:30.6612325Z 'range_flattens': [None, False], 2026-02-21T18:10:30.6612628Z 'range_multi_buffers': [None, True], 2026-02-21T18:10:30.6612938Z 'range_num_stages': [0, 0], 2026-02-21T18:10:30.6613218Z 'range_unroll_factors': [0, 0], 2026-02-21T18:10:30.6613519Z 'range_warp_specializes': [], 2026-02-21T18:10:30.6613795Z 'waves_per_eu': 4} 2026-02-21T18:10:30.6770137Z [243s] Fitting surrogate: 789 points, 789 targets 2026-02-21T18:10:30.8072485Z [244s] Autotuning complete in 244.1s after searching 760 configs. 2026-02-21T18:10:30.8073248Z One can hardcode the best config and skip autotuning with: 2026-02-21T18:10:30.8075322Z @helion.kernel(config=helion.Config(block_sizes=[128, 128, 64], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=2, num_warps=8, pid_type='persistent_interleaved', range_flattens=[None, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T18:10:30.8077192Z 2026-02-21T18:10:30.8077652Z [244s] Code of selected kernel: /tmp/torchinductor_root/nf/cnf2errvfkgmm4qon3bsi7zbpexxo56hleg2huelljzh2dbzea7e.py 2026-02-21T18:11:00.4104633Z WARNING:tritonbench.utils.triton_op:Completed input ID 11: 2026-02-21T18:11:00.4105128Z (M, N, K) 2026-02-21T18:11:00.4105356Z ------------------- 2026-02-21T18:11:00.4105643Z (2048, 12288, 2048) 2026-02-21T18:11:00.4105814Z 2026-02-21T18:11:00.4106394Z 100%|██████████| 8/8 [30:01<00:00, 245.00s/it] 2026-02-21T18:11:00.4106809Z 100%|██████████| 8/8 [30:01<00:00, 225.14s/it] 2026-02-21T18:11:00.4135807Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmpto3wlmji.csv 2026-02-21T18:11:00.4161150Z (M, N, K) triton_tutorial_matmul-speedup triton_tutorial_matmul-accuracy pt2_triton_matmul-speedup pt2_triton_matmul-accuracy helion_matmul_tritonbench-speedup helion_matmul_tritonbench-accuracy 2026-02-21T18:11:00.4163930Z ------------------- -------------------------------- --------------------------------- --------------------------- ---------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------ 2026-02-21T18:11:00.4166320Z (4096, 1024, 1024) 0.653885 1 0.796155 1 0.8199462915709959 1 2026-02-21T18:11:00.4168089Z (4096, 2048, 2048) 0.6217 1 0.93042 1 Error from Triton code: 2026-02-21T18:11:00.4169677Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:11:00.4170583Z 2026-02-21T18:11:00.4171193Z Error running generated Triton program: 2026-02-21T18:11:00.4172418Z SystemError: returned a result with an exception set 2026-02-21T18:11:00.4174191Z @helion.kernel(config=helion.Config(block_sizes=[2, 1, 128], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T18:11:00.4175957Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T18:11:00.4177246Z (2048, 4096, 2048) 0.638452 1 0.94446 1 0.9620478578550535 1 2026-02-21T18:11:00.4178459Z (1024, 8192, 1024) 0.7364 1 0.966899 1 Error from Triton code: 2026-02-21T18:11:00.4179686Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:11:00.4180291Z 2026-02-21T18:11:00.4180704Z Error running generated Triton program: 2026-02-21T18:11:00.4181619Z SystemError: returned a result with an exception set 2026-02-21T18:11:00.4182949Z @helion.kernel(config=helion.Config(block_sizes=[1, 1, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T18:11:00.4184255Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T18:11:00.4185183Z (8192, 2048, 2048) 0.600071 1 0.893361 1 Error from Triton code: 2026-02-21T18:11:00.4186083Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:11:00.4186561Z 2026-02-21T18:11:00.4186953Z Error running generated Triton program: 2026-02-21T18:11:00.4187862Z SystemError: returned a result with an exception set 2026-02-21T18:11:00.4189221Z @helion.kernel(config=helion.Config(block_sizes=[2, 1, 128], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=1, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T18:11:00.4190562Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T18:11:00.4191507Z (12288, 1024, 1024) 0.636476 1 0.875682 1 0.8800398842877998 1 2026-02-21T18:11:00.4192236Z (1024, 12288, 1024) 0.60107 1 0.805678 1 Error from Triton code: 2026-02-21T18:11:00.4192990Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:11:00.4193370Z 2026-02-21T18:11:00.4193692Z Error running generated Triton program: 2026-02-21T18:11:00.4194428Z SystemError: returned a result with an exception set 2026-02-21T18:11:00.4195487Z @helion.kernel(config=helion.Config(block_sizes=[1, 1, 256], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[16], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_stages=3, num_warps=8, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T18:11:00.4196535Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T18:11:00.4197304Z (2048, 12288, 2048) 0.606168 1 0.806661 1 0.7922404289675028 1 2026-02-21T18:11:00.4197761Z ERROR:__main__:failed to process results 2026-02-21T18:11:00.4197896Z Traceback (most recent call last): 2026-02-21T18:11:00.4198173Z File "/__w/helion/helion/benchmarks/run.py", line 1312, in run_kernel_variants 2026-02-21T18:11:00.4198363Z process_result( 2026-02-21T18:11:00.4198519Z File "/__w/helion/helion/benchmarks/run.py", line 1380, in process_result 2026-02-21T18:11:00.4198739Z metrics[active_metrics[kernel_name][name]].append(float(item)) 2026-02-21T18:11:00.4198911Z ^^^^^^^^^^^ 2026-02-21T18:11:00.4199102Z ValueError: could not convert string to float: '"Error from Triton code:' 2026-02-21T18:11:02.5188618Z average 0.636778 1 0.877414 1 0.431784307835169 0.5 2026-02-21T18:13:17.8109606Z Applying custom args for gemm: {'num_inputs': 8, 'non_square': '', 'rep': '3000'} 2026-02-21T18:13:17.8511349Z Running gemm benchmark with Helion implementation... 2026-02-21T18:13:17.8511664Z 2026-02-21T18:13:18.0150432Z Equally-spaced-k mode: Selected 8 equally spaced inputs (total available: 12) 2026-02-21T18:13:18.0151137Z WARNING:tritonbench.utils.triton_op:Input IDs to run: [0, 2, 3, 5, 6, 8, 9, 11] 2026-02-21T18:13:18.0157456Z 2026-02-21T18:13:18.0165641Z 0%| | 0/8 [00:00(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:15:37.9336678Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:15:37.9337535Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:15:37.9338329Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T18:15:37.9339107Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:15:37.9340309Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:15:37.9341233Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:15:37.9341978Z #blocked6 = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:15:37.9342671Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:15:37.9343625Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:15:37.9344323Z %c32_i32 = arith.constant 32 : i32 2026-02-21T18:15:37.9344584Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:15:37.9344838Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:15:37.9345153Z %cst = arith.constant dense<1024> : tensor<1x256xi32, #blocked> 2026-02-21T18:15:37.9345525Z %cst_0 = arith.constant dense<1024> : tensor<8x1xi32, #blocked1> 2026-02-21T18:15:37.9345920Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x256xf32, #blocked> 2026-02-21T18:15:37.9346260Z %c256_i32 = arith.constant 256 : i32 2026-02-21T18:15:37.9346502Z %c8_i32 = arith.constant 8 : i32 2026-02-21T18:15:37.9346734Z %c512_i32 = arith.constant 512 : i32 2026-02-21T18:15:37.9346971Z %0 = tt.get_program_id x : i32 2026-02-21T18:15:37.9347208Z %1 = arith.divsi %0, %c32_i32 : i32 2026-02-21T18:15:37.9347445Z %2 = arith.muli %1, %c8_i32 : i32 2026-02-21T18:15:37.9347678Z %3 = arith.subi %c512_i32, %2 : i32 2026-02-21T18:15:37.9347911Z %4 = arith.minsi %3, %c8_i32 : i32 2026-02-21T18:15:37.9348253Z %5 = arith.remsi %0, %c32_i32 : i32 2026-02-21T18:15:37.9348483Z %6 = arith.remsi %5, %4 : i32 2026-02-21T18:15:37.9348711Z %7 = arith.addi %2, %6 : i32 2026-02-21T18:15:37.9348924Z %8 = arith.divsi %5, %4 : i32 2026-02-21T18:15:37.9349146Z %9 = arith.muli %7, %c8_i32 : i32 2026-02-21T18:15:37.9349475Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked2> 2026-02-21T18:15:37.9349978Z %11 = tt.splat %9 : i32 -> tensor<8xi32, #blocked2> 2026-02-21T18:15:37.9350354Z %12 = arith.addi %11, %10 : tensor<8xi32, #blocked2> 2026-02-21T18:15:37.9350624Z %13 = arith.muli %8, %c256_i32 : i32 2026-02-21T18:15:37.9350954Z %14 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked2> 2026-02-21T18:15:37.9351343Z %15 = tt.splat %13 : i32 -> tensor<256xi32, #blocked2> 2026-02-21T18:15:37.9351651Z %16 = arith.addi %15, %14 : tensor<256xi32, #blocked2> 2026-02-21T18:15:37.9351971Z %17 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked2> 2026-02-21T18:15:37.9352361Z %18 = ttg.convert_layout %12 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:15:37.9352847Z %19 = tt.expand_dims %18 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<8x1xi32, #blocked3> 2026-02-21T18:15:37.9353264Z %20 = ttg.convert_layout %19 : tensor<8x1xi32, #blocked3> -> tensor<8x1xi32, #blocked1> 2026-02-21T18:15:37.9353563Z %21 = arith.muli %20, %cst_0 : tensor<8x1xi32, #blocked1> 2026-02-21T18:15:37.9353838Z %22 = tt.broadcast %21 : tensor<8x1xi32, #blocked1> -> tensor<8x32xi32, #blocked1> 2026-02-21T18:15:37.9354170Z %23 = ttg.convert_layout %22 : tensor<8x32xi32, #blocked1> -> tensor<8x32xi32, #blocked4> 2026-02-21T18:15:37.9354498Z %24 = tt.splat %arg0 : !tt.ptr -> tensor<8x32x!tt.ptr, #blocked4> 2026-02-21T18:15:37.9354880Z %25 = ttg.convert_layout %16 : tensor<256xi32, #blocked2> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:15:37.9355367Z %26 = tt.expand_dims %25 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T18:15:37.9355851Z %27 = ttg.convert_layout %26 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked> 2026-02-21T18:15:37.9356176Z %28 = arith.muli %27, %cst : tensor<1x256xi32, #blocked> 2026-02-21T18:15:37.9356453Z %29 = tt.broadcast %28 : tensor<1x256xi32, #blocked> -> tensor<32x256xi32, #blocked> 2026-02-21T18:15:37.9356763Z %30 = tt.splat %arg1 : !tt.ptr -> tensor<32x256x!tt.ptr, #blocked> 2026-02-21T18:15:37.9357151Z %31 = scf.for %arg3 = %c0_i32 to %c1024_i32 step %c32_i32 iter_args(%arg4 = %cst_1) -> (tensor<8x256xf32, #blocked>) : i32 { 2026-02-21T18:15:37.9357505Z %46 = tt.splat %arg3 : i32 -> tensor<32xi32, #blocked2> 2026-02-21T18:15:37.9357728Z %47 = arith.addi %46, %17 : tensor<32xi32, #blocked2> 2026-02-21T18:15:37.9358071Z %48 = ttg.convert_layout %47 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:15:37.9358554Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x32xi32, #blocked5> 2026-02-21T18:15:37.9358973Z %50 = ttg.convert_layout %49 : tensor<1x32xi32, #blocked5> -> tensor<1x32xi32, #blocked4> 2026-02-21T18:15:37.9359314Z %51 = tt.broadcast %50 : tensor<1x32xi32, #blocked4> -> tensor<8x32xi32, #blocked4> 2026-02-21T18:15:37.9359591Z %52 = arith.addi %23, %51 : tensor<8x32xi32, #blocked4> 2026-02-21T18:15:37.9359876Z %53 = tt.addptr %24, %52 : tensor<8x32x!tt.ptr, #blocked4>, tensor<8x32xi32, #blocked4> 2026-02-21T18:15:37.9360167Z %54 = tt.load %53 : tensor<8x32x!tt.ptr, #blocked4> 2026-02-21T18:15:37.9360512Z %55 = ttg.convert_layout %47 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:15:37.9360987Z %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<32x1xi32, #blocked3> 2026-02-21T18:15:37.9361413Z %57 = ttg.convert_layout %56 : tensor<32x1xi32, #blocked3> -> tensor<32x1xi32, #blocked1> 2026-02-21T18:15:37.9361756Z %58 = tt.broadcast %57 : tensor<32x1xi32, #blocked1> -> tensor<32x256xi32, #blocked1> 2026-02-21T18:15:37.9362065Z %59 = ttg.convert_layout %58 : tensor<32x256xi32, #blocked1> -> tensor<32x256xi32, #blocked> 2026-02-21T18:15:37.9362315Z %60 = arith.addi %59, %29 : tensor<32x256xi32, #blocked> 2026-02-21T18:15:37.9362540Z %61 = tt.addptr %30, %60 : tensor<32x256x!tt.ptr, #blocked>, tensor<32x256xi32, #blocked> 2026-02-21T18:15:37.9362770Z %62 = tt.load %61 : tensor<32x256x!tt.ptr, #blocked> 2026-02-21T18:15:37.9363053Z %63 = ttg.convert_layout %54 : tensor<8x32xf16, #blocked4> -> tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> 2026-02-21T18:15:37.9363446Z %64 = ttg.convert_layout %62 : tensor<32x256xf16, #blocked> -> tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> 2026-02-21T18:15:37.9363787Z %65 = ttg.convert_layout %arg4 : tensor<8x256xf32, #blocked> -> tensor<8x256xf32, #blocked6> 2026-02-21T18:15:37.9364251Z %66 = tt.dot %63, %64, %65, inputPrecision = tf32 : tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<8x256xf32, #blocked6> 2026-02-21T18:15:37.9364699Z %67 = ttg.convert_layout %66 : tensor<8x256xf32, #blocked6> -> tensor<8x256xf32, #blocked> 2026-02-21T18:15:37.9364920Z scf.yield %67 : tensor<8x256xf32, #blocked> 2026-02-21T18:15:37.9365068Z } {tt.disallow_acc_multi_buffer} 2026-02-21T18:15:37.9365254Z %32 = arith.truncf %31 : tensor<8x256xf32, #blocked> to tensor<8x256xf16, #blocked> 2026-02-21T18:15:37.9365559Z %33 = ttg.convert_layout %12 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:15:37.9365921Z %34 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<8x1xi32, #blocked3> 2026-02-21T18:15:37.9366264Z %35 = ttg.convert_layout %34 : tensor<8x1xi32, #blocked3> -> tensor<8x1xi32, #blocked1> 2026-02-21T18:15:37.9366509Z %36 = arith.muli %35, %cst_0 : tensor<8x1xi32, #blocked1> 2026-02-21T18:15:37.9366777Z %37 = ttg.convert_layout %16 : tensor<256xi32, #blocked2> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:15:37.9367151Z %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T18:15:37.9367478Z %39 = ttg.convert_layout %38 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked> 2026-02-21T18:15:37.9367736Z %40 = tt.broadcast %36 : tensor<8x1xi32, #blocked1> -> tensor<8x256xi32, #blocked1> 2026-02-21T18:15:37.9367995Z %41 = ttg.convert_layout %40 : tensor<8x256xi32, #blocked1> -> tensor<8x256xi32, #blocked> 2026-02-21T18:15:37.9368249Z %42 = tt.broadcast %39 : tensor<1x256xi32, #blocked> -> tensor<8x256xi32, #blocked> 2026-02-21T18:15:37.9368467Z %43 = arith.addi %41, %42 : tensor<8x256xi32, #blocked> 2026-02-21T18:15:37.9368663Z %44 = tt.splat %arg2 : !tt.ptr -> tensor<8x256x!tt.ptr, #blocked> 2026-02-21T18:15:37.9368914Z %45 = tt.addptr %44, %43 : tensor<8x256x!tt.ptr, #blocked>, tensor<8x256xi32, #blocked> 2026-02-21T18:15:37.9369143Z tt.store %45, %32 : tensor<8x256x!tt.ptr, #blocked> 2026-02-21T18:15:37.9369295Z tt.return 2026-02-21T18:15:37.9369390Z } 2026-02-21T18:15:37.9369478Z } 2026-02-21T18:15:37.9369528Z 2026-02-21T18:15:37.9369566Z {-# 2026-02-21T18:15:37.9369657Z external_resources: { 2026-02-21T18:15:37.9369773Z mlir_reproducer: { 2026-02-21T18:15:37.9372348Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=2 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=2}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:15:37.9374674Z disable_threading: false, 2026-02-21T18:15:37.9374782Z verify_each: true 2026-02-21T18:15:37.9374870Z } 2026-02-21T18:15:37.9374946Z } 2026-02-21T18:15:37.9375016Z #-} 2026-02-21T18:15:37.9375295Z /tmp/torchinductor_root/db/cdb52usfphr4hmepkni4ahg2zly53yrvhn3zq7dwm33fko73ctxd.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:15:37.9375973Z /tmp/torchinductor_root/db/cdb52usfphr4hmepkni4ahg2zly53yrvhn3zq7dwm33fko73ctxd.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:15:37.9376529Z [10s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:15:37.9377277Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 256, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T18:15:37.9377959Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:15:37.9378127Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:15:40.0918876Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━ 100/100 13.9 configs/s 2026-02-21T18:15:40.0925803Z [12s] Adaptive compile timeout: 30s (90% percentile=2.6s, bounds=[30.0s, 60s]) 2026-02-21T18:15:40.7251041Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 1142.0 configs/s 2026-02-21T18:15:41.1220048Z [13s] Initial random population of 100, 5 starting points: 2026-02-21T18:15:41.1220487Z error=12 2026-02-21T18:15:41.1220704Z ok=88 2026-02-21T18:15:41.1220927Z min=0.0521 2026-02-21T18:15:41.1221136Z mid=0.4802 2026-02-21T18:15:41.1221336Z max=75.2547 2026-02-21T18:15:41.1221574Z best={'block_sizes': [16, 128, 64], 2026-02-21T18:15:41.1221945Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:15:41.1222312Z 'l2_groupings': [16], 2026-02-21T18:15:41.1222600Z 'load_eviction_policies': ['', ''], 2026-02-21T18:15:41.1222912Z 'loop_orders': [[1, 0]], 2026-02-21T18:15:41.1223189Z 'matrix_instr_nonkdim': 0, 2026-02-21T18:15:41.1223455Z 'num_stages': 1, 2026-02-21T18:15:41.1223687Z 'num_warps': 2, 2026-02-21T18:15:41.1223919Z 'pid_type': 'flat', 2026-02-21T18:15:41.1224183Z 'range_flattens': [None, True], 2026-02-21T18:15:41.1224488Z 'range_multi_buffers': [None, False], 2026-02-21T18:15:41.1224806Z 'range_num_stages': [0, 0], 2026-02-21T18:15:41.1225088Z 'range_unroll_factors': [0, 0], 2026-02-21T18:15:41.1225387Z 'range_warp_specializes': [], 2026-02-21T18:15:41.1225668Z 'waves_per_eu': 4} 2026-02-21T18:15:41.1232434Z [13s] Fitting surrogate: 100 points, 100 targets 2026-02-21T18:15:41.9941842Z [14s] Generation 1 starting: 85 neighbors, 5 active search path(s) 2026-02-21T18:15:45.7665637Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 26.6 configs/s 2026-02-21T18:15:49.9525255Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 21.6 configs/s 2026-02-21T18:15:50.9898593Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 832.7 2026-02-21T18:15:50.9899156Z configs/s 2026-02-21T18:15:51.3620094Z [24s] Generation 1 complete: 2026-02-21T18:15:51.3620471Z error=20 2026-02-21T18:15:51.3620660Z ok=71 2026-02-21T18:15:51.3620848Z min=0.0330 2026-02-21T18:15:51.3621034Z mid=0.0632 2026-02-21T18:15:51.3621221Z max=1.3727 2026-02-21T18:15:51.3621423Z best={'block_sizes': [256, 64, 64], 2026-02-21T18:15:51.3621762Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:15:51.3622089Z 'l2_groupings': [2], 2026-02-21T18:15:51.3622356Z 'load_eviction_policies': ['', ''], 2026-02-21T18:15:51.3622632Z 'loop_orders': [[0, 1]], 2026-02-21T18:15:51.3622882Z 'matrix_instr_nonkdim': 32, 2026-02-21T18:15:51.3623125Z 'num_stages': 2, 2026-02-21T18:15:51.3623341Z 'num_warps': 16, 2026-02-21T18:15:51.3623564Z 'pid_type': 'flat', 2026-02-21T18:15:51.3623814Z 'range_flattens': [None, None], 2026-02-21T18:15:51.3624087Z 'range_multi_buffers': [None, None], 2026-02-21T18:15:51.3624366Z 'range_num_stages': [0, 0], 2026-02-21T18:15:51.3624611Z 'range_unroll_factors': [0, 0], 2026-02-21T18:15:51.3624878Z 'range_warp_specializes': [], 2026-02-21T18:15:51.3625122Z 'waves_per_eu': 1} 2026-02-21T18:15:51.3644044Z [24s] Fitting surrogate: 191 points, 191 targets 2026-02-21T18:15:52.4719419Z [25s] Generation 2 starting: 83 neighbors, 5 active search path(s) 2026-02-21T18:15:57.1884288Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 29.0 configs/s 2026-02-21T18:16:01.5712011Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 19.9 configs/s 2026-02-21T18:16:04.5092452Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 326.7 2026-02-21T18:16:04.5093680Z configs/s 2026-02-21T18:16:05.0731341Z [37s] Generation 2 complete: 2026-02-21T18:16:05.0731585Z error=14 2026-02-21T18:16:05.0731751Z ok=74 2026-02-21T18:16:05.0731887Z min=0.0322 2026-02-21T18:16:05.0732020Z mid=0.0488 2026-02-21T18:16:05.0732148Z max=0.8380 2026-02-21T18:16:05.0732297Z best={'block_sizes': [128, 64, 64], 2026-02-21T18:16:05.0732537Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:16:05.0732775Z 'l2_groupings': [2], 2026-02-21T18:16:05.0732946Z 'load_eviction_policies': ['', ''], 2026-02-21T18:16:05.0733139Z 'loop_orders': [[0, 1]], 2026-02-21T18:16:05.0733318Z 'matrix_instr_nonkdim': 32, 2026-02-21T18:16:05.0733495Z 'num_sm_multiplier': 16, 2026-02-21T18:16:05.0733662Z 'num_stages': 2, 2026-02-21T18:16:05.0733814Z 'num_warps': 4, 2026-02-21T18:16:05.0733981Z 'pid_type': 'persistent_blocked', 2026-02-21T18:16:05.0734177Z 'range_flattens': [None, None], 2026-02-21T18:16:05.0734368Z 'range_multi_buffers': [False, None], 2026-02-21T18:16:05.0734572Z 'range_num_stages': [0, 0], 2026-02-21T18:16:05.0734982Z 'range_unroll_factors': [0, 0], 2026-02-21T18:16:05.0735179Z 'range_warp_specializes': [], 2026-02-21T18:16:05.0735349Z 'waves_per_eu': 1} 2026-02-21T18:16:05.0795763Z [37s] Fitting surrogate: 279 points, 279 targets 2026-02-21T18:16:06.4723987Z [39s] Generation 3 starting: 90 neighbors, 5 active search path(s) 2026-02-21T18:16:12.4592023Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92/92 8.3 configs/s 2026-02-21T18:16:17.9883836Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 92/92 17.1 configs/s 2026-02-21T18:16:21.3689863Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 286.6 2026-02-21T18:16:21.3690485Z configs/s 2026-02-21T18:16:21.9184094Z [54s] Generation 3 complete: 2026-02-21T18:16:21.9184453Z error=4 2026-02-21T18:16:21.9184666Z ok=92 2026-02-21T18:16:21.9184876Z min=0.0272 2026-02-21T18:16:21.9185105Z mid=0.0435 2026-02-21T18:16:21.9185312Z max=0.6143 2026-02-21T18:16:21.9185835Z best={'block_sizes': [64, 128, 64], 2026-02-21T18:16:21.9186215Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:16:21.9186591Z 'l2_groupings': [16], 2026-02-21T18:16:21.9186799Z 'load_eviction_policies': ['', ''], 2026-02-21T18:16:21.9187113Z 'loop_orders': [[0, 1]], 2026-02-21T18:16:21.9187394Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:16:21.9187682Z 'num_sm_multiplier': 16, 2026-02-21T18:16:21.9187943Z 'num_stages': 2, 2026-02-21T18:16:21.9188284Z 'num_warps': 8, 2026-02-21T18:16:21.9188547Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:16:21.9188883Z 'range_flattens': [False, None], 2026-02-21T18:16:21.9189194Z 'range_multi_buffers': [None, False], 2026-02-21T18:16:21.9189512Z 'range_num_stages': [0, 0], 2026-02-21T18:16:21.9189803Z 'range_unroll_factors': [0, 0], 2026-02-21T18:16:21.9190103Z 'range_warp_specializes': [], 2026-02-21T18:16:21.9190382Z 'waves_per_eu': 2} 2026-02-21T18:16:22.0205216Z [54s] Fitting surrogate: 375 points, 375 targets 2026-02-21T18:16:23.4020611Z [56s] Generation 4 starting: 84 neighbors, 5 active search path(s) 2026-02-21T18:16:28.6528339Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 18.9 configs/s 2026-02-21T18:16:33.1822574Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 19.7 configs/s 2026-02-21T18:16:36.9408587Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 259.8 2026-02-21T18:16:36.9409020Z configs/s 2026-02-21T18:16:37.5349957Z [70s] Generation 4 complete: 2026-02-21T18:16:37.5350198Z error=15 2026-02-21T18:16:37.5350379Z ok=74 2026-02-21T18:16:37.5350519Z min=0.0267 2026-02-21T18:16:37.5350934Z mid=0.0379 2026-02-21T18:16:37.5351086Z max=0.2618 2026-02-21T18:16:37.5351250Z best={'block_sizes': [64, 128, 64], 2026-02-21T18:16:37.5351514Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:16:37.5351886Z 'l2_groupings': [16], 2026-02-21T18:16:37.5352090Z 'load_eviction_policies': ['', ''], 2026-02-21T18:16:37.5352320Z 'loop_orders': [[0, 1]], 2026-02-21T18:16:37.5352515Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:16:37.5352711Z 'num_sm_multiplier': 16, 2026-02-21T18:16:37.5352889Z 'num_stages': 2, 2026-02-21T18:16:37.5353045Z 'num_warps': 4, 2026-02-21T18:16:37.5353228Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:16:37.5353464Z 'range_flattens': [False, None], 2026-02-21T18:16:37.5353682Z 'range_multi_buffers': [None, False], 2026-02-21T18:16:37.5353897Z 'range_num_stages': [0, 0], 2026-02-21T18:16:37.5354091Z 'range_unroll_factors': [0, 0], 2026-02-21T18:16:37.5354296Z 'range_warp_specializes': [], 2026-02-21T18:16:37.5354484Z 'waves_per_eu': 2} 2026-02-21T18:16:37.6554197Z [70s] Fitting surrogate: 464 points, 464 targets 2026-02-21T18:16:39.0368243Z [71s] Generation 5 starting: 80 neighbors, 5 active search path(s) 2026-02-21T18:16:44.0040734Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81/81 22.3 configs/s 2026-02-21T18:16:48.4604944Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 81/81 18.6 configs/s 2026-02-21T18:16:53.3213250Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 221.7 2026-02-21T18:16:53.3213779Z configs/s 2026-02-21T18:16:53.9435952Z [86s] Generation 5 complete: 2026-02-21T18:16:53.9436299Z error=11 2026-02-21T18:16:53.9436496Z ok=74 2026-02-21T18:16:53.9436695Z min=0.0261 2026-02-21T18:16:53.9436880Z mid=0.0325 2026-02-21T18:16:53.9437068Z max=0.1178 2026-02-21T18:16:53.9437280Z best={'block_sizes': [128, 64, 64], 2026-02-21T18:16:53.9437628Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:16:53.9437996Z 'l2_groupings': [1], 2026-02-21T18:16:53.9438250Z 'load_eviction_policies': ['', ''], 2026-02-21T18:16:53.9438542Z 'loop_orders': [[0, 1]], 2026-02-21T18:16:53.9438801Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:16:53.9439078Z 'num_sm_multiplier': 32, 2026-02-21T18:16:53.9439334Z 'num_stages': 2, 2026-02-21T18:16:53.9439841Z 'num_warps': 4, 2026-02-21T18:16:53.9440086Z 'pid_type': 'persistent_blocked', 2026-02-21T18:16:53.9440385Z 'range_flattens': [None, None], 2026-02-21T18:16:53.9440666Z 'range_multi_buffers': [False, False], 2026-02-21T18:16:53.9440963Z 'range_num_stages': [0, 0], 2026-02-21T18:16:53.9441228Z 'range_unroll_factors': [0, 0], 2026-02-21T18:16:53.9441507Z 'range_warp_specializes': [], 2026-02-21T18:16:53.9441766Z 'waves_per_eu': 3} 2026-02-21T18:16:54.0720029Z [86s] Fitting surrogate: 549 points, 549 targets 2026-02-21T18:16:55.0859636Z [87s] Generation 6 starting: 87 neighbors, 5 active search path(s) 2026-02-21T18:17:00.2681291Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88/88 16.4 configs/s 2026-02-21T18:17:05.3981715Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 88/88 17.3 configs/s 2026-02-21T18:17:10.0788781Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 230.6 2026-02-21T18:17:10.0789453Z configs/s 2026-02-21T18:17:10.6632343Z [103s] Generation 6 complete: 2026-02-21T18:17:10.6632737Z error=6 2026-02-21T18:17:10.6632934Z ok=86 2026-02-21T18:17:10.6633133Z min=0.0260 2026-02-21T18:17:10.6633376Z mid=0.0329 2026-02-21T18:17:10.6633556Z max=0.3156 2026-02-21T18:17:10.6633766Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:17:10.6634112Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:17:10.6634450Z 'l2_groupings': [16], 2026-02-21T18:17:10.6634704Z 'load_eviction_policies': ['', ''], 2026-02-21T18:17:10.6634982Z 'loop_orders': [[0, 1]], 2026-02-21T18:17:10.6635239Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:17:10.6635904Z 'num_sm_multiplier': 16, 2026-02-21T18:17:10.6636149Z 'num_stages': 2, 2026-02-21T18:17:10.6636361Z 'num_warps': 16, 2026-02-21T18:17:10.6636588Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:17:10.6637038Z 'range_flattens': [False, None], 2026-02-21T18:17:10.6637332Z 'range_multi_buffers': [None, False], 2026-02-21T18:17:10.6637638Z 'range_num_stages': [0, 0], 2026-02-21T18:17:10.6637892Z 'range_unroll_factors': [0, 0], 2026-02-21T18:17:10.6638157Z 'range_warp_specializes': [], 2026-02-21T18:17:10.6638407Z 'waves_per_eu': 3} 2026-02-21T18:17:10.7845417Z [103s] Fitting surrogate: 641 points, 641 targets 2026-02-21T18:17:11.8272773Z [104s] Generation 7 starting: 90 neighbors, 5 active search path(s) 2026-02-21T18:17:17.1050062Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 21.0 configs/s 2026-02-21T18:17:22.3676290Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 17.4 configs/s 2026-02-21T18:17:26.4759565Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 239.6 2026-02-21T18:17:26.4760237Z configs/s 2026-02-21T18:17:27.0722453Z [119s] Generation 7 complete: 2026-02-21T18:17:27.0722820Z error=7 2026-02-21T18:17:27.0723050Z ok=88 2026-02-21T18:17:27.0723654Z min=0.0254 2026-02-21T18:17:27.0723983Z mid=0.0327 2026-02-21T18:17:27.0724185Z max=0.2299 2026-02-21T18:17:27.0724408Z best={'block_sizes': [128, 64, 64], 2026-02-21T18:17:27.0724785Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:17:27.0725148Z 'l2_groupings': [1], 2026-02-21T18:17:27.0725423Z 'load_eviction_policies': ['', ''], 2026-02-21T18:17:27.0725732Z 'loop_orders': [[0, 1]], 2026-02-21T18:17:27.0726005Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:17:27.0726287Z 'num_sm_multiplier': 32, 2026-02-21T18:17:27.0726548Z 'num_stages': 2, 2026-02-21T18:17:27.0726778Z 'num_warps': 4, 2026-02-21T18:17:27.0727033Z 'pid_type': 'persistent_blocked', 2026-02-21T18:17:27.0727346Z 'range_flattens': [None, False], 2026-02-21T18:17:27.0727660Z 'range_multi_buffers': [False, False], 2026-02-21T18:17:27.0727976Z 'range_num_stages': [0, 0], 2026-02-21T18:17:27.0728257Z 'range_unroll_factors': [0, 0], 2026-02-21T18:17:27.0728552Z 'range_warp_specializes': [], 2026-02-21T18:17:27.0728832Z 'waves_per_eu': 3} 2026-02-21T18:17:27.2014933Z [120s] Fitting surrogate: 736 points, 736 targets 2026-02-21T18:17:28.5597110Z [121s] Generation 8 starting: 78 neighbors, 4 active search path(s) 2026-02-21T18:17:33.1442655Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79/79 17.5 configs/s 2026-02-21T18:17:37.8781101Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 79/79 17.2 configs/s 2026-02-21T18:17:41.9573135Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 241.0 2026-02-21T18:17:41.9573729Z configs/s 2026-02-21T18:17:42.5723829Z [135s] Generation 8 complete: 2026-02-21T18:17:42.5724492Z error=5 2026-02-21T18:17:42.5724703Z ok=77 2026-02-21T18:17:42.5724905Z min=0.0246 2026-02-21T18:17:42.5725118Z mid=0.0336 2026-02-21T18:17:42.5725319Z max=1.0010 2026-02-21T18:17:42.5725549Z best={'block_sizes': [64, 128, 64], 2026-02-21T18:17:42.5726082Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:17:42.5726467Z 'l2_groupings': [32], 2026-02-21T18:17:42.5726748Z 'load_eviction_policies': ['', ''], 2026-02-21T18:17:42.5727056Z 'loop_orders': [[0, 1]], 2026-02-21T18:17:42.5727338Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:17:42.5727628Z 'num_sm_multiplier': 16, 2026-02-21T18:17:42.5727894Z 'num_stages': 2, 2026-02-21T18:17:42.5728122Z 'num_warps': 4, 2026-02-21T18:17:42.5728380Z 'pid_type': 'persistent_blocked', 2026-02-21T18:17:42.5728695Z 'range_flattens': [None, None], 2026-02-21T18:17:42.5728999Z 'range_multi_buffers': [False, False], 2026-02-21T18:17:42.5729295Z 'range_num_stages': [0, 0], 2026-02-21T18:17:42.5729546Z 'range_unroll_factors': [0, 0], 2026-02-21T18:17:42.5729811Z 'range_warp_specializes': [], 2026-02-21T18:17:42.5730056Z 'waves_per_eu': 1} 2026-02-21T18:17:42.7041970Z [135s] Fitting surrogate: 818 points, 818 targets 2026-02-21T18:17:43.9856623Z [136s] Generation 9 starting: 68 neighbors, 4 active search path(s) 2026-02-21T18:17:48.1743636Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69/69 39.0 configs/s 2026-02-21T18:17:52.2594889Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 69/69 17.1 configs/s 2026-02-21T18:17:55.6448515Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 289.1 2026-02-21T18:17:55.6449140Z configs/s 2026-02-21T18:17:56.1804977Z [148s] Generation 9 complete: 2026-02-21T18:17:56.1805386Z error=4 2026-02-21T18:17:56.1805596Z ok=68 2026-02-21T18:17:56.1805807Z min=0.0248 2026-02-21T18:17:56.1806014Z mid=0.0303 2026-02-21T18:17:56.1806216Z max=0.1048 2026-02-21T18:17:56.1806442Z best={'block_sizes': [64, 128, 64], 2026-02-21T18:17:56.1806850Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:17:56.1807214Z 'l2_groupings': [32], 2026-02-21T18:17:56.1807498Z 'load_eviction_policies': ['', ''], 2026-02-21T18:17:56.1807821Z 'loop_orders': [[0, 1]], 2026-02-21T18:17:56.1808473Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:17:56.1808993Z 'num_sm_multiplier': 16, 2026-02-21T18:17:56.1809256Z 'num_stages': 2, 2026-02-21T18:17:56.1809489Z 'num_warps': 4, 2026-02-21T18:17:56.1809748Z 'pid_type': 'persistent_blocked', 2026-02-21T18:17:56.1810064Z 'range_flattens': [None, None], 2026-02-21T18:17:56.1810362Z 'range_multi_buffers': [False, False], 2026-02-21T18:17:56.1810679Z 'range_num_stages': [0, 0], 2026-02-21T18:17:56.1810953Z 'range_unroll_factors': [0, 0], 2026-02-21T18:17:56.1811249Z 'range_warp_specializes': [], 2026-02-21T18:17:56.1811519Z 'waves_per_eu': 2} 2026-02-21T18:17:56.2849163Z [149s] Fitting surrogate: 890 points, 890 targets 2026-02-21T18:17:56.9437470Z [149s] Generation 10 starting: 51 neighbors, 3 active search path(s) 2026-02-21T18:18:00.1969288Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 14.2 configs/s 2026-02-21T18:18:03.4375633Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 51/51 16.5 configs/s 2026-02-21T18:18:05.8860496Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 394.4 2026-02-21T18:18:05.8861053Z configs/s 2026-02-21T18:18:06.3524397Z [159s] Generation 10 complete: 2026-02-21T18:18:06.3524739Z error=8 2026-02-21T18:18:06.3524958Z ok=46 2026-02-21T18:18:06.3525170Z min=0.0247 2026-02-21T18:18:06.3525385Z mid=0.0295 2026-02-21T18:18:06.3525585Z max=0.0928 2026-02-21T18:18:06.3525833Z best={'block_sizes': [64, 128, 64], 2026-02-21T18:18:06.3526207Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:18:06.3526570Z 'l2_groupings': [32], 2026-02-21T18:18:06.3526849Z 'load_eviction_policies': ['', ''], 2026-02-21T18:18:06.3527157Z 'loop_orders': [[0, 1]], 2026-02-21T18:18:06.3527680Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:18:06.3527964Z 'num_sm_multiplier': 16, 2026-02-21T18:18:06.3528233Z 'num_stages': 2, 2026-02-21T18:18:06.3528462Z 'num_warps': 4, 2026-02-21T18:18:06.3528843Z 'pid_type': 'persistent_blocked', 2026-02-21T18:18:06.3529176Z 'range_flattens': [None, None], 2026-02-21T18:18:06.3529483Z 'range_multi_buffers': [False, False], 2026-02-21T18:18:06.3529801Z 'range_num_stages': [0, 0], 2026-02-21T18:18:06.3530080Z 'range_unroll_factors': [0, 0], 2026-02-21T18:18:06.3530379Z 'range_warp_specializes': [], 2026-02-21T18:18:06.3530653Z 'waves_per_eu': 2} 2026-02-21T18:18:06.4230931Z [159s] Fitting surrogate: 944 points, 944 targets 2026-02-21T18:18:07.0414345Z [159s] Generation 11 starting: 45 neighbors, 3 active search path(s) 2026-02-21T18:18:09.8632673Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 45/45 24.5 configs/s 2026-02-21T18:18:12.5501554Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 45/45 17.7 configs/s 2026-02-21T18:18:14.8515055Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 419.5 2026-02-21T18:18:14.8515692Z configs/s 2026-02-21T18:18:15.2999416Z [168s] Generation 11 complete: 2026-02-21T18:18:15.2999784Z error=4 2026-02-21T18:18:15.2999992Z ok=44 2026-02-21T18:18:15.3000196Z min=0.0247 2026-02-21T18:18:15.3000406Z mid=0.0285 2026-02-21T18:18:15.3000607Z max=0.0530 2026-02-21T18:18:15.3000839Z best={'block_sizes': [64, 128, 64], 2026-02-21T18:18:15.3001210Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:18:15.3001574Z 'l2_groupings': [32], 2026-02-21T18:18:15.3001854Z 'load_eviction_policies': ['', ''], 2026-02-21T18:18:15.3002166Z 'loop_orders': [[0, 1]], 2026-02-21T18:18:15.3002468Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:18:15.3002753Z 'num_sm_multiplier': 16, 2026-02-21T18:18:15.3003015Z 'num_stages': 2, 2026-02-21T18:18:15.3003250Z 'num_warps': 4, 2026-02-21T18:18:15.3003520Z 'pid_type': 'persistent_blocked', 2026-02-21T18:18:15.3003831Z 'range_flattens': [None, None], 2026-02-21T18:18:15.3004142Z 'range_multi_buffers': [False, False], 2026-02-21T18:18:15.3004476Z 'range_num_stages': [0, 0], 2026-02-21T18:18:15.3004756Z 'range_unroll_factors': [0, 0], 2026-02-21T18:18:15.3005150Z 'range_warp_specializes': [], 2026-02-21T18:18:15.3005420Z 'waves_per_eu': 2} 2026-02-21T18:18:15.3667119Z [168s] Fitting surrogate: 992 points, 992 targets 2026-02-21T18:18:15.8231157Z [168s] Generation 12 starting: 30 neighbors, 2 active search path(s) 2026-02-21T18:18:18.4726512Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 14.1 configs/s 2026-02-21T18:18:20.0066153Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 30/30 20.3 configs/s 2026-02-21T18:18:21.3345461Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 704.9 2026-02-21T18:18:21.3346033Z configs/s 2026-02-21T18:18:21.6984582Z [174s] Generation 12 complete: 2026-02-21T18:18:21.6984939Z error=7 2026-02-21T18:18:21.6985192Z ok=26 2026-02-21T18:18:21.6985398Z min=0.0242 2026-02-21T18:18:21.6985618Z mid=0.0299 2026-02-21T18:18:21.6985838Z max=0.0487 2026-02-21T18:18:21.6986087Z best={'block_sizes': [128, 128, 128], 2026-02-21T18:18:21.6986482Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T18:18:21.6986836Z 'l2_groupings': [32], 2026-02-21T18:18:21.6987125Z 'load_eviction_policies': ['', ''], 2026-02-21T18:18:21.6987437Z 'loop_orders': [[0, 1]], 2026-02-21T18:18:21.6987718Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:18:21.6988008Z 'num_sm_multiplier': 16, 2026-02-21T18:18:21.6988365Z 'num_stages': 2, 2026-02-21T18:18:21.6988604Z 'num_warps': 16, 2026-02-21T18:18:21.6988862Z 'pid_type': 'persistent_blocked', 2026-02-21T18:18:21.6989178Z 'range_flattens': [None, None], 2026-02-21T18:18:21.6989487Z 'range_multi_buffers': [True, True], 2026-02-21T18:18:21.6989799Z 'range_num_stages': [0, 0], 2026-02-21T18:18:21.6990380Z 'range_unroll_factors': [0, 0], 2026-02-21T18:18:21.6990673Z 'range_warp_specializes': [], 2026-02-21T18:18:21.6990964Z 'waves_per_eu': 1} 2026-02-21T18:18:21.7325426Z [174s] Fitting surrogate: 1025 points, 1025 targets 2026-02-21T18:18:22.2312117Z [175s] Generation 13 starting: 36 neighbors, 2 active search path(s) 2026-02-21T18:18:24.6796068Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 35.5 configs/s 2026-02-21T18:18:26.6894167Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 36/36 19.4 configs/s 2026-02-21T18:18:28.7498233Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 466.0 2026-02-21T18:18:28.7498813Z configs/s 2026-02-21T18:18:29.1698531Z [181s] Generation 13 complete: 2026-02-21T18:18:29.1698872Z error=6 2026-02-21T18:18:29.1699100Z ok=32 2026-02-21T18:18:29.1699300Z min=0.0238 2026-02-21T18:18:29.1699513Z mid=0.0259 2026-02-21T18:18:29.1699741Z max=0.0392 2026-02-21T18:18:29.1699973Z best={'block_sizes': [128, 128, 128], 2026-02-21T18:18:29.1700349Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T18:18:29.1700697Z 'l2_groupings': [32], 2026-02-21T18:18:29.1700989Z 'load_eviction_policies': ['', ''], 2026-02-21T18:18:29.1701609Z 'loop_orders': [[0, 1]], 2026-02-21T18:18:29.1701892Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:18:29.1702169Z 'num_sm_multiplier': 16, 2026-02-21T18:18:29.1702430Z 'num_stages': 2, 2026-02-21T18:18:29.1702657Z 'num_warps': 8, 2026-02-21T18:18:29.1702915Z 'pid_type': 'persistent_blocked', 2026-02-21T18:18:29.1703228Z 'range_flattens': [None, None], 2026-02-21T18:18:29.1703522Z 'range_multi_buffers': [True, True], 2026-02-21T18:18:29.1703827Z 'range_num_stages': [0, 0], 2026-02-21T18:18:29.1704099Z 'range_unroll_factors': [0, 0], 2026-02-21T18:18:29.1704473Z 'range_warp_specializes': [], 2026-02-21T18:18:29.1704720Z 'waves_per_eu': 1} 2026-02-21T18:18:29.2306253Z [182s] Fitting surrogate: 1063 points, 1063 targets 2026-02-21T18:18:29.7114194Z [182s] Generation 14 starting: 34 neighbors, 2 active search path(s) 2026-02-21T18:18:32.1395505Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34/34 7.7 configs/s 2026-02-21T18:18:33.5200726Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 34/34 25.8 configs/s 2026-02-21T18:18:35.2341439Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 749.4 2026-02-21T18:18:35.2342016Z configs/s 2026-02-21T18:18:35.5913631Z [188s] Generation 14 complete: 2026-02-21T18:18:35.5914008Z error=13 2026-02-21T18:18:35.5914218Z ok=23 2026-02-21T18:18:35.5914420Z min=0.0236 2026-02-21T18:18:35.5914630Z mid=0.0262 2026-02-21T18:18:35.5914827Z max=0.0446 2026-02-21T18:18:35.5915053Z best={'block_sizes': [128, 128, 128], 2026-02-21T18:18:35.5915427Z 'indexing': ['pointer', 'pointer', 'pointer'], 2026-02-21T18:18:35.5915782Z 'l2_groupings': [32], 2026-02-21T18:18:35.5916103Z 'load_eviction_policies': ['', ''], 2026-02-21T18:18:35.5916415Z 'loop_orders': [[0, 1]], 2026-02-21T18:18:35.5916692Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:18:35.5916998Z 'num_sm_multiplier': 16, 2026-02-21T18:18:35.5917268Z 'num_stages': 2, 2026-02-21T18:18:35.5917507Z 'num_warps': 8, 2026-02-21T18:18:35.5917783Z 'pid_type': 'persistent_blocked', 2026-02-21T18:18:35.5918094Z 'range_flattens': [None, None], 2026-02-21T18:18:35.5918399Z 'range_multi_buffers': [True, True], 2026-02-21T18:18:35.5918704Z 'range_num_stages': [0, 0], 2026-02-21T18:18:35.5918987Z 'range_unroll_factors': [0, 0], 2026-02-21T18:18:35.5919278Z 'range_warp_specializes': [], 2026-02-21T18:18:35.5919556Z 'waves_per_eu': 1} 2026-02-21T18:18:35.6251886Z [188s] Fitting surrogate: 1099 points, 1099 targets 2026-02-21T18:18:35.7538170Z [188s] Autotuning complete in 188.6s after searching 1056 configs. 2026-02-21T18:18:35.7538396Z One can hardcode the best config and skip autotuning with: 2026-02-21T18:18:35.7539519Z @helion.kernel(config=helion.Config(block_sizes=[128, 128, 128], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_sm_multiplier=16, num_stages=2, num_warps=8, pid_type='persistent_blocked', range_flattens=[None, None], range_multi_buffers=[True, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=1), static_shapes=True) 2026-02-21T18:18:35.7540256Z 2026-02-21T18:18:35.7540441Z [188s] Code of selected kernel: /tmp/torchinductor_root/y3/cy3wp6l56oqx3bgyxlzetbluxjatg5qqdo6mb2yje5tbpwhq7vcs.py 2026-02-21T18:18:35.7639494Z from __future__ import annotations 2026-02-21T18:18:35.7639776Z 2026-02-21T18:18:35.7639917Z import torch 2026-02-21T18:18:35.7640081Z import helion 2026-02-21T18:18:35.7640207Z import triton 2026-02-21T18:18:35.7640374Z import triton.language as tl 2026-02-21T18:18:35.7640595Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T18:18:35.7640793Z 2026-02-21T18:18:35.7640856Z _BLOCK_SIZE_0 = tl.constexpr(128) 2026-02-21T18:18:35.7641018Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T18:18:35.7641183Z _BLOCK_SIZE_2 = tl.constexpr(128) 2026-02-21T18:18:35.7641285Z 2026-02-21T18:18:35.7641565Z @triton.jit 2026-02-21T18:18:35.7641723Z def _helion_matmul(x, y, out, _NUM_SM: tl.constexpr): 2026-02-21T18:18:35.7641980Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:18:35.7642261Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:18:35.7642530Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:18:35.7642716Z # src[matmul.py:63-67]: ... 2026-02-21T18:18:35.7642931Z total_pids = tl.cdiv(4096, _BLOCK_SIZE_0) * tl.cdiv(1024, _BLOCK_SIZE_1) 2026-02-21T18:18:35.7643177Z block_size = tl.cdiv(total_pids, _NUM_SM * 16) 2026-02-21T18:18:35.7643375Z start_pid = tl.program_id(0) * block_size 2026-02-21T18:18:35.7643602Z end_pid = tl.minimum(start_pid + block_size, total_pids) 2026-02-21T18:18:35.7643881Z for virtual_pid in tl.range(start_pid, end_pid, disallow_acc_multi_buffer=False): 2026-02-21T18:18:35.7644190Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:18:35.7644421Z num_pid_m = tl.cdiv(4096, _BLOCK_SIZE_0) 2026-02-21T18:18:35.7644680Z num_pid_n = tl.cdiv(1024, _BLOCK_SIZE_1) 2026-02-21T18:18:35.7644853Z inner_2d_pid = virtual_pid 2026-02-21T18:18:35.7645029Z num_pid_in_group = 32 * num_pid_n 2026-02-21T18:18:35.7645224Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T18:18:35.7645413Z first_pid_m = group_id * 32 2026-02-21T18:18:35.7645610Z group_size_m = min(num_pid_m - first_pid_m, 32) 2026-02-21T18:18:35.7645863Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T18:18:35.7646130Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T18:18:35.7646344Z offset_0 = pid_0 * _BLOCK_SIZE_0 2026-02-21T18:18:35.7646574Z indices_0 = (offset_0 + tl.arange(0, _BLOCK_SIZE_0)).to(tl.int32) 2026-02-21T18:18:35.7646799Z offset_1 = pid_1 * _BLOCK_SIZE_1 2026-02-21T18:18:35.7647018Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T18:18:35.7647308Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:18:35.7647591Z acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 0.0, tl.float32) 2026-02-21T18:18:35.7647820Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:18:35.7648087Z # src[matmul.py:66]: acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) 2026-02-21T18:18:35.7648426Z for offset_2 in tl.range(0, 1024, _BLOCK_SIZE_2, disallow_acc_multi_buffer=False): 2026-02-21T18:18:35.7648735Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int32) 2026-02-21T18:18:35.7648952Z acc_copy = acc 2026-02-21T18:18:35.7649138Z acc_copy_0 = acc_copy 2026-02-21T18:18:35.7649369Z # src[matmul.py:66]: acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) 2026-02-21T18:18:35.7649727Z load = tl.load(x + (indices_0[:, None] * 1024 + indices_2[None, :] * 1), None) 2026-02-21T18:18:35.7650041Z load_1 = tl.load(y + (indices_2[:, None] * 1 + indices_1[None, :] * 1024), None) 2026-02-21T18:18:35.7650477Z acc = tl.dot(tl.cast(load, tl.float16), tl.cast(load_1, tl.float16), acc=acc_copy_0, input_precision='tf32', out_dtype=tl.float32) 2026-02-21T18:18:35.7650891Z # src[matmul.py:67]: out[tile_m, tile_n] = epilogue(acc, (tile_m, tile_n)) 2026-02-21T18:18:35.7651136Z v_0 = tl.cast(acc, tl.float16) 2026-02-21T18:18:35.7651369Z tl.store(out + (indices_0[:, None] * 1024 + indices_1[None, :] * 1), v_0, None) 2026-02-21T18:18:35.7651559Z 2026-02-21T18:18:35.7651823Z def matmul(x: Tensor, y: Tensor, epilogue: Callable[[Tensor, tuple[Tensor, ...]], Tensor]=lambda acc, tile: acc, *, _launcher=_default_launcher): 2026-02-21T18:18:35.7652196Z """ 2026-02-21T18:18:35.7652412Z Performs matrix multiplication of x and y with an optional epilogue function. 2026-02-21T18:18:35.7652659Z Args: 2026-02-21T18:18:35.7652829Z x (Tensor): Left matrix of shape [m, k]. 2026-02-21T18:18:35.7653031Z y (Tensor): Right matrix of shape [k, n]. 2026-02-21T18:18:35.7653314Z epilogue (Callable, optional): Function applied to the accumulator and tile indices 2026-02-21T18:18:35.7653624Z after the matmul. Defaults to identity (no change). 2026-02-21T18:18:35.7653816Z Returns: 2026-02-21T18:18:35.7653958Z Tensor: Resulting matrix of shape [m, n]. 2026-02-21T18:18:35.7654089Z """ 2026-02-21T18:18:35.7654186Z # src[matmul.py:57]: m, k = x.size() 2026-02-21T18:18:35.7654311Z m, k = x.size() 2026-02-21T18:18:35.7654417Z # src[matmul.py:58]: k2, n = y.size() 2026-02-21T18:18:35.7654540Z k2, n = y.size() 2026-02-21T18:18:35.7654689Z # src[matmul.py:59]: assert k == k2, f"size mismatch {k} != {k2}" 2026-02-21T18:18:35.7654868Z assert k == k2, f'size mismatch {k} != {k2}' 2026-02-21T18:18:35.7655013Z # src[matmul.py:60]: out = torch.empty( 2026-02-21T18:18:35.7655226Z # src[matmul.py:61]: [m, n], dtype=torch.promote_types(x.dtype, y.dtype), device=x.device 2026-02-21T18:18:35.7655463Z # src[matmul.py:62]: ) 2026-02-21T18:18:35.7655652Z out = torch.empty([m, n], dtype=torch.promote_types(x.dtype, y.dtype), device=x.device) 2026-02-21T18:18:35.7655885Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:18:35.7656062Z _NUM_SM = helion.runtime.get_num_sm(x.device) 2026-02-21T18:18:35.7656233Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:18:35.7656438Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:18:35.7656641Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:18:35.7656788Z # src[matmul.py:63-67]: ... 2026-02-21T18:18:35.7657042Z _launcher(_helion_matmul, (_NUM_SM * 16,), x, y, out, _NUM_SM, num_warps=8, num_stages=2, waves_per_eu=1, matrix_instr_nonkdim=16) 2026-02-21T18:18:35.7657307Z # src[matmul.py:68]: return out 2026-02-21T18:18:35.7657430Z return out 2026-02-21T18:18:58.1015983Z WARNING:tritonbench.utils.triton_op:Completed input ID 0: 2026-02-21T18:18:58.1016472Z (M, N, K) 2026-02-21T18:18:58.1016690Z ------------------ 2026-02-21T18:18:58.1016936Z (4096, 1024, 1024) 2026-02-21T18:18:58.1017081Z 2026-02-21T18:18:58.1024551Z 12%|█▎ | 1/8 [05:40<39:40, 340.09s/it]WARNING:tritonbench.utils.triton_op:Running input ID 2: 2026-02-21T18:18:58.1024995Z (M, N, K) 2026-02-21T18:18:58.1025175Z ------------------ 2026-02-21T18:18:58.1025463Z (4096, 2048, 2048) 2026-02-21T18:18:58.1025868Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T18:19:34.0251557Z INFO:tritonbench.utils.triton_op:Took 0.02ms to get benchmark function for triton_tutorial_matmul 2026-02-21T18:20:11.0533481Z INFO:tritonbench.utils.triton_op:Took 99.57ms to get benchmark function for pt2_triton_matmul 2026-02-21T18:20:50.8621376Z WARNING:__main__:Input tensor metadata: 2026-02-21T18:20:50.8621849Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T18:20:50.8622201Z 'dtype': 'torch.float16', 2026-02-21T18:20:50.8622524Z 'shape': (4096, 2048), 2026-02-21T18:20:50.8622824Z 'stride': (2048, 1)}, 2026-02-21T18:20:50.8623116Z { 'device': 'cuda:0', 2026-02-21T18:20:50.8623404Z 'dtype': 'torch.float16', 2026-02-21T18:20:50.8623696Z 'shape': (2048, 2048), 2026-02-21T18:20:50.8623978Z 'stride': (1, 2048)}, 2026-02-21T18:20:50.8624250Z None), 2026-02-21T18:20:50.8624477Z 'kwargs': {}} 2026-02-21T18:20:50.8643170Z INFO:tritonbench.utils.triton_op:Took 2.74ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T18:20:50.9384386Z [0s] Autotune random seed: 2171071898 2026-02-21T18:20:50.9533158Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T18:20:57.3624231Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━ 100/100 13.6 configs/s 2026-02-21T18:21:08.6842470Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:21:08.6844185Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T18:21:08.6845039Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:21:08.6845872Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:21:08.6846687Z #blocked3 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T18:21:08.6847499Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T18:21:08.6848564Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T18:21:08.6849211Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:21:08.6850101Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:21:08.6850831Z %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf16, #blocked> 2026-02-21T18:21:08.6851212Z %cst_0 = arith.constant dense<2048> : tensor<1x256xi64, #blocked> 2026-02-21T18:21:08.6851563Z %cst_1 = arith.constant dense<0> : tensor<1x256xi64, #blocked> 2026-02-21T18:21:08.6851907Z %cst_2 = arith.constant dense<4096> : tensor<4x1xi64, #blocked1> 2026-02-21T18:21:08.6852249Z %cst_3 = arith.constant dense<0> : tensor<4x1xi64, #blocked1> 2026-02-21T18:21:08.6852587Z %cst_4 = arith.constant dense<2048> : tensor<4x1xi64, #blocked1> 2026-02-21T18:21:08.6852881Z %c256_i32 = arith.constant 256 : i32 2026-02-21T18:21:08.6853117Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T18:21:08.6853349Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:21:08.6853618Z %cst_5 = arith.constant dense<2048> : tensor<4x1xi32, #blocked1> 2026-02-21T18:21:08.6853948Z %cst_6 = arith.constant dense<2048> : tensor<1x2xi32, #blocked2> 2026-02-21T18:21:08.6854305Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<4x2xf32, #blocked2> 2026-02-21T18:21:08.6854611Z %c4_i32 = arith.constant 4 : i32 2026-02-21T18:21:08.6854932Z %c2_i32 = arith.constant 2 : i32 2026-02-21T18:21:08.6855151Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:21:08.6855378Z %0 = tt.get_program_id x : i32 2026-02-21T18:21:08.6855682Z %1 = arith.remsi %0, %c1024_i32 : i32 2026-02-21T18:21:08.6855907Z %2 = arith.divsi %0, %c1024_i32 : i32 2026-02-21T18:21:08.6856131Z %3 = arith.muli %1, %c2_i32 : i32 2026-02-21T18:21:08.6856429Z %4 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked3> 2026-02-21T18:21:08.6856787Z %5 = tt.splat %3 : i32 -> tensor<2xi32, #blocked3> 2026-02-21T18:21:08.6857070Z %6 = arith.addi %5, %4 : tensor<2xi32, #blocked3> 2026-02-21T18:21:08.6857317Z %7 = arith.muli %2, %c4_i32 : i32 2026-02-21T18:21:08.6857612Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked3> 2026-02-21T18:21:08.6857950Z %9 = tt.splat %7 : i32 -> tensor<4xi32, #blocked3> 2026-02-21T18:21:08.6858223Z %10 = arith.addi %9, %8 : tensor<4xi32, #blocked3> 2026-02-21T18:21:08.6858562Z %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked3> 2026-02-21T18:21:08.6858810Z %12 = arith.extsi %7 : i32 to i64 2026-02-21T18:21:08.6859027Z %13 = tt.splat %arg0 : !tt.ptr -> tensor<4x256x!tt.ptr, #blocked> 2026-02-21T18:21:08.6859367Z %14 = tt.splat %12 : i64 -> tensor<4xi64, #blocked3> 2026-02-21T18:21:08.6859608Z %15 = arith.extsi %8 : tensor<4xi32, #blocked3> to tensor<4xi64, #blocked3> 2026-02-21T18:21:08.6859851Z %16 = arith.addi %14, %15 : tensor<4xi64, #blocked3> 2026-02-21T18:21:08.6860176Z %17 = ttg.convert_layout %16 : tensor<4xi64, #blocked3> -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:21:08.6860634Z %18 = tt.expand_dims %17 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi64, #blocked4> 2026-02-21T18:21:08.6861046Z %19 = ttg.convert_layout %18 : tensor<4x1xi64, #blocked4> -> tensor<4x1xi64, #blocked1> 2026-02-21T18:21:08.6861330Z %20 = arith.muli %19, %cst_4 : tensor<4x1xi64, #blocked1> 2026-02-21T18:21:08.6861590Z %21 = tt.broadcast %20 : tensor<4x1xi64, #blocked1> -> tensor<4x256xi64, #blocked1> 2026-02-21T18:21:08.6861918Z %22 = ttg.convert_layout %21 : tensor<4x256xi64, #blocked1> -> tensor<4x256xi64, #blocked> 2026-02-21T18:21:08.6862256Z %23 = arith.extsi %11 : tensor<256xi32, #blocked3> to tensor<256xi64, #blocked3> 2026-02-21T18:21:08.6862529Z %24 = arith.cmpi sge, %19, %cst_3 : tensor<4x1xi64, #blocked1> 2026-02-21T18:21:08.6862768Z %25 = arith.cmpi slt, %19, %cst_2 : tensor<4x1xi64, #blocked1> 2026-02-21T18:21:08.6863012Z %26 = arith.andi %24, %25 : tensor<4x1xi1, #blocked1> 2026-02-21T18:21:08.6863267Z %27 = tt.broadcast %26 : tensor<4x1xi1, #blocked1> -> tensor<4x256xi1, #blocked1> 2026-02-21T18:21:08.6863582Z %28 = ttg.convert_layout %27 : tensor<4x256xi1, #blocked1> -> tensor<4x256xi1, #blocked> 2026-02-21T18:21:08.6863963Z %29 = ttg.convert_layout %6 : tensor<2xi32, #blocked3> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:21:08.6864411Z %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x2xi32, #blocked5> 2026-02-21T18:21:08.6864804Z %31 = ttg.convert_layout %30 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #blocked2> 2026-02-21T18:21:08.6865083Z %32 = arith.muli %31, %cst_6 : tensor<1x2xi32, #blocked2> 2026-02-21T18:21:08.6865340Z %33 = tt.broadcast %32 : tensor<1x2xi32, #blocked2> -> tensor<256x2xi32, #blocked2> 2026-02-21T18:21:08.6865632Z %34 = tt.splat %arg1 : !tt.ptr -> tensor<256x2x!tt.ptr, #blocked2> 2026-02-21T18:21:08.6866001Z %35 = scf.for %arg3 = %c0_i32 to %c2048_i32 step %c256_i32 iter_args(%arg4 = %cst_7) -> (tensor<4x2xf32, #blocked2>) : i32 { 2026-02-21T18:21:08.6866336Z %50 = tt.splat %arg3 : i32 -> tensor<256xi32, #blocked3> 2026-02-21T18:21:08.6866553Z %51 = arith.addi %50, %11 : tensor<256xi32, #blocked3> 2026-02-21T18:21:08.6866768Z %52 = arith.extsi %arg3 : i32 to i64 2026-02-21T18:21:08.6866953Z %53 = tt.splat %52 : i64 -> tensor<256xi64, #blocked3> 2026-02-21T18:21:08.6867182Z %54 = arith.addi %53, %23 : tensor<256xi64, #blocked3> 2026-02-21T18:21:08.6867509Z %55 = ttg.convert_layout %54 : tensor<256xi64, #blocked3> -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:21:08.6867985Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi64, #blocked5> 2026-02-21T18:21:08.6868477Z %57 = ttg.convert_layout %56 : tensor<1x256xi64, #blocked5> -> tensor<1x256xi64, #blocked> 2026-02-21T18:21:08.6868767Z %58 = tt.broadcast %57 : tensor<1x256xi64, #blocked> -> tensor<4x256xi64, #blocked> 2026-02-21T18:21:08.6868979Z %59 = arith.addi %22, %58 : tensor<4x256xi64, #blocked> 2026-02-21T18:21:08.6869193Z %60 = tt.addptr %13, %59 : tensor<4x256x!tt.ptr, #blocked>, tensor<4x256xi64, #blocked> 2026-02-21T18:21:08.6869423Z %61 = arith.cmpi sge, %57, %cst_1 : tensor<1x256xi64, #blocked> 2026-02-21T18:21:08.6869612Z %62 = arith.cmpi slt, %57, %cst_0 : tensor<1x256xi64, #blocked> 2026-02-21T18:21:08.6869791Z %63 = arith.andi %61, %62 : tensor<1x256xi1, #blocked> 2026-02-21T18:21:08.6870013Z %64 = tt.broadcast %63 : tensor<1x256xi1, #blocked> -> tensor<4x256xi1, #blocked> 2026-02-21T18:21:08.6870218Z %65 = arith.andi %28, %64 : tensor<4x256xi1, #blocked> 2026-02-21T18:21:08.6870392Z %66 = tt.load %60, %65, %cst : tensor<4x256x!tt.ptr, #blocked> 2026-02-21T18:21:08.6870664Z %67 = ttg.convert_layout %51 : tensor<256xi32, #blocked3> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:21:08.6871032Z %68 = tt.expand_dims %67 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<256x1xi32, #blocked4> 2026-02-21T18:21:08.6871355Z %69 = ttg.convert_layout %68 : tensor<256x1xi32, #blocked4> -> tensor<256x1xi32, #blocked1> 2026-02-21T18:21:08.6871612Z %70 = tt.broadcast %69 : tensor<256x1xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T18:21:08.6871872Z %71 = ttg.convert_layout %70 : tensor<256x2xi32, #blocked1> -> tensor<256x2xi32, #blocked2> 2026-02-21T18:21:08.6872094Z %72 = arith.addi %71, %33 : tensor<256x2xi32, #blocked2> 2026-02-21T18:21:08.6872335Z %73 = tt.addptr %34, %72 : tensor<256x2x!tt.ptr, #blocked2>, tensor<256x2xi32, #blocked2> 2026-02-21T18:21:08.6872562Z %74 = tt.load %73 : tensor<256x2x!tt.ptr, #blocked2> 2026-02-21T18:21:08.6872839Z %75 = ttg.convert_layout %66 : tensor<4x256xf16, #blocked> -> tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked2}>> 2026-02-21T18:21:08.6873219Z %76 = ttg.convert_layout %74 : tensor<256x2xf16, #blocked2> -> tensor<256x2xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked2}>> 2026-02-21T18:21:08.6873542Z %77 = ttg.convert_layout %arg4 : tensor<4x2xf32, #blocked2> -> tensor<4x2xf32, #blocked2> 2026-02-21T18:21:08.6873988Z %78 = tt.dot %75, %76, %77, inputPrecision = tf32 : tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked2}>> * tensor<256x2xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked2}>> -> tensor<4x2xf32, #blocked2> 2026-02-21T18:21:08.6874366Z scf.yield %78 : tensor<4x2xf32, #blocked2> 2026-02-21T18:21:08.6874501Z } {tt.flatten} 2026-02-21T18:21:08.6874660Z %36 = arith.truncf %35 : tensor<4x2xf32, #blocked2> to tensor<4x2xf16, #blocked2> 2026-02-21T18:21:08.6874950Z %37 = ttg.convert_layout %10 : tensor<4xi32, #blocked3> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:21:08.6875301Z %38 = tt.expand_dims %37 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T18:21:08.6875612Z %39 = ttg.convert_layout %38 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked1> 2026-02-21T18:21:08.6875827Z %40 = arith.muli %39, %cst_5 : tensor<4x1xi32, #blocked1> 2026-02-21T18:21:08.6876101Z %41 = ttg.convert_layout %6 : tensor<2xi32, #blocked3> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:21:08.6876467Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x2xi32, #blocked5> 2026-02-21T18:21:08.6876776Z %43 = ttg.convert_layout %42 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #blocked2> 2026-02-21T18:21:08.6877022Z %44 = tt.broadcast %40 : tensor<4x1xi32, #blocked1> -> tensor<4x2xi32, #blocked1> 2026-02-21T18:21:08.6877262Z %45 = ttg.convert_layout %44 : tensor<4x2xi32, #blocked1> -> tensor<4x2xi32, #blocked2> 2026-02-21T18:21:08.6877504Z %46 = tt.broadcast %43 : tensor<1x2xi32, #blocked2> -> tensor<4x2xi32, #blocked2> 2026-02-21T18:21:08.6877702Z %47 = arith.addi %45, %46 : tensor<4x2xi32, #blocked2> 2026-02-21T18:21:08.6877893Z %48 = tt.splat %arg2 : !tt.ptr -> tensor<4x2x!tt.ptr, #blocked2> 2026-02-21T18:21:08.6878126Z %49 = tt.addptr %48, %47 : tensor<4x2x!tt.ptr, #blocked2>, tensor<4x2xi32, #blocked2> 2026-02-21T18:21:08.6878342Z tt.store %49, %36 : tensor<4x2x!tt.ptr, #blocked2> 2026-02-21T18:21:08.6878485Z tt.return 2026-02-21T18:21:08.6878573Z } 2026-02-21T18:21:08.6878657Z } 2026-02-21T18:21:08.6878722Z 2026-02-21T18:21:08.6878758Z {-# 2026-02-21T18:21:08.6878847Z external_resources: { 2026-02-21T18:21:08.6878955Z mlir_reproducer: { 2026-02-21T18:21:08.6881253Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:21:08.6883586Z disable_threading: false, 2026-02-21T18:21:08.6883693Z verify_each: true 2026-02-21T18:21:08.6883784Z } 2026-02-21T18:21:08.6883854Z } 2026-02-21T18:21:08.6883927Z #-} 2026-02-21T18:21:08.6884211Z /tmp/torchinductor_root/tc/ctcs6hnh5lgh2gay7owxsora5z4gsayh7mggmcxsmnalrgsowplo.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:21:08.6884912Z /tmp/torchinductor_root/tc/ctcs6hnh5lgh2gay7owxsora5z4gsayh7mggmcxsmnalrgsowplo.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:21:08.6885468Z [17s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:21:08.6886190Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 2, 256], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T18:21:08.6886858Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:21:08.6887055Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:21:16.0149405Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 5.4 configs/s 2026-02-21T18:21:16.0158031Z [25s] Adaptive compile timeout: 30s (90% percentile=4.8s, bounds=[30.0s, 30s]) 2026-02-21T18:21:16.0837152Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 - configs/s 2026-02-21T18:21:16.4254945Z [25s] Initial random population of 100, 5 starting points: 2026-02-21T18:21:16.4255258Z error=9 2026-02-21T18:21:16.4255401Z ok=91 2026-02-21T18:21:16.4255543Z min=0.1341 2026-02-21T18:21:16.4255675Z mid=2.0903 2026-02-21T18:21:16.4255811Z max=367.7411 2026-02-21T18:21:16.4255974Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:21:16.4256224Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T18:21:16.4256491Z 'l2_groupings': [16], 2026-02-21T18:21:16.4256723Z 'load_eviction_policies': ['', ''], 2026-02-21T18:21:16.4256926Z 'loop_orders': [[1, 0]], 2026-02-21T18:21:16.4257106Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:21:16.4257312Z 'num_sm_multiplier': 128, 2026-02-21T18:21:16.4257766Z 'num_stages': 1, 2026-02-21T18:21:16.4257931Z 'num_warps': 16, 2026-02-21T18:21:16.4258108Z 'pid_type': 'persistent_blocked', 2026-02-21T18:21:16.4258317Z 'range_flattens': [False, True], 2026-02-21T18:21:16.4258525Z 'range_multi_buffers': [False, False], 2026-02-21T18:21:16.4258731Z 'range_num_stages': [0, 0], 2026-02-21T18:21:16.4258924Z 'range_unroll_factors': [0, 0], 2026-02-21T18:21:16.4259117Z 'range_warp_specializes': [], 2026-02-21T18:21:16.4259301Z 'waves_per_eu': 2} 2026-02-21T18:21:16.4279807Z [25s] Fitting surrogate: 100 points, 100 targets 2026-02-21T18:21:17.3988877Z [26s] Generation 1 starting: 87 neighbors, 5 active search path(s) 2026-02-21T18:21:26.1150113Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 14.4 configs/s 2026-02-21T18:21:31.5361613Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 17.1 configs/s 2026-02-21T18:21:33.1871432Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 503.4 2026-02-21T18:21:33.1872018Z configs/s 2026-02-21T18:21:33.8032936Z [42s] Generation 1 complete: 2026-02-21T18:21:33.8033354Z error=2 2026-02-21T18:21:33.8033569Z ok=90 2026-02-21T18:21:33.8033773Z min=0.0801 2026-02-21T18:21:33.8033984Z mid=0.1759 2026-02-21T18:21:33.8034202Z max=2.8923 2026-02-21T18:21:33.8034442Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:21:33.8034817Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:21:33.8035180Z 'l2_groupings': [2], 2026-02-21T18:21:33.8035453Z 'load_eviction_policies': ['', ''], 2026-02-21T18:21:33.8035756Z 'loop_orders': [[0, 1]], 2026-02-21T18:21:33.8036032Z 'matrix_instr_nonkdim': 0, 2026-02-21T18:21:33.8036849Z 'num_stages': 2, 2026-02-21T18:21:33.8037080Z 'num_warps': 4, 2026-02-21T18:21:33.8037312Z 'pid_type': 'flat', 2026-02-21T18:21:33.8037572Z 'range_flattens': [None, None], 2026-02-21T18:21:33.8038010Z 'range_multi_buffers': [None, None], 2026-02-21T18:21:33.8038336Z 'range_num_stages': [0, 0], 2026-02-21T18:21:33.8038637Z 'range_unroll_factors': [0, 0], 2026-02-21T18:21:33.8038934Z 'range_warp_specializes': [], 2026-02-21T18:21:33.8039214Z 'waves_per_eu': 1} 2026-02-21T18:21:33.8214974Z [42s] Fitting surrogate: 192 points, 192 targets 2026-02-21T18:21:35.2890829Z [44s] Generation 2 starting: 88 neighbors, 5 active search path(s) 2026-02-21T18:21:44.1978940Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91/91 10.6 configs/s 2026-02-21T18:21:49.5344369Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 91/91 17.2 configs/s 2026-02-21T18:21:50.5850046Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 762.7 2026-02-21T18:21:50.5850573Z configs/s 2026-02-21T18:21:51.1345165Z [60s] Generation 2 complete: 2026-02-21T18:21:51.1345498Z error=2 2026-02-21T18:21:51.1345699Z ok=92 2026-02-21T18:21:51.1345920Z min=0.0618 2026-02-21T18:21:51.1346126Z mid=0.1403 2026-02-21T18:21:51.1346341Z max=2.1418 2026-02-21T18:21:51.1346592Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:21:51.1346969Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:21:51.1347336Z 'l2_groupings': [2], 2026-02-21T18:21:51.1347610Z 'load_eviction_policies': ['', ''], 2026-02-21T18:21:51.1347918Z 'loop_orders': [[0, 1]], 2026-02-21T18:21:51.1348296Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:21:51.1348573Z 'num_stages': 2, 2026-02-21T18:21:51.1348794Z 'num_warps': 4, 2026-02-21T18:21:51.1349035Z 'pid_type': 'flat', 2026-02-21T18:21:51.1349293Z 'range_flattens': [None, None], 2026-02-21T18:21:51.1349597Z 'range_multi_buffers': [None, None], 2026-02-21T18:21:51.1349903Z 'range_num_stages': [0, 0], 2026-02-21T18:21:51.1350178Z 'range_unroll_factors': [0, 0], 2026-02-21T18:21:51.1350477Z 'range_warp_specializes': [], 2026-02-21T18:21:51.1350748Z 'waves_per_eu': 1} 2026-02-21T18:21:51.1495781Z [60s] Fitting surrogate: 286 points, 286 targets 2026-02-21T18:21:52.1301855Z [61s] Generation 3 starting: 83 neighbors, 5 active search path(s) 2026-02-21T18:22:06.3428769Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 1.0 configs/s 2026-02-21T18:22:11.1213943Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 18.6 configs/s 2026-02-21T18:22:13.6029721Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 363.7 2026-02-21T18:22:13.6030305Z configs/s 2026-02-21T18:22:14.2491622Z [83s] Generation 3 complete: 2026-02-21T18:22:14.2491998Z error=9 2026-02-21T18:22:14.2492200Z ok=79 2026-02-21T18:22:14.2492413Z min=0.0613 2026-02-21T18:22:14.2492617Z mid=0.1182 2026-02-21T18:22:14.2492819Z max=4.5685 2026-02-21T18:22:14.2493072Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:22:14.2493464Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:22:14.2493828Z 'l2_groupings': [2], 2026-02-21T18:22:14.2494127Z 'load_eviction_policies': ['', ''], 2026-02-21T18:22:14.2494506Z 'loop_orders': [[0, 1]], 2026-02-21T18:22:14.2494803Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:22:14.2495073Z 'num_stages': 2, 2026-02-21T18:22:14.2495309Z 'num_warps': 4, 2026-02-21T18:22:14.2495548Z 'pid_type': 'flat', 2026-02-21T18:22:14.2495815Z 'range_flattens': [None, None], 2026-02-21T18:22:14.2496119Z 'range_multi_buffers': [None, None], 2026-02-21T18:22:14.2496434Z 'range_num_stages': [0, 0], 2026-02-21T18:22:14.2496716Z 'range_unroll_factors': [0, 0], 2026-02-21T18:22:14.2497019Z 'range_warp_specializes': [], 2026-02-21T18:22:14.2497305Z 'waves_per_eu': 1} 2026-02-21T18:22:14.2861939Z [83s] Fitting surrogate: 374 points, 374 targets 2026-02-21T18:22:15.7151720Z [84s] Generation 4 starting: 82 neighbors, 5 active search path(s) 2026-02-21T18:22:23.8535255Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83/83 17.2 configs/s 2026-02-21T18:22:27.7773258Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 83/83 21.2 configs/s 2026-02-21T18:22:29.7260065Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 450.4 2026-02-21T18:22:29.7260695Z configs/s 2026-02-21T18:22:30.3293443Z [99s] Generation 4 complete: 2026-02-21T18:22:30.3293812Z error=19 2026-02-21T18:22:30.3294024Z ok=68 2026-02-21T18:22:30.3294225Z min=0.0619 2026-02-21T18:22:30.3294435Z mid=0.1125 2026-02-21T18:22:30.3294631Z max=2.2822 2026-02-21T18:22:30.3294855Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:22:30.3295240Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:22:30.3295611Z 'l2_groupings': [2], 2026-02-21T18:22:30.3295880Z 'load_eviction_policies': ['', ''], 2026-02-21T18:22:30.3296188Z 'loop_orders': [[0, 1]], 2026-02-21T18:22:30.3296488Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:22:30.3296757Z 'num_stages': 2, 2026-02-21T18:22:30.3296990Z 'num_warps': 4, 2026-02-21T18:22:30.3297219Z 'pid_type': 'flat', 2026-02-21T18:22:30.3297492Z 'range_flattens': [None, None], 2026-02-21T18:22:30.3298090Z 'range_multi_buffers': [None, None], 2026-02-21T18:22:30.3298405Z 'range_num_stages': [0, 0], 2026-02-21T18:22:30.3298682Z 'range_unroll_factors': [0, 0], 2026-02-21T18:22:30.3298980Z 'range_warp_specializes': [], 2026-02-21T18:22:30.3299264Z 'waves_per_eu': 1} 2026-02-21T18:22:30.3589368Z [99s] Fitting surrogate: 461 points, 461 targets 2026-02-21T18:22:31.1394918Z [100s] Generation 5 starting: 65 neighbors, 4 active search path(s) 2026-02-21T18:22:37.7252732Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 11.5 configs/s 2026-02-21T18:22:41.2147216Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 18.9 configs/s 2026-02-21T18:22:44.2493809Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 302.6 2026-02-21T18:22:44.2494448Z configs/s 2026-02-21T18:22:44.8094648Z [113s] Generation 5 complete: 2026-02-21T18:22:44.8095013Z error=7 2026-02-21T18:22:44.8095232Z ok=63 2026-02-21T18:22:44.8095813Z min=0.0624 2026-02-21T18:22:44.8096018Z mid=0.1024 2026-02-21T18:22:44.8096202Z max=1.9587 2026-02-21T18:22:44.8096425Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:22:44.8096791Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:22:44.8097144Z 'l2_groupings': [2], 2026-02-21T18:22:44.8097410Z 'load_eviction_policies': ['', ''], 2026-02-21T18:22:44.8097701Z 'loop_orders': [[0, 1]], 2026-02-21T18:22:44.8097974Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:22:44.8098225Z 'num_stages': 2, 2026-02-21T18:22:44.8098448Z 'num_warps': 4, 2026-02-21T18:22:44.8098662Z 'pid_type': 'flat', 2026-02-21T18:22:44.8098919Z 'range_flattens': [None, None], 2026-02-21T18:22:44.8099222Z 'range_multi_buffers': [None, None], 2026-02-21T18:22:44.8099648Z 'range_num_stages': [0, 0], 2026-02-21T18:22:44.8100044Z 'range_unroll_factors': [0, 0], 2026-02-21T18:22:44.8100497Z 'range_warp_specializes': [], 2026-02-21T18:22:44.8100929Z 'waves_per_eu': 1} 2026-02-21T18:22:44.8592221Z [113s] Fitting surrogate: 531 points, 531 targets 2026-02-21T18:22:46.1572923Z [115s] Generation 6 starting: 65 neighbors, 4 active search path(s) 2026-02-21T18:22:52.3013773Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 35.1 configs/s 2026-02-21T18:22:55.6416779Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 65/65 19.8 configs/s 2026-02-21T18:22:58.5075000Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 318.9 2026-02-21T18:22:58.5075616Z configs/s 2026-02-21T18:22:59.0791544Z [128s] Generation 6 complete: 2026-02-21T18:22:59.0792058Z error=11 2026-02-21T18:22:59.0792629Z ok=59 2026-02-21T18:22:59.0792843Z min=0.0606 2026-02-21T18:22:59.0793052Z mid=0.0981 2026-02-21T18:22:59.0793247Z max=2.5717 2026-02-21T18:22:59.0793475Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:22:59.0794015Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:22:59.0794395Z 'l2_groupings': [2], 2026-02-21T18:22:59.0794727Z 'load_eviction_policies': ['', ''], 2026-02-21T18:22:59.0795039Z 'loop_orders': [[0, 1]], 2026-02-21T18:22:59.0795315Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:22:59.0795586Z 'num_stages': 2, 2026-02-21T18:22:59.0795814Z 'num_warps': 4, 2026-02-21T18:22:59.0796047Z 'pid_type': 'flat', 2026-02-21T18:22:59.0796311Z 'range_flattens': [None, None], 2026-02-21T18:22:59.0796608Z 'range_multi_buffers': [None, None], 2026-02-21T18:22:59.0796872Z 'range_num_stages': [0, 0], 2026-02-21T18:22:59.0797094Z 'range_unroll_factors': [0, 0], 2026-02-21T18:22:59.0797334Z 'range_warp_specializes': [], 2026-02-21T18:22:59.0797555Z 'waves_per_eu': 1} 2026-02-21T18:22:59.1230910Z [128s] Fitting surrogate: 601 points, 601 targets 2026-02-21T18:22:59.7416537Z [128s] Generation 7 starting: 50 neighbors, 3 active search path(s) 2026-02-21T18:23:05.3655691Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50/50 2.8 configs/s 2026-02-21T18:23:07.7957455Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 50/50 22.0 configs/s 2026-02-21T18:23:09.6725960Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 467.6 2026-02-21T18:23:09.6726591Z configs/s 2026-02-21T18:23:10.2591021Z [139s] Generation 7 complete: 2026-02-21T18:23:10.2591411Z error=11 2026-02-21T18:23:10.2591623Z ok=43 2026-02-21T18:23:10.2591819Z min=0.0618 2026-02-21T18:23:10.2592010Z mid=0.1006 2026-02-21T18:23:10.2592194Z max=1.1419 2026-02-21T18:23:10.2592405Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:23:10.2592764Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:23:10.2593131Z 'l2_groupings': [2], 2026-02-21T18:23:10.2593387Z 'load_eviction_policies': ['', ''], 2026-02-21T18:23:10.2593671Z 'loop_orders': [[0, 1]], 2026-02-21T18:23:10.2593933Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:23:10.2594191Z 'num_stages': 2, 2026-02-21T18:23:10.2594421Z 'num_warps': 4, 2026-02-21T18:23:10.2594961Z 'pid_type': 'flat', 2026-02-21T18:23:10.2595238Z 'range_flattens': [None, None], 2026-02-21T18:23:10.2595547Z 'range_multi_buffers': [None, None], 2026-02-21T18:23:10.2595822Z 'range_num_stages': [0, 0], 2026-02-21T18:23:10.2596072Z 'range_unroll_factors': [0, 0], 2026-02-21T18:23:10.2596337Z 'range_warp_specializes': [], 2026-02-21T18:23:10.2596592Z 'waves_per_eu': 1} 2026-02-21T18:23:10.2854791Z [139s] Fitting surrogate: 655 points, 655 targets 2026-02-21T18:23:10.7664182Z [139s] Generation 8 starting: 37 neighbors, 2 active search path(s) 2026-02-21T18:23:14.0971362Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 9.6 configs/s 2026-02-21T18:23:16.6234240Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 15.5 configs/s 2026-02-21T18:23:19.9910705Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 275.0 2026-02-21T18:23:19.9911325Z configs/s 2026-02-21T18:23:20.5766738Z [149s] Generation 8 complete: 2026-02-21T18:23:20.5767037Z error=4 2026-02-21T18:23:20.5767509Z ok=36 2026-02-21T18:23:20.5767795Z min=0.0608 2026-02-21T18:23:20.5768010Z mid=0.0789 2026-02-21T18:23:20.5768212Z max=0.4894 2026-02-21T18:23:20.5768436Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:23:20.5768843Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:23:20.5769206Z 'l2_groupings': [2], 2026-02-21T18:23:20.5769485Z 'load_eviction_policies': ['', ''], 2026-02-21T18:23:20.5769789Z 'loop_orders': [[0, 1]], 2026-02-21T18:23:20.5770057Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:23:20.5770321Z 'num_stages': 2, 2026-02-21T18:23:20.5770547Z 'num_warps': 4, 2026-02-21T18:23:20.5771188Z 'pid_type': 'flat', 2026-02-21T18:23:20.5771449Z 'range_flattens': [None, True], 2026-02-21T18:23:20.5771753Z 'range_multi_buffers': [None, False], 2026-02-21T18:23:20.5772060Z 'range_num_stages': [0, 0], 2026-02-21T18:23:20.5772473Z 'range_unroll_factors': [0, 0], 2026-02-21T18:23:20.5772780Z 'range_warp_specializes': [], 2026-02-21T18:23:20.5773074Z 'waves_per_eu': 1} 2026-02-21T18:23:20.6272942Z [149s] Fitting surrogate: 695 points, 695 targets 2026-02-21T18:23:21.0817006Z [150s] Generation 9 starting: 36 neighbors, 2 active search path(s) 2026-02-21T18:23:24.5130838Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 4.4 configs/s 2026-02-21T18:23:26.5278851Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 18.8 configs/s 2026-02-21T18:23:29.5622109Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 301.8 2026-02-21T18:23:29.5622725Z configs/s 2026-02-21T18:23:30.1132177Z [159s] Generation 9 complete: 2026-02-21T18:23:30.1132640Z error=4 2026-02-21T18:23:30.1132846Z ok=34 2026-02-21T18:23:30.1133048Z min=0.0613 2026-02-21T18:23:30.1133257Z mid=0.0824 2026-02-21T18:23:30.1133491Z max=0.2339 2026-02-21T18:23:30.1134233Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:23:30.1134759Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:23:30.1135121Z 'l2_groupings': [2], 2026-02-21T18:23:30.1135393Z 'load_eviction_policies': ['', ''], 2026-02-21T18:23:30.1135700Z 'loop_orders': [[0, 1]], 2026-02-21T18:23:30.1135980Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:23:30.1136252Z 'num_stages': 2, 2026-02-21T18:23:30.1136479Z 'num_warps': 4, 2026-02-21T18:23:30.1136713Z 'pid_type': 'flat', 2026-02-21T18:23:30.1136971Z 'range_flattens': [None, True], 2026-02-21T18:23:30.1137271Z 'range_multi_buffers': [None, None], 2026-02-21T18:23:30.1137574Z 'range_num_stages': [0, 0], 2026-02-21T18:23:30.1137850Z 'range_unroll_factors': [0, 0], 2026-02-21T18:23:30.1138152Z 'range_warp_specializes': [], 2026-02-21T18:23:30.1138428Z 'waves_per_eu': 1} 2026-02-21T18:23:30.1590933Z [159s] Fitting surrogate: 733 points, 733 targets 2026-02-21T18:23:30.5701449Z [159s] Generation 10 starting: 30 neighbors, 2 active search path(s) 2026-02-21T18:23:33.5817680Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/30 4.9 configs/s 2026-02-21T18:23:35.1792613Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 30/30 20.8 configs/s 2026-02-21T18:23:36.7090497Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 555.9 2026-02-21T18:23:36.7091113Z configs/s 2026-02-21T18:23:37.2627235Z [166s] Generation 10 complete: 2026-02-21T18:23:37.2627531Z error=5 2026-02-21T18:23:37.2627695Z ok=27 2026-02-21T18:23:37.2627858Z min=0.0616 2026-02-21T18:23:37.2628032Z mid=0.0922 2026-02-21T18:23:37.2628296Z max=0.3368 2026-02-21T18:23:37.2628478Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:23:37.2629167Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:23:37.2629457Z 'l2_groupings': [2], 2026-02-21T18:23:37.2629680Z 'load_eviction_policies': ['', ''], 2026-02-21T18:23:37.2630058Z 'loop_orders': [[0, 1]], 2026-02-21T18:23:37.2630295Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:23:37.2630527Z 'num_stages': 2, 2026-02-21T18:23:37.2630717Z 'num_warps': 4, 2026-02-21T18:23:37.2630902Z 'pid_type': 'flat', 2026-02-21T18:23:37.2631117Z 'range_flattens': [None, True], 2026-02-21T18:23:37.2631361Z 'range_multi_buffers': [None, None], 2026-02-21T18:23:37.2631607Z 'range_num_stages': [0, 0], 2026-02-21T18:23:37.2631838Z 'range_unroll_factors': [0, 0], 2026-02-21T18:23:37.2632074Z 'range_warp_specializes': [], 2026-02-21T18:23:37.2632301Z 'waves_per_eu': 1} 2026-02-21T18:23:37.2859086Z [166s] Fitting surrogate: 765 points, 765 targets 2026-02-21T18:23:37.6023014Z [166s] Generation 11 starting: 20 neighbors, 1 active search path(s) 2026-02-21T18:23:39.4895190Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 9.8 configs/s 2026-02-21T18:23:40.7817013Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 17.5 configs/s 2026-02-21T18:23:43.0063146Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 488.2 2026-02-21T18:23:43.0063594Z configs/s 2026-02-21T18:23:43.5143658Z [172s] Generation 11 complete: 2026-02-21T18:23:43.5144100Z ok=22 2026-02-21T18:23:43.5144316Z min=0.0626 2026-02-21T18:23:43.5144532Z mid=0.0790 2026-02-21T18:23:43.5144737Z max=0.1015 2026-02-21T18:23:43.5144978Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:23:43.5145354Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:23:43.5145726Z 'l2_groupings': [2], 2026-02-21T18:23:43.5146000Z 'load_eviction_policies': ['', ''], 2026-02-21T18:23:43.5146312Z 'loop_orders': [[1, 0]], 2026-02-21T18:23:43.5146590Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:23:43.5146893Z 'num_sm_multiplier': 2, 2026-02-21T18:23:43.5147153Z 'num_stages': 2, 2026-02-21T18:23:43.5147385Z 'num_warps': 4, 2026-02-21T18:23:43.5147640Z 'pid_type': 'persistent_blocked', 2026-02-21T18:23:43.5147965Z 'range_flattens': [None, False], 2026-02-21T18:23:43.5148368Z 'range_multi_buffers': [False, False], 2026-02-21T18:23:43.5149134Z 'range_num_stages': [0, 0], 2026-02-21T18:23:43.5149418Z 'range_unroll_factors': [0, 0], 2026-02-21T18:23:43.5149713Z 'range_warp_specializes': [], 2026-02-21T18:23:43.5150000Z 'waves_per_eu': 1} 2026-02-21T18:23:43.5405988Z [172s] Fitting surrogate: 787 points, 787 targets 2026-02-21T18:23:43.8594062Z [172s] Generation 12 starting: 21 neighbors, 1 active search path(s) 2026-02-21T18:23:46.1060005Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21/21 8.9 configs/s 2026-02-21T18:23:47.4179334Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 21/21 18.1 configs/s 2026-02-21T18:23:48.6583067Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 665.9 2026-02-21T18:23:48.6583549Z configs/s 2026-02-21T18:23:49.1835651Z [178s] Generation 12 complete: 2026-02-21T18:23:49.1836008Z error=1 2026-02-21T18:23:49.1836213Z ok=21 2026-02-21T18:23:49.1836433Z min=0.0579 2026-02-21T18:23:49.1836659Z mid=0.0868 2026-02-21T18:23:49.1841872Z max=0.1310 2026-02-21T18:23:49.1842145Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:23:49.1842591Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:23:49.1842965Z 'l2_groupings': [2], 2026-02-21T18:23:49.1843236Z 'load_eviction_policies': ['', ''], 2026-02-21T18:23:49.1853637Z 'loop_orders': [[1, 0]], 2026-02-21T18:23:49.1853841Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:23:49.1854033Z 'num_sm_multiplier': 2, 2026-02-21T18:23:49.1854209Z 'num_stages': 2, 2026-02-21T18:23:49.1854367Z 'num_warps': 4, 2026-02-21T18:23:49.1854535Z 'pid_type': 'persistent_blocked', 2026-02-21T18:23:49.1854748Z 'range_flattens': [None, None], 2026-02-21T18:23:49.1855224Z 'range_multi_buffers': [False, False], 2026-02-21T18:23:49.1855433Z 'range_num_stages': [0, 0], 2026-02-21T18:23:49.1855613Z 'range_unroll_factors': [0, 0], 2026-02-21T18:23:49.1855915Z 'range_warp_specializes': [], 2026-02-21T18:23:49.1856102Z 'waves_per_eu': 1} 2026-02-21T18:23:49.2006887Z [178s] Fitting surrogate: 809 points, 809 targets 2026-02-21T18:23:49.4939175Z [178s] Generation 13 starting: 18 neighbors, 1 active search path(s) 2026-02-21T18:23:51.2696686Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18/18 9.8 configs/s 2026-02-21T18:23:52.3873046Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 18/18 18.6 configs/s 2026-02-21T18:23:53.7569472Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 610.5 2026-02-21T18:23:53.7570106Z configs/s 2026-02-21T18:23:54.2309195Z [183s] Generation 13 complete: 2026-02-21T18:23:54.2309587Z error=1 2026-02-21T18:23:54.2309793Z ok=18 2026-02-21T18:23:54.2309999Z min=0.0622 2026-02-21T18:23:54.2310193Z mid=0.0768 2026-02-21T18:23:54.2310424Z max=0.1608 2026-02-21T18:23:54.2310671Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:23:54.2311396Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:23:54.2311787Z 'l2_groupings': [2], 2026-02-21T18:23:54.2312065Z 'load_eviction_policies': ['', ''], 2026-02-21T18:23:54.2312369Z 'loop_orders': [[1, 0]], 2026-02-21T18:23:54.2312642Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:23:54.2312927Z 'num_sm_multiplier': 2, 2026-02-21T18:23:54.2313189Z 'num_stages': 2, 2026-02-21T18:23:54.2313409Z 'num_warps': 4, 2026-02-21T18:23:54.2313668Z 'pid_type': 'persistent_blocked', 2026-02-21T18:23:54.2313981Z 'range_flattens': [False, None], 2026-02-21T18:23:54.2314302Z 'range_multi_buffers': [False, False], 2026-02-21T18:23:54.2314615Z 'range_num_stages': [0, 0], 2026-02-21T18:23:54.2314890Z 'range_unroll_factors': [0, 0], 2026-02-21T18:23:54.2315193Z 'range_warp_specializes': [], 2026-02-21T18:23:54.2315461Z 'waves_per_eu': 1} 2026-02-21T18:23:54.2514486Z [183s] Fitting surrogate: 828 points, 828 targets 2026-02-21T18:23:54.5604577Z [183s] Generation 14 starting: 19 neighbors, 1 active search path(s) 2026-02-21T18:23:56.4227803Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 19/19 20.0 configs/s 2026-02-21T18:23:57.5526803Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 19/19 19.4 configs/s 2026-02-21T18:23:58.6211826Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 762.8 2026-02-21T18:23:58.6212368Z configs/s 2026-02-21T18:23:59.1238390Z [188s] Generation 14 complete: 2026-02-21T18:23:59.1238759Z error=2 2026-02-21T18:23:59.1238975Z ok=18 2026-02-21T18:23:59.1239180Z min=0.0613 2026-02-21T18:23:59.1239394Z mid=0.0757 2026-02-21T18:23:59.1239591Z max=0.1427 2026-02-21T18:23:59.1239825Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:23:59.1240232Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:23:59.1240607Z 'l2_groupings': [2], 2026-02-21T18:23:59.1240878Z 'load_eviction_policies': ['', ''], 2026-02-21T18:23:59.1241199Z 'loop_orders': [[1, 0]], 2026-02-21T18:23:59.1241496Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:23:59.1241793Z 'num_sm_multiplier': 2, 2026-02-21T18:23:59.1242057Z 'num_stages': 2, 2026-02-21T18:23:59.1242290Z 'num_warps': 4, 2026-02-21T18:23:59.1242549Z 'pid_type': 'persistent_blocked', 2026-02-21T18:23:59.1242869Z 'range_flattens': [False, None], 2026-02-21T18:23:59.1243185Z 'range_multi_buffers': [False, False], 2026-02-21T18:23:59.1243498Z 'range_num_stages': [0, 0], 2026-02-21T18:23:59.1243786Z 'range_unroll_factors': [0, 0], 2026-02-21T18:23:59.1244084Z 'range_warp_specializes': [], 2026-02-21T18:23:59.1244365Z 'waves_per_eu': 2} 2026-02-21T18:23:59.1396166Z [188s] Fitting surrogate: 848 points, 848 targets 2026-02-21T18:23:59.4537994Z [188s] Generation 15 starting: 20 neighbors, 1 active search path(s) 2026-02-21T18:24:04.3253198Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 1.3 configs/s 2026-02-21T18:24:05.3372509Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 23.5 configs/s 2026-02-21T18:24:06.0638056Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1029.2 2026-02-21T18:24:06.0638686Z configs/s 2026-02-21T18:24:06.5135650Z [195s] Generation 15 complete: 2026-02-21T18:24:06.5136015Z error=5 2026-02-21T18:24:06.5136227Z ok=16 2026-02-21T18:24:06.5136431Z min=0.0580 2026-02-21T18:24:06.5136647Z mid=0.0845 2026-02-21T18:24:06.5136841Z max=0.1598 2026-02-21T18:24:06.5137075Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:24:06.5137453Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:24:06.5137820Z 'l2_groupings': [2], 2026-02-21T18:24:06.5138096Z 'load_eviction_policies': ['', ''], 2026-02-21T18:24:06.5138425Z 'loop_orders': [[1, 0]], 2026-02-21T18:24:06.5138711Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:24:06.5138987Z 'num_sm_multiplier': 2, 2026-02-21T18:24:06.5139249Z 'num_stages': 2, 2026-02-21T18:24:06.5139490Z 'num_warps': 4, 2026-02-21T18:24:06.5140047Z 'pid_type': 'persistent_blocked', 2026-02-21T18:24:06.5140255Z 'range_flattens': [False, False], 2026-02-21T18:24:06.5140411Z 'range_multi_buffers': [False, False], 2026-02-21T18:24:06.5140569Z 'range_num_stages': [0, 0], 2026-02-21T18:24:06.5140714Z 'range_unroll_factors': [0, 0], 2026-02-21T18:24:06.5140867Z 'range_warp_specializes': [], 2026-02-21T18:24:06.5141006Z 'waves_per_eu': 2} 2026-02-21T18:24:06.5252217Z [195s] Fitting surrogate: 869 points, 869 targets 2026-02-21T18:24:06.8268258Z [195s] Generation 16 starting: 20 neighbors, 1 active search path(s) 2026-02-21T18:24:08.7357979Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 8.4 configs/s 2026-02-21T18:24:09.6878093Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 20/20 22.4 configs/s 2026-02-21T18:24:10.7058662Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1298.3 2026-02-21T18:24:10.7059246Z configs/s 2026-02-21T18:24:11.1461348Z [200s] Generation 16 complete: 2026-02-21T18:24:11.1461842Z error=6 2026-02-21T18:24:11.1461923Z ok=15 2026-02-21T18:24:11.1462006Z min=0.0620 2026-02-21T18:24:11.1462086Z mid=0.0768 2026-02-21T18:24:11.1462163Z max=1.0315 2026-02-21T18:24:11.1462253Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:24:11.1462399Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:24:11.1462535Z 'l2_groupings': [2], 2026-02-21T18:24:11.1462639Z 'load_eviction_policies': ['', ''], 2026-02-21T18:24:11.1462751Z 'loop_orders': [[1, 0]], 2026-02-21T18:24:11.1462859Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:24:11.1462962Z 'num_sm_multiplier': 2, 2026-02-21T18:24:11.1463061Z 'num_stages': 2, 2026-02-21T18:24:11.1463145Z 'num_warps': 4, 2026-02-21T18:24:11.1463250Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:24:11.1463372Z 'range_flattens': [False, False], 2026-02-21T18:24:11.1463491Z 'range_multi_buffers': [False, False], 2026-02-21T18:24:11.1463614Z 'range_num_stages': [0, 0], 2026-02-21T18:24:11.1463720Z 'range_unroll_factors': [0, 0], 2026-02-21T18:24:11.1463834Z 'range_warp_specializes': [], 2026-02-21T18:24:11.1463935Z 'waves_per_eu': 2} 2026-02-21T18:24:11.1552036Z [200s] Fitting surrogate: 890 points, 890 targets 2026-02-21T18:24:11.2785009Z [200s] Autotuning complete in 200.3s after searching 852 configs. 2026-02-21T18:24:11.2785249Z One can hardcode the best config and skip autotuning with: 2026-02-21T18:24:11.2787475Z @helion.kernel(config=helion.Config(block_sizes=[128, 128, 64], indexing=['block_ptr', 'pointer', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=2, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, False], range_multi_buffers=[False, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T18:24:11.2789661Z 2026-02-21T18:24:11.2790210Z [200s] Code of selected kernel: /tmp/torchinductor_root/7n/c7naror3g26lmmowtivnwzkgkv32qi7t63dkpohovfgo6zqsr3rl.py 2026-02-21T18:24:11.2871921Z from __future__ import annotations 2026-02-21T18:24:11.2872102Z 2026-02-21T18:24:11.2872156Z import torch 2026-02-21T18:24:11.2872271Z import helion 2026-02-21T18:24:11.2872377Z import triton 2026-02-21T18:24:11.2872494Z import triton.language as tl 2026-02-21T18:24:11.2872693Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T18:24:11.2872850Z 2026-02-21T18:24:11.2872908Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T18:24:11.2873058Z _BLOCK_SIZE_0 = tl.constexpr(128) 2026-02-21T18:24:11.2873201Z _BLOCK_SIZE_2 = tl.constexpr(64) 2026-02-21T18:24:11.2873294Z 2026-02-21T18:24:11.2873334Z @triton.jit 2026-02-21T18:24:11.2873474Z def _helion_matmul(x, y, out, _NUM_SM: tl.constexpr): 2026-02-21T18:24:11.2873704Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:24:11.2873955Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:24:11.2874317Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:24:11.2874492Z # src[matmul.py:63-67]: ... 2026-02-21T18:24:11.2874689Z total_pids = tl.cdiv(2048, _BLOCK_SIZE_1) * tl.cdiv(4096, _BLOCK_SIZE_0) 2026-02-21T18:24:11.2875047Z for virtual_pid in tl.range(tl.program_id(0), total_pids, _NUM_SM * 2, disallow_acc_multi_buffer=True, flatten=False): 2026-02-21T18:24:11.2875382Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:24:11.2875590Z num_pid_m = tl.cdiv(2048, _BLOCK_SIZE_1) 2026-02-21T18:24:11.2875765Z num_pid_n = tl.cdiv(4096, _BLOCK_SIZE_0) 2026-02-21T18:24:11.2875923Z inner_2d_pid = virtual_pid 2026-02-21T18:24:11.2876080Z num_pid_in_group = 2 * num_pid_n 2026-02-21T18:24:11.2876246Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T18:24:11.2876425Z first_pid_m = group_id * 2 2026-02-21T18:24:11.2876588Z group_size_m = min(num_pid_m - first_pid_m, 2) 2026-02-21T18:24:11.2876810Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T18:24:11.2877096Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T18:24:11.2877290Z offset_1 = pid_0 * _BLOCK_SIZE_1 2026-02-21T18:24:11.2877507Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T18:24:11.2877708Z offset_0 = pid_1 * _BLOCK_SIZE_0 2026-02-21T18:24:11.2877914Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:24:11.2878216Z acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 0.0, tl.float32) 2026-02-21T18:24:11.2878410Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:24:11.2878636Z # src[matmul.py:66]: acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) 2026-02-21T18:24:11.2878934Z for offset_2 in tl.range(0, 2048, _BLOCK_SIZE_2, disallow_acc_multi_buffer=True, flatten=False): 2026-02-21T18:24:11.2879212Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int32) 2026-02-21T18:24:11.2879400Z acc_copy = acc 2026-02-21T18:24:11.2879523Z acc_copy_0 = acc_copy 2026-02-21T18:24:11.2879720Z # src[matmul.py:66]: acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) 2026-02-21T18:24:11.2880129Z load = tl.load(tl.make_block_ptr(x, [4096, 2048], [2048, 1], [offset_0, offset_2], [_BLOCK_SIZE_0, _BLOCK_SIZE_2], [1, 0]), boundary_check=[0, 1], padding_option='zero') 2026-02-21T18:24:11.2880531Z load_1 = tl.load(y + (indices_2[:, None] * 1 + indices_1[None, :] * 2048), None) 2026-02-21T18:24:11.2880885Z acc = tl.dot(tl.cast(load, tl.float16), tl.cast(load_1, tl.float16), acc=acc_copy_0, input_precision='tf32', out_dtype=tl.float32) 2026-02-21T18:24:11.2881255Z # src[matmul.py:67]: out[tile_m, tile_n] = epilogue(acc, (tile_m, tile_n)) 2026-02-21T18:24:11.2881497Z v_0 = tl.cast(acc, tl.float16) 2026-02-21T18:24:11.2881798Z tl.store(tl.make_block_ptr(out, [4096, 2048], [2048, 1], [offset_0, offset_1], [_BLOCK_SIZE_0, _BLOCK_SIZE_1], [1, 0]), v_0, boundary_check=[0, 1]) 2026-02-21T18:24:11.2882059Z 2026-02-21T18:24:11.2882275Z def matmul(x: Tensor, y: Tensor, epilogue: Callable[[Tensor, tuple[Tensor, ...]], Tensor]=lambda acc, tile: acc, *, _launcher=_default_launcher): 2026-02-21T18:24:11.2882575Z """ 2026-02-21T18:24:11.2882747Z Performs matrix multiplication of x and y with an optional epilogue function. 2026-02-21T18:24:11.2882954Z Args: 2026-02-21T18:24:11.2883067Z x (Tensor): Left matrix of shape [m, k]. 2026-02-21T18:24:11.2883236Z y (Tensor): Right matrix of shape [k, n]. 2026-02-21T18:24:11.2883461Z epilogue (Callable, optional): Function applied to the accumulator and tile indices 2026-02-21T18:24:11.2883714Z after the matmul. Defaults to identity (no change). 2026-02-21T18:24:11.2883875Z Returns: 2026-02-21T18:24:11.2884021Z Tensor: Resulting matrix of shape [m, n]. 2026-02-21T18:24:11.2884165Z """ 2026-02-21T18:24:11.2884268Z # src[matmul.py:57]: m, k = x.size() 2026-02-21T18:24:11.2884399Z m, k = x.size() 2026-02-21T18:24:11.2884514Z # src[matmul.py:58]: k2, n = y.size() 2026-02-21T18:24:11.2884646Z k2, n = y.size() 2026-02-21T18:24:11.2884801Z # src[matmul.py:59]: assert k == k2, f"size mismatch {k} != {k2}" 2026-02-21T18:24:11.2884994Z assert k == k2, f'size mismatch {k} != {k2}' 2026-02-21T18:24:11.2885151Z # src[matmul.py:60]: out = torch.empty( 2026-02-21T18:24:11.2885379Z # src[matmul.py:61]: [m, n], dtype=torch.promote_types(x.dtype, y.dtype), device=x.device 2026-02-21T18:24:11.2885598Z # src[matmul.py:62]: ) 2026-02-21T18:24:11.2885802Z out = torch.empty([m, n], dtype=torch.promote_types(x.dtype, y.dtype), device=x.device) 2026-02-21T18:24:11.2886054Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:24:11.2886238Z _NUM_SM = helion.runtime.get_num_sm(x.device) 2026-02-21T18:24:11.2886451Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:24:11.2886669Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:24:11.2886879Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:24:11.2887037Z # src[matmul.py:63-67]: ... 2026-02-21T18:24:11.2887305Z _launcher(_helion_matmul, (_NUM_SM * 2,), x, y, out, _NUM_SM, num_warps=4, num_stages=2, waves_per_eu=2, matrix_instr_nonkdim=16) 2026-02-21T18:24:11.2887590Z # src[matmul.py:68]: return out 2026-02-21T18:24:11.2887716Z return out 2026-02-21T18:24:40.1272495Z WARNING:tritonbench.utils.triton_op:Completed input ID 2: 2026-02-21T18:24:40.1273054Z (M, N, K) 2026-02-21T18:24:40.1273278Z ------------------ 2026-02-21T18:24:40.1273529Z (4096, 2048, 2048) 2026-02-21T18:24:40.1273677Z 2026-02-21T18:24:40.1276881Z 25%|██▌ | 2/8 [11:22<34:07, 341.23s/it]WARNING:tritonbench.utils.triton_op:Running input ID 3: 2026-02-21T18:24:40.1277398Z (M, N, K) 2026-02-21T18:24:40.1277626Z ------------------ 2026-02-21T18:24:40.1277839Z (2048, 4096, 2048) 2026-02-21T18:24:40.1278245Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T18:25:21.3522670Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T18:25:56.0161757Z INFO:tritonbench.utils.triton_op:Took 86.16ms to get benchmark function for pt2_triton_matmul 2026-02-21T18:26:33.5182666Z WARNING:__main__:Input tensor metadata: 2026-02-21T18:26:33.5184642Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T18:26:33.5184918Z 'dtype': 'torch.float16', 2026-02-21T18:26:33.5185107Z 'shape': (2048, 2048), 2026-02-21T18:26:33.5185706Z 'stride': (2048, 1)}, 2026-02-21T18:26:33.5185889Z { 'device': 'cuda:0', 2026-02-21T18:26:33.5186060Z 'dtype': 'torch.float16', 2026-02-21T18:26:33.5186334Z 'shape': (2048, 4096), 2026-02-21T18:26:33.5186513Z 'stride': (1, 2048)}, 2026-02-21T18:26:33.5186685Z None), 2026-02-21T18:26:33.5186819Z 'kwargs': {}} 2026-02-21T18:26:33.5203203Z INFO:tritonbench.utils.triton_op:Took 2.57ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T18:26:33.5852131Z [0s] Autotune random seed: 2171071898 2026-02-21T18:26:33.5985769Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T18:26:38.7970717Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 7.3 configs/s 2026-02-21T18:26:50.1367586Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:26:50.1369840Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T18:26:50.1370782Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:26:50.1371619Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:26:50.1372414Z #blocked3 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T18:26:50.1373009Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T18:26:50.1373489Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T18:26:50.1374040Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:26:50.1374787Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:26:50.1375506Z %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf16, #blocked> 2026-02-21T18:26:50.1375828Z %cst_0 = arith.constant dense<2048> : tensor<1x256xi64, #blocked> 2026-02-21T18:26:50.1376123Z %cst_1 = arith.constant dense<0> : tensor<1x256xi64, #blocked> 2026-02-21T18:26:50.1376407Z %cst_2 = arith.constant dense<0> : tensor<4x1xi64, #blocked1> 2026-02-21T18:26:50.1376692Z %cst_3 = arith.constant dense<2048> : tensor<4x1xi64, #blocked1> 2026-02-21T18:26:50.1376938Z %c256_i32 = arith.constant 256 : i32 2026-02-21T18:26:50.1377137Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T18:26:50.1377332Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:26:50.1377564Z %cst_4 = arith.constant dense<4096> : tensor<4x1xi32, #blocked1> 2026-02-21T18:26:50.1377847Z %cst_5 = arith.constant dense<2048> : tensor<1x2xi32, #blocked2> 2026-02-21T18:26:50.1378156Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<4x2xf32, #blocked2> 2026-02-21T18:26:50.1378413Z %c4_i32 = arith.constant 4 : i32 2026-02-21T18:26:50.1378593Z %c2_i32 = arith.constant 2 : i32 2026-02-21T18:26:50.1378773Z %0 = tt.get_program_id x : i32 2026-02-21T18:26:50.1378963Z %1 = arith.remsi %0, %c2048_i32 : i32 2026-02-21T18:26:50.1379154Z %2 = arith.divsi %0, %c2048_i32 : i32 2026-02-21T18:26:50.1379337Z %3 = arith.muli %1, %c2_i32 : i32 2026-02-21T18:26:50.1379590Z %4 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked3> 2026-02-21T18:26:50.1379889Z %5 = tt.splat %3 : i32 -> tensor<2xi32, #blocked3> 2026-02-21T18:26:50.1380220Z %6 = arith.addi %5, %4 : tensor<2xi32, #blocked3> 2026-02-21T18:26:50.1380424Z %7 = arith.muli %2, %c4_i32 : i32 2026-02-21T18:26:50.1380742Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked3> 2026-02-21T18:26:50.1381028Z %9 = tt.splat %7 : i32 -> tensor<4xi32, #blocked3> 2026-02-21T18:26:50.1381257Z %10 = arith.addi %9, %8 : tensor<4xi32, #blocked3> 2026-02-21T18:26:50.1381541Z %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked3> 2026-02-21T18:26:50.1381817Z %12 = arith.extsi %7 : i32 to i64 2026-02-21T18:26:50.1382072Z %13 = tt.splat %arg0 : !tt.ptr -> tensor<4x256x!tt.ptr, #blocked> 2026-02-21T18:26:50.1382357Z %14 = tt.splat %12 : i64 -> tensor<4xi64, #blocked3> 2026-02-21T18:26:50.1382635Z %15 = arith.extsi %8 : tensor<4xi32, #blocked3> to tensor<4xi64, #blocked3> 2026-02-21T18:26:50.1382867Z %16 = arith.addi %14, %15 : tensor<4xi64, #blocked3> 2026-02-21T18:26:50.1383149Z %17 = ttg.convert_layout %16 : tensor<4xi64, #blocked3> -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:26:50.1383556Z %18 = tt.expand_dims %17 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi64, #blocked4> 2026-02-21T18:26:50.1383923Z %19 = ttg.convert_layout %18 : tensor<4x1xi64, #blocked4> -> tensor<4x1xi64, #blocked1> 2026-02-21T18:26:50.1384171Z %20 = arith.muli %19, %cst_3 : tensor<4x1xi64, #blocked1> 2026-02-21T18:26:50.1384402Z %21 = tt.broadcast %20 : tensor<4x1xi64, #blocked1> -> tensor<4x256xi64, #blocked1> 2026-02-21T18:26:50.1384685Z %22 = ttg.convert_layout %21 : tensor<4x256xi64, #blocked1> -> tensor<4x256xi64, #blocked> 2026-02-21T18:26:50.1384964Z %23 = arith.extsi %11 : tensor<256xi32, #blocked3> to tensor<256xi64, #blocked3> 2026-02-21T18:26:50.1385201Z %24 = arith.cmpi sge, %19, %cst_2 : tensor<4x1xi64, #blocked1> 2026-02-21T18:26:50.1385411Z %25 = arith.cmpi slt, %19, %cst_3 : tensor<4x1xi64, #blocked1> 2026-02-21T18:26:50.1385609Z %26 = arith.andi %24, %25 : tensor<4x1xi1, #blocked1> 2026-02-21T18:26:50.1385831Z %27 = tt.broadcast %26 : tensor<4x1xi1, #blocked1> -> tensor<4x256xi1, #blocked1> 2026-02-21T18:26:50.1386111Z %28 = ttg.convert_layout %27 : tensor<4x256xi1, #blocked1> -> tensor<4x256xi1, #blocked> 2026-02-21T18:26:50.1386468Z %29 = ttg.convert_layout %6 : tensor<2xi32, #blocked3> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:26:50.1386860Z %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x2xi32, #blocked5> 2026-02-21T18:26:50.1387205Z %31 = ttg.convert_layout %30 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #blocked2> 2026-02-21T18:26:50.1387448Z %32 = arith.muli %31, %cst_5 : tensor<1x2xi32, #blocked2> 2026-02-21T18:26:50.1387676Z %33 = tt.broadcast %32 : tensor<1x2xi32, #blocked2> -> tensor<256x2xi32, #blocked2> 2026-02-21T18:26:50.1387933Z %34 = tt.splat %arg1 : !tt.ptr -> tensor<256x2x!tt.ptr, #blocked2> 2026-02-21T18:26:50.1388343Z %35 = scf.for %arg3 = %c0_i32 to %c2048_i32 step %c256_i32 iter_args(%arg4 = %cst_6) -> (tensor<4x2xf32, #blocked2>) : i32 { 2026-02-21T18:26:50.1388642Z %50 = tt.splat %arg3 : i32 -> tensor<256xi32, #blocked3> 2026-02-21T18:26:50.1388838Z %51 = arith.addi %50, %11 : tensor<256xi32, #blocked3> 2026-02-21T18:26:50.1389006Z %52 = arith.extsi %arg3 : i32 to i64 2026-02-21T18:26:50.1389171Z %53 = tt.splat %52 : i64 -> tensor<256xi64, #blocked3> 2026-02-21T18:26:50.1389354Z %54 = arith.addi %53, %23 : tensor<256xi64, #blocked3> 2026-02-21T18:26:50.1389639Z %55 = ttg.convert_layout %54 : tensor<256xi64, #blocked3> -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:26:50.1390052Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi64, #blocked5> 2026-02-21T18:26:50.1390413Z %57 = ttg.convert_layout %56 : tensor<1x256xi64, #blocked5> -> tensor<1x256xi64, #blocked> 2026-02-21T18:26:50.1390722Z %58 = tt.broadcast %57 : tensor<1x256xi64, #blocked> -> tensor<4x256xi64, #blocked> 2026-02-21T18:26:50.1390979Z %59 = arith.addi %22, %58 : tensor<4x256xi64, #blocked> 2026-02-21T18:26:50.1391218Z %60 = tt.addptr %13, %59 : tensor<4x256x!tt.ptr, #blocked>, tensor<4x256xi64, #blocked> 2026-02-21T18:26:50.1391474Z %61 = arith.cmpi sge, %57, %cst_1 : tensor<1x256xi64, #blocked> 2026-02-21T18:26:50.1391702Z %62 = arith.cmpi slt, %57, %cst_0 : tensor<1x256xi64, #blocked> 2026-02-21T18:26:50.1391896Z %63 = arith.andi %61, %62 : tensor<1x256xi1, #blocked> 2026-02-21T18:26:50.1392119Z %64 = tt.broadcast %63 : tensor<1x256xi1, #blocked> -> tensor<4x256xi1, #blocked> 2026-02-21T18:26:50.1392342Z %65 = arith.andi %28, %64 : tensor<4x256xi1, #blocked> 2026-02-21T18:26:50.1392536Z %66 = tt.load %60, %65, %cst : tensor<4x256x!tt.ptr, #blocked> 2026-02-21T18:26:50.1392841Z %67 = ttg.convert_layout %51 : tensor<256xi32, #blocked3> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:26:50.1393276Z %68 = tt.expand_dims %67 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<256x1xi32, #blocked4> 2026-02-21T18:26:50.1393617Z %69 = ttg.convert_layout %68 : tensor<256x1xi32, #blocked4> -> tensor<256x1xi32, #blocked1> 2026-02-21T18:26:50.1393851Z %70 = tt.broadcast %69 : tensor<256x1xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T18:26:50.1394083Z %71 = ttg.convert_layout %70 : tensor<256x2xi32, #blocked1> -> tensor<256x2xi32, #blocked2> 2026-02-21T18:26:50.1394284Z %72 = arith.addi %71, %33 : tensor<256x2xi32, #blocked2> 2026-02-21T18:26:50.1394476Z %73 = tt.addptr %34, %72 : tensor<256x2x!tt.ptr, #blocked2>, tensor<256x2xi32, #blocked2> 2026-02-21T18:26:50.1394679Z %74 = tt.load %73 : tensor<256x2x!tt.ptr, #blocked2> 2026-02-21T18:26:50.1394930Z %75 = ttg.convert_layout %66 : tensor<4x256xf16, #blocked> -> tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked2}>> 2026-02-21T18:26:50.1395276Z %76 = ttg.convert_layout %74 : tensor<256x2xf16, #blocked2> -> tensor<256x2xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked2}>> 2026-02-21T18:26:50.1395595Z %77 = ttg.convert_layout %arg4 : tensor<4x2xf32, #blocked2> -> tensor<4x2xf32, #blocked2> 2026-02-21T18:26:50.1396014Z %78 = tt.dot %75, %76, %77, inputPrecision = tf32 : tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked2}>> * tensor<256x2xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked2}>> -> tensor<4x2xf32, #blocked2> 2026-02-21T18:26:50.1396349Z scf.yield %78 : tensor<4x2xf32, #blocked2> 2026-02-21T18:26:50.1396475Z } {tt.flatten} 2026-02-21T18:26:50.1396613Z %36 = arith.truncf %35 : tensor<4x2xf32, #blocked2> to tensor<4x2xf16, #blocked2> 2026-02-21T18:26:50.1396881Z %37 = ttg.convert_layout %10 : tensor<4xi32, #blocked3> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:26:50.1397199Z %38 = tt.expand_dims %37 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T18:26:50.1397481Z %39 = ttg.convert_layout %38 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked1> 2026-02-21T18:26:50.1397679Z %40 = arith.muli %39, %cst_4 : tensor<4x1xi32, #blocked1> 2026-02-21T18:26:50.1397908Z %41 = ttg.convert_layout %6 : tensor<2xi32, #blocked3> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:26:50.1398224Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x2xi32, #blocked5> 2026-02-21T18:26:50.1398503Z %43 = ttg.convert_layout %42 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #blocked2> 2026-02-21T18:26:50.1398726Z %44 = tt.broadcast %40 : tensor<4x1xi32, #blocked1> -> tensor<4x2xi32, #blocked1> 2026-02-21T18:26:50.1398948Z %45 = ttg.convert_layout %44 : tensor<4x2xi32, #blocked1> -> tensor<4x2xi32, #blocked2> 2026-02-21T18:26:50.1399183Z %46 = tt.broadcast %43 : tensor<1x2xi32, #blocked2> -> tensor<4x2xi32, #blocked2> 2026-02-21T18:26:50.1399388Z %47 = arith.addi %45, %46 : tensor<4x2xi32, #blocked2> 2026-02-21T18:26:50.1399559Z %48 = tt.splat %arg2 : !tt.ptr -> tensor<4x2x!tt.ptr, #blocked2> 2026-02-21T18:26:50.1399774Z %49 = tt.addptr %48, %47 : tensor<4x2x!tt.ptr, #blocked2>, tensor<4x2xi32, #blocked2> 2026-02-21T18:26:50.1399966Z tt.store %49, %36 : tensor<4x2x!tt.ptr, #blocked2> 2026-02-21T18:26:50.1400094Z tt.return 2026-02-21T18:26:50.1400178Z } 2026-02-21T18:26:50.1400253Z } 2026-02-21T18:26:50.1400300Z 2026-02-21T18:26:50.1400330Z {-# 2026-02-21T18:26:50.1400409Z external_resources: { 2026-02-21T18:26:50.1400511Z mlir_reproducer: { 2026-02-21T18:26:50.1402803Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:26:50.1405131Z disable_threading: false, 2026-02-21T18:26:50.1405241Z verify_each: true 2026-02-21T18:26:50.1405349Z } 2026-02-21T18:26:50.1405422Z } 2026-02-21T18:26:50.1405491Z #-} 2026-02-21T18:26:50.1405773Z /tmp/torchinductor_root/el/celh3fmclx5poorzh6xpzhkw2rhupxlsj2hozjvtfgei3rt6nbqo.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:26:50.1406457Z /tmp/torchinductor_root/el/celh3fmclx5poorzh6xpzhkw2rhupxlsj2hozjvtfgei3rt6nbqo.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:26:50.1407014Z [16s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:26:50.1407740Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 2, 256], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T18:26:50.1408402Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:26:50.1408569Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:26:53.0154330Z Initial population exploring neighbors 76% ━━━━━━━━━━╸ 76/100 5.3 configs/s 2026-02-21T18:26:53.0156949Z WARNING:tritonbench.utils.triton_op:Completed input ID 3: 2026-02-21T18:26:53.0157391Z (M, N, K) 2026-02-21T18:26:53.0157615Z ------------------ 2026-02-21T18:26:53.0157863Z (2048, 4096, 2048) 2026-02-21T18:26:53.0158009Z 2026-02-21T18:26:53.0163607Z 38%|███▊ | 3/8 [13:34<20:30, 246.10s/it]WARNING:tritonbench.utils.triton_op:Running input ID 5: 2026-02-21T18:26:53.0164014Z (M, N, K) 2026-02-21T18:26:53.0164185Z ------------------ 2026-02-21T18:26:53.0164440Z (1024, 8192, 1024) 2026-02-21T18:26:53.0164936Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T18:27:40.9424634Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T18:28:23.4783890Z INFO:tritonbench.utils.triton_op:Took 87.08ms to get benchmark function for pt2_triton_matmul 2026-02-21T18:29:07.2205347Z WARNING:__main__:Input tensor metadata: 2026-02-21T18:29:07.2205830Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T18:29:07.2206168Z 'dtype': 'torch.float16', 2026-02-21T18:29:07.2206490Z 'shape': (1024, 1024), 2026-02-21T18:29:07.2206791Z 'stride': (1024, 1)}, 2026-02-21T18:29:07.2207097Z { 'device': 'cuda:0', 2026-02-21T18:29:07.2207419Z 'dtype': 'torch.float16', 2026-02-21T18:29:07.2207712Z 'shape': (1024, 8192), 2026-02-21T18:29:07.2208000Z 'stride': (1, 1024)}, 2026-02-21T18:29:07.2208287Z None), 2026-02-21T18:29:07.2208519Z 'kwargs': {}} 2026-02-21T18:29:07.2213494Z INFO:tritonbench.utils.triton_op:Took 1.32ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T18:29:07.2886019Z [0s] Autotune random seed: 2171071898 2026-02-21T18:29:07.3018646Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T18:29:15.8570920Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━ 100/100 16.3 configs/s 2026-02-21T18:29:27.7351631Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 8.5 configs/s 2026-02-21T18:29:27.7359474Z [20s] Adaptive compile timeout: 30s (90% percentile=6.9s, bounds=[30.0s, 30s]) 2026-02-21T18:29:27.8656138Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 8310.2 configs/s 2026-02-21T18:29:28.2869386Z [20s] Initial random population of 100, 5 starting points: 2026-02-21T18:29:28.2869754Z error=10 2026-02-21T18:29:28.2869895Z ok=90 2026-02-21T18:29:28.2870055Z min=0.0711 2026-02-21T18:29:28.2870198Z mid=0.8445 2026-02-21T18:29:28.2870604Z max=224.2206 2026-02-21T18:29:28.2870766Z best={'block_sizes': [64, 256, 16], 2026-02-21T18:29:28.2871015Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:29:28.2871263Z 'l2_groupings': [16], 2026-02-21T18:29:28.2871444Z 'load_eviction_policies': ['', ''], 2026-02-21T18:29:28.2871657Z 'loop_orders': [[1, 0]], 2026-02-21T18:29:28.2871841Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:29:28.2872038Z 'num_sm_multiplier': 8, 2026-02-21T18:29:28.2872214Z 'num_stages': 1, 2026-02-21T18:29:28.2872375Z 'num_warps': 8, 2026-02-21T18:29:28.2872553Z 'pid_type': 'persistent_blocked', 2026-02-21T18:29:28.2872767Z 'range_flattens': [None, None], 2026-02-21T18:29:28.2872970Z 'range_multi_buffers': [True, None], 2026-02-21T18:29:28.2873298Z 'range_num_stages': [0, 0], 2026-02-21T18:29:28.2873486Z 'range_unroll_factors': [0, 0], 2026-02-21T18:29:28.2873686Z 'range_warp_specializes': [], 2026-02-21T18:29:28.2873885Z 'waves_per_eu': 1} 2026-02-21T18:29:28.2912538Z [20s] Fitting surrogate: 100 points, 100 targets 2026-02-21T18:29:29.2859588Z [21s] Generation 1 starting: 94 neighbors, 5 active search path(s) 2026-02-21T18:29:39.5166807Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 5.6 configs/s 2026-02-21T18:29:44.2075716Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 21.4 configs/s 2026-02-21T18:29:44.7630861Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1345.7 2026-02-21T18:29:44.7631424Z configs/s 2026-02-21T18:29:45.1496662Z [37s] Generation 1 complete: 2026-02-21T18:29:45.1497010Z error=26 2026-02-21T18:29:45.1497252Z ok=73 2026-02-21T18:29:45.1497478Z min=0.0529 2026-02-21T18:29:45.1497692Z mid=0.1058 2026-02-21T18:29:45.1497894Z max=25.9747 2026-02-21T18:29:45.1498128Z best={'block_sizes': [64, 128, 64], 2026-02-21T18:29:45.1498890Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T18:29:45.1499273Z 'l2_groupings': [64], 2026-02-21T18:29:45.1499564Z 'load_eviction_policies': ['', ''], 2026-02-21T18:29:45.1499879Z 'loop_orders': [[1, 0]], 2026-02-21T18:29:45.1500169Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:29:45.1500470Z 'num_sm_multiplier': 128, 2026-02-21T18:29:45.1500739Z 'num_stages': 1, 2026-02-21T18:29:45.1500964Z 'num_warps': 4, 2026-02-21T18:29:45.1501234Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:29:45.1501562Z 'range_flattens': [False, False], 2026-02-21T18:29:45.1501876Z 'range_multi_buffers': [False, True], 2026-02-21T18:29:45.1502184Z 'range_num_stages': [0, 0], 2026-02-21T18:29:45.1502474Z 'range_unroll_factors': [0, 0], 2026-02-21T18:29:45.1502766Z 'range_warp_specializes': [], 2026-02-21T18:29:45.1503049Z 'waves_per_eu': 2} 2026-02-21T18:29:45.1595400Z [37s] Fitting surrogate: 199 points, 199 targets 2026-02-21T18:29:46.5642402Z [39s] Generation 2 starting: 86 neighbors, 5 active search path(s) 2026-02-21T18:29:54.3264947Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89/89 21.4 configs/s 2026-02-21T18:29:58.8231150Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 89/89 20.0 configs/s 2026-02-21T18:30:02.0607378Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 291.5 2026-02-21T18:30:02.0607967Z configs/s 2026-02-21T18:30:02.5747296Z [55s] Generation 2 complete: 2026-02-21T18:30:02.5747613Z error=16 2026-02-21T18:30:02.5747821Z ok=76 2026-02-21T18:30:02.5748015Z min=0.0439 2026-02-21T18:30:02.5748335Z mid=0.0757 2026-02-21T18:30:02.5748524Z max=2.9902 2026-02-21T18:30:02.5748745Z best={'block_sizes': [64, 256, 64], 2026-02-21T18:30:02.5749106Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:30:02.5749487Z 'l2_groupings': [16], 2026-02-21T18:30:02.5749749Z 'load_eviction_policies': ['', ''], 2026-02-21T18:30:02.5750047Z 'loop_orders': [[1, 0]], 2026-02-21T18:30:02.5750335Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:30:02.5750617Z 'num_sm_multiplier': 16, 2026-02-21T18:30:02.5750885Z 'num_stages': 1, 2026-02-21T18:30:02.5751104Z 'num_warps': 8, 2026-02-21T18:30:02.5751351Z 'pid_type': 'persistent_blocked', 2026-02-21T18:30:02.5751648Z 'range_flattens': [False, False], 2026-02-21T18:30:02.5751950Z 'range_multi_buffers': [True, None], 2026-02-21T18:30:02.5752243Z 'range_num_stages': [0, 0], 2026-02-21T18:30:02.5752514Z 'range_unroll_factors': [0, 0], 2026-02-21T18:30:02.5752796Z 'range_warp_specializes': [], 2026-02-21T18:30:02.6275301Z 'waves_per_eu': 1} 2026-02-21T18:30:02.6275642Z [55s] Fitting surrogate: 291 points, 291 targets 2026-02-21T18:30:03.6266257Z [56s] Generation 3 starting: 93 neighbors, 5 active search path(s) 2026-02-21T18:30:14.1675280Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 95/95 2.5 configs/s 2026-02-21T18:30:18.7321306Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 95/95 21.1 configs/s 2026-02-21T18:30:23.5377938Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 221.8 2026-02-21T18:30:23.5378562Z configs/s 2026-02-21T18:30:24.0979276Z [76s] Generation 3 complete: 2026-02-21T18:30:24.0979584Z error=21 2026-02-21T18:30:24.0979747Z ok=77 2026-02-21T18:30:24.0979908Z min=0.0440 2026-02-21T18:30:24.0980060Z mid=0.0604 2026-02-21T18:30:24.0980161Z max=2.3702 2026-02-21T18:30:24.0980343Z best={'block_sizes': [64, 256, 64], 2026-02-21T18:30:24.0980643Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:30:24.0980936Z 'l2_groupings': [16], 2026-02-21T18:30:24.0981155Z 'load_eviction_policies': ['', ''], 2026-02-21T18:30:24.0981396Z 'loop_orders': [[1, 0]], 2026-02-21T18:30:24.0981610Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:30:24.0981855Z 'num_sm_multiplier': 16, 2026-02-21T18:30:24.0982058Z 'num_stages': 1, 2026-02-21T18:30:24.0982237Z 'num_warps': 8, 2026-02-21T18:30:24.0982437Z 'pid_type': 'persistent_blocked', 2026-02-21T18:30:24.0982686Z 'range_flattens': [False, False], 2026-02-21T18:30:24.0995295Z 'range_multi_buffers': [True, None], 2026-02-21T18:30:24.0995496Z 'range_num_stages': [0, 0], 2026-02-21T18:30:24.0995664Z 'range_unroll_factors': [0, 0], 2026-02-21T18:30:24.0995839Z 'range_warp_specializes': [], 2026-02-21T18:30:24.0996000Z 'waves_per_eu': 1} 2026-02-21T18:30:24.1753442Z [76s] Fitting surrogate: 389 points, 389 targets 2026-02-21T18:30:25.0802074Z [77s] Generation 4 starting: 85 neighbors, 5 active search path(s) 2026-02-21T18:30:34.7634645Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 1.8 configs/s 2026-02-21T18:30:38.9649018Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 21.3 configs/s 2026-02-21T18:30:43.8644476Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 197.4 2026-02-21T18:30:43.8645134Z configs/s 2026-02-21T18:30:44.4624752Z [97s] Generation 4 complete: 2026-02-21T18:30:44.4625101Z error=19 2026-02-21T18:30:44.4625326Z ok=71 2026-02-21T18:30:44.4625855Z min=0.0428 2026-02-21T18:30:44.4626065Z mid=0.0578 2026-02-21T18:30:44.4626259Z max=1.8088 2026-02-21T18:30:44.4626490Z best={'block_sizes': [128, 128, 32], 2026-02-21T18:30:44.4626863Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:30:44.4627227Z 'l2_groupings': [64], 2026-02-21T18:30:44.4627512Z 'load_eviction_policies': ['', ''], 2026-02-21T18:30:44.4627829Z 'loop_orders': [[1, 0]], 2026-02-21T18:30:44.4628107Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:30:44.4628484Z 'num_sm_multiplier': 4, 2026-02-21T18:30:44.4628714Z 'num_stages': 2, 2026-02-21T18:30:44.4628919Z 'num_warps': 8, 2026-02-21T18:30:44.4629145Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:30:44.4629446Z 'range_flattens': [False, False], 2026-02-21T18:30:44.4629715Z 'range_multi_buffers': [True, False], 2026-02-21T18:30:44.4629987Z 'range_num_stages': [0, 0], 2026-02-21T18:30:44.4630236Z 'range_unroll_factors': [0, 0], 2026-02-21T18:30:44.4630502Z 'range_warp_specializes': [], 2026-02-21T18:30:44.4630748Z 'waves_per_eu': 3} 2026-02-21T18:30:44.5494374Z [97s] Fitting surrogate: 479 points, 479 targets 2026-02-21T18:30:45.1128700Z [97s] Generation 5 starting: 51 neighbors, 3 active search path(s) 2026-02-21T18:30:54.1929170Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 51/51 0.7 configs/s 2026-02-21T18:30:57.0368840Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 51/51 18.9 configs/s 2026-02-21T18:31:00.0477451Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 313.7 2026-02-21T18:31:00.0478048Z configs/s 2026-02-21T18:31:00.5504543Z [113s] Generation 5 complete: 2026-02-21T18:31:00.5505534Z error=6 2026-02-21T18:31:00.5505744Z ok=48 2026-02-21T18:31:00.5505952Z min=0.0402 2026-02-21T18:31:00.5506166Z mid=0.0554 2026-02-21T18:31:00.5506364Z max=0.4379 2026-02-21T18:31:00.5506751Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:31:00.5507148Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:31:00.5507538Z 'l2_groupings': [64], 2026-02-21T18:31:00.5507817Z 'load_eviction_policies': ['', ''], 2026-02-21T18:31:00.5508125Z 'loop_orders': [[1, 0]], 2026-02-21T18:31:00.5508502Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:31:00.5508790Z 'num_sm_multiplier': 4, 2026-02-21T18:31:00.5509049Z 'num_stages': 2, 2026-02-21T18:31:00.5509282Z 'num_warps': 8, 2026-02-21T18:31:00.5509552Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:31:00.5509878Z 'range_flattens': [False, False], 2026-02-21T18:31:00.5510193Z 'range_multi_buffers': [True, False], 2026-02-21T18:31:00.5510501Z 'range_num_stages': [0, 0], 2026-02-21T18:31:00.5510788Z 'range_unroll_factors': [0, 0], 2026-02-21T18:31:00.5511093Z 'range_warp_specializes': [], 2026-02-21T18:31:00.5511370Z 'waves_per_eu': 3} 2026-02-21T18:31:00.6051609Z [113s] Fitting surrogate: 533 points, 533 targets 2026-02-21T18:31:01.0775175Z [113s] Generation 6 starting: 37 neighbors, 2 active search path(s) 2026-02-21T18:31:04.6425915Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37/37 15.2 configs/s 2026-02-21T18:31:06.5286269Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 37/37 20.8 configs/s 2026-02-21T18:31:09.0951623Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 364.5 2026-02-21T18:31:09.0952235Z configs/s 2026-02-21T18:31:09.5935849Z [122s] Generation 6 complete: 2026-02-21T18:31:09.5936205Z error=8 2026-02-21T18:31:09.5936433Z ok=31 2026-02-21T18:31:09.5936638Z min=0.0419 2026-02-21T18:31:09.5936852Z mid=0.0519 2026-02-21T18:31:09.5937053Z max=0.4225 2026-02-21T18:31:09.5937286Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:31:09.5937689Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:31:09.5938060Z 'l2_groupings': [64], 2026-02-21T18:31:09.5938351Z 'load_eviction_policies': ['', ''], 2026-02-21T18:31:09.5938662Z 'loop_orders': [[1, 0]], 2026-02-21T18:31:09.5939229Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:31:09.5939511Z 'num_sm_multiplier': 4, 2026-02-21T18:31:09.5939780Z 'num_stages': 2, 2026-02-21T18:31:09.5940009Z 'num_warps': 8, 2026-02-21T18:31:09.5940273Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:31:09.5940600Z 'range_flattens': [False, None], 2026-02-21T18:31:09.5940910Z 'range_multi_buffers': [True, False], 2026-02-21T18:31:09.5941216Z 'range_num_stages': [0, 0], 2026-02-21T18:31:09.5941501Z 'range_unroll_factors': [0, 0], 2026-02-21T18:31:09.5941797Z 'range_warp_specializes': [], 2026-02-21T18:31:09.5942082Z 'waves_per_eu': 3} 2026-02-21T18:31:09.6374062Z [122s] Fitting surrogate: 572 points, 572 targets 2026-02-21T18:31:10.1276845Z [122s] Generation 7 starting: 38 neighbors, 2 active search path(s) 2026-02-21T18:31:13.8749840Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38/38 13.9 configs/s 2026-02-21T18:31:16.1923501Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 38/38 17.5 configs/s 2026-02-21T18:31:17.9136323Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 530.1 2026-02-21T18:31:17.9136965Z configs/s 2026-02-21T18:31:18.3451838Z [131s] Generation 7 complete: 2026-02-21T18:31:18.3452056Z error=1 2026-02-21T18:31:18.3452209Z ok=39 2026-02-21T18:31:18.3452596Z min=0.0352 2026-02-21T18:31:18.3452861Z mid=0.0548 2026-02-21T18:31:18.3453065Z max=0.7244 2026-02-21T18:31:18.3453309Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:31:18.3453706Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:31:18.3454076Z 'l2_groupings': [64], 2026-02-21T18:31:18.3454354Z 'load_eviction_policies': ['', ''], 2026-02-21T18:31:18.3455010Z 'loop_orders': [[1, 0]], 2026-02-21T18:31:18.3455284Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:31:18.3455564Z 'num_sm_multiplier': 4, 2026-02-21T18:31:18.3455830Z 'num_stages': 2, 2026-02-21T18:31:18.3456182Z 'num_warps': 4, 2026-02-21T18:31:18.3456455Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:31:18.3467310Z 'range_flattens': [False, None], 2026-02-21T18:31:18.3467540Z 'range_multi_buffers': [True, False], 2026-02-21T18:31:18.3467761Z 'range_num_stages': [0, 0], 2026-02-21T18:31:18.3467957Z 'range_unroll_factors': [0, 0], 2026-02-21T18:31:18.3468240Z 'range_warp_specializes': [], 2026-02-21T18:31:18.3468428Z 'waves_per_eu': 2} 2026-02-21T18:31:18.3735851Z [131s] Fitting surrogate: 612 points, 612 targets 2026-02-21T18:31:19.4213656Z [132s] Generation 8 starting: 43 neighbors, 2 active search path(s) 2026-02-21T18:31:22.1049563Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43/43 12.5 configs/s 2026-02-21T18:31:24.8380443Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 43/43 16.6 configs/s 2026-02-21T18:31:27.0289097Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 423.7 2026-02-21T18:31:27.0289721Z configs/s 2026-02-21T18:31:27.5415504Z [140s] Generation 8 complete: 2026-02-21T18:31:27.5415933Z ok=45 2026-02-21T18:31:27.5416126Z min=0.0412 2026-02-21T18:31:27.5416314Z mid=0.0556 2026-02-21T18:31:27.5416486Z max=1.0328 2026-02-21T18:31:27.5416683Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:31:27.5417006Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:31:27.5417322Z 'l2_groupings': [64], 2026-02-21T18:31:27.5417562Z 'load_eviction_policies': ['', ''], 2026-02-21T18:31:27.5417823Z 'loop_orders': [[1, 0]], 2026-02-21T18:31:27.5418066Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:31:27.5418311Z 'num_sm_multiplier': 4, 2026-02-21T18:31:27.5418534Z 'num_stages': 2, 2026-02-21T18:31:27.5418733Z 'num_warps': 4, 2026-02-21T18:31:27.5418973Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:31:27.5419252Z 'range_flattens': [False, None], 2026-02-21T18:31:27.5419519Z 'range_multi_buffers': [True, False], 2026-02-21T18:31:27.5419796Z 'range_num_stages': [0, 0], 2026-02-21T18:31:27.5420037Z 'range_unroll_factors': [0, 0], 2026-02-21T18:31:27.5420301Z 'range_warp_specializes': [], 2026-02-21T18:31:27.5420531Z 'waves_per_eu': 2} 2026-02-21T18:31:27.5844520Z [140s] Fitting surrogate: 657 points, 657 targets 2026-02-21T18:31:28.0531919Z [140s] Generation 9 starting: 36 neighbors, 2 active search path(s) 2026-02-21T18:31:30.3799324Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 21.8 configs/s 2026-02-21T18:31:32.4450505Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 17.3 configs/s 2026-02-21T18:31:34.1473681Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 541.0 2026-02-21T18:31:34.1474100Z configs/s 2026-02-21T18:31:34.5644254Z [147s] Generation 9 complete: 2026-02-21T18:31:34.5644585Z error=3 2026-02-21T18:31:34.5644790Z ok=35 2026-02-21T18:31:34.5644999Z min=0.0353 2026-02-21T18:31:34.5645210Z mid=0.0446 2026-02-21T18:31:34.5645641Z max=0.4166 2026-02-21T18:31:34.5645890Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:31:34.5646278Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:31:34.5646659Z 'l2_groupings': [64], 2026-02-21T18:31:34.5646936Z 'load_eviction_policies': ['', ''], 2026-02-21T18:31:34.5647243Z 'loop_orders': [[1, 0]], 2026-02-21T18:31:34.5647517Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:31:34.5647799Z 'num_sm_multiplier': 4, 2026-02-21T18:31:34.5648059Z 'num_stages': 2, 2026-02-21T18:31:34.5648287Z 'num_warps': 4, 2026-02-21T18:31:34.5648544Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:31:34.5648866Z 'range_flattens': [False, None], 2026-02-21T18:31:34.5649172Z 'range_multi_buffers': [True, False], 2026-02-21T18:31:34.5649476Z 'range_num_stages': [0, 0], 2026-02-21T18:31:34.5649767Z 'range_unroll_factors': [0, 0], 2026-02-21T18:31:34.5650062Z 'range_warp_specializes': [], 2026-02-21T18:31:34.5650333Z 'waves_per_eu': 2} 2026-02-21T18:31:34.5973909Z [147s] Fitting surrogate: 695 points, 695 targets 2026-02-21T18:31:34.7175289Z [147s] Autotuning complete in 147.4s after searching 672 configs. 2026-02-21T18:31:34.7175836Z One can hardcode the best config and skip autotuning with: 2026-02-21T18:31:34.7177893Z @helion.kernel(config=helion.Config(block_sizes=[128, 128, 64], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[64], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=4, num_stages=2, num_warps=4, pid_type='persistent_interleaved', range_flattens=[False, None], range_multi_buffers=[True, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T18:31:34.7179896Z 2026-02-21T18:31:34.7180364Z [147s] Code of selected kernel: /tmp/torchinductor_root/oc/coccj64qfolnsmqfrdasgoihgjtijmohnkwjz6yqj737n5vmcmwv.py 2026-02-21T18:31:34.7264529Z from __future__ import annotations 2026-02-21T18:31:34.7264636Z 2026-02-21T18:31:34.7264680Z import torch 2026-02-21T18:31:34.7264921Z import helion 2026-02-21T18:31:34.7265059Z import triton 2026-02-21T18:31:34.7265147Z import triton.language as tl 2026-02-21T18:31:34.7265297Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T18:31:34.7265416Z 2026-02-21T18:31:34.7265463Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T18:31:34.7265575Z _BLOCK_SIZE_0 = tl.constexpr(128) 2026-02-21T18:31:34.7265686Z _BLOCK_SIZE_2 = tl.constexpr(64) 2026-02-21T18:31:34.7265755Z 2026-02-21T18:31:34.7265787Z @triton.jit 2026-02-21T18:31:34.7265895Z def _helion_matmul(x, y, out, _NUM_SM: tl.constexpr): 2026-02-21T18:31:34.7266061Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:31:34.7266254Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:31:34.7266433Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:31:34.7266565Z # src[matmul.py:63-67]: ... 2026-02-21T18:31:34.7266719Z total_pids = tl.cdiv(8192, _BLOCK_SIZE_1) * tl.cdiv(1024, _BLOCK_SIZE_0) 2026-02-21T18:31:34.7266989Z for virtual_pid in tl.range(tl.program_id(0), total_pids, _NUM_SM * 4, disallow_acc_multi_buffer=False, flatten=False): 2026-02-21T18:31:34.7267243Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:31:34.7267393Z num_pid_m = tl.cdiv(8192, _BLOCK_SIZE_1) 2026-02-21T18:31:34.7267538Z num_pid_n = tl.cdiv(1024, _BLOCK_SIZE_0) 2026-02-21T18:31:34.7267664Z inner_2d_pid = virtual_pid 2026-02-21T18:31:34.7267780Z num_pid_in_group = 64 * num_pid_n 2026-02-21T18:31:34.7267910Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T18:31:34.7268039Z first_pid_m = group_id * 64 2026-02-21T18:31:34.7268247Z group_size_m = min(num_pid_m - first_pid_m, 64) 2026-02-21T18:31:34.7268473Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T18:31:34.7268685Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T18:31:34.7268830Z offset_1 = pid_0 * _BLOCK_SIZE_1 2026-02-21T18:31:34.7268985Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T18:31:34.7269140Z offset_0 = pid_1 * _BLOCK_SIZE_0 2026-02-21T18:31:34.7269284Z indices_0 = (offset_0 + tl.arange(0, _BLOCK_SIZE_0)).to(tl.int32) 2026-02-21T18:31:34.7269478Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:31:34.7269668Z acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 0.0, tl.float32) 2026-02-21T18:31:34.7269829Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:31:34.7270011Z # src[matmul.py:66]: acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) 2026-02-21T18:31:34.7270234Z for offset_2 in tl.range(0, 1024, _BLOCK_SIZE_2, disallow_acc_multi_buffer=True): 2026-02-21T18:31:34.7270443Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int32) 2026-02-21T18:31:34.7270593Z acc_copy = acc 2026-02-21T18:31:34.7270699Z acc_copy_0 = acc_copy 2026-02-21T18:31:34.7270863Z # src[matmul.py:66]: acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) 2026-02-21T18:31:34.7271080Z load = tl.load(x + (indices_0[:, None] * 1024 + indices_2[None, :] * 1), None) 2026-02-21T18:31:34.7271292Z load_1 = tl.load(y + (indices_2[:, None] * 1 + indices_1[None, :] * 1024), None) 2026-02-21T18:31:34.7271580Z acc = tl.dot(tl.cast(load, tl.float16), tl.cast(load_1, tl.float16), acc=acc_copy_0, input_precision='tf32', out_dtype=tl.float32) 2026-02-21T18:31:34.7271863Z # src[matmul.py:67]: out[tile_m, tile_n] = epilogue(acc, (tile_m, tile_n)) 2026-02-21T18:31:34.7272024Z v_0 = tl.cast(acc, tl.float16) 2026-02-21T18:31:34.7272280Z tl.store(tl.make_block_ptr(out, [1024, 8192], [8192, 1], [offset_0, offset_1], [_BLOCK_SIZE_0, _BLOCK_SIZE_1], [1, 0]), v_0, boundary_check=[0, 1]) 2026-02-21T18:31:34.7272495Z 2026-02-21T18:31:34.7272697Z def matmul(x: Tensor, y: Tensor, epilogue: Callable[[Tensor, tuple[Tensor, ...]], Tensor]=lambda acc, tile: acc, *, _launcher=_default_launcher): 2026-02-21T18:31:34.7272963Z """ 2026-02-21T18:31:34.7273104Z Performs matrix multiplication of x and y with an optional epilogue function. 2026-02-21T18:31:34.7273271Z Args: 2026-02-21T18:31:34.7273363Z x (Tensor): Left matrix of shape [m, k]. 2026-02-21T18:31:34.7273499Z y (Tensor): Right matrix of shape [k, n]. 2026-02-21T18:31:34.7273686Z epilogue (Callable, optional): Function applied to the accumulator and tile indices 2026-02-21T18:31:34.7273898Z after the matmul. Defaults to identity (no change). 2026-02-21T18:31:34.7274031Z Returns: 2026-02-21T18:31:34.7274128Z Tensor: Resulting matrix of shape [m, n]. 2026-02-21T18:31:34.7274241Z """ 2026-02-21T18:31:34.7274325Z # src[matmul.py:57]: m, k = x.size() 2026-02-21T18:31:34.7274434Z m, k = x.size() 2026-02-21T18:31:34.7274529Z # src[matmul.py:58]: k2, n = y.size() 2026-02-21T18:31:34.7274645Z k2, n = y.size() 2026-02-21T18:31:34.7274771Z # src[matmul.py:59]: assert k == k2, f"size mismatch {k} != {k2}" 2026-02-21T18:31:34.7274927Z assert k == k2, f'size mismatch {k} != {k2}' 2026-02-21T18:31:34.7275054Z # src[matmul.py:60]: out = torch.empty( 2026-02-21T18:31:34.7275237Z # src[matmul.py:61]: [m, n], dtype=torch.promote_types(x.dtype, y.dtype), device=x.device 2026-02-21T18:31:34.7275417Z # src[matmul.py:62]: ) 2026-02-21T18:31:34.7275581Z out = torch.empty([m, n], dtype=torch.promote_types(x.dtype, y.dtype), device=x.device) 2026-02-21T18:31:34.7275788Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:31:34.7275942Z _NUM_SM = helion.runtime.get_num_sm(x.device) 2026-02-21T18:31:34.7276108Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:31:34.7276304Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:31:34.7276480Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:31:34.7276610Z # src[matmul.py:63-67]: ... 2026-02-21T18:31:34.7276832Z _launcher(_helion_matmul, (_NUM_SM * 4,), x, y, out, _NUM_SM, num_warps=4, num_stages=2, waves_per_eu=2, matrix_instr_nonkdim=16) 2026-02-21T18:31:34.7277060Z # src[matmul.py:68]: return out 2026-02-21T18:31:34.7277166Z return out 2026-02-21T18:32:02.3742330Z WARNING:tritonbench.utils.triton_op:Completed input ID 5: 2026-02-21T18:32:02.3742830Z (M, N, K) 2026-02-21T18:32:02.3743066Z ------------------ 2026-02-21T18:32:02.3743313Z (1024, 8192, 1024) 2026-02-21T18:32:02.3743453Z 2026-02-21T18:32:02.3749358Z 50%|█████ | 4/8 [18:44<18:04, 271.07s/it]WARNING:tritonbench.utils.triton_op:Running input ID 6: 2026-02-21T18:32:02.3749946Z (M, N, K) 2026-02-21T18:32:02.3750193Z ------------------ 2026-02-21T18:32:02.3750428Z (8192, 2048, 2048) 2026-02-21T18:32:02.3750799Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T18:32:38.5049331Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T18:33:16.3287283Z INFO:tritonbench.utils.triton_op:Took 92.82ms to get benchmark function for pt2_triton_matmul 2026-02-21T18:33:56.3366107Z WARNING:__main__:Input tensor metadata: 2026-02-21T18:33:56.3366538Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T18:33:56.3366864Z 'dtype': 'torch.float16', 2026-02-21T18:33:56.3367190Z 'shape': (8192, 2048), 2026-02-21T18:33:56.3367481Z 'stride': (2048, 1)}, 2026-02-21T18:33:56.3367770Z { 'device': 'cuda:0', 2026-02-21T18:33:56.3368061Z 'dtype': 'torch.float16', 2026-02-21T18:33:56.3368291Z 'shape': (2048, 2048), 2026-02-21T18:33:56.3368592Z 'stride': (1, 2048)}, 2026-02-21T18:33:56.3368863Z None), 2026-02-21T18:33:56.3369096Z 'kwargs': {}} 2026-02-21T18:33:56.3385989Z INFO:tritonbench.utils.triton_op:Took 2.36ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T18:33:56.4096839Z [0s] Autotune random seed: 2171071898 2026-02-21T18:33:56.4234210Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T18:34:29.4103905Z [32s] Timeout after 30s compiling Config(block_sizes=[4096, 4, 16], indexing=['block_ptr', 'block_ptr', 'block_ptr'], l2_groupings=[2], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=32, num_sm_multiplier=8, num_stages=3, num_warps=16, pid_type='persistent_blocked', range_flattens=[True, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3) 2026-02-21T18:34:29.4116563Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 0.2 configs/s 2026-02-21T18:34:43.0403604Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:34:43.0410961Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T18:34:43.0411867Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:34:43.0412666Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:34:43.0413410Z #blocked3 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T18:34:43.0414141Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T18:34:43.0415286Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T18:34:43.0416001Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:34:43.0416972Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:34:43.0417767Z %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf16, #blocked> 2026-02-21T18:34:43.0418187Z %cst_0 = arith.constant dense<2048> : tensor<1x256xi64, #blocked> 2026-02-21T18:34:43.0418570Z %cst_1 = arith.constant dense<0> : tensor<1x256xi64, #blocked> 2026-02-21T18:34:43.0418942Z %cst_2 = arith.constant dense<8192> : tensor<4x1xi64, #blocked1> 2026-02-21T18:34:43.0419313Z %cst_3 = arith.constant dense<0> : tensor<4x1xi64, #blocked1> 2026-02-21T18:34:43.0419676Z %cst_4 = arith.constant dense<2048> : tensor<4x1xi64, #blocked1> 2026-02-21T18:34:43.0419997Z %c256_i32 = arith.constant 256 : i32 2026-02-21T18:34:43.0420258Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T18:34:43.0420512Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:34:43.0420809Z %cst_5 = arith.constant dense<2048> : tensor<4x1xi32, #blocked1> 2026-02-21T18:34:43.0421173Z %cst_6 = arith.constant dense<2048> : tensor<1x2xi32, #blocked2> 2026-02-21T18:34:43.0421564Z %cst_7 = arith.constant dense<0.000000e+00> : tensor<4x2xf32, #blocked2> 2026-02-21T18:34:43.0421899Z %c4_i32 = arith.constant 4 : i32 2026-02-21T18:34:43.0422134Z %c2_i32 = arith.constant 2 : i32 2026-02-21T18:34:43.0422373Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:34:43.0422625Z %0 = tt.get_program_id x : i32 2026-02-21T18:34:43.0422866Z %1 = arith.remsi %0, %c1024_i32 : i32 2026-02-21T18:34:43.0423112Z %2 = arith.divsi %0, %c1024_i32 : i32 2026-02-21T18:34:43.0423355Z %3 = arith.muli %1, %c2_i32 : i32 2026-02-21T18:34:43.0423782Z %4 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked3> 2026-02-21T18:34:43.0424197Z %5 = tt.splat %3 : i32 -> tensor<2xi32, #blocked3> 2026-02-21T18:34:43.0424415Z %6 = arith.addi %5, %4 : tensor<2xi32, #blocked3> 2026-02-21T18:34:43.0424608Z %7 = arith.muli %2, %c4_i32 : i32 2026-02-21T18:34:43.0424830Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked3> 2026-02-21T18:34:43.0425091Z %9 = tt.splat %7 : i32 -> tensor<4xi32, #blocked3> 2026-02-21T18:34:43.0425302Z %10 = arith.addi %9, %8 : tensor<4xi32, #blocked3> 2026-02-21T18:34:43.0425559Z %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked3> 2026-02-21T18:34:43.0425817Z %12 = arith.extsi %7 : i32 to i64 2026-02-21T18:34:43.0426047Z %13 = tt.splat %arg0 : !tt.ptr -> tensor<4x256x!tt.ptr, #blocked> 2026-02-21T18:34:43.0426311Z %14 = tt.splat %12 : i64 -> tensor<4xi64, #blocked3> 2026-02-21T18:34:43.0426565Z %15 = arith.extsi %8 : tensor<4xi32, #blocked3> to tensor<4xi64, #blocked3> 2026-02-21T18:34:43.0426829Z %16 = arith.addi %14, %15 : tensor<4xi64, #blocked3> 2026-02-21T18:34:43.0427179Z %17 = ttg.convert_layout %16 : tensor<4xi64, #blocked3> -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:34:43.0427669Z %18 = tt.expand_dims %17 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi64, #blocked4> 2026-02-21T18:34:43.0428093Z %19 = ttg.convert_layout %18 : tensor<4x1xi64, #blocked4> -> tensor<4x1xi64, #blocked1> 2026-02-21T18:34:43.0428506Z %20 = arith.muli %19, %cst_4 : tensor<4x1xi64, #blocked1> 2026-02-21T18:34:43.0428786Z %21 = tt.broadcast %20 : tensor<4x1xi64, #blocked1> -> tensor<4x256xi64, #blocked1> 2026-02-21T18:34:43.0429170Z %22 = ttg.convert_layout %21 : tensor<4x256xi64, #blocked1> -> tensor<4x256xi64, #blocked> 2026-02-21T18:34:43.0429505Z %23 = arith.extsi %11 : tensor<256xi32, #blocked3> to tensor<256xi64, #blocked3> 2026-02-21T18:34:43.0429829Z %24 = arith.cmpi sge, %19, %cst_3 : tensor<4x1xi64, #blocked1> 2026-02-21T18:34:43.0430084Z %25 = arith.cmpi slt, %19, %cst_2 : tensor<4x1xi64, #blocked1> 2026-02-21T18:34:43.0430327Z %26 = arith.andi %24, %25 : tensor<4x1xi1, #blocked1> 2026-02-21T18:34:43.0430594Z %27 = tt.broadcast %26 : tensor<4x1xi1, #blocked1> -> tensor<4x256xi1, #blocked1> 2026-02-21T18:34:43.0430927Z %28 = ttg.convert_layout %27 : tensor<4x256xi1, #blocked1> -> tensor<4x256xi1, #blocked> 2026-02-21T18:34:43.0431338Z %29 = ttg.convert_layout %6 : tensor<2xi32, #blocked3> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:34:43.0431811Z %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x2xi32, #blocked5> 2026-02-21T18:34:43.0432233Z %31 = ttg.convert_layout %30 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #blocked2> 2026-02-21T18:34:43.0432530Z %32 = arith.muli %31, %cst_6 : tensor<1x2xi32, #blocked2> 2026-02-21T18:34:43.0432806Z %33 = tt.broadcast %32 : tensor<1x2xi32, #blocked2> -> tensor<256x2xi32, #blocked2> 2026-02-21T18:34:43.0433123Z %34 = tt.splat %arg1 : !tt.ptr -> tensor<256x2x!tt.ptr, #blocked2> 2026-02-21T18:34:43.0433519Z %35 = scf.for %arg3 = %c0_i32 to %c2048_i32 step %c256_i32 iter_args(%arg4 = %cst_7) -> (tensor<4x2xf32, #blocked2>) : i32 { 2026-02-21T18:34:43.0433880Z %50 = tt.splat %arg3 : i32 -> tensor<256xi32, #blocked3> 2026-02-21T18:34:43.0434091Z %51 = arith.addi %50, %11 : tensor<256xi32, #blocked3> 2026-02-21T18:34:43.0434250Z %52 = arith.extsi %arg3 : i32 to i64 2026-02-21T18:34:43.0434406Z %53 = tt.splat %52 : i64 -> tensor<256xi64, #blocked3> 2026-02-21T18:34:43.0434575Z %54 = arith.addi %53, %23 : tensor<256xi64, #blocked3> 2026-02-21T18:34:43.0434847Z %55 = ttg.convert_layout %54 : tensor<256xi64, #blocked3> -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:34:43.0435261Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi64, #blocked5> 2026-02-21T18:34:43.0435623Z %57 = ttg.convert_layout %56 : tensor<1x256xi64, #blocked5> -> tensor<1x256xi64, #blocked> 2026-02-21T18:34:43.0435888Z %58 = tt.broadcast %57 : tensor<1x256xi64, #blocked> -> tensor<4x256xi64, #blocked> 2026-02-21T18:34:43.0436103Z %59 = arith.addi %22, %58 : tensor<4x256xi64, #blocked> 2026-02-21T18:34:43.0436329Z %60 = tt.addptr %13, %59 : tensor<4x256x!tt.ptr, #blocked>, tensor<4x256xi64, #blocked> 2026-02-21T18:34:43.0436564Z %61 = arith.cmpi sge, %57, %cst_1 : tensor<1x256xi64, #blocked> 2026-02-21T18:34:43.0436762Z %62 = arith.cmpi slt, %57, %cst_0 : tensor<1x256xi64, #blocked> 2026-02-21T18:34:43.0436945Z %63 = arith.andi %61, %62 : tensor<1x256xi1, #blocked> 2026-02-21T18:34:43.0437150Z %64 = tt.broadcast %63 : tensor<1x256xi1, #blocked> -> tensor<4x256xi1, #blocked> 2026-02-21T18:34:43.0437362Z %65 = arith.andi %28, %64 : tensor<4x256xi1, #blocked> 2026-02-21T18:34:43.0437546Z %66 = tt.load %60, %65, %cst : tensor<4x256x!tt.ptr, #blocked> 2026-02-21T18:34:43.0437833Z %67 = ttg.convert_layout %51 : tensor<256xi32, #blocked3> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:34:43.0438214Z %68 = tt.expand_dims %67 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<256x1xi32, #blocked4> 2026-02-21T18:34:43.0438553Z %69 = ttg.convert_layout %68 : tensor<256x1xi32, #blocked4> -> tensor<256x1xi32, #blocked1> 2026-02-21T18:34:43.0438824Z %70 = tt.broadcast %69 : tensor<256x1xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T18:34:43.0439091Z %71 = ttg.convert_layout %70 : tensor<256x2xi32, #blocked1> -> tensor<256x2xi32, #blocked2> 2026-02-21T18:34:43.0439348Z %72 = arith.addi %71, %33 : tensor<256x2xi32, #blocked2> 2026-02-21T18:34:43.0439591Z %73 = tt.addptr %34, %72 : tensor<256x2x!tt.ptr, #blocked2>, tensor<256x2xi32, #blocked2> 2026-02-21T18:34:43.0439830Z %74 = tt.load %73 : tensor<256x2x!tt.ptr, #blocked2> 2026-02-21T18:34:43.0440122Z %75 = ttg.convert_layout %66 : tensor<4x256xf16, #blocked> -> tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked2}>> 2026-02-21T18:34:43.0440521Z %76 = ttg.convert_layout %74 : tensor<256x2xf16, #blocked2> -> tensor<256x2xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked2}>> 2026-02-21T18:34:43.0440864Z %77 = ttg.convert_layout %arg4 : tensor<4x2xf32, #blocked2> -> tensor<4x2xf32, #blocked2> 2026-02-21T18:34:43.0441324Z %78 = tt.dot %75, %76, %77, inputPrecision = tf32 : tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked2}>> * tensor<256x2xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked2}>> -> tensor<4x2xf32, #blocked2> 2026-02-21T18:34:43.0441723Z scf.yield %78 : tensor<4x2xf32, #blocked2> 2026-02-21T18:34:43.0441866Z } {tt.flatten} 2026-02-21T18:34:43.0442028Z %36 = arith.truncf %35 : tensor<4x2xf32, #blocked2> to tensor<4x2xf16, #blocked2> 2026-02-21T18:34:43.0442337Z %37 = ttg.convert_layout %10 : tensor<4xi32, #blocked3> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:34:43.0442702Z %38 = tt.expand_dims %37 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T18:34:43.0443025Z %39 = ttg.convert_layout %38 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked1> 2026-02-21T18:34:43.0443254Z %40 = arith.muli %39, %cst_5 : tensor<4x1xi32, #blocked1> 2026-02-21T18:34:43.0443521Z %41 = ttg.convert_layout %6 : tensor<2xi32, #blocked3> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:34:43.0443885Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x2xi32, #blocked5> 2026-02-21T18:34:43.0444185Z %43 = ttg.convert_layout %42 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #blocked2> 2026-02-21T18:34:43.0444428Z %44 = tt.broadcast %40 : tensor<4x1xi32, #blocked1> -> tensor<4x2xi32, #blocked1> 2026-02-21T18:34:43.0444667Z %45 = ttg.convert_layout %44 : tensor<4x2xi32, #blocked1> -> tensor<4x2xi32, #blocked2> 2026-02-21T18:34:43.0444882Z %46 = tt.broadcast %43 : tensor<1x2xi32, #blocked2> -> tensor<4x2xi32, #blocked2> 2026-02-21T18:34:43.0445064Z %47 = arith.addi %45, %46 : tensor<4x2xi32, #blocked2> 2026-02-21T18:34:43.0445235Z %48 = tt.splat %arg2 : !tt.ptr -> tensor<4x2x!tt.ptr, #blocked2> 2026-02-21T18:34:43.0445447Z %49 = tt.addptr %48, %47 : tensor<4x2x!tt.ptr, #blocked2>, tensor<4x2xi32, #blocked2> 2026-02-21T18:34:43.0445641Z tt.store %49, %36 : tensor<4x2x!tt.ptr, #blocked2> 2026-02-21T18:34:43.0445770Z tt.return 2026-02-21T18:34:43.0445853Z } 2026-02-21T18:34:43.0445929Z } 2026-02-21T18:34:43.0445973Z 2026-02-21T18:34:43.0446008Z {-# 2026-02-21T18:34:43.0446087Z external_resources: { 2026-02-21T18:34:43.0446188Z mlir_reproducer: { 2026-02-21T18:34:43.0448466Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:34:43.0450799Z disable_threading: false, 2026-02-21T18:34:43.0450909Z verify_each: true 2026-02-21T18:34:43.0450998Z } 2026-02-21T18:34:43.0451074Z } 2026-02-21T18:34:43.0451145Z #-} 2026-02-21T18:34:43.0451425Z /tmp/torchinductor_root/ha/cha6kdjzug2ivegtts5jysf6jdnyfjjtnc25mezx7ee66tgvwarw.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:34:43.0452116Z /tmp/torchinductor_root/ha/cha6kdjzug2ivegtts5jysf6jdnyfjjtnc25mezx7ee66tgvwarw.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:34:43.0452670Z [46s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:34:43.0453393Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 2, 256], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T18:34:43.0454042Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:34:43.0454209Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:34:45.7990087Z Initial population exploring neighbors 64% ━━━━━━━━╸ 64/100 3.8 configs/s 2026-02-21T18:34:45.7992909Z WARNING:tritonbench.utils.triton_op:Completed input ID 6: 2026-02-21T18:34:45.7993328Z (M, N, K) 2026-02-21T18:34:45.7994002Z ------------------ 2026-02-21T18:34:45.7994334Z (8192, 2048, 2048) 2026-02-21T18:34:45.7994477Z 2026-02-21T18:34:45.7999520Z 62%|██████▎ | 5/8 [21:27<11:36, 232.25s/it]WARNING:tritonbench.utils.triton_op:Running input ID 8: 2026-02-21T18:34:45.7999982Z (M, N, K) 2026-02-21T18:34:45.8000179Z ------------------- 2026-02-21T18:34:45.8000386Z (12288, 1024, 1024) 2026-02-21T18:34:45.8000968Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T18:35:29.3284702Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T18:36:04.9043597Z INFO:tritonbench.utils.triton_op:Took 90.49ms to get benchmark function for pt2_triton_matmul 2026-02-21T18:36:47.2863166Z WARNING:__main__:Input tensor metadata: 2026-02-21T18:36:47.2863615Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T18:36:47.2863946Z 'dtype': 'torch.float16', 2026-02-21T18:36:47.2864307Z 'shape': (12288, 1024), 2026-02-21T18:36:47.2864618Z 'stride': (1024, 1)}, 2026-02-21T18:36:47.2864934Z { 'device': 'cuda:0', 2026-02-21T18:36:47.2865233Z 'dtype': 'torch.float16', 2026-02-21T18:36:47.2865539Z 'shape': (1024, 1024), 2026-02-21T18:36:47.2865833Z 'stride': (1, 1024)}, 2026-02-21T18:36:47.2866111Z None), 2026-02-21T18:36:47.2866343Z 'kwargs': {}} 2026-02-21T18:36:47.2872083Z INFO:tritonbench.utils.triton_op:Took 1.36ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T18:36:47.3606742Z [0s] Autotune random seed: 2171071898 2026-02-21T18:36:47.3740864Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T18:36:56.3757906Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━━ 100/100 2.4 configs/s 2026-02-21T18:37:03.5695028Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:37:03.5697056Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:37:03.5698325Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:37:03.5698958Z #blocked2 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [1], order = [0]}> 2026-02-21T18:37:03.5699259Z #blocked3 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:37:03.5699573Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [2, 32], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:37:03.5699881Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [0, 1]}> 2026-02-21T18:37:03.5700270Z #blocked6 = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 64], warpsPerCTA = [1, 1], order = [1, 0]}> 2026-02-21T18:37:03.5700612Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 1 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:37:03.5701083Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:37:03.5701430Z %c32_i32 = arith.constant 32 : i32 2026-02-21T18:37:03.5701559Z %c1024_i32 = arith.constant 1024 : i32 2026-02-21T18:37:03.5701687Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:37:03.5701860Z %cst = arith.constant dense<1024> : tensor<1x256xi32, #blocked> 2026-02-21T18:37:03.5702045Z %cst_0 = arith.constant dense<1024> : tensor<8x1xi32, #blocked1> 2026-02-21T18:37:03.5702242Z %cst_1 = arith.constant dense<0.000000e+00> : tensor<8x256xf32, #blocked> 2026-02-21T18:37:03.5702463Z %c256_i32 = arith.constant 256 : i32 2026-02-21T18:37:03.5702585Z %c8_i32 = arith.constant 8 : i32 2026-02-21T18:37:03.5702705Z %c1536_i32 = arith.constant 1536 : i32 2026-02-21T18:37:03.5702826Z %0 = tt.get_program_id x : i32 2026-02-21T18:37:03.5702943Z %1 = arith.divsi %0, %c32_i32 : i32 2026-02-21T18:37:03.5703063Z %2 = arith.muli %1, %c8_i32 : i32 2026-02-21T18:37:03.5703181Z %3 = arith.subi %c1536_i32, %2 : i32 2026-02-21T18:37:03.5703296Z %4 = arith.minsi %3, %c8_i32 : i32 2026-02-21T18:37:03.5703413Z %5 = arith.remsi %0, %c32_i32 : i32 2026-02-21T18:37:03.5703527Z %6 = arith.remsi %5, %4 : i32 2026-02-21T18:37:03.5703642Z %7 = arith.addi %2, %6 : i32 2026-02-21T18:37:03.5703749Z %8 = arith.divsi %5, %4 : i32 2026-02-21T18:37:03.5703860Z %9 = arith.muli %7, %c8_i32 : i32 2026-02-21T18:37:03.5704023Z %10 = tt.make_range {end = 8 : i32, start = 0 : i32} : tensor<8xi32, #blocked2> 2026-02-21T18:37:03.5704219Z %11 = tt.splat %9 : i32 -> tensor<8xi32, #blocked2> 2026-02-21T18:37:03.5704378Z %12 = arith.addi %11, %10 : tensor<8xi32, #blocked2> 2026-02-21T18:37:03.5704512Z %13 = arith.muli %8, %c256_i32 : i32 2026-02-21T18:37:03.5704677Z %14 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked2> 2026-02-21T18:37:03.5704866Z %15 = tt.splat %13 : i32 -> tensor<256xi32, #blocked2> 2026-02-21T18:37:03.5705021Z %16 = arith.addi %15, %14 : tensor<256xi32, #blocked2> 2026-02-21T18:37:03.5705201Z %17 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #blocked2> 2026-02-21T18:37:03.5705468Z %18 = ttg.convert_layout %12 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:37:03.5705883Z %19 = tt.expand_dims %18 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<8x1xi32, #blocked3> 2026-02-21T18:37:03.5706192Z %20 = ttg.convert_layout %19 : tensor<8x1xi32, #blocked3> -> tensor<8x1xi32, #blocked1> 2026-02-21T18:37:03.5706400Z %21 = arith.muli %20, %cst_0 : tensor<8x1xi32, #blocked1> 2026-02-21T18:37:03.5706591Z %22 = tt.broadcast %21 : tensor<8x1xi32, #blocked1> -> tensor<8x32xi32, #blocked1> 2026-02-21T18:37:03.5706822Z %23 = ttg.convert_layout %22 : tensor<8x32xi32, #blocked1> -> tensor<8x32xi32, #blocked4> 2026-02-21T18:37:03.5707049Z %24 = tt.splat %arg0 : !tt.ptr -> tensor<8x32x!tt.ptr, #blocked4> 2026-02-21T18:37:03.5707315Z %25 = ttg.convert_layout %16 : tensor<256xi32, #blocked2> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:37:03.5707653Z %26 = tt.expand_dims %25 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T18:37:03.5707950Z %27 = ttg.convert_layout %26 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked> 2026-02-21T18:37:03.5708237Z %28 = arith.muli %27, %cst : tensor<1x256xi32, #blocked> 2026-02-21T18:37:03.5708453Z %29 = tt.broadcast %28 : tensor<1x256xi32, #blocked> -> tensor<32x256xi32, #blocked> 2026-02-21T18:37:03.5708669Z %30 = tt.splat %arg1 : !tt.ptr -> tensor<32x256x!tt.ptr, #blocked> 2026-02-21T18:37:03.5708933Z %31 = scf.for %arg3 = %c0_i32 to %c1024_i32 step %c32_i32 iter_args(%arg4 = %cst_1) -> (tensor<8x256xf32, #blocked>) : i32 { 2026-02-21T18:37:03.5709174Z %46 = tt.splat %arg3 : i32 -> tensor<32xi32, #blocked2> 2026-02-21T18:37:03.5709325Z %47 = arith.addi %46, %17 : tensor<32xi32, #blocked2> 2026-02-21T18:37:03.5709557Z %48 = ttg.convert_layout %47 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:37:03.5709880Z %49 = tt.expand_dims %48 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x32xi32, #blocked5> 2026-02-21T18:37:03.5710172Z %50 = ttg.convert_layout %49 : tensor<1x32xi32, #blocked5> -> tensor<1x32xi32, #blocked4> 2026-02-21T18:37:03.5710402Z %51 = tt.broadcast %50 : tensor<1x32xi32, #blocked4> -> tensor<8x32xi32, #blocked4> 2026-02-21T18:37:03.5710609Z %52 = arith.addi %23, %51 : tensor<8x32xi32, #blocked4> 2026-02-21T18:37:03.5710803Z %53 = tt.addptr %24, %52 : tensor<8x32x!tt.ptr, #blocked4>, tensor<8x32xi32, #blocked4> 2026-02-21T18:37:03.5710999Z %54 = tt.load %53 : tensor<8x32x!tt.ptr, #blocked4> 2026-02-21T18:37:03.5711232Z %55 = ttg.convert_layout %47 : tensor<32xi32, #blocked2> -> tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:37:03.5711555Z %56 = tt.expand_dims %55 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<32x1xi32, #blocked3> 2026-02-21T18:37:03.5711837Z %57 = ttg.convert_layout %56 : tensor<32x1xi32, #blocked3> -> tensor<32x1xi32, #blocked1> 2026-02-21T18:37:03.5712072Z %58 = tt.broadcast %57 : tensor<32x1xi32, #blocked1> -> tensor<32x256xi32, #blocked1> 2026-02-21T18:37:03.5712309Z %59 = ttg.convert_layout %58 : tensor<32x256xi32, #blocked1> -> tensor<32x256xi32, #blocked> 2026-02-21T18:37:03.5712517Z %60 = arith.addi %59, %29 : tensor<32x256xi32, #blocked> 2026-02-21T18:37:03.5712713Z %61 = tt.addptr %30, %60 : tensor<32x256x!tt.ptr, #blocked>, tensor<32x256xi32, #blocked> 2026-02-21T18:37:03.5712914Z %62 = tt.load %61 : tensor<32x256x!tt.ptr, #blocked> 2026-02-21T18:37:03.5713162Z %63 = ttg.convert_layout %54 : tensor<8x32xf16, #blocked4> -> tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> 2026-02-21T18:37:03.5713506Z %64 = ttg.convert_layout %62 : tensor<32x256xf16, #blocked> -> tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> 2026-02-21T18:37:03.5713820Z %65 = ttg.convert_layout %arg4 : tensor<8x256xf32, #blocked> -> tensor<8x256xf32, #blocked6> 2026-02-21T18:37:03.5714241Z %66 = tt.dot %63, %64, %65, inputPrecision = tf32 : tensor<8x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked6}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked6}>> -> tensor<8x256xf32, #blocked6> 2026-02-21T18:37:03.5714629Z %67 = ttg.convert_layout %66 : tensor<8x256xf32, #blocked6> -> tensor<8x256xf32, #blocked> 2026-02-21T18:37:03.5714822Z scf.yield %67 : tensor<8x256xf32, #blocked> 2026-02-21T18:37:03.5714951Z } {tt.disallow_acc_multi_buffer} 2026-02-21T18:37:03.5715116Z %32 = arith.truncf %31 : tensor<8x256xf32, #blocked> to tensor<8x256xf16, #blocked> 2026-02-21T18:37:03.5715379Z %33 = ttg.convert_layout %12 : tensor<8xi32, #blocked2> -> tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> 2026-02-21T18:37:03.5715694Z %34 = tt.expand_dims %33 {axis = 1 : i32} : tensor<8xi32, #ttg.slice<{dim = 1, parent = #blocked3}>> -> tensor<8x1xi32, #blocked3> 2026-02-21T18:37:03.5715972Z %35 = ttg.convert_layout %34 : tensor<8x1xi32, #blocked3> -> tensor<8x1xi32, #blocked1> 2026-02-21T18:37:03.5716168Z %36 = arith.muli %35, %cst_0 : tensor<8x1xi32, #blocked1> 2026-02-21T18:37:03.5716424Z %37 = ttg.convert_layout %16 : tensor<256xi32, #blocked2> -> tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:37:03.5716749Z %38 = tt.expand_dims %37 {axis = 0 : i32} : tensor<256xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi32, #blocked5> 2026-02-21T18:37:03.5717034Z %39 = ttg.convert_layout %38 : tensor<1x256xi32, #blocked5> -> tensor<1x256xi32, #blocked> 2026-02-21T18:37:03.5717260Z %40 = tt.broadcast %36 : tensor<8x1xi32, #blocked1> -> tensor<8x256xi32, #blocked1> 2026-02-21T18:37:03.5717482Z %41 = ttg.convert_layout %40 : tensor<8x256xi32, #blocked1> -> tensor<8x256xi32, #blocked> 2026-02-21T18:37:03.5717704Z %42 = tt.broadcast %39 : tensor<1x256xi32, #blocked> -> tensor<8x256xi32, #blocked> 2026-02-21T18:37:03.5717893Z %43 = arith.addi %41, %42 : tensor<8x256xi32, #blocked> 2026-02-21T18:37:03.5718065Z %44 = tt.splat %arg2 : !tt.ptr -> tensor<8x256x!tt.ptr, #blocked> 2026-02-21T18:37:03.5718287Z %45 = tt.addptr %44, %43 : tensor<8x256x!tt.ptr, #blocked>, tensor<8x256xi32, #blocked> 2026-02-21T18:37:03.5718501Z tt.store %45, %32 : tensor<8x256x!tt.ptr, #blocked> 2026-02-21T18:37:03.5718634Z tt.return 2026-02-21T18:37:03.5718713Z } 2026-02-21T18:37:03.5718792Z } 2026-02-21T18:37:03.5718837Z 2026-02-21T18:37:03.5718868Z {-# 2026-02-21T18:37:03.5718950Z external_resources: { 2026-02-21T18:37:03.5719052Z mlir_reproducer: { 2026-02-21T18:37:03.5721334Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=2 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=2}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:37:03.5723686Z disable_threading: false, 2026-02-21T18:37:03.5723794Z verify_each: true 2026-02-21T18:37:03.5723884Z } 2026-02-21T18:37:03.5723974Z } 2026-02-21T18:37:03.5724044Z #-} 2026-02-21T18:37:03.5724326Z /tmp/torchinductor_root/vl/cvljk7ql6d3ugd7nri3pmbmfsrt76kwayaq7hlcfr33r7nxvzmi6.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:37:03.5725016Z /tmp/torchinductor_root/vl/cvljk7ql6d3ugd7nri3pmbmfsrt76kwayaq7hlcfr33r7nxvzmi6.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:37:03.5725574Z [16s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:37:03.5726320Z Config: @helion.kernel(config=helion.Config(block_sizes=[8, 256, 32], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[8], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=0, num_stages=2, num_warps=1, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=4), static_shapes=True) 2026-02-21T18:37:03.5726982Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:37:03.5727150Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:37:06.7646619Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 9.6 configs/s 2026-02-21T18:37:06.7654330Z [19s] Adaptive compile timeout: 30s (90% percentile=4.9s, bounds=[30.0s, 30s]) 2026-02-21T18:37:06.7655819Z [19s] Initial random population of 100, 5 starting points: 2026-02-21T18:37:06.7656279Z error=10 2026-02-21T18:37:06.7656470Z ok=90 2026-02-21T18:37:06.7656641Z min=0.0862 2026-02-21T18:37:06.7656830Z mid=1.3565 2026-02-21T18:37:06.7657001Z max=170.9371 2026-02-21T18:37:06.7657244Z best={'block_sizes': [256, 256, 16], 2026-02-21T18:37:06.7657559Z 'indexing': ['block_ptr', 'block_ptr', 'pointer'], 2026-02-21T18:37:06.7657879Z 'l2_groupings': [32], 2026-02-21T18:37:06.7658142Z 'load_eviction_policies': ['', ''], 2026-02-21T18:37:06.7658424Z 'loop_orders': [[0, 1]], 2026-02-21T18:37:06.7658909Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:37:06.7659115Z 'num_sm_multiplier': 1, 2026-02-21T18:37:06.7659311Z 'num_stages': 4, 2026-02-21T18:37:06.7659477Z 'num_warps': 8, 2026-02-21T18:37:06.7659675Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:37:06.7659917Z 'range_flattens': [None, False], 2026-02-21T18:37:06.7660139Z 'range_multi_buffers': [None, False], 2026-02-21T18:37:06.7660371Z 'range_num_stages': [0, 0], 2026-02-21T18:37:06.7660578Z 'range_unroll_factors': [0, 0], 2026-02-21T18:37:06.7660796Z 'range_warp_specializes': [], 2026-02-21T18:37:06.7660997Z 'waves_per_eu': 1} 2026-02-21T18:37:06.7667936Z [19s] Fitting surrogate: 100 points, 100 targets 2026-02-21T18:37:08.1661337Z [20s] Generation 1 starting: 82 neighbors, 5 active search path(s) 2026-02-21T18:37:21.5901721Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 0.7 configs/s 2026-02-21T18:37:25.9993387Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 19.5 configs/s 2026-02-21T18:37:27.8030285Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 480.4 2026-02-21T18:37:27.8030857Z configs/s 2026-02-21T18:37:28.3528633Z [40s] Generation 1 complete: 2026-02-21T18:37:28.3528996Z error=14 2026-02-21T18:37:28.3529203Z ok=74 2026-02-21T18:37:28.3529409Z min=0.0667 2026-02-21T18:37:28.3529616Z mid=0.1604 2026-02-21T18:37:28.3529814Z max=4.1047 2026-02-21T18:37:28.3530043Z best={'block_sizes': [64, 64, 64], 2026-02-21T18:37:28.3530431Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:37:28.3530803Z 'l2_groupings': [16], 2026-02-21T18:37:28.3531085Z 'load_eviction_policies': ['', ''], 2026-02-21T18:37:28.3531724Z 'loop_orders': [[0, 1]], 2026-02-21T18:37:28.3532005Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:37:28.3532294Z 'num_sm_multiplier': 16, 2026-02-21T18:37:28.3532700Z 'num_stages': 2, 2026-02-21T18:37:28.3532935Z 'num_warps': 4, 2026-02-21T18:37:28.3533202Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:37:28.3533545Z 'range_flattens': [False, None], 2026-02-21T18:37:28.3533866Z 'range_multi_buffers': [None, False], 2026-02-21T18:37:28.3534186Z 'range_num_stages': [0, 0], 2026-02-21T18:37:28.3534470Z 'range_unroll_factors': [0, 0], 2026-02-21T18:37:28.3534769Z 'range_warp_specializes': [], 2026-02-21T18:37:28.3535052Z 'waves_per_eu': 2} 2026-02-21T18:37:28.3793053Z [41s] Fitting surrogate: 188 points, 188 targets 2026-02-21T18:37:29.2615755Z [41s] Generation 2 starting: 83 neighbors, 5 active search path(s) 2026-02-21T18:37:39.0471713Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84/84 0.9 configs/s 2026-02-21T18:37:43.5036584Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 84/84 19.5 configs/s 2026-02-21T18:37:47.4024902Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 241.1 2026-02-21T18:37:47.4025944Z configs/s 2026-02-21T18:37:48.0662995Z [60s] Generation 2 complete: 2026-02-21T18:37:48.0663358Z error=13 2026-02-21T18:37:48.0663568Z ok=75 2026-02-21T18:37:48.0663796Z min=0.0624 2026-02-21T18:37:48.0664005Z mid=0.0976 2026-02-21T18:37:48.0664200Z max=3.3233 2026-02-21T18:37:48.0664433Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:37:48.0664810Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:37:48.0665180Z 'l2_groupings': [16], 2026-02-21T18:37:48.0665457Z 'load_eviction_policies': ['', ''], 2026-02-21T18:37:48.0665766Z 'loop_orders': [[0, 1]], 2026-02-21T18:37:48.0666044Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:37:48.0666326Z 'num_sm_multiplier': 16, 2026-02-21T18:37:48.0666589Z 'num_stages': 2, 2026-02-21T18:37:48.0666841Z 'num_warps': 4, 2026-02-21T18:37:48.0667102Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:37:48.0667428Z 'range_flattens': [False, None], 2026-02-21T18:37:48.0667741Z 'range_multi_buffers': [None, False], 2026-02-21T18:37:48.0668055Z 'range_num_stages': [0, 0], 2026-02-21T18:37:48.0668674Z 'range_unroll_factors': [0, 0], 2026-02-21T18:37:48.0668970Z 'range_warp_specializes': [], 2026-02-21T18:37:48.0669254Z 'waves_per_eu': 2} 2026-02-21T18:37:48.1317074Z [60s] Fitting surrogate: 276 points, 276 targets 2026-02-21T18:37:49.5006002Z [62s] Generation 3 starting: 72 neighbors, 5 active search path(s) 2026-02-21T18:37:57.5579557Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75/75 3.1 configs/s 2026-02-21T18:38:01.1241285Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 75/75 22.0 configs/s 2026-02-21T18:38:04.5489965Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 272.5 2026-02-21T18:38:04.5490605Z configs/s 2026-02-21T18:38:05.1533715Z [77s] Generation 3 complete: 2026-02-21T18:38:05.1534199Z error=18 2026-02-21T18:38:05.1534518Z ok=60 2026-02-21T18:38:05.1534850Z min=0.0588 2026-02-21T18:38:05.1535106Z mid=0.0996 2026-02-21T18:38:05.1535319Z max=3.2915 2026-02-21T18:38:05.1535570Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:38:05.1535962Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:38:05.1536328Z 'l2_groupings': [16], 2026-02-21T18:38:05.1536605Z 'load_eviction_policies': ['', ''], 2026-02-21T18:38:05.1536906Z 'loop_orders': [[0, 1]], 2026-02-21T18:38:05.1537188Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:38:05.1537467Z 'num_sm_multiplier': 16, 2026-02-21T18:38:05.1537732Z 'num_stages': 2, 2026-02-21T18:38:05.1537961Z 'num_warps': 8, 2026-02-21T18:38:05.1538221Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:38:05.1538545Z 'range_flattens': [False, None], 2026-02-21T18:38:05.1538852Z 'range_multi_buffers': [True, False], 2026-02-21T18:38:05.1539466Z 'range_num_stages': [0, 0], 2026-02-21T18:38:05.1539755Z 'range_unroll_factors': [0, 0], 2026-02-21T18:38:05.1540050Z 'range_warp_specializes': [], 2026-02-21T18:38:05.1540457Z 'waves_per_eu': 2} 2026-02-21T18:38:05.2056572Z [77s] Fitting surrogate: 354 points, 354 targets 2026-02-21T18:38:06.0433488Z [78s] Generation 4 starting: 76 neighbors, 5 active search path(s) 2026-02-21T18:38:14.4963547Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78/78 5.1 configs/s 2026-02-21T18:38:18.0163498Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 78/78 22.5 configs/s 2026-02-21T18:38:21.1746687Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 296.1 2026-02-21T18:38:21.1747276Z configs/s 2026-02-21T18:38:21.7229229Z [94s] Generation 4 complete: 2026-02-21T18:38:21.7229603Z error=22 2026-02-21T18:38:21.7229815Z ok=59 2026-02-21T18:38:21.7230022Z min=0.0592 2026-02-21T18:38:21.7230258Z mid=0.0898 2026-02-21T18:38:21.7230459Z max=3.3057 2026-02-21T18:38:21.7230686Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:38:21.7231080Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:38:21.7231444Z 'l2_groupings': [16], 2026-02-21T18:38:21.7232020Z 'load_eviction_policies': ['', ''], 2026-02-21T18:38:21.7232349Z 'loop_orders': [[0, 1]], 2026-02-21T18:38:21.7232630Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:38:21.7232908Z 'num_sm_multiplier': 16, 2026-02-21T18:38:21.7233170Z 'num_stages': 2, 2026-02-21T18:38:21.7233394Z 'num_warps': 8, 2026-02-21T18:38:21.7233655Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:38:21.7233979Z 'range_flattens': [False, None], 2026-02-21T18:38:21.7234286Z 'range_multi_buffers': [True, False], 2026-02-21T18:38:21.7234592Z 'range_num_stages': [0, 0], 2026-02-21T18:38:21.7234872Z 'range_unroll_factors': [0, 0], 2026-02-21T18:38:21.7235180Z 'range_warp_specializes': [], 2026-02-21T18:38:21.7235456Z 'waves_per_eu': 2} 2026-02-21T18:38:21.7759994Z [94s] Fitting surrogate: 435 points, 435 targets 2026-02-21T18:38:22.3253210Z [94s] Generation 5 starting: 45 neighbors, 3 active search path(s) 2026-02-21T18:38:27.5789698Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46/46 6.8 configs/s 2026-02-21T18:38:29.7136646Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 46/46 23.3 configs/s 2026-02-21T18:38:32.2634395Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 360.7 2026-02-21T18:38:32.2634992Z configs/s 2026-02-21T18:38:32.7960430Z [105s] Generation 5 complete: 2026-02-21T18:38:32.7960916Z error=13 2026-02-21T18:38:32.7961168Z ok=35 2026-02-21T18:38:32.7961374Z min=0.0590 2026-02-21T18:38:32.7961585Z mid=0.0765 2026-02-21T18:38:32.7961788Z max=3.4922 2026-02-21T18:38:32.7962039Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:38:32.7962415Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:38:32.7962809Z 'l2_groupings': [16], 2026-02-21T18:38:32.7963083Z 'load_eviction_policies': ['', ''], 2026-02-21T18:38:32.7963389Z 'loop_orders': [[0, 1]], 2026-02-21T18:38:32.7963689Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:38:32.7963975Z 'num_sm_multiplier': 16, 2026-02-21T18:38:32.7964265Z 'num_stages': 2, 2026-02-21T18:38:32.7964504Z 'num_warps': 8, 2026-02-21T18:38:32.7964768Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:38:32.7965093Z 'range_flattens': [False, None], 2026-02-21T18:38:32.7965408Z 'range_multi_buffers': [True, False], 2026-02-21T18:38:32.7965723Z 'range_num_stages': [0, 0], 2026-02-21T18:38:32.7966002Z 'range_unroll_factors': [0, 0], 2026-02-21T18:38:32.7966304Z 'range_warp_specializes': [], 2026-02-21T18:38:32.7966583Z 'waves_per_eu': 2} 2026-02-21T18:38:32.8401128Z [105s] Fitting surrogate: 483 points, 483 targets 2026-02-21T18:38:33.2841048Z [105s] Generation 6 starting: 36 neighbors, 2 active search path(s) 2026-02-21T18:38:39.0939890Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36/36 2.1 configs/s 2026-02-21T18:38:41.1800234Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 36/36 18.1 configs/s 2026-02-21T18:38:43.4978110Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 395.5 2026-02-21T18:38:43.4978351Z configs/s 2026-02-21T18:38:44.0590843Z [116s] Generation 6 complete: 2026-02-21T18:38:44.0591183Z error=5 2026-02-21T18:38:44.0591389Z ok=34 2026-02-21T18:38:44.0591599Z min=0.0592 2026-02-21T18:38:44.0601596Z mid=0.0730 2026-02-21T18:38:44.0601731Z max=5.4590 2026-02-21T18:38:44.0601872Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:38:44.0602114Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:38:44.0602331Z 'l2_groupings': [16], 2026-02-21T18:38:44.0602505Z 'load_eviction_policies': ['', ''], 2026-02-21T18:38:44.0602685Z 'loop_orders': [[0, 1]], 2026-02-21T18:38:44.0602857Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:38:44.0603043Z 'num_sm_multiplier': 16, 2026-02-21T18:38:44.0603208Z 'num_stages': 2, 2026-02-21T18:38:44.0603351Z 'num_warps': 8, 2026-02-21T18:38:44.0603506Z 'pid_type': 'persistent_interleaved', 2026-02-21T18:38:44.0603711Z 'range_flattens': [False, None], 2026-02-21T18:38:44.0604204Z 'range_multi_buffers': [True, False], 2026-02-21T18:38:44.0604393Z 'range_num_stages': [0, 0], 2026-02-21T18:38:44.0604575Z 'range_unroll_factors': [0, 0], 2026-02-21T18:38:44.0604760Z 'range_warp_specializes': [], 2026-02-21T18:38:44.0604928Z 'waves_per_eu': 2} 2026-02-21T18:38:44.1003798Z [116s] Fitting surrogate: 522 points, 522 targets 2026-02-21T18:38:44.3862075Z [117s] Generation 7 starting: 17 neighbors, 1 active search path(s) 2026-02-21T18:38:46.4280778Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 7.4 configs/s 2026-02-21T18:38:47.5446138Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 17.5 configs/s 2026-02-21T18:38:48.6029730Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 784.4 2026-02-21T18:38:48.6030390Z configs/s 2026-02-21T18:38:49.0487378Z [121s] Generation 7 complete: 2026-02-21T18:38:49.0487858Z ok=19 2026-02-21T18:38:49.0488167Z min=0.0586 2026-02-21T18:38:49.0488486Z mid=0.0756 2026-02-21T18:38:49.0488767Z max=1.1470 2026-02-21T18:38:49.0489085Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:38:49.0489552Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:38:49.0489924Z 'l2_groupings': [32], 2026-02-21T18:38:49.0490205Z 'load_eviction_policies': ['', ''], 2026-02-21T18:38:49.0490524Z 'loop_orders': [[0, 1]], 2026-02-21T18:38:49.0490807Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:38:49.0491088Z 'num_stages': 2, 2026-02-21T18:38:49.0491326Z 'num_warps': 8, 2026-02-21T18:38:49.0491565Z 'pid_type': 'flat', 2026-02-21T18:38:49.0491836Z 'range_flattens': [None, True], 2026-02-21T18:38:49.0492147Z 'range_multi_buffers': [None, False], 2026-02-21T18:38:49.0492866Z 'range_num_stages': [0, 0], 2026-02-21T18:38:49.0493157Z 'range_unroll_factors': [0, 0], 2026-02-21T18:38:49.0493466Z 'range_warp_specializes': [], 2026-02-21T18:38:49.0493887Z 'waves_per_eu': 2} 2026-02-21T18:38:49.0718436Z [121s] Fitting surrogate: 541 points, 541 targets 2026-02-21T18:38:49.3438934Z [121s] Generation 8 starting: 17 neighbors, 1 active search path(s) 2026-02-21T18:38:51.3498087Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 8.2 configs/s 2026-02-21T18:38:52.1906258Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 17/17 23.6 configs/s 2026-02-21T18:38:53.2167905Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 811.7 2026-02-21T18:38:53.2168474Z configs/s 2026-02-21T18:38:53.6547740Z [126s] Generation 8 complete: 2026-02-21T18:38:53.6548096Z error=5 2026-02-21T18:38:53.6548392Z ok=13 2026-02-21T18:38:53.6548601Z min=0.0588 2026-02-21T18:38:53.6548834Z mid=0.0628 2026-02-21T18:38:53.6549035Z max=0.1293 2026-02-21T18:38:53.6549261Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:38:53.6549653Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:38:53.6550023Z 'l2_groupings': [32], 2026-02-21T18:38:53.6550590Z 'load_eviction_policies': ['', ''], 2026-02-21T18:38:53.6550917Z 'loop_orders': [[0, 1]], 2026-02-21T18:38:53.6551195Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:38:53.6551466Z 'num_stages': 2, 2026-02-21T18:38:53.6551693Z 'num_warps': 8, 2026-02-21T18:38:53.6551926Z 'pid_type': 'flat', 2026-02-21T18:38:53.6552187Z 'range_flattens': [None, True], 2026-02-21T18:38:53.6552493Z 'range_multi_buffers': [None, False], 2026-02-21T18:38:53.6552799Z 'range_num_stages': [0, 0], 2026-02-21T18:38:53.6553078Z 'range_unroll_factors': [0, 0], 2026-02-21T18:38:53.6553377Z 'range_warp_specializes': [], 2026-02-21T18:38:53.6553651Z 'waves_per_eu': 2} 2026-02-21T18:38:53.6750607Z [126s] Fitting surrogate: 559 points, 559 targets 2026-02-21T18:38:53.9347809Z [126s] Generation 9 starting: 12 neighbors, 1 active search path(s) 2026-02-21T18:38:55.1868026Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12/12 10.1 configs/s 2026-02-21T18:38:56.0191020Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 12/12 17.5 configs/s 2026-02-21T18:38:56.8561925Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 946.9 2026-02-21T18:38:56.8562320Z configs/s 2026-02-21T18:38:57.2939848Z [129s] Generation 9 complete: 2026-02-21T18:38:57.2940122Z ok=13 2026-02-21T18:38:57.2940302Z min=0.0579 2026-02-21T18:38:57.2940480Z mid=0.0765 2026-02-21T18:38:57.2940652Z max=0.0902 2026-02-21T18:38:57.2940844Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:38:57.2941163Z 'indexing': ['pointer', 'pointer', 'block_ptr'], 2026-02-21T18:38:57.2941468Z 'l2_groupings': [32], 2026-02-21T18:38:57.2941709Z 'load_eviction_policies': ['', ''], 2026-02-21T18:38:57.2941994Z 'loop_orders': [[0, 1]], 2026-02-21T18:38:57.2942228Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:38:57.2942458Z 'num_stages': 2, 2026-02-21T18:38:57.2942651Z 'num_warps': 8, 2026-02-21T18:38:57.2942857Z 'pid_type': 'flat', 2026-02-21T18:38:57.2943088Z 'range_flattens': [None, True], 2026-02-21T18:38:57.2943355Z 'range_multi_buffers': [None, False], 2026-02-21T18:38:57.2943617Z 'range_num_stages': [0, 0], 2026-02-21T18:38:57.2943857Z 'range_unroll_factors': [0, 0], 2026-02-21T18:38:57.2944105Z 'range_warp_specializes': [], 2026-02-21T18:38:57.2944354Z 'waves_per_eu': 2} 2026-02-21T18:38:57.3100440Z [129s] Fitting surrogate: 572 points, 572 targets 2026-02-21T18:38:57.4272744Z [130s] Autotuning complete in 130.1s after searching 550 configs. 2026-02-21T18:38:57.4272957Z One can hardcode the best config and skip autotuning with: 2026-02-21T18:38:57.4273821Z @helion.kernel(config=helion.Config(block_sizes=[128, 128, 64], indexing=['pointer', 'pointer', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=8, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T18:38:57.4274522Z 2026-02-21T18:38:57.4274696Z [130s] Code of selected kernel: /tmp/torchinductor_root/fi/cfifiogqw4jrifvhwcceycjsfqnxvq5hgq74uocbqeldgqgvbsnh.py 2026-02-21T18:38:57.4358246Z from __future__ import annotations 2026-02-21T18:38:57.4358324Z 2026-02-21T18:38:57.4358387Z import torch 2026-02-21T18:38:57.4358736Z import triton 2026-02-21T18:38:57.4359046Z import triton.language as tl 2026-02-21T18:38:57.4359439Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T18:38:57.4359750Z 2026-02-21T18:38:57.4359866Z _BLOCK_SIZE_0 = tl.constexpr(128) 2026-02-21T18:38:57.4360154Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T18:38:57.4360476Z _BLOCK_SIZE_2 = tl.constexpr(64) 2026-02-21T18:38:57.4360650Z 2026-02-21T18:38:57.4360737Z @triton.jit 2026-02-21T18:38:57.4360966Z def _helion_matmul(x, y, out): 2026-02-21T18:38:57.4361317Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:38:57.4361982Z num_pid_m = tl.cdiv(12288, _BLOCK_SIZE_0) 2026-02-21T18:38:57.4362310Z num_pid_n = tl.cdiv(1024, _BLOCK_SIZE_1) 2026-02-21T18:38:57.4362630Z inner_2d_pid = tl.program_id(0) 2026-02-21T18:38:57.4362931Z num_pid_in_group = 32 * num_pid_n 2026-02-21T18:38:57.4363250Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T18:38:57.4363563Z first_pid_m = group_id * 32 2026-02-21T18:38:57.4363868Z group_size_m = min(num_pid_m - first_pid_m, 32) 2026-02-21T18:38:57.4364306Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T18:38:57.4364761Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T18:38:57.4365118Z offset_0 = pid_0 * _BLOCK_SIZE_0 2026-02-21T18:38:57.4365492Z indices_0 = (offset_0 + tl.arange(0, _BLOCK_SIZE_0)).to(tl.int32) 2026-02-21T18:38:57.4365866Z offset_1 = pid_1 * _BLOCK_SIZE_1 2026-02-21T18:38:57.4366224Z indices_1 = (offset_1 + tl.arange(0, _BLOCK_SIZE_1)).to(tl.int32) 2026-02-21T18:38:57.4366700Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:38:57.4367172Z acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 0.0, tl.float32) 2026-02-21T18:38:57.4367476Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:38:57.4367820Z # src[matmul.py:66]: acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) 2026-02-21T18:38:57.4368293Z for offset_2 in tl.range(0, 1024, _BLOCK_SIZE_2, disallow_acc_multi_buffer=True, flatten=True): 2026-02-21T18:38:57.4368732Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int32) 2026-02-21T18:38:57.4369016Z acc_copy = acc 2026-02-21T18:38:57.4369205Z acc_copy_0 = acc_copy 2026-02-21T18:38:57.4369507Z # src[matmul.py:66]: acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) 2026-02-21T18:38:57.4369918Z load = tl.load(x + (indices_0[:, None] * 1024 + indices_2[None, :] * 1), None) 2026-02-21T18:38:57.4370329Z load_1 = tl.load(y + (indices_2[:, None] * 1 + indices_1[None, :] * 1024), None) 2026-02-21T18:38:57.4370890Z acc = tl.dot(tl.cast(load, tl.float16), tl.cast(load_1, tl.float16), acc=acc_copy_0, input_precision='tf32', out_dtype=tl.float32) 2026-02-21T18:38:57.4371426Z # src[matmul.py:67]: out[tile_m, tile_n] = epilogue(acc, (tile_m, tile_n)) 2026-02-21T18:38:57.4371737Z v_0 = tl.cast(acc, tl.float16) 2026-02-21T18:38:57.4372218Z tl.store(tl.make_block_ptr(out, [12288, 1024], [1024, 1], [offset_0, offset_1], [_BLOCK_SIZE_0, _BLOCK_SIZE_1], [1, 0]), v_0, boundary_check=[0, 1]) 2026-02-21T18:38:57.4372640Z 2026-02-21T18:38:57.4372988Z def matmul(x: Tensor, y: Tensor, epilogue: Callable[[Tensor, tuple[Tensor, ...]], Tensor]=lambda acc, tile: acc, *, _launcher=_default_launcher): 2026-02-21T18:38:57.4373511Z """ 2026-02-21T18:38:57.4373784Z Performs matrix multiplication of x and y with an optional epilogue function. 2026-02-21T18:38:57.4374113Z Args: 2026-02-21T18:38:57.4374332Z x (Tensor): Left matrix of shape [m, k]. 2026-02-21T18:38:57.4374597Z y (Tensor): Right matrix of shape [k, n]. 2026-02-21T18:38:57.4374962Z epilogue (Callable, optional): Function applied to the accumulator and tile indices 2026-02-21T18:38:57.4375365Z after the matmul. Defaults to identity (no change). 2026-02-21T18:38:57.4375623Z Returns: 2026-02-21T18:38:57.4375804Z Tensor: Resulting matrix of shape [m, n]. 2026-02-21T18:38:57.4376030Z """ 2026-02-21T18:38:57.4376190Z # src[matmul.py:57]: m, k = x.size() 2026-02-21T18:38:57.4376403Z m, k = x.size() 2026-02-21T18:38:57.4376577Z # src[matmul.py:58]: k2, n = y.size() 2026-02-21T18:38:57.4376792Z k2, n = y.size() 2026-02-21T18:38:57.4377009Z # src[matmul.py:59]: assert k == k2, f"size mismatch {k} != {k2}" 2026-02-21T18:38:57.4377227Z assert k == k2, f'size mismatch {k} != {k2}' 2026-02-21T18:38:57.4377412Z # src[matmul.py:60]: out = torch.empty( 2026-02-21T18:38:57.4377694Z # src[matmul.py:61]: [m, n], dtype=torch.promote_types(x.dtype, y.dtype), device=x.device 2026-02-21T18:38:57.4377955Z # src[matmul.py:62]: ) 2026-02-21T18:38:57.4378193Z out = torch.empty([m, n], dtype=torch.promote_types(x.dtype, y.dtype), device=x.device) 2026-02-21T18:38:57.4378482Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:38:57.4378679Z _BLOCK_SIZE_0 = 128 2026-02-21T18:38:57.4378816Z _BLOCK_SIZE_1 = 128 2026-02-21T18:38:57.4378986Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:38:57.4379246Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:38:57.4379503Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:38:57.4379688Z # src[matmul.py:63-67]: ... 2026-02-21T18:38:57.4380104Z _launcher(_helion_matmul, (triton.cdiv(12288, _BLOCK_SIZE_0) * triton.cdiv(1024, _BLOCK_SIZE_1),), x, y, out, num_warps=8, num_stages=2, waves_per_eu=2, matrix_instr_nonkdim=16) 2026-02-21T18:38:57.4380545Z # src[matmul.py:68]: return out 2026-02-21T18:38:57.4380719Z return out 2026-02-21T18:39:25.4488098Z WARNING:tritonbench.utils.triton_op:Completed input ID 8: 2026-02-21T18:39:25.4488620Z (M, N, K) 2026-02-21T18:39:25.4488843Z ------------------- 2026-02-21T18:39:25.4489096Z (12288, 1024, 1024) 2026-02-21T18:39:25.4489242Z 2026-02-21T18:39:25.4494231Z 75%|███████▌ | 6/8 [26:07<08:16, 248.37s/it]WARNING:tritonbench.utils.triton_op:Running input ID 9: 2026-02-21T18:39:25.4494769Z (M, N, K) 2026-02-21T18:39:25.4494963Z ------------------- 2026-02-21T18:39:25.4496381Z (1024, 12288, 1024) 2026-02-21T18:39:25.4496748Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T18:40:02.2202794Z INFO:tritonbench.utils.triton_op:Took 0.02ms to get benchmark function for triton_tutorial_matmul 2026-02-21T18:40:39.6947549Z INFO:tritonbench.utils.triton_op:Took 91.72ms to get benchmark function for pt2_triton_matmul 2026-02-21T18:41:22.1209489Z WARNING:__main__:Input tensor metadata: 2026-02-21T18:41:22.1209800Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T18:41:22.1209980Z 'dtype': 'torch.float16', 2026-02-21T18:41:22.1210150Z 'shape': (1024, 1024), 2026-02-21T18:41:22.1210305Z 'stride': (1024, 1)}, 2026-02-21T18:41:22.1210466Z { 'device': 'cuda:0', 2026-02-21T18:41:22.1210618Z 'dtype': 'torch.float16', 2026-02-21T18:41:22.1210781Z 'shape': (1024, 12288), 2026-02-21T18:41:22.1210938Z 'stride': (1, 1024)}, 2026-02-21T18:41:22.1211084Z None), 2026-02-21T18:41:22.1211209Z 'kwargs': {}} 2026-02-21T18:41:22.1232323Z INFO:tritonbench.utils.triton_op:Took 2.78ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T18:41:22.1965305Z [0s] Autotune random seed: 2171071898 2026-02-21T18:41:22.2103666Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T18:41:31.0663224Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━ 100/100 17.1 configs/s 2026-02-21T18:41:45.5537397Z Initial population exploring neighbors 100% ━━━━━━━━━━━━━━ 100/100 6.9 configs/s 2026-02-21T18:41:45.5546963Z [23s] Adaptive compile timeout: 30s (90% percentile=6.5s, bounds=[30.0s, 30s]) 2026-02-21T18:41:45.8178913Z Verifying initial results 100% ━━━━━━━━━━━━━━━━━━━━━━ 1000/1000 2470.5 configs/s 2026-02-21T18:41:46.5511343Z [24s] Initial random population of 100, 5 starting points: 2026-02-21T18:41:46.5511839Z error=11 2026-02-21T18:41:46.5512050Z ok=89 2026-02-21T18:41:46.5512294Z min=0.1126 2026-02-21T18:41:46.5512497Z mid=1.2396 2026-02-21T18:41:46.5512699Z max=337.2105 2026-02-21T18:41:46.5512950Z best={'block_sizes': [64, 256, 16], 2026-02-21T18:41:46.5513355Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:41:46.5513712Z 'l2_groupings': [16], 2026-02-21T18:41:46.5513980Z 'load_eviction_policies': ['', ''], 2026-02-21T18:41:46.5514302Z 'loop_orders': [[1, 0]], 2026-02-21T18:41:46.5515253Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:41:46.5515557Z 'num_sm_multiplier': 8, 2026-02-21T18:41:46.5515818Z 'num_stages': 1, 2026-02-21T18:41:46.5516045Z 'num_warps': 8, 2026-02-21T18:41:46.5516298Z 'pid_type': 'persistent_blocked', 2026-02-21T18:41:46.5516610Z 'range_flattens': [None, None], 2026-02-21T18:41:46.5516912Z 'range_multi_buffers': [True, None], 2026-02-21T18:41:46.5517213Z 'range_num_stages': [0, 0], 2026-02-21T18:41:46.5517492Z 'range_unroll_factors': [0, 0], 2026-02-21T18:41:46.5517782Z 'range_warp_specializes': [], 2026-02-21T18:41:46.5518062Z 'waves_per_eu': 1} 2026-02-21T18:41:46.5578351Z [24s] Fitting surrogate: 100 points, 100 targets 2026-02-21T18:41:47.6170241Z [25s] Generation 1 starting: 93 neighbors, 5 active search path(s) 2026-02-21T18:41:58.8972973Z Generation 1: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97/97 12.6 configs/s 2026-02-21T18:42:03.4820540Z Generation 1: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 97/97 21.7 configs/s 2026-02-21T18:42:04.5109530Z Generation 1: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 791.8 2026-02-21T18:42:05.0474188Z [42s] Generation 1 complete: 2026-02-21T18:42:05.0474390Z error=22 2026-02-21T18:42:05.0474501Z ok=76 2026-02-21T18:42:05.0474603Z min=0.0655 2026-02-21T18:42:05.0474706Z mid=0.1450 2026-02-21T18:42:05.0474807Z max=0.8213 2026-02-21T18:42:05.0474934Z best={'block_sizes': [64, 128, 64], 2026-02-21T18:42:05.0475112Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T18:42:05.0475309Z 'l2_groupings': [32], 2026-02-21T18:42:05.0475452Z 'load_eviction_policies': ['', ''], 2026-02-21T18:42:05.0475608Z 'loop_orders': [[1, 0]], 2026-02-21T18:42:05.0475743Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:42:05.0475883Z 'num_sm_multiplier': 128, 2026-02-21T18:42:05.0476030Z 'num_stages': 1, 2026-02-21T18:42:05.0476143Z 'num_warps': 4, 2026-02-21T18:42:05.0476309Z configs/s 2026-02-21T18:42:05.0476534Z 'pid_type': 'persistent_blocked', 2026-02-21T18:42:05.0476707Z 'range_flattens': [False, None], 2026-02-21T18:42:05.0476861Z 'range_multi_buffers': [False, True], 2026-02-21T18:42:05.0477018Z 'range_num_stages': [0, 0], 2026-02-21T18:42:05.0477157Z 'range_unroll_factors': [0, 0], 2026-02-21T18:42:05.0477310Z 'range_warp_specializes': [], 2026-02-21T18:42:05.0477450Z 'waves_per_eu': 2} 2026-02-21T18:42:05.0642868Z [42s] Fitting surrogate: 198 points, 198 targets 2026-02-21T18:42:06.0958966Z [43s] Generation 2 starting: 89 neighbors, 5 active search path(s) 2026-02-21T18:42:16.5358646Z Generation 2: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 14.1 configs/s 2026-02-21T18:42:21.5875943Z Generation 2: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 18.4 configs/s 2026-02-21T18:42:26.4465586Z Generation 2: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 223.6 2026-02-21T18:42:26.4466341Z configs/s 2026-02-21T18:42:27.0562375Z [64s] Generation 2 complete: 2026-02-21T18:42:27.0562749Z error=20 2026-02-21T18:42:27.0562973Z ok=74 2026-02-21T18:42:27.0563174Z min=0.0663 2026-02-21T18:42:27.0563384Z mid=0.1004 2026-02-21T18:42:27.0563583Z max=65.8312 2026-02-21T18:42:27.0563827Z best={'block_sizes': [64, 128, 64], 2026-02-21T18:42:27.0564208Z 'indexing': ['block_ptr', 'block_ptr', 'block_ptr'], 2026-02-21T18:42:27.0564577Z 'l2_groupings': [32], 2026-02-21T18:42:27.0564851Z 'load_eviction_policies': ['', ''], 2026-02-21T18:42:27.0565159Z 'loop_orders': [[1, 0]], 2026-02-21T18:42:27.0565283Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:42:27.0565385Z 'num_sm_multiplier': 128, 2026-02-21T18:42:27.0565486Z 'num_stages': 1, 2026-02-21T18:42:27.0565569Z 'num_warps': 4, 2026-02-21T18:42:27.0565673Z 'pid_type': 'persistent_blocked', 2026-02-21T18:42:27.0565789Z 'range_flattens': [False, None], 2026-02-21T18:42:27.0565901Z 'range_multi_buffers': [False, True], 2026-02-21T18:42:27.0566021Z 'range_num_stages': [0, 0], 2026-02-21T18:42:27.0566336Z 'range_unroll_factors': [0, 0], 2026-02-21T18:42:27.0566452Z 'range_warp_specializes': [], 2026-02-21T18:42:27.0566553Z 'waves_per_eu': 2} 2026-02-21T18:42:27.1132642Z [64s] Fitting surrogate: 292 points, 292 targets 2026-02-21T18:42:28.0812785Z [65s] Generation 3 starting: 82 neighbors, 5 active search path(s) 2026-02-21T18:42:46.4279979Z Generation 3: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86/86 0.5 configs/s 2026-02-21T18:42:50.4310061Z Generation 3: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 86/86 22.4 configs/s 2026-02-21T18:42:53.7546371Z Generation 3: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 280.2 2026-02-21T18:42:53.7546840Z configs/s 2026-02-21T18:42:54.3714180Z [92s] Generation 3 complete: 2026-02-21T18:42:54.3714541Z error=26 2026-02-21T18:42:54.3714748Z ok=61 2026-02-21T18:42:54.3714960Z min=0.0579 2026-02-21T18:42:54.3715170Z mid=0.0910 2026-02-21T18:42:54.3715399Z max=33.9256 2026-02-21T18:42:54.3715676Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:42:54.3716409Z 'indexing': ['block_ptr', 'pointer', 'block_ptr'], 2026-02-21T18:42:54.3716779Z 'l2_groupings': [32], 2026-02-21T18:42:54.3717046Z 'load_eviction_policies': ['', ''], 2026-02-21T18:42:54.3717351Z 'loop_orders': [[1, 0]], 2026-02-21T18:42:54.3717621Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:42:54.3717901Z 'num_sm_multiplier': 8, 2026-02-21T18:42:54.3718161Z 'num_stages': 1, 2026-02-21T18:42:54.3718398Z 'num_warps': 4, 2026-02-21T18:42:54.3718678Z 'pid_type': 'persistent_blocked', 2026-02-21T18:42:54.3718991Z 'range_flattens': [False, None], 2026-02-21T18:42:54.3719297Z 'range_multi_buffers': [True, None], 2026-02-21T18:42:54.3719603Z 'range_num_stages': [0, 0], 2026-02-21T18:42:54.3719891Z 'range_unroll_factors': [0, 0], 2026-02-21T18:42:54.3720181Z 'range_warp_specializes': [], 2026-02-21T18:42:54.3720461Z 'waves_per_eu': 3} 2026-02-21T18:42:54.4224038Z [92s] Fitting surrogate: 379 points, 379 targets 2026-02-21T18:42:55.3715250Z [93s] Generation 4 starting: 88 neighbors, 5 active search path(s) 2026-02-21T18:43:04.4907680Z Generation 4: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 12.8 configs/s 2026-02-21T18:43:08.8127046Z Generation 4: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 21.6 configs/s 2026-02-21T18:43:12.4821647Z Generation 4: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 257.2 2026-02-21T18:43:12.4821904Z configs/s 2026-02-21T18:43:13.1097773Z [110s] Generation 4 complete: 2026-02-21T18:43:13.1098226Z error=22 2026-02-21T18:43:13.1098432Z ok=71 2026-02-21T18:43:13.1098638Z min=0.0490 2026-02-21T18:43:13.1098850Z mid=0.0868 2026-02-21T18:43:13.1099501Z max=1.1006 2026-02-21T18:43:13.1099732Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:43:13.1100105Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:43:13.1100585Z 'l2_groupings': [32], 2026-02-21T18:43:13.1100742Z 'load_eviction_policies': ['', ''], 2026-02-21T18:43:13.1100888Z 'loop_orders': [[1, 0]], 2026-02-21T18:43:13.1101001Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:43:13.1101118Z 'num_sm_multiplier': 128, 2026-02-21T18:43:13.1101226Z 'num_stages': 1, 2026-02-21T18:43:13.1101321Z 'num_warps': 4, 2026-02-21T18:43:13.1101426Z 'pid_type': 'persistent_blocked', 2026-02-21T18:43:13.1101554Z 'range_flattens': [False, None], 2026-02-21T18:43:13.1101680Z 'range_multi_buffers': [False, True], 2026-02-21T18:43:13.1101807Z 'range_num_stages': [0, 0], 2026-02-21T18:43:13.1101922Z 'range_unroll_factors': [0, 0], 2026-02-21T18:43:13.1102038Z 'range_warp_specializes': [], 2026-02-21T18:43:13.1102152Z 'waves_per_eu': 2} 2026-02-21T18:43:13.1180793Z [110s] Fitting surrogate: 472 points, 472 targets 2026-02-21T18:43:14.7356952Z [112s] Generation 5 starting: 85 neighbors, 5 active search path(s) 2026-02-21T18:43:22.9684657Z Generation 5: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87/87 22.8 configs/s 2026-02-21T18:43:27.1693916Z Generation 5: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 87/87 21.0 configs/s 2026-02-21T18:43:30.4574353Z Generation 5: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 285.3 2026-02-21T18:43:30.4574977Z configs/s 2026-02-21T18:43:31.0316650Z [128s] Generation 5 complete: 2026-02-21T18:43:31.0317023Z error=19 2026-02-21T18:43:31.0317304Z ok=71 2026-02-21T18:43:31.0317554Z min=0.0514 2026-02-21T18:43:31.0317765Z mid=0.0880 2026-02-21T18:43:31.0317961Z max=1.6570 2026-02-21T18:43:31.0318188Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:43:31.0318574Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:43:31.0318940Z 'l2_groupings': [32], 2026-02-21T18:43:31.0319247Z 'load_eviction_policies': ['', ''], 2026-02-21T18:43:31.0319559Z 'loop_orders': [[1, 0]], 2026-02-21T18:43:31.0319834Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:43:31.0320131Z 'num_sm_multiplier': 128, 2026-02-21T18:43:31.0320396Z 'num_stages': 1, 2026-02-21T18:43:31.0320690Z 'num_warps': 4, 2026-02-21T18:43:31.0320952Z 'pid_type': 'persistent_blocked', 2026-02-21T18:43:31.0321334Z 'range_flattens': [False, None], 2026-02-21T18:43:31.0321637Z 'range_multi_buffers': [None, True], 2026-02-21T18:43:31.0321948Z 'range_num_stages': [0, 0], 2026-02-21T18:43:31.0322222Z 'range_unroll_factors': [0, 0], 2026-02-21T18:43:31.0322513Z 'range_warp_specializes': [], 2026-02-21T18:43:31.0322787Z 'waves_per_eu': 2} 2026-02-21T18:43:31.0859453Z [128s] Fitting surrogate: 562 points, 562 targets 2026-02-21T18:43:32.1115064Z [129s] Generation 6 starting: 88 neighbors, 5 active search path(s) 2026-02-21T18:43:41.0169437Z Generation 6: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 8.6 configs/s 2026-02-21T18:43:45.0849712Z Generation 6: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 23.0 configs/s 2026-02-21T18:43:48.2989010Z Generation 6: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 289.8 2026-02-21T18:43:48.2989433Z configs/s 2026-02-21T18:43:48.8951137Z [146s] Generation 6 complete: 2026-02-21T18:43:48.8951481Z error=25 2026-02-21T18:43:48.8951751Z ok=68 2026-02-21T18:43:48.8951955Z min=0.0515 2026-02-21T18:43:48.8952193Z mid=0.0831 2026-02-21T18:43:48.8952395Z max=1.2922 2026-02-21T18:43:48.8952620Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:43:48.8952999Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:43:48.8953364Z 'l2_groupings': [32], 2026-02-21T18:43:48.8953632Z 'load_eviction_policies': ['', ''], 2026-02-21T18:43:48.8953933Z 'loop_orders': [[1, 0]], 2026-02-21T18:43:48.8954208Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:43:48.8954489Z 'num_sm_multiplier': 128, 2026-02-21T18:43:48.8954773Z 'num_stages': 1, 2026-02-21T18:43:48.8955005Z 'num_warps': 4, 2026-02-21T18:43:48.8955256Z 'pid_type': 'persistent_blocked', 2026-02-21T18:43:48.8955567Z 'range_flattens': [False, None], 2026-02-21T18:43:48.8955879Z 'range_multi_buffers': [True, True], 2026-02-21T18:43:48.8956194Z 'range_num_stages': [0, 0], 2026-02-21T18:43:48.8956475Z 'range_unroll_factors': [0, 0], 2026-02-21T18:43:48.8956771Z 'range_warp_specializes': [], 2026-02-21T18:43:48.8957059Z 'waves_per_eu': 2} 2026-02-21T18:43:48.9035658Z [146s] Fitting surrogate: 655 points, 655 targets 2026-02-21T18:43:50.4804384Z [148s] Generation 7 starting: 87 neighbors, 5 active search path(s) 2026-02-21T18:43:59.4756616Z Generation 7: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90/90 11.1 configs/s 2026-02-21T18:44:03.7682109Z Generation 7: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 90/90 21.8 configs/s 2026-02-21T18:44:05.8970090Z Generation 7: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 424.5 2026-02-21T18:44:05.8970350Z configs/s 2026-02-21T18:44:06.3908453Z [164s] Generation 7 complete: 2026-02-21T18:44:06.3908635Z error=21 2026-02-21T18:44:06.3908762Z ok=71 2026-02-21T18:44:06.3908847Z min=0.0521 2026-02-21T18:44:06.3909177Z mid=0.0958 2026-02-21T18:44:06.3909265Z max=2.0321 2026-02-21T18:44:06.3909354Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:44:06.3909511Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:44:06.3909665Z 'l2_groupings': [32], 2026-02-21T18:44:06.3909778Z 'load_eviction_policies': ['', ''], 2026-02-21T18:44:06.3909904Z 'loop_orders': [[1, 0]], 2026-02-21T18:44:06.3910013Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:44:06.3910132Z 'num_sm_multiplier': 128, 2026-02-21T18:44:06.3910243Z 'num_stages': 1, 2026-02-21T18:44:06.3910333Z 'num_warps': 4, 2026-02-21T18:44:06.3910437Z 'pid_type': 'persistent_blocked', 2026-02-21T18:44:06.3910568Z 'range_flattens': [False, False], 2026-02-21T18:44:06.3910703Z 'range_multi_buffers': [True, True], 2026-02-21T18:44:06.3910829Z 'range_num_stages': [0, 0], 2026-02-21T18:44:06.3910944Z 'range_unroll_factors': [0, 0], 2026-02-21T18:44:06.3911068Z 'range_warp_specializes': [], 2026-02-21T18:44:06.3911184Z 'waves_per_eu': 2} 2026-02-21T18:44:06.4247834Z [164s] Fitting surrogate: 747 points, 747 targets 2026-02-21T18:44:07.3555521Z [165s] Generation 8 starting: 91 neighbors, 5 active search path(s) 2026-02-21T18:44:23.3955692Z Generation 8: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94/94 1.0 configs/s 2026-02-21T18:44:28.3060539Z Generation 8: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 94/94 19.8 configs/s 2026-02-21T18:44:30.3901996Z Generation 8: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 431.5 2026-02-21T18:44:30.3902627Z configs/s 2026-02-21T18:44:30.9324187Z [188s] Generation 8 complete: 2026-02-21T18:44:30.9324538Z error=15 2026-02-21T18:44:30.9325133Z ok=81 2026-02-21T18:44:30.9325341Z min=0.0516 2026-02-21T18:44:30.9325562Z mid=0.0926 2026-02-21T18:44:30.9325768Z max=3.9485 2026-02-21T18:44:30.9325994Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:44:30.9326560Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:44:30.9326965Z 'l2_groupings': [32], 2026-02-21T18:44:30.9327252Z 'load_eviction_policies': ['', ''], 2026-02-21T18:44:30.9327537Z 'loop_orders': [[1, 0]], 2026-02-21T18:44:30.9327783Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:44:30.9328032Z 'num_sm_multiplier': 128, 2026-02-21T18:44:30.9328268Z 'num_stages': 1, 2026-02-21T18:44:30.9328471Z 'num_warps': 4, 2026-02-21T18:44:30.9328732Z 'pid_type': 'persistent_blocked', 2026-02-21T18:44:30.9329010Z 'range_flattens': [False, False], 2026-02-21T18:44:30.9329274Z 'range_multi_buffers': [None, True], 2026-02-21T18:44:30.9329547Z 'range_num_stages': [0, 0], 2026-02-21T18:44:30.9329790Z 'range_unroll_factors': [0, 0], 2026-02-21T18:44:30.9330053Z 'range_warp_specializes': [], 2026-02-21T18:44:30.9330302Z 'waves_per_eu': 2} 2026-02-21T18:44:30.9645253Z [188s] Fitting surrogate: 843 points, 843 targets 2026-02-21T18:44:31.8461160Z [189s] Generation 9 starting: 82 neighbors, 5 active search path(s) 2026-02-21T18:44:40.5377846Z Generation 9: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 85/85 31.1 configs/s 2026-02-21T18:44:45.3010149Z Generation 9: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━━ 85/85 18.2 configs/s 2026-02-21T18:44:47.7336971Z Generation 9: verifying top configs 100% ━━━━━━━━━━━━━━ 1000/1000 502.0 2026-02-21T18:44:47.7337343Z configs/s 2026-02-21T18:44:48.2164898Z [206s] Generation 9 complete: 2026-02-21T18:44:48.2165229Z error=13 2026-02-21T18:44:48.2165349Z ok=74 2026-02-21T18:44:48.2165476Z min=0.0509 2026-02-21T18:44:48.2165605Z mid=0.1038 2026-02-21T18:44:48.2165729Z max=1.2979 2026-02-21T18:44:48.2165865Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:44:48.2166134Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:44:48.2166356Z 'l2_groupings': [32], 2026-02-21T18:44:48.2166516Z 'load_eviction_policies': ['', ''], 2026-02-21T18:44:48.2166714Z 'loop_orders': [[1, 0]], 2026-02-21T18:44:48.2166878Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:44:48.2167626Z 'num_sm_multiplier': 128, 2026-02-21T18:44:48.2167783Z 'num_stages': 1, 2026-02-21T18:44:48.2167924Z 'num_warps': 4, 2026-02-21T18:44:48.2168075Z 'pid_type': 'persistent_blocked', 2026-02-21T18:44:48.2168260Z 'range_flattens': [False, False], 2026-02-21T18:44:48.2168444Z 'range_multi_buffers': [None, True], 2026-02-21T18:44:48.2168633Z 'range_num_stages': [0, 0], 2026-02-21T18:44:48.2168803Z 'range_unroll_factors': [0, 0], 2026-02-21T18:44:48.2168991Z 'range_warp_specializes': [], 2026-02-21T18:44:48.2169158Z 'waves_per_eu': 2} 2026-02-21T18:44:48.2421812Z [206s] Fitting surrogate: 930 points, 930 targets 2026-02-21T18:44:49.0191604Z [206s] Generation 10 starting: 63 neighbors, 4 active search path(s) 2026-02-21T18:44:56.8804952Z Generation 10: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 65/65 13.4 configs/s 2026-02-21T18:45:00.4010570Z Generation 10: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 65/65 19.3 configs/s 2026-02-21T18:45:01.4519513Z Generation 10: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 790.9 2026-02-21T18:45:01.4520131Z configs/s 2026-02-21T18:45:01.9705047Z [219s] Generation 10 complete: 2026-02-21T18:45:01.9705403Z error=9 2026-02-21T18:45:01.9705609Z ok=59 2026-02-21T18:45:01.9705811Z min=0.0526 2026-02-21T18:45:01.9706022Z mid=0.0960 2026-02-21T18:45:01.9706222Z max=0.9408 2026-02-21T18:45:01.9706462Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:45:01.9706836Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:45:01.9707207Z 'l2_groupings': [32], 2026-02-21T18:45:01.9707489Z 'load_eviction_policies': ['', ''], 2026-02-21T18:45:01.9707798Z 'loop_orders': [[1, 0]], 2026-02-21T18:45:01.9708390Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:45:01.9708678Z 'num_sm_multiplier': 128, 2026-02-21T18:45:01.9708950Z 'num_stages': 1, 2026-02-21T18:45:01.9709182Z 'num_warps': 4, 2026-02-21T18:45:01.9709549Z 'pid_type': 'persistent_blocked', 2026-02-21T18:45:01.9709819Z 'range_flattens': [False, False], 2026-02-21T18:45:01.9710091Z 'range_multi_buffers': [None, True], 2026-02-21T18:45:01.9710357Z 'range_num_stages': [0, 0], 2026-02-21T18:45:01.9710594Z 'range_unroll_factors': [0, 0], 2026-02-21T18:45:01.9710848Z 'range_warp_specializes': [], 2026-02-21T18:45:01.9711085Z 'waves_per_eu': 2} 2026-02-21T18:45:01.9953633Z [219s] Fitting surrogate: 998 points, 998 targets 2026-02-21T18:45:02.5952299Z [220s] Generation 11 starting: 46 neighbors, 3 active search path(s) 2026-02-21T18:45:08.4561579Z Generation 11: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 14.2 configs/s 2026-02-21T18:45:10.9306304Z Generation 11: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 48/48 19.8 configs/s 2026-02-21T18:45:13.0803157Z Generation 11: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 421.2 2026-02-21T18:45:13.0803919Z configs/s 2026-02-21T18:45:13.6706471Z [231s] Generation 11 complete: 2026-02-21T18:45:13.6707101Z error=8 2026-02-21T18:45:13.6707277Z ok=42 2026-02-21T18:45:13.6707442Z min=0.0514 2026-02-21T18:45:13.6707624Z mid=0.0750 2026-02-21T18:45:13.6707787Z max=0.1774 2026-02-21T18:45:13.6707983Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:45:13.6708366Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:45:13.6708645Z 'l2_groupings': [32], 2026-02-21T18:45:13.6708907Z 'load_eviction_policies': ['', ''], 2026-02-21T18:45:13.6709167Z 'loop_orders': [[1, 0]], 2026-02-21T18:45:13.6709399Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:45:13.6709636Z 'num_sm_multiplier': 128, 2026-02-21T18:45:13.6709863Z 'num_stages': 1, 2026-02-21T18:45:13.6710051Z 'num_warps': 4, 2026-02-21T18:45:13.6710286Z 'pid_type': 'persistent_blocked', 2026-02-21T18:45:13.6710548Z 'range_flattens': [False, False], 2026-02-21T18:45:13.6710806Z 'range_multi_buffers': [None, True], 2026-02-21T18:45:13.6711081Z 'range_num_stages': [0, 0], 2026-02-21T18:45:13.6711326Z 'range_unroll_factors': [0, 0], 2026-02-21T18:45:13.6711686Z 'range_warp_specializes': [], 2026-02-21T18:45:13.6711919Z 'waves_per_eu': 2} 2026-02-21T18:45:13.7187021Z [231s] Fitting surrogate: 1048 points, 1048 targets 2026-02-21T18:45:14.3555177Z [232s] Generation 12 starting: 48 neighbors, 3 active search path(s) 2026-02-21T18:45:20.3175022Z Generation 12: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48/48 8.5 configs/s 2026-02-21T18:45:22.7814014Z Generation 12: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 48/48 19.5 configs/s 2026-02-21T18:45:25.9209381Z Generation 12: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 297.5 2026-02-21T18:45:25.9209632Z configs/s 2026-02-21T18:45:26.5205530Z [244s] Generation 12 complete: 2026-02-21T18:45:26.5205906Z error=9 2026-02-21T18:45:26.5206107Z ok=43 2026-02-21T18:45:26.5206316Z min=0.0493 2026-02-21T18:45:26.5206536Z mid=0.0667 2026-02-21T18:45:26.5206760Z max=0.1450 2026-02-21T18:45:26.5207003Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:45:26.5207396Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:45:26.5207762Z 'l2_groupings': [32], 2026-02-21T18:45:26.5208033Z 'load_eviction_policies': ['', ''], 2026-02-21T18:45:26.5208347Z 'loop_orders': [[1, 0]], 2026-02-21T18:45:26.5208625Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:45:26.5208912Z 'num_sm_multiplier': 128, 2026-02-21T18:45:26.5209178Z 'num_stages': 1, 2026-02-21T18:45:26.5209409Z 'num_warps': 4, 2026-02-21T18:45:26.5209670Z 'pid_type': 'persistent_blocked', 2026-02-21T18:45:26.5209979Z 'range_flattens': [False, False], 2026-02-21T18:45:26.5210284Z 'range_multi_buffers': [None, True], 2026-02-21T18:45:26.5210587Z 'range_num_stages': [0, 0], 2026-02-21T18:45:26.5211165Z 'range_unroll_factors': [0, 0], 2026-02-21T18:45:26.5211453Z 'range_warp_specializes': [], 2026-02-21T18:45:26.5211731Z 'waves_per_eu': 2} 2026-02-21T18:45:26.5860693Z [244s] Fitting surrogate: 1100 points, 1100 targets 2026-02-21T18:45:27.1125223Z [244s] Generation 13 starting: 35 neighbors, 3 active search path(s) 2026-02-21T18:45:31.4317158Z Generation 13: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/35 7.7 configs/s 2026-02-21T18:45:33.4381782Z Generation 13: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 35/35 18.3 configs/s 2026-02-21T18:45:37.0862376Z Generation 13: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 338.0 2026-02-21T18:45:37.0862888Z configs/s 2026-02-21T18:45:37.6223062Z [255s] Generation 13 complete: 2026-02-21T18:45:37.6223333Z error=3 2026-02-21T18:45:37.6223490Z ok=36 2026-02-21T18:45:37.6223654Z min=0.0516 2026-02-21T18:45:37.6223817Z mid=0.0622 2026-02-21T18:45:37.6223987Z max=0.8930 2026-02-21T18:45:37.6224156Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:45:37.6224440Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:45:37.6224719Z 'l2_groupings': [32], 2026-02-21T18:45:37.6225236Z 'load_eviction_policies': ['', ''], 2026-02-21T18:45:37.6225577Z 'loop_orders': [[1, 0]], 2026-02-21T18:45:37.6225784Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:45:37.6225993Z 'num_sm_multiplier': 128, 2026-02-21T18:45:37.6226197Z 'num_stages': 1, 2026-02-21T18:45:37.6226388Z 'num_warps': 4, 2026-02-21T18:45:37.6226579Z 'pid_type': 'persistent_blocked', 2026-02-21T18:45:37.6226807Z 'range_flattens': [False, False], 2026-02-21T18:45:37.6227029Z 'range_multi_buffers': [None, True], 2026-02-21T18:45:37.6227255Z 'range_num_stages': [0, 0], 2026-02-21T18:45:37.6227461Z 'range_unroll_factors': [0, 0], 2026-02-21T18:45:37.6227678Z 'range_warp_specializes': [], 2026-02-21T18:45:37.6227878Z 'waves_per_eu': 2} 2026-02-21T18:45:37.6751748Z [255s] Fitting surrogate: 1139 points, 1139 targets 2026-02-21T18:45:38.1310246Z [255s] Generation 14 starting: 26 neighbors, 2 active search path(s) 2026-02-21T18:45:41.2729086Z Generation 14: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26/26 2.5 configs/s 2026-02-21T18:45:42.8711182Z Generation 14: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 26/26 18.0 configs/s 2026-02-21T18:45:44.7523662Z Generation 14: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 459.4 2026-02-21T18:45:44.7549716Z configs/s 2026-02-21T18:45:44.9423644Z [262s] Generation 14 complete: 2026-02-21T18:45:44.9445949Z error=2 2026-02-21T18:45:44.9446092Z ok=27 2026-02-21T18:45:44.9446207Z min=0.0487 2026-02-21T18:45:44.9446320Z mid=0.0688 2026-02-21T18:45:44.9446428Z max=0.6030 2026-02-21T18:45:44.9446549Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:45:44.9446748Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:45:44.9446940Z 'l2_groupings': [32], 2026-02-21T18:45:44.9447588Z 'load_eviction_policies': ['', ''], 2026-02-21T18:45:44.9447746Z 'loop_orders': [[1, 0]], 2026-02-21T18:45:44.9447892Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:45:44.9448144Z 'num_sm_multiplier': 128, 2026-02-21T18:45:44.9448283Z 'num_stages': 1, 2026-02-21T18:45:44.9448426Z 'num_warps': 4, 2026-02-21T18:45:44.9448560Z 'pid_type': 'persistent_blocked', 2026-02-21T18:45:44.9448724Z 'range_flattens': [False, False], 2026-02-21T18:45:44.9448881Z 'range_multi_buffers': [None, True], 2026-02-21T18:45:44.9449044Z 'range_num_stages': [0, 0], 2026-02-21T18:45:44.9449186Z 'range_unroll_factors': [0, 0], 2026-02-21T18:45:44.9449339Z 'range_warp_specializes': [], 2026-02-21T18:45:44.9449480Z 'waves_per_eu': 2} 2026-02-21T18:45:44.9686717Z [262s] Fitting surrogate: 1168 points, 1168 targets 2026-02-21T18:45:45.7385288Z [263s] Generation 15 starting: 33 neighbors, 2 active search path(s) 2026-02-21T18:45:48.7277057Z Generation 15: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 33/33 11.3 configs/s 2026-02-21T18:45:50.8003764Z Generation 15: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 33/33 17.1 configs/s 2026-02-21T18:45:53.1928031Z Generation 15: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 381.7 2026-02-21T18:45:53.1928899Z configs/s 2026-02-21T18:45:53.7173133Z [271s] Generation 15 complete: 2026-02-21T18:45:53.7173359Z error=1 2026-02-21T18:45:53.7173502Z ok=35 2026-02-21T18:45:53.7173646Z min=0.0513 2026-02-21T18:45:53.7173792Z mid=0.0707 2026-02-21T18:45:53.7173934Z max=1.0760 2026-02-21T18:45:53.7174095Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:45:53.7174358Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:45:53.7174618Z 'l2_groupings': [32], 2026-02-21T18:45:53.7174823Z 'load_eviction_policies': ['', ''], 2026-02-21T18:45:53.7175043Z 'loop_orders': [[1, 0]], 2026-02-21T18:45:53.7175237Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:45:53.7175440Z 'num_sm_multiplier': 128, 2026-02-21T18:45:53.7175640Z 'num_stages': 1, 2026-02-21T18:45:53.7175803Z 'num_warps': 4, 2026-02-21T18:45:53.7175980Z 'pid_type': 'persistent_blocked', 2026-02-21T18:45:53.7176206Z 'range_flattens': [False, False], 2026-02-21T18:45:53.7176426Z 'range_multi_buffers': [None, True], 2026-02-21T18:45:53.7176871Z 'range_num_stages': [0, 0], 2026-02-21T18:45:53.7177069Z 'range_unroll_factors': [0, 0], 2026-02-21T18:45:53.7177284Z 'range_warp_specializes': [], 2026-02-21T18:45:53.7177483Z 'waves_per_eu': 2} 2026-02-21T18:45:53.7663886Z [271s] Fitting surrogate: 1204 points, 1204 targets 2026-02-21T18:45:54.0941451Z [271s] Generation 16 starting: 17 neighbors, 1 active search path(s) 2026-02-21T18:45:56.3662396Z Generation 16: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17/17 9.9 configs/s 2026-02-21T18:45:57.3164834Z Generation 16: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 17/17 18.9 configs/s 2026-02-21T18:45:58.7892082Z Generation 16: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 599.7 2026-02-21T18:45:58.7892668Z configs/s 2026-02-21T18:45:59.3184521Z [277s] Generation 16 complete: 2026-02-21T18:45:59.3184900Z error=1 2026-02-21T18:45:59.3185129Z ok=18 2026-02-21T18:45:59.3185349Z min=0.0492 2026-02-21T18:45:59.3185581Z mid=0.0575 2026-02-21T18:45:59.3185775Z max=0.1012 2026-02-21T18:45:59.3186002Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:45:59.3186370Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:45:59.3186739Z 'l2_groupings': [32], 2026-02-21T18:45:59.3187023Z 'load_eviction_policies': ['', ''], 2026-02-21T18:45:59.3187329Z 'loop_orders': [[1, 0]], 2026-02-21T18:45:59.3187602Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:45:59.3187896Z 'num_sm_multiplier': 128, 2026-02-21T18:45:59.3188266Z 'num_stages': 1, 2026-02-21T18:45:59.3188503Z 'num_warps': 4, 2026-02-21T18:45:59.3188758Z 'pid_type': 'persistent_blocked', 2026-02-21T18:45:59.3189069Z 'range_flattens': [False, False], 2026-02-21T18:45:59.3189737Z 'range_multi_buffers': [None, True], 2026-02-21T18:45:59.3190047Z 'range_num_stages': [0, 0], 2026-02-21T18:45:59.3190333Z 'range_unroll_factors': [0, 0], 2026-02-21T18:45:59.3190768Z 'range_warp_specializes': [], 2026-02-21T18:45:59.3191057Z 'waves_per_eu': 2} 2026-02-21T18:45:59.3554573Z [277s] Fitting surrogate: 1223 points, 1223 targets 2026-02-21T18:45:59.6475959Z [277s] Generation 17 starting: 15 neighbors, 1 active search path(s) 2026-02-21T18:46:01.7058540Z Generation 17: precompiling 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━ 15/15 10.6 configs/s 2026-02-21T18:46:02.4669197Z Generation 17: exploring neighbors 100% ━━━━━━━━━━━━━━━━━━━ 15/15 23.3 configs/s 2026-02-21T18:46:03.1001176Z Generation 17: verifying top configs 100% ━━━━━━━━━━━━━ 1000/1000 1230.2 2026-02-21T18:46:03.1001746Z configs/s 2026-02-21T18:46:03.6008777Z [281s] Generation 17 complete: 2026-02-21T18:46:03.6009054Z error=4 2026-02-21T18:46:03.6009226Z ok=13 2026-02-21T18:46:03.6009381Z min=0.0514 2026-02-21T18:46:03.6009535Z mid=0.0666 2026-02-21T18:46:03.6009687Z max=0.1978 2026-02-21T18:46:03.6009855Z best={'block_sizes': [128, 128, 64], 2026-02-21T18:46:03.6010451Z 'indexing': ['pointer', 'block_ptr', 'block_ptr'], 2026-02-21T18:46:03.6010740Z 'l2_groupings': [32], 2026-02-21T18:46:03.6010939Z 'load_eviction_policies': ['', ''], 2026-02-21T18:46:03.6011168Z 'loop_orders': [[1, 0]], 2026-02-21T18:46:03.6011370Z 'matrix_instr_nonkdim': 16, 2026-02-21T18:46:03.6011583Z 'num_sm_multiplier': 128, 2026-02-21T18:46:03.6011776Z 'num_stages': 1, 2026-02-21T18:46:03.6011946Z 'num_warps': 4, 2026-02-21T18:46:03.6012133Z 'pid_type': 'persistent_blocked', 2026-02-21T18:46:03.6012365Z 'range_flattens': [False, False], 2026-02-21T18:46:03.6012594Z 'range_multi_buffers': [None, True], 2026-02-21T18:46:03.6012818Z 'range_num_stages': [0, 0], 2026-02-21T18:46:03.6013025Z 'range_unroll_factors': [0, 0], 2026-02-21T18:46:03.6013243Z 'range_warp_specializes': [], 2026-02-21T18:46:03.6013447Z 'waves_per_eu': 2} 2026-02-21T18:46:03.6210853Z [281s] Fitting surrogate: 1240 points, 1240 targets 2026-02-21T18:46:03.7705189Z [281s] Autotuning complete in 281.6s after searching 1196 configs. 2026-02-21T18:46:03.7705550Z One can hardcode the best config and skip autotuning with: 2026-02-21T18:46:03.7707038Z @helion.kernel(config=helion.Config(block_sizes=[128, 128, 64], indexing=['pointer', 'block_ptr', 'block_ptr'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=16, num_sm_multiplier=128, num_stages=1, num_warps=4, pid_type='persistent_blocked', range_flattens=[False, False], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T18:46:03.7708363Z 2026-02-21T18:46:03.7708669Z [281s] Code of selected kernel: /tmp/torchinductor_root/bs/cbs7khf6rsxfufizmkhk5egcdavu6qxcpjxwijxe3prso523q7eq.py 2026-02-21T18:46:03.7815179Z from __future__ import annotations 2026-02-21T18:46:03.7815296Z 2026-02-21T18:46:03.7815349Z import torch 2026-02-21T18:46:03.7815473Z import helion 2026-02-21T18:46:03.7815592Z import triton 2026-02-21T18:46:03.7815726Z import triton.language as tl 2026-02-21T18:46:03.7815947Z from helion.runtime import default_launcher as _default_launcher 2026-02-21T18:46:03.7816134Z 2026-02-21T18:46:03.7816199Z _BLOCK_SIZE_1 = tl.constexpr(128) 2026-02-21T18:46:03.7816367Z _BLOCK_SIZE_0 = tl.constexpr(128) 2026-02-21T18:46:03.7816526Z _BLOCK_SIZE_2 = tl.constexpr(64) 2026-02-21T18:46:03.7816636Z 2026-02-21T18:46:03.7816684Z @triton.jit 2026-02-21T18:46:03.7816852Z def _helion_matmul(x, y, out, _NUM_SM: tl.constexpr): 2026-02-21T18:46:03.7817102Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:46:03.7817389Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:46:03.7817656Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:46:03.7817981Z # src[matmul.py:63-67]: ... 2026-02-21T18:46:03.7818201Z total_pids = tl.cdiv(12288, _BLOCK_SIZE_1) * tl.cdiv(1024, _BLOCK_SIZE_0) 2026-02-21T18:46:03.7818526Z block_size = tl.cdiv(total_pids, _NUM_SM * 128) 2026-02-21T18:46:03.7818738Z start_pid = tl.program_id(0) * block_size 2026-02-21T18:46:03.7818955Z end_pid = tl.minimum(start_pid + block_size, total_pids) 2026-02-21T18:46:03.7819217Z for virtual_pid in tl.range(start_pid, end_pid, flatten=False): 2026-02-21T18:46:03.7819483Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:46:03.7819716Z num_pid_m = tl.cdiv(12288, _BLOCK_SIZE_1) 2026-02-21T18:46:03.7819909Z num_pid_n = tl.cdiv(1024, _BLOCK_SIZE_0) 2026-02-21T18:46:03.7820095Z inner_2d_pid = virtual_pid 2026-02-21T18:46:03.7820266Z num_pid_in_group = 32 * num_pid_n 2026-02-21T18:46:03.7820454Z group_id = inner_2d_pid // num_pid_in_group 2026-02-21T18:46:03.7820647Z first_pid_m = group_id * 32 2026-02-21T18:46:03.7820837Z group_size_m = min(num_pid_m - first_pid_m, 32) 2026-02-21T18:46:03.7821092Z pid_0 = first_pid_m + inner_2d_pid % num_pid_in_group % group_size_m 2026-02-21T18:46:03.7821404Z pid_1 = inner_2d_pid % num_pid_in_group // group_size_m 2026-02-21T18:46:03.7821621Z offset_1 = pid_0 * _BLOCK_SIZE_1 2026-02-21T18:46:03.7821791Z offset_0 = pid_1 * _BLOCK_SIZE_0 2026-02-21T18:46:03.7822013Z indices_0 = (offset_0 + tl.arange(0, _BLOCK_SIZE_0)).to(tl.int32) 2026-02-21T18:46:03.7822308Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:46:03.7822589Z acc = tl.full([_BLOCK_SIZE_0, _BLOCK_SIZE_1], 0.0, tl.float32) 2026-02-21T18:46:03.7822823Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:46:03.7823089Z # src[matmul.py:66]: acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) 2026-02-21T18:46:03.7823465Z for offset_2 in tl.range(0, 1024, _BLOCK_SIZE_2, disallow_acc_multi_buffer=False, flatten=False): 2026-02-21T18:46:03.7823814Z indices_2 = offset_2 + tl.arange(0, _BLOCK_SIZE_2).to(tl.int32) 2026-02-21T18:46:03.7824027Z acc_copy = acc 2026-02-21T18:46:03.7824161Z acc_copy_0 = acc_copy 2026-02-21T18:46:03.7824395Z # src[matmul.py:66]: acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n]) 2026-02-21T18:46:03.7824680Z load = tl.load(x + (indices_0[:, None] * 1024 + indices_2[None, :] * 1), None) 2026-02-21T18:46:03.7825117Z load_1 = tl.load(tl.make_block_ptr(y, [1024, 12288], [1, 1024], [offset_2, offset_1], [_BLOCK_SIZE_2, _BLOCK_SIZE_1], [0, 1]), boundary_check=[0, 1], padding_option='zero') 2026-02-21T18:46:03.7825648Z acc = tl.dot(tl.cast(load, tl.float16), tl.cast(load_1, tl.float16), acc=acc_copy_0, input_precision='tf32', out_dtype=tl.float32) 2026-02-21T18:46:03.7826014Z # src[matmul.py:67]: out[tile_m, tile_n] = epilogue(acc, (tile_m, tile_n)) 2026-02-21T18:46:03.7826230Z v_0 = tl.cast(acc, tl.float16) 2026-02-21T18:46:03.7826558Z tl.store(tl.make_block_ptr(out, [1024, 12288], [12288, 1], [offset_0, offset_1], [_BLOCK_SIZE_0, _BLOCK_SIZE_1], [1, 0]), v_0, boundary_check=[0, 1]) 2026-02-21T18:46:03.7826837Z 2026-02-21T18:46:03.7827070Z def matmul(x: Tensor, y: Tensor, epilogue: Callable[[Tensor, tuple[Tensor, ...]], Tensor]=lambda acc, tile: acc, *, _launcher=_default_launcher): 2026-02-21T18:46:03.7827385Z """ 2026-02-21T18:46:03.7827567Z Performs matrix multiplication of x and y with an optional epilogue function. 2026-02-21T18:46:03.7827786Z Args: 2026-02-21T18:46:03.7827904Z x (Tensor): Left matrix of shape [m, k]. 2026-02-21T18:46:03.7828076Z y (Tensor): Right matrix of shape [k, n]. 2026-02-21T18:46:03.7828385Z epilogue (Callable, optional): Function applied to the accumulator and tile indices 2026-02-21T18:46:03.7828656Z after the matmul. Defaults to identity (no change). 2026-02-21T18:46:03.7828853Z Returns: 2026-02-21T18:46:03.7828978Z Tensor: Resulting matrix of shape [m, n]. 2026-02-21T18:46:03.7829126Z """ 2026-02-21T18:46:03.7829259Z # src[matmul.py:57]: m, k = x.size() 2026-02-21T18:46:03.7829401Z m, k = x.size() 2026-02-21T18:46:03.7829530Z # src[matmul.py:58]: k2, n = y.size() 2026-02-21T18:46:03.7829672Z k2, n = y.size() 2026-02-21T18:46:03.7829837Z # src[matmul.py:59]: assert k == k2, f"size mismatch {k} != {k2}" 2026-02-21T18:46:03.7830041Z assert k == k2, f'size mismatch {k} != {k2}' 2026-02-21T18:46:03.7830209Z # src[matmul.py:60]: out = torch.empty( 2026-02-21T18:46:03.7830444Z # src[matmul.py:61]: [m, n], dtype=torch.promote_types(x.dtype, y.dtype), device=x.device 2026-02-21T18:46:03.7830678Z # src[matmul.py:62]: ) 2026-02-21T18:46:03.7830892Z out = torch.empty([m, n], dtype=torch.promote_types(x.dtype, y.dtype), device=x.device) 2026-02-21T18:46:03.7831178Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:46:03.7831380Z _NUM_SM = helion.runtime.get_num_sm(x.device) 2026-02-21T18:46:03.7831578Z # src[matmul.py:63]: for tile_m, tile_n in hl.tile([m, n]): 2026-02-21T18:46:03.7831836Z # src[matmul.py:64]: acc = hl.zeros([tile_m, tile_n], dtype=torch.float32) 2026-02-21T18:46:03.7832067Z # src[matmul.py:65]: for tile_k in hl.tile(k): 2026-02-21T18:46:03.7832232Z # src[matmul.py:63-67]: ... 2026-02-21T18:46:03.7832525Z _launcher(_helion_matmul, (_NUM_SM * 128,), x, y, out, _NUM_SM, num_warps=4, num_stages=1, waves_per_eu=2, matrix_instr_nonkdim=16) 2026-02-21T18:46:03.7832826Z # src[matmul.py:68]: return out 2026-02-21T18:46:03.7832963Z return out 2026-02-21T18:46:30.2736942Z WARNING:tritonbench.utils.triton_op:Completed input ID 9: 2026-02-21T18:46:30.2737430Z (M, N, K) 2026-02-21T18:46:30.2737653Z ------------------- 2026-02-21T18:46:30.2737905Z (1024, 12288, 1024) 2026-02-21T18:46:30.2738050Z 2026-02-21T18:46:30.2750006Z 88%|████████▊ | 7/8 [33:12<05:06, 306.06s/it]WARNING:tritonbench.utils.triton_op:Running input ID 11: 2026-02-21T18:46:30.2750469Z (M, N, K) 2026-02-21T18:46:30.2750639Z ------------------- 2026-02-21T18:46:30.2750822Z (2048, 12288, 2048) 2026-02-21T18:46:30.2752044Z INFO:tritonbench.utils.triton_op:Took 0.00ms to get benchmark function for aten_matmul 2026-02-21T18:47:05.5986206Z INFO:tritonbench.utils.triton_op:Took 0.01ms to get benchmark function for triton_tutorial_matmul 2026-02-21T18:47:42.3847699Z INFO:tritonbench.utils.triton_op:Took 104.83ms to get benchmark function for pt2_triton_matmul 2026-02-21T18:48:17.6184965Z WARNING:__main__:Input tensor metadata: 2026-02-21T18:48:17.6185449Z { 'args': ( { 'device': 'cuda:0', 2026-02-21T18:48:17.6185778Z 'dtype': 'torch.float16', 2026-02-21T18:48:17.6186115Z 'shape': (2048, 2048), 2026-02-21T18:48:17.6186424Z 'stride': (2048, 1)}, 2026-02-21T18:48:17.6186720Z { 'device': 'cuda:0', 2026-02-21T18:48:17.6187083Z 'dtype': 'torch.float16', 2026-02-21T18:48:17.6187388Z 'shape': (2048, 12288), 2026-02-21T18:48:17.6187685Z 'stride': (1, 2048)}, 2026-02-21T18:48:17.6187974Z None), 2026-02-21T18:48:17.6188362Z 'kwargs': {}} 2026-02-21T18:48:17.6271364Z INFO:tritonbench.utils.triton_op:Took 9.23ms to get benchmark function for helion_matmul_tritonbench 2026-02-21T18:48:17.6945286Z [0s] Autotune random seed: 2171071898 2026-02-21T18:48:17.7102641Z [0s] Starting LFBOPatternSearch with initial_population=FROM_RANDOM, copies=5, max_generations=20, similarity_penalty=1.0 2026-02-21T18:48:26.5081131Z Initial population precompiling 100% ━━━━━━━━━━━━━━━━━━━━ 100/100 12.2 configs/s 2026-02-21T18:48:55.9515043Z python: /root/.triton/llvm/llvm-7d5de303-almalinux-x64/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From &) [To = mlir::triton::gpu::AMDMfmaEncodingAttr, From = mlir::Attribute]: Assertion `isa(Val) && "cast() argument of incompatible type!"' failed. 2026-02-21T18:48:55.9521117Z #blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [1, 0]}> 2026-02-21T18:48:55.9523597Z #blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:48:55.9524592Z #blocked2 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 2], warpsPerCTA = [2, 1], order = [1, 0]}> 2026-02-21T18:48:55.9525479Z #blocked3 = #ttg.blocked<{sizePerThread = [1], threadsPerWarp = [64], warpsPerCTA = [2], order = [0]}> 2026-02-21T18:48:55.9526283Z #blocked4 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [64, 1], warpsPerCTA = [2, 1], order = [0, 1]}> 2026-02-21T18:48:55.9526856Z #blocked5 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 64], warpsPerCTA = [1, 2], order = [0, 1]}> 2026-02-21T18:48:55.9527432Z module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "hip:gfx942", "ttg.threads-per-warp" = 64 : i32} { 2026-02-21T18:48:55.9528221Z tt.func public @_helion_matmul(%arg0: !tt.ptr {tt.divisibility = 16 : i32}, %arg1: !tt.ptr {tt.divisibility = 16 : i32}, %arg2: !tt.ptr {tt.divisibility = 16 : i32}) attributes {noinline = false} { 2026-02-21T18:48:55.9528963Z %cst = arith.constant dense<0.000000e+00> : tensor<4x256xf16, #blocked> 2026-02-21T18:48:55.9529299Z %cst_0 = arith.constant dense<2048> : tensor<1x256xi64, #blocked> 2026-02-21T18:48:55.9529599Z %cst_1 = arith.constant dense<0> : tensor<1x256xi64, #blocked> 2026-02-21T18:48:55.9529890Z %cst_2 = arith.constant dense<0> : tensor<4x1xi64, #blocked1> 2026-02-21T18:48:55.9530180Z %cst_3 = arith.constant dense<2048> : tensor<4x1xi64, #blocked1> 2026-02-21T18:48:55.9530438Z %c256_i32 = arith.constant 256 : i32 2026-02-21T18:48:55.9530650Z %c2048_i32 = arith.constant 2048 : i32 2026-02-21T18:48:55.9530852Z %c0_i32 = arith.constant 0 : i32 2026-02-21T18:48:55.9531095Z %cst_4 = arith.constant dense<12288> : tensor<4x1xi32, #blocked1> 2026-02-21T18:48:55.9531391Z %cst_5 = arith.constant dense<2048> : tensor<1x2xi32, #blocked2> 2026-02-21T18:48:55.9531701Z %cst_6 = arith.constant dense<0.000000e+00> : tensor<4x2xf32, #blocked2> 2026-02-21T18:48:55.9531966Z %c4_i32 = arith.constant 4 : i32 2026-02-21T18:48:55.9532222Z %c2_i32 = arith.constant 2 : i32 2026-02-21T18:48:55.9532416Z %c6144_i32 = arith.constant 6144 : i32 2026-02-21T18:48:55.9532621Z %0 = tt.get_program_id x : i32 2026-02-21T18:48:55.9532809Z %1 = arith.remsi %0, %c6144_i32 : i32 2026-02-21T18:48:55.9533004Z %2 = arith.divsi %0, %c6144_i32 : i32 2026-02-21T18:48:55.9533199Z %3 = arith.muli %1, %c2_i32 : i32 2026-02-21T18:48:55.9533454Z %4 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #blocked3> 2026-02-21T18:48:55.9533762Z %5 = tt.splat %3 : i32 -> tensor<2xi32, #blocked3> 2026-02-21T18:48:55.9534012Z %6 = arith.addi %5, %4 : tensor<2xi32, #blocked3> 2026-02-21T18:48:55.9534232Z %7 = arith.muli %2, %c4_i32 : i32 2026-02-21T18:48:55.9534478Z %8 = tt.make_range {end = 4 : i32, start = 0 : i32} : tensor<4xi32, #blocked3> 2026-02-21T18:48:55.9534769Z %9 = tt.splat %7 : i32 -> tensor<4xi32, #blocked3> 2026-02-21T18:48:55.9535009Z %10 = arith.addi %9, %8 : tensor<4xi32, #blocked3> 2026-02-21T18:48:55.9535298Z %11 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32, #blocked3> 2026-02-21T18:48:55.9535586Z %12 = arith.extsi %7 : i32 to i64 2026-02-21T18:48:55.9535843Z %13 = tt.splat %arg0 : !tt.ptr -> tensor<4x256x!tt.ptr, #blocked> 2026-02-21T18:48:55.9536143Z %14 = tt.splat %12 : i64 -> tensor<4xi64, #blocked3> 2026-02-21T18:48:55.9536426Z %15 = arith.extsi %8 : tensor<4xi32, #blocked3> to tensor<4xi64, #blocked3> 2026-02-21T18:48:55.9536723Z %16 = arith.addi %14, %15 : tensor<4xi64, #blocked3> 2026-02-21T18:48:55.9537116Z %17 = ttg.convert_layout %16 : tensor<4xi64, #blocked3> -> tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:48:55.9537625Z %18 = tt.expand_dims %17 {axis = 1 : i32} : tensor<4xi64, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi64, #blocked4> 2026-02-21T18:48:55.9538003Z %19 = ttg.convert_layout %18 : tensor<4x1xi64, #blocked4> -> tensor<4x1xi64, #blocked1> 2026-02-21T18:48:55.9538253Z %20 = arith.muli %19, %cst_3 : tensor<4x1xi64, #blocked1> 2026-02-21T18:48:55.9538491Z %21 = tt.broadcast %20 : tensor<4x1xi64, #blocked1> -> tensor<4x256xi64, #blocked1> 2026-02-21T18:48:55.9538786Z %22 = ttg.convert_layout %21 : tensor<4x256xi64, #blocked1> -> tensor<4x256xi64, #blocked> 2026-02-21T18:48:55.9539069Z %23 = arith.extsi %11 : tensor<256xi32, #blocked3> to tensor<256xi64, #blocked3> 2026-02-21T18:48:55.9539312Z %24 = arith.cmpi sge, %19, %cst_2 : tensor<4x1xi64, #blocked1> 2026-02-21T18:48:55.9539523Z %25 = arith.cmpi slt, %19, %cst_3 : tensor<4x1xi64, #blocked1> 2026-02-21T18:48:55.9539725Z %26 = arith.andi %24, %25 : tensor<4x1xi1, #blocked1> 2026-02-21T18:48:55.9539953Z %27 = tt.broadcast %26 : tensor<4x1xi1, #blocked1> -> tensor<4x256xi1, #blocked1> 2026-02-21T18:48:55.9540238Z %28 = ttg.convert_layout %27 : tensor<4x256xi1, #blocked1> -> tensor<4x256xi1, #blocked> 2026-02-21T18:48:55.9540600Z %29 = ttg.convert_layout %6 : tensor<2xi32, #blocked3> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:48:55.9540998Z %30 = tt.expand_dims %29 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x2xi32, #blocked5> 2026-02-21T18:48:55.9541349Z %31 = ttg.convert_layout %30 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #blocked2> 2026-02-21T18:48:55.9541598Z %32 = arith.muli %31, %cst_5 : tensor<1x2xi32, #blocked2> 2026-02-21T18:48:55.9541848Z %33 = tt.broadcast %32 : tensor<1x2xi32, #blocked2> -> tensor<256x2xi32, #blocked2> 2026-02-21T18:48:55.9542115Z %34 = tt.splat %arg1 : !tt.ptr -> tensor<256x2x!tt.ptr, #blocked2> 2026-02-21T18:48:55.9542452Z %35 = scf.for %arg3 = %c0_i32 to %c2048_i32 step %c256_i32 iter_args(%arg4 = %cst_6) -> (tensor<4x2xf32, #blocked2>) : i32 { 2026-02-21T18:48:55.9542753Z %50 = tt.splat %arg3 : i32 -> tensor<256xi32, #blocked3> 2026-02-21T18:48:55.9542952Z %51 = arith.addi %50, %11 : tensor<256xi32, #blocked3> 2026-02-21T18:48:55.9543143Z %52 = arith.extsi %arg3 : i32 to i64 2026-02-21T18:48:55.9543312Z %53 = tt.splat %52 : i64 -> tensor<256xi64, #blocked3> 2026-02-21T18:48:55.9543494Z %54 = arith.addi %53, %23 : tensor<256xi64, #blocked3> 2026-02-21T18:48:55.9543791Z %55 = ttg.convert_layout %54 : tensor<256xi64, #blocked3> -> tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:48:55.9544212Z %56 = tt.expand_dims %55 {axis = 0 : i32} : tensor<256xi64, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x256xi64, #blocked5> 2026-02-21T18:48:55.9544581Z %57 = ttg.convert_layout %56 : tensor<1x256xi64, #blocked5> -> tensor<1x256xi64, #blocked> 2026-02-21T18:48:55.9544873Z %58 = tt.broadcast %57 : tensor<1x256xi64, #blocked> -> tensor<4x256xi64, #blocked> 2026-02-21T18:48:55.9545109Z %59 = arith.addi %22, %58 : tensor<4x256xi64, #blocked> 2026-02-21T18:48:55.9545357Z %60 = tt.addptr %13, %59 : tensor<4x256x!tt.ptr, #blocked>, tensor<4x256xi64, #blocked> 2026-02-21T18:48:55.9545620Z %61 = arith.cmpi sge, %57, %cst_1 : tensor<1x256xi64, #blocked> 2026-02-21T18:48:55.9545833Z %62 = arith.cmpi slt, %57, %cst_0 : tensor<1x256xi64, #blocked> 2026-02-21T18:48:55.9546034Z %63 = arith.andi %61, %62 : tensor<1x256xi1, #blocked> 2026-02-21T18:48:55.9546256Z %64 = tt.broadcast %63 : tensor<1x256xi1, #blocked> -> tensor<4x256xi1, #blocked> 2026-02-21T18:48:55.9546487Z %65 = arith.andi %28, %64 : tensor<4x256xi1, #blocked> 2026-02-21T18:48:55.9546688Z %66 = tt.load %60, %65, %cst : tensor<4x256x!tt.ptr, #blocked> 2026-02-21T18:48:55.9547001Z %67 = ttg.convert_layout %51 : tensor<256xi32, #blocked3> -> tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:48:55.9547460Z %68 = tt.expand_dims %67 {axis = 1 : i32} : tensor<256xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<256x1xi32, #blocked4> 2026-02-21T18:48:55.9547811Z %69 = ttg.convert_layout %68 : tensor<256x1xi32, #blocked4> -> tensor<256x1xi32, #blocked1> 2026-02-21T18:48:55.9548053Z %70 = tt.broadcast %69 : tensor<256x1xi32, #blocked1> -> tensor<256x2xi32, #blocked1> 2026-02-21T18:48:55.9548513Z %71 = ttg.convert_layout %70 : tensor<256x2xi32, #blocked1> -> tensor<256x2xi32, #blocked2> 2026-02-21T18:48:55.9548847Z %72 = arith.addi %71, %33 : tensor<256x2xi32, #blocked2> 2026-02-21T18:48:55.9549156Z %73 = tt.addptr %34, %72 : tensor<256x2x!tt.ptr, #blocked2>, tensor<256x2xi32, #blocked2> 2026-02-21T18:48:55.9549438Z %74 = tt.load %73 : tensor<256x2x!tt.ptr, #blocked2> 2026-02-21T18:48:55.9549874Z %75 = ttg.convert_layout %66 : tensor<4x256xf16, #blocked> -> tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked2}>> 2026-02-21T18:48:55.9550467Z %76 = ttg.convert_layout %74 : tensor<256x2xf16, #blocked2> -> tensor<256x2xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked2}>> 2026-02-21T18:48:55.9551052Z %77 = ttg.convert_layout %arg4 : tensor<4x2xf32, #blocked2> -> tensor<4x2xf32, #blocked2> 2026-02-21T18:48:55.9551726Z %78 = tt.dot %75, %76, %77, inputPrecision = tf32 : tensor<4x256xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked2}>> * tensor<256x2xf16, #ttg.dot_op<{opIdx = 1, parent = #blocked2}>> -> tensor<4x2xf32, #blocked2> 2026-02-21T18:48:55.9552179Z scf.yield %78 : tensor<4x2xf32, #blocked2> 2026-02-21T18:48:55.9552373Z } {tt.flatten} 2026-02-21T18:48:55.9552604Z %36 = arith.truncf %35 : tensor<4x2xf32, #blocked2> to tensor<4x2xf16, #blocked2> 2026-02-21T18:48:55.9553055Z %37 = ttg.convert_layout %10 : tensor<4xi32, #blocked3> -> tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> 2026-02-21T18:48:55.9553582Z %38 = tt.expand_dims %37 {axis = 1 : i32} : tensor<4xi32, #ttg.slice<{dim = 1, parent = #blocked4}>> -> tensor<4x1xi32, #blocked4> 2026-02-21T18:48:55.9554031Z %39 = ttg.convert_layout %38 : tensor<4x1xi32, #blocked4> -> tensor<4x1xi32, #blocked1> 2026-02-21T18:48:55.9554291Z %40 = arith.muli %39, %cst_4 : tensor<4x1xi32, #blocked1> 2026-02-21T18:48:55.9554553Z %41 = ttg.convert_layout %6 : tensor<2xi32, #blocked3> -> tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> 2026-02-21T18:48:55.9554871Z %42 = tt.expand_dims %41 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #blocked5}>> -> tensor<1x2xi32, #blocked5> 2026-02-21T18:48:55.9555150Z %43 = ttg.convert_layout %42 : tensor<1x2xi32, #blocked5> -> tensor<1x2xi32, #blocked2> 2026-02-21T18:48:55.9555375Z %44 = tt.broadcast %40 : tensor<4x1xi32, #blocked1> -> tensor<4x2xi32, #blocked1> 2026-02-21T18:48:55.9555604Z %45 = ttg.convert_layout %44 : tensor<4x2xi32, #blocked1> -> tensor<4x2xi32, #blocked2> 2026-02-21T18:48:55.9555830Z %46 = tt.broadcast %43 : tensor<1x2xi32, #blocked2> -> tensor<4x2xi32, #blocked2> 2026-02-21T18:48:55.9556032Z %47 = arith.addi %45, %46 : tensor<4x2xi32, #blocked2> 2026-02-21T18:48:55.9556214Z %48 = tt.splat %arg2 : !tt.ptr -> tensor<4x2x!tt.ptr, #blocked2> 2026-02-21T18:48:55.9556438Z %49 = tt.addptr %48, %47 : tensor<4x2x!tt.ptr, #blocked2>, tensor<4x2xi32, #blocked2> 2026-02-21T18:48:55.9556638Z tt.store %49, %36 : tensor<4x2x!tt.ptr, #blocked2> 2026-02-21T18:48:55.9556769Z tt.return 2026-02-21T18:48:55.9556853Z } 2026-02-21T18:48:55.9556933Z } 2026-02-21T18:48:55.9556978Z 2026-02-21T18:48:55.9557009Z {-# 2026-02-21T18:48:55.9557092Z external_resources: { 2026-02-21T18:48:55.9557191Z mlir_reproducer: { 2026-02-21T18:48:55.9559464Z pipeline: "builtin.module(tritongpu-coalesce, tritongpu-remove-layout-conversions, tritongpu-optimize-thread-locality, tritonamdgpu-accelerate-matmul{arch-generation-name=gfx942 kPack=1 matrix-instruction-size=0}, tritongpu-remove-layout-conversions, tritonamdgpu-optimize-epilogue, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tt.func(tritonamdgpu-hoist-layout-conversions), tritongpu-fuse-nested-loops, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, triton-licm, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritonamdgpu-stream-pipeline{global_prefetch=0 local_prefetch=0 num_stages=3 use_async_copy=false use_pingpong=true}, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, tritongpu-optimize-dot-operands{hoist-layout-conversion=true}, tritongpu-remove-layout-conversions, tritongpu-reduce-data-duplication, tt.func(tritonamdgpu-in-thread-transpose), tritongpu-remove-layout-conversions, tritonamdgpu-reorder-instructions, tritonamdgpu-block-pingpong{num-stages=3}, tritonamdgpu-fold-true-cmpi, canonicalize{ max-iterations=10 max-num-rewrites=-1 region-simplify=normal test-convergence=false top-down=true}, cse, symbol-dce)", 2026-02-21T18:48:55.9561816Z disable_threading: false, 2026-02-21T18:48:55.9561926Z verify_each: true 2026-02-21T18:48:55.9562015Z } 2026-02-21T18:48:55.9562102Z } 2026-02-21T18:48:55.9562174Z #-} 2026-02-21T18:48:55.9562455Z /tmp/torchinductor_root/kv/ckvwe4jmmoim5yjqijr67cqqxtqfp4liup3nhptvzagtkwwqdpdf.py:13:0: error: Failures have been detected while processing an MLIR pass pipeline 2026-02-21T18:48:55.9563177Z /tmp/torchinductor_root/kv/ckvwe4jmmoim5yjqijr67cqqxtqfp4liup3nhptvzagtkwwqdpdf.py:13:0: note: Pipeline failed while executing [`TritonAMDGPUBlockPingpong` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` 2026-02-21T18:48:55.9563731Z [38s] Triton compile failed. This likely indicates a bug in Triton. Skipping failing config. 2026-02-21T18:48:55.9564454Z Config: @helion.kernel(config=helion.Config(block_sizes=[4, 2, 256], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[1], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=2, pid_type='flat', range_flattens=[None, True], range_multi_buffers=[None, True], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T18:48:55.9565138Z Error: RuntimeError: PassManager::run failed 2026-02-21T18:48:55.9565303Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:49:00.9643262Z Initial population exploring neighbors 76% ━━━━━━━━━━╸ 76/100 3.0 configs/s 2026-02-21T18:49:00.9646693Z WARNING:tritonbench.utils.triton_op:Completed input ID 11: 2026-02-21T18:49:00.9646990Z (M, N, K) 2026-02-21T18:49:00.9647973Z ------------------- 2026-02-21T18:49:00.9648377Z (2048, 12288, 2048) 2026-02-21T18:49:00.9648572Z 2026-02-21T18:49:00.9649186Z 100%|██████████| 8/8 [35:42<00:00, 256.60s/it] 2026-02-21T18:49:00.9649679Z 100%|██████████| 8/8 [35:42<00:00, 267.87s/it] 2026-02-21T18:49:00.9657383Z INFO:tritonbench.utils.run_utils:[tritonbench] Output result csv to /tmp/tmpwvdd2km5.csv 2026-02-21T18:49:00.9669095Z (M, N, K) triton_tutorial_matmul-speedup triton_tutorial_matmul-accuracy pt2_triton_matmul-speedup pt2_triton_matmul-accuracy helion_matmul_tritonbench-speedup helion_matmul_tritonbench-accuracy 2026-02-21T18:49:00.9671783Z ------------------- -------------------------------- --------------------------------- --------------------------- ---------------------------- ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------ 2026-02-21T18:49:00.9673299Z ERROR:__main__:failed to process results 2026-02-21T18:49:00.9673551Z Traceback (most recent call last): 2026-02-21T18:49:00.9674030Z File "/__w/helion/helion/benchmarks/run.py", line 1312, in run_kernel_variants 2026-02-21T18:49:00.9674394Z process_result( 2026-02-21T18:49:00.9674699Z File "/__w/helion/helion/benchmarks/run.py", line 1380, in process_result 2026-02-21T18:49:00.9675062Z metrics[active_metrics[kernel_name][name]].append(float(item)) 2026-02-21T18:49:00.9675355Z ^^^^^^^^^^^ 2026-02-21T18:49:00.9675659Z ValueError: could not convert string to float: '"Error from Triton code:' 2026-02-21T18:49:00.9676584Z (4096, 1024, 1024) 0.646506 1 0.794573 1 0.7536495570703666 1 2026-02-21T18:49:00.9677867Z (4096, 2048, 2048) 0.624257 1 0.911399 1 0.9551704715651995 1 2026-02-21T18:49:00.9679180Z (2048, 4096, 2048) 0.631705 1 0.935622 1 Error from Triton code: 2026-02-21T18:49:00.9680411Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:49:00.9681058Z 2026-02-21T18:49:00.9681600Z Error running generated Triton program: 2026-02-21T18:49:00.9682934Z SystemError: returned a result with an exception set 2026-02-21T18:49:00.9684810Z @helion.kernel(config=helion.Config(block_sizes=[1, 1, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T18:49:00.9686274Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T18:49:00.9687243Z (1024, 8192, 1024) 0.726033 1 0.955949 1 0.9366368635301743 1 2026-02-21T18:49:00.9688191Z (8192, 2048, 2048) 0.603979 1 0.90311 1 Error from Triton code: 2026-02-21T18:49:00.9689091Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:49:00.9689569Z 2026-02-21T18:49:00.9689995Z Error running generated Triton program: 2026-02-21T18:49:00.9690944Z SystemError: returned a result with an exception set 2026-02-21T18:49:00.9692305Z @helion.kernel(config=helion.Config(block_sizes=[1, 2, 128], indexing=['block_ptr', 'pointer', 'pointer'], l2_groupings=[32], load_eviction_policies=['', ''], loop_orders=[[1, 0]], matrix_instr_nonkdim=0, num_stages=3, num_warps=16, pid_type='flat', range_flattens=[None, False], range_multi_buffers=[None, None], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=3), static_shapes=True) 2026-02-21T18:49:00.9693633Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T18:49:00.9694609Z (12288, 1024, 1024) 0.648268 1 0.896812 1 0.8547490173354304 1 2026-02-21T18:49:00.9695435Z (1024, 12288, 1024) 0.601353 1 0.80204 1 0.7828030822384998 1 2026-02-21T18:49:00.9696165Z (2048, 12288, 2048) 0.607656 1 0.804207 1 Error from Triton code: 2026-02-21T18:49:00.9696927Z Enable HELION_AUTOTUNE_LOG_LEVEL=DEBUG to log generated Triton code. 2026-02-21T18:49:00.9697309Z 2026-02-21T18:49:00.9698128Z Error running generated Triton program: 2026-02-21T18:49:00.9698872Z SystemError: returned a result with an exception set 2026-02-21T18:49:00.9699943Z @helion.kernel(config=helion.Config(block_sizes=[1, 1, 256], indexing=['pointer', 'pointer', 'pointer'], l2_groupings=[4], load_eviction_policies=['', ''], loop_orders=[[0, 1]], matrix_instr_nonkdim=16, num_stages=2, num_warps=16, pid_type='flat', range_flattens=[None, None], range_multi_buffers=[None, False], range_num_stages=[0, 0], range_unroll_factors=[0, 0], range_warp_specializes=[], waves_per_eu=2), static_shapes=True) 2026-02-21T18:49:00.9701029Z Set autotune_ignore_errors=True or HELION_AUTOTUNE_IGNORE_ERRORS=1 to ignore Triton errors in autotuning. 2026-02-21T18:49:03.3646239Z average 0.63622 1 0.875464 1 0.5353761239674588 0.625 2026-02-21T18:49:09.4925836Z ✅ Completed benchmark for kernel: gemm 2026-02-21T18:49:09.4936246Z [ 2026-02-21T18:49:09.4936470Z { 2026-02-21T18:49:09.4936662Z "benchmark": { 2026-02-21T18:49:09.4936917Z "name": "Helion Benchmark", 2026-02-21T18:49:09.4937200Z "extra_info": { 2026-02-21T18:49:09.4937914Z "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-" 2026-02-21T18:49:09.4938237Z } 2026-02-21T18:49:09.4938411Z }, 2026-02-21T18:49:09.4938593Z "model": { 2026-02-21T18:49:09.4938950Z "name": "layer_norm" 2026-02-21T18:49:09.4939321Z }, 2026-02-21T18:49:09.4939510Z "metric": { 2026-02-21T18:49:09.4939714Z "name": "torch_compile_speedup", 2026-02-21T18:49:09.4939975Z "benchmark_values": [ 2026-02-21T18:49:09.4940194Z 1.3387864337926014, 2026-02-21T18:49:09.4940399Z 1.1565043132457022, 2026-02-21T18:49:09.4940600Z 1.0812941462464551, 2026-02-21T18:49:09.4940792Z 1.0304656923619862, 2026-02-21T18:49:09.4940988Z 1.1350360418783962, 2026-02-21T18:49:09.4941177Z 1.1261249039026817, 2026-02-21T18:49:09.4941371Z 1.0732627548955662, 2026-02-21T18:49:09.4941615Z 1.1418462867123056, 2026-02-21T18:49:09.4941820Z 1.1687360169983305, 2026-02-21T18:49:09.4942011Z 1.1732117987780817, 2026-02-21T18:49:09.4942215Z 1.1814773831285454, 2026-02-21T18:49:09.4942410Z 1.2134084465279746, 2026-02-21T18:49:09.4942605Z 1.191078089237463, 2026-02-21T18:49:09.4942805Z 1.165134638503707, 2026-02-21T18:49:09.4943118Z 1.1734558602158738, 2026-02-21T18:49:09.4943321Z 1.2058562837236402, 2026-02-21T18:49:09.4943513Z 1.1677233775359002, 2026-02-21T18:49:09.4943712Z 1.1714010793244465, 2026-02-21T18:49:09.4943902Z 1.2007288613722675, 2026-02-21T18:49:09.4944103Z 1.215060743296968 2026-02-21T18:49:09.4944294Z ] 2026-02-21T18:49:09.4944452Z }, 2026-02-21T18:49:09.4944608Z "shape": [ 2026-02-21T18:49:09.4944785Z "(4096, 1024)", 2026-02-21T18:49:09.4944973Z "(4096, 2048)", 2026-02-21T18:49:09.4945151Z "(4096, 2560)", 2026-02-21T18:49:09.4945332Z "(4096, 3584)", 2026-02-21T18:49:09.4945506Z "(4096, 4096)", 2026-02-21T18:49:09.4945686Z "(4096, 5120)", 2026-02-21T18:49:09.4945859Z "(4096, 5632)", 2026-02-21T18:49:09.4946036Z "(4096, 6656)", 2026-02-21T18:49:09.4946209Z "(4096, 7168)", 2026-02-21T18:49:09.4946385Z "(4096, 8192)", 2026-02-21T18:49:09.4946561Z "(4096, 8704)", 2026-02-21T18:49:09.4946738Z "(4096, 9728)", 2026-02-21T18:49:09.4947017Z "(4096, 10240)", 2026-02-21T18:49:09.4947198Z "(4096, 11264)", 2026-02-21T18:49:09.4947382Z "(4096, 11776)", 2026-02-21T18:49:09.4947559Z "(4096, 12800)", 2026-02-21T18:49:09.4947739Z "(4096, 13312)", 2026-02-21T18:49:09.4947913Z "(4096, 14336)", 2026-02-21T18:49:09.4948093Z "(4096, 14848)", 2026-02-21T18:49:09.4948365Z "(4096, 15872)" 2026-02-21T18:49:09.4948543Z ] 2026-02-21T18:49:09.4948695Z }, 2026-02-21T18:49:09.4948849Z { 2026-02-21T18:49:09.4949004Z "benchmark": { 2026-02-21T18:49:09.4949207Z "name": "Helion Benchmark", 2026-02-21T18:49:09.4949439Z "extra_info": { 2026-02-21T18:49:09.4949692Z "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-" 2026-02-21T18:49:09.4949983Z } 2026-02-21T18:49:09.4950109Z }, 2026-02-21T18:49:09.4950227Z "model": { 2026-02-21T18:49:09.4950356Z "name": "layer_norm" 2026-02-21T18:49:09.4950504Z }, 2026-02-21T18:49:09.4950614Z "metric": { 2026-02-21T18:49:09.4950763Z "name": "torch_compile_accuracy", 2026-02-21T18:49:09.4950942Z "benchmark_values": [ 2026-02-21T18:49:09.4951088Z 1.0, 2026-02-21T18:49:09.4951209Z 1.0, 2026-02-21T18:49:09.4951324Z 1.0, 2026-02-21T18:49:09.4951441Z 1.0, 2026-02-21T18:49:09.4951553Z 1.0, 2026-02-21T18:49:09.4951667Z 1.0, 2026-02-21T18:49:09.4951778Z 1.0, 2026-02-21T18:49:09.4951892Z 1.0, 2026-02-21T18:49:09.4952001Z 1.0, 2026-02-21T18:49:09.4952116Z 1.0, 2026-02-21T18:49:09.4952226Z 1.0, 2026-02-21T18:49:09.4952340Z 1.0, 2026-02-21T18:49:09.4952451Z 1.0, 2026-02-21T18:49:09.4952566Z 1.0, 2026-02-21T18:49:09.4952721Z 1.0, 2026-02-21T18:49:09.4952835Z 1.0, 2026-02-21T18:49:09.4952951Z 1.0, 2026-02-21T18:49:09.4953063Z 1.0, 2026-02-21T18:49:09.4953180Z 1.0, 2026-02-21T18:49:09.4953291Z 1.0 2026-02-21T18:49:09.4953441Z ] 2026-02-21T18:49:09.4953552Z }, 2026-02-21T18:49:09.4953669Z "shape": [ 2026-02-21T18:49:09.4953790Z "(4096, 1024)", 2026-02-21T18:49:09.4953923Z "(4096, 2048)", 2026-02-21T18:49:09.4954050Z "(4096, 2560)", 2026-02-21T18:49:09.4954179Z "(4096, 3584)", 2026-02-21T18:49:09.4954303Z "(4096, 4096)", 2026-02-21T18:49:09.4954431Z "(4096, 5120)", 2026-02-21T18:49:09.4954559Z "(4096, 5632)", 2026-02-21T18:49:09.4954681Z "(4096, 6656)", 2026-02-21T18:49:09.4954810Z "(4096, 7168)", 2026-02-21T18:49:09.4954932Z "(4096, 8192)", 2026-02-21T18:49:09.4955060Z "(4096, 8704)", 2026-02-21T18:49:09.4955183Z "(4096, 9728)", 2026-02-21T18:49:09.4955312Z "(4096, 10240)", 2026-02-21T18:49:09.4955445Z "(4096, 11264)", 2026-02-21T18:49:09.4955579Z "(4096, 11776)", 2026-02-21T18:49:09.4955736Z "(4096, 12800)", 2026-02-21T18:49:09.4955864Z "(4096, 13312)", 2026-02-21T18:49:09.4955995Z "(4096, 14336)", 2026-02-21T18:49:09.4956152Z "(4096, 14848)", 2026-02-21T18:49:09.4956287Z "(4096, 15872)" 2026-02-21T18:49:09.4956410Z ] 2026-02-21T18:49:09.4956523Z }, 2026-02-21T18:49:09.4956630Z { 2026-02-21T18:49:09.4956745Z "benchmark": { 2026-02-21T18:49:09.4956886Z "name": "Helion Benchmark", 2026-02-21T18:49:09.4957051Z "extra_info": { 2026-02-21T18:49:09.4957225Z "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-" 2026-02-21T18:49:09.4957425Z } 2026-02-21T18:49:09.4957539Z }, 2026-02-21T18:49:09.4957650Z "model": { 2026-02-21T18:49:09.4957782Z "name": "layer_norm" 2026-02-21T18:49:09.4957921Z }, 2026-02-21T18:49:09.4958035Z "metric": { 2026-02-21T18:49:09.4958169Z "name": "triton_speedup", 2026-02-21T18:49:09.4958337Z "benchmark_values": [ 2026-02-21T18:49:09.4958486Z 1.1493620224835568, 2026-02-21T18:49:09.4958632Z 1.038059917832201, 2026-02-21T18:49:09.4958772Z 1.1039626294694398, 2026-02-21T18:49:09.4958918Z 1.1027962023601108, 2026-02-21T18:49:09.4959064Z 1.1494295470417513, 2026-02-21T18:49:09.4959232Z 1.2200084389550936, 2026-02-21T18:49:09.4959375Z 1.1587472821920906, 2026-02-21T18:49:09.4959513Z 1.238518162714649, 2026-02-21T18:49:09.4959656Z 1.2680751968640465, 2026-02-21T18:49:09.4959793Z 1.2791801948017405, 2026-02-21T18:49:09.4959932Z 1.2633860138742115, 2026-02-21T18:49:09.4960047Z 1.289546111153349, 2026-02-21T18:49:09.4960158Z 1.2487846204136483, 2026-02-21T18:49:09.4960264Z 1.2532713527273605, 2026-02-21T18:49:09.4960374Z 1.2711574179877863, 2026-02-21T18:49:09.4960483Z 1.2759654092127133, 2026-02-21T18:49:09.4960589Z 1.278374810673535, 2026-02-21T18:49:09.4960704Z 1.315048471809879, 2026-02-21T18:49:09.4960813Z 1.29441374407624, 2026-02-21T18:49:09.4960926Z 1.347140824915948 2026-02-21T18:49:09.4961029Z ] 2026-02-21T18:49:09.4961118Z }, 2026-02-21T18:49:09.4961204Z "shape": [ 2026-02-21T18:49:09.4961306Z "(4096, 1024)", 2026-02-21T18:49:09.4961408Z "(4096, 2048)", 2026-02-21T18:49:09.4961511Z "(4096, 2560)", 2026-02-21T18:49:09.4961608Z "(4096, 3584)", 2026-02-21T18:49:09.4961709Z "(4096, 4096)", 2026-02-21T18:49:09.4961811Z "(4096, 5120)", 2026-02-21T18:49:09.4961908Z "(4096, 5632)", 2026-02-21T18:49:09.4962009Z "(4096, 6656)", 2026-02-21T18:49:09.4962107Z "(4096, 7168)", 2026-02-21T18:49:09.4962208Z "(4096, 8192)", 2026-02-21T18:49:09.4962305Z "(4096, 8704)", 2026-02-21T18:49:09.4962408Z "(4096, 9728)", 2026-02-21T18:49:09.4962505Z "(4096, 10240)", 2026-02-21T18:49:09.4962611Z "(4096, 11264)", 2026-02-21T18:49:09.4962712Z "(4096, 11776)", 2026-02-21T18:49:09.4962846Z "(4096, 12800)", 2026-02-21T18:49:09.4962951Z "(4096, 13312)", 2026-02-21T18:49:09.4963050Z "(4096, 14336)", 2026-02-21T18:49:09.4963156Z "(4096, 14848)", 2026-02-21T18:49:09.4963281Z "(4096, 15872)" 2026-02-21T18:49:09.4963386Z ] 2026-02-21T18:49:09.4963474Z }, 2026-02-21T18:49:09.4963562Z { 2026-02-21T18:49:09.4963648Z "benchmark": { 2026-02-21T18:49:09.4963758Z "name": "Helion Benchmark", 2026-02-21T18:49:09.4963881Z "extra_info": { 2026-02-21T18:49:09.4964021Z "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-" 2026-02-21T18:49:09.4964183Z } 2026-02-21T18:49:09.4964268Z }, 2026-02-21T18:49:09.4964361Z "model": { 2026-02-21T18:49:09.4964460Z "name": "layer_norm" 2026-02-21T18:49:09.4964576Z }, 2026-02-21T18:49:09.4964664Z "metric": { 2026-02-21T18:49:09.4964775Z "name": "triton_accuracy", 2026-02-21T18:49:09.4964900Z "benchmark_values": [ 2026-02-21T18:49:09.4965015Z 1.0, 2026-02-21T18:49:09.4965105Z 1.0, 2026-02-21T18:49:09.4965198Z 1.0, 2026-02-21T18:49:09.4965286Z 1.0, 2026-02-21T18:49:09.4965378Z 1.0, 2026-02-21T18:49:09.4965467Z 1.0, 2026-02-21T18:49:09.4965558Z 1.0, 2026-02-21T18:49:09.4965678Z 1.0, 2026-02-21T18:49:09.4965766Z 1.0, 2026-02-21T18:49:09.4965856Z 1.0, 2026-02-21T18:49:09.4965942Z 1.0, 2026-02-21T18:49:09.4966033Z 1.0, 2026-02-21T18:49:09.4966119Z 1.0, 2026-02-21T18:49:09.4966208Z 1.0, 2026-02-21T18:49:09.4966293Z 1.0, 2026-02-21T18:49:09.4966384Z 1.0, 2026-02-21T18:49:09.4966468Z 1.0, 2026-02-21T18:49:09.4966557Z 1.0, 2026-02-21T18:49:09.4966642Z 1.0, 2026-02-21T18:49:09.4966732Z 1.0 2026-02-21T18:49:09.4966823Z ] 2026-02-21T18:49:09.4966910Z }, 2026-02-21T18:49:09.4967002Z "shape": [ 2026-02-21T18:49:09.4967097Z "(4096, 1024)", 2026-02-21T18:49:09.4967202Z "(4096, 2048)", 2026-02-21T18:49:09.4967305Z "(4096, 2560)", 2026-02-21T18:49:09.4967407Z "(4096, 3584)", 2026-02-21T18:49:09.4967506Z "(4096, 4096)", 2026-02-21T18:49:09.4967608Z "(4096, 5120)", 2026-02-21T18:49:09.4967709Z "(4096, 5632)", 2026-02-21T18:49:09.4967815Z "(4096, 6656)", 2026-02-21T18:49:09.4967936Z "(4096, 7168)", 2026-02-21T18:49:09.4968039Z "(4096, 8192)", 2026-02-21T18:49:09.4968140Z "(4096, 8704)", 2026-02-21T18:49:09.4968236Z "(4096, 9728)", 2026-02-21T18:49:09.4968338Z "(4096, 10240)", 2026-02-21T18:49:09.4968439Z "(4096, 11264)", 2026-02-21T18:49:09.4968544Z "(4096, 11776)", 2026-02-21T18:49:09.4968645Z "(4096, 12800)", 2026-02-21T18:49:09.4968749Z "(4096, 13312)", 2026-02-21T18:49:09.4968847Z "(4096, 14336)", 2026-02-21T18:49:09.4968951Z "(4096, 14848)", 2026-02-21T18:49:09.4969050Z "(4096, 15872)" 2026-02-21T18:49:09.4969151Z ] 2026-02-21T18:49:09.4969236Z }, 2026-02-21T18:49:09.4969324Z { 2026-02-21T18:49:09.4969416Z "benchmark": { 2026-02-21T18:49:09.4969523Z "name": "Helion Benchmark", 2026-02-21T18:49:09.4969640Z "extra_info": { 2026-02-21T18:49:09.4969760Z "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-" 2026-02-21T18:49:09.4969897Z } 2026-02-21T18:49:09.4969975Z }, 2026-02-21T18:49:09.4970054Z "model": { 2026-02-21T18:49:09.4970139Z "name": "layer_norm" 2026-02-21T18:49:09.4970238Z }, 2026-02-21T18:49:09.4970312Z "metric": { 2026-02-21T18:49:09.4970407Z "name": "helion_speedup", 2026-02-21T18:49:09.4970519Z "benchmark_values": [ 2026-02-21T18:49:09.4970617Z 1.1411253527373746, 2026-02-21T18:49:09.4970716Z 1.12923031826141, 2026-02-21T18:49:09.4970810Z 1.167143156264196, 2026-02-21T18:49:09.4970906Z 1.1014619134698223, 2026-02-21T18:49:09.4970998Z 1.1962281161244432, 2026-02-21T18:49:09.4971093Z 1.2489787358945272, 2026-02-21T18:49:09.4971185Z 1.1135140120106222, 2026-02-21T18:49:09.4971305Z 1.2525821119163423, 2026-02-21T18:49:09.4971397Z 1.2938258302848524, 2026-02-21T18:49:09.4971492Z 1.3014600097574138, 2026-02-21T18:49:09.4971601Z 1.3102654210982405, 2026-02-21T18:49:09.4971697Z 1.3373191787038987, 2026-02-21T18:49:09.4971795Z 1.2963937550271853, 2026-02-21T18:49:09.4971888Z 1.328067608679353, 2026-02-21T18:49:09.4971986Z 1.3175725984708206, 2026-02-21T18:49:09.4972078Z 1.3397435695898392, 2026-02-21T18:49:09.4972176Z 1.3737741062343953, 2026-02-21T18:49:09.4972271Z 1.3587718886298845, 2026-02-21T18:49:09.4972367Z 1.3975587736710753, 2026-02-21T18:49:09.4972459Z 1.4028870291851567 2026-02-21T18:49:09.4972554Z ] 2026-02-21T18:49:09.4972634Z }, 2026-02-21T18:49:09.4972710Z "shape": [ 2026-02-21T18:49:09.4972796Z "(4096, 1024)", 2026-02-21T18:49:09.4972883Z "(4096, 2048)", 2026-02-21T18:49:09.4972975Z "(4096, 2560)", 2026-02-21T18:49:09.4973063Z "(4096, 3584)", 2026-02-21T18:49:09.4973152Z "(4096, 4096)", 2026-02-21T18:49:09.4973235Z "(4096, 5120)", 2026-02-21T18:49:09.4973324Z "(4096, 5632)", 2026-02-21T18:49:09.4973411Z "(4096, 6656)", 2026-02-21T18:49:09.4973535Z "(4096, 7168)", 2026-02-21T18:49:09.4973621Z "(4096, 8192)", 2026-02-21T18:49:09.4973711Z "(4096, 8704)", 2026-02-21T18:49:09.4973798Z "(4096, 9728)", 2026-02-21T18:49:09.4973882Z "(4096, 10240)", 2026-02-21T18:49:09.4973974Z "(4096, 11264)", 2026-02-21T18:49:09.4974061Z "(4096, 11776)", 2026-02-21T18:49:09.4974152Z "(4096, 12800)", 2026-02-21T18:49:09.4974237Z "(4096, 13312)", 2026-02-21T18:49:09.4974326Z "(4096, 14336)", 2026-02-21T18:49:09.4974411Z "(4096, 14848)", 2026-02-21T18:49:09.4974501Z "(4096, 15872)" 2026-02-21T18:49:09.4974586Z ] 2026-02-21T18:49:09.4974664Z }, 2026-02-21T18:49:09.4974739Z { 2026-02-21T18:49:09.4974815Z "benchmark": { 2026-02-21T18:49:09.4974913Z "name": "Helion Benchmark", 2026-02-21T18:49:09.4975019Z "extra_info": { 2026-02-21T18:49:09.4975140Z "device": "AMD Instinct MI325X gfx942:sramecc+:xnack-" 2026-02-21T18:49:09.4975270Z } 2026-02-21T18:49:09.4975348Z }, 2026-02-21T18:49:09.4975425Z "model": { 2026-02-21T18:49:09.4975538Z "name": "layer_norm" 2026-02-21T18:49:09.4975632Z }, 2026-02-21T18:49:09.4975711Z "metric": { 2026-02-21T18:49:09.4975800Z "name": "helion_accuracy", 2026-02-21T18:49:09.4975926Z "benchmark_values": [ 2026-02-21T18:49:09.4976094Z 1.0, 2026-02-21T18:49:09.4976255Z 1.0, 2026-02-21T18:49:09.4976395Z 1.0, 2026-02-21T18:49:09.4976497Z 1.0, 2026-02-21T18:49:09.4976578Z 1.0, 2026-02-21T18:49:09.4976669Z 1.0, 2026-02-21T18:49:09.4984725Z 1.0, 2026-02-21T18:49:09.4984825Z 1.0, 2026-02-21T18:49:09.4984919Z 1.0, 2026-02-21T18:49:09.4985019Z 1.0, 2026-02-21T18:49:09.4985095Z 1.0, 2026-02-21T18:49:09.4985164Z 1.0, 2026-02-21T18:49:09.4985236Z 1.0, 2026-02-21T18:49:09.4985305Z 1.0, 2026-02-21T18:49:09.4985409Z 1.0, 2026-02-21T18:49:09.4985477Z 1.0, 2026-02-21T18:49:09.4985552Z 1.0, 2026-02-21T18:49:09.4985627Z 1.0, 2026-02-21T18:49:09.4985704Z 1.0, 2026-02-21T18:49:09.4985773Z 1.0 2026-02-21T18:49:09.4985849Z ] 2026-02-21T18:49:09.4985919Z }, 2026-02-21T18:49:09.4985997Z "shape": [ 2026-02-21T18:49:09.4986090Z "(4096, 1024)", 2026-02-21T18:49:09.4986183Z "(4096, 2048)", 2026-02-21T18:49:09.4986267Z "(4096, 2560)", 2026-02-21T18:49:09.4986346Z "(4096, 3584)", 2026-02-21T18:49:09.4986430Z "(4096, 4096)", 2026-02-21T18:49:09.4986510Z "(4096, 5120)", 2026-02-21T18:49:09.4986591Z "(4096, 5632)", 2026-02-21T18:49:09.4986671Z "(4096, 6656)", 2026-02-21T18:49:09.4986750Z "(4096, 7168)", 2026-02-21T18:49:09.4986829Z "(4096, 8192)", 2026-02-21T18:49:09.4986964Z "(4096, 8704)", 2026-02-21T18:49:09.4987041Z "(4096, 9728)", 2026-02-21T18:49:09.4987123Z "(4096, 10240)", 2026-02-21T18:49:09.4987205Z "(4096, 11264)", 2026-02-21T18:49:09.4987307Z "(4096, 11776)", 2026-02-21T18:49:09.4987394Z "(4096, 12800)", 2026-02-21T18:49:09.4987477Z "(4096, 13312)", 2026-02-21T18:49:09.4987560Z "(4096, 14336)", 2026-02-21T18:49:09.4987640Z "(4096, 14848)", 2026-02-21T18:49:09.4987720Z "(4096, 15872)" 2026-02-21T18:49:09.4987799Z ] 2026-02-21T18:49:09.4987872Z } 2026-02-21T18:49:09.4987947Z ] 2026-02-21T18:49:09.5040818Z ##[group]Run pytorch/test-infra/.github/actions/gather-benchmark-metadata@main 2026-02-21T18:49:09.5041003Z with: 2026-02-21T18:49:09.5041502Z github-token: *** 2026-02-21T18:49:09.5041597Z venv: .venv/bin/activate 2026-02-21T18:49:09.5041699Z schema-version: v3 2026-02-21T18:49:09.5041792Z env: 2026-02-21T18:49:09.5041880Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:09.5042015Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.5042173Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:09.5042329Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.5042466Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.5042605Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.5042742Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:09.5042899Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:09.5043041Z ##[endgroup] 2026-02-21T18:49:09.5081974Z ##[group]Run set -eux 2026-02-21T18:49:09.5082086Z set -eux 2026-02-21T18:49:09.5082172Z  2026-02-21T18:49:09.5082270Z if [[ -z "${GITHUB_TOKEN}" ]]; then 2026-02-21T18:49:09.5082405Z  echo "Missing github-token input" 2026-02-21T18:49:09.5082526Z  exit 1 2026-02-21T18:49:09.5082605Z fi 2026-02-21T18:49:09.5083262Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T18:49:09.5083393Z env: 2026-02-21T18:49:09.5083481Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:09.5083615Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.5083774Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:09.5083976Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.5084112Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.5084251Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.5084391Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:09.5084552Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:09.5084774Z GITHUB_TOKEN: *** 2026-02-21T18:49:09.5084864Z ##[endgroup] 2026-02-21T18:49:09.6523796Z + [[ -z *** ]] 2026-02-21T18:49:09.6586822Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main 2026-02-21T18:49:09.6587015Z with: 2026-02-21T18:49:09.6587215Z github-token: *** 2026-02-21T18:49:09.6587320Z env: 2026-02-21T18:49:09.6587410Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:09.6587554Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.6587721Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:09.6587885Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.6588035Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.6588259Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.6588411Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:09.6588592Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:09.6588739Z ##[endgroup] 2026-02-21T18:49:09.6596778Z ##[group]Run set -eux 2026-02-21T18:49:09.6596902Z set -eux 2026-02-21T18:49:09.6596993Z  2026-02-21T18:49:09.6597182Z python3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2026-02-21T18:49:09.6597615Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T18:49:09.6597751Z env: 2026-02-21T18:49:09.6597866Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:09.6598002Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.6598160Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:09.6598325Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.6598462Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.6598603Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:09.6598915Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:09.6599075Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:09.6599285Z GITHUB_TOKEN: *** 2026-02-21T18:49:09.6599375Z ##[endgroup] 2026-02-21T18:49:09.7364046Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 22253280836 linux.rocm.gpu.gfx942.2-n2gvb-runner-kpc8w 2026-02-21T18:49:12.2643561Z setting job-id=64380329949 2026-02-21T18:49:12.2644220Z setting job-name=run-mi325x (layer_norm,gemm) / benchmark-rocm6.4-layer_norm,gemm-py3.12-mi325x 2026-02-21T18:49:12.2810656Z ##[group]Run set -eux 2026-02-21T18:49:12.2810797Z set -eux 2026-02-21T18:49:12.2810899Z  2026-02-21T18:49:12.2810998Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T18:49:12.2811132Z  source ".venv/bin/activate" 2026-02-21T18:49:12.2811242Z fi 2026-02-21T18:49:12.2811320Z  2026-02-21T18:49:12.2811477Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \ 2026-02-21T18:49:12.2811679Z  --schema-version "${SCHEMA_VERSION}" \ 2026-02-21T18:49:12.2811813Z  --repo "${REPO}" \ 2026-02-21T18:49:12.2811930Z  --head-branch "${HEAD_BRANCH}" \ 2026-02-21T18:49:12.2812050Z  --head-sha "${HEAD_SHA}" \ 2026-02-21T18:49:12.2812174Z  --workflow-id "${WORKFLOW_RUN_ID}" \ 2026-02-21T18:49:12.2812307Z  --run-attempt "${RUN_ATTEMPT}" \ 2026-02-21T18:49:12.2812426Z  --job-id "${JOB_ID}" \ 2026-02-21T18:49:12.2812535Z  --job-name "${JOB_NAME}" 2026-02-21T18:49:12.2812746Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T18:49:12.2812961Z env: 2026-02-21T18:49:12.2813044Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:12.2813170Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.2813321Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:12.2813478Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.2813613Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.2813751Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.2813887Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:12.2814046Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:12.2814189Z SCHEMA_VERSION: v3 2026-02-21T18:49:12.2814278Z REPO: pytorch/helion 2026-02-21T18:49:12.2814376Z HEAD_BRANCH: refs/heads/main 2026-02-21T18:49:12.2814496Z HEAD_SHA: 874a7d0cadab18218a84ad3579d329dc95c51820 2026-02-21T18:49:12.2814648Z WORKFLOW_RUN_ID: 22253280836 2026-02-21T18:49:12.2814744Z RUN_ATTEMPT: 1 2026-02-21T18:49:12.2814828Z JOB_ID: 64380329949 2026-02-21T18:49:12.2814990Z JOB_NAME: run-mi325x (layer_norm,gemm) / benchmark-rocm6.4-layer_norm,gemm-py3.12-mi325x 2026-02-21T18:49:12.2815178Z ##[endgroup] 2026-02-21T18:49:12.3199095Z + [[ -n .venv/bin/activate ]] 2026-02-21T18:49:12.3199244Z + source .venv/bin/activate 2026-02-21T18:49:12.3202318Z ++ '[' -z '' ']' 2026-02-21T18:49:12.3202546Z ++ '[' -n x ']' 2026-02-21T18:49:12.3202713Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T18:49:12.3202981Z ++ '[' .venv/bin/activate = /__w/_temp/c28373bc-8643-4c0a-a4db-3987c30ce2ad.sh ']' 2026-02-21T18:49:12.3203246Z ++ deactivate nondestructive 2026-02-21T18:49:12.3203410Z ++ unset -f pydoc 2026-02-21T18:49:12.3203868Z ++ '[' -z '' ']' 2026-02-21T18:49:12.3203994Z ++ '[' -z '' ']' 2026-02-21T18:49:12.3204110Z ++ hash -r 2026-02-21T18:49:12.3204230Z ++ '[' -z '' ']' 2026-02-21T18:49:12.3204358Z ++ unset VIRTUAL_ENV 2026-02-21T18:49:12.3204502Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T18:49:12.3204674Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T18:49:12.3204877Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T18:49:12.3205052Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T18:49:12.3205210Z ++ '[' linux-gnu = msys ']' 2026-02-21T18:49:12.3206070Z ++ export VIRTUAL_ENV 2026-02-21T18:49:12.3206373Z ++ '[' -z '' ']' 2026-02-21T18:49:12.3206500Z ++ unset SCRIPT_PATH 2026-02-21T18:49:12.3207029Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T18:49:12.3207951Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T18:49:12.3208487Z ++ export PATH 2026-02-21T18:49:12.3208614Z ++ '[' xhelion '!=' x ']' 2026-02-21T18:49:12.3208766Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T18:49:12.3208918Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T18:49:12.3209057Z ++ '[' -z '' ']' 2026-02-21T18:49:12.3209177Z ++ '[' -z '' ']' 2026-02-21T18:49:12.3209287Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T18:49:12.3209418Z ++ PS1='(helion) ' 2026-02-21T18:49:12.3209539Z ++ export PS1 2026-02-21T18:49:12.3209648Z ++ alias pydoc 2026-02-21T18:49:12.3209759Z ++ true 2026-02-21T18:49:12.3209862Z ++ hash -r 2026-02-21T18:49:12.3210708Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-benchmark-metadata/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/helion --head-branch refs/heads/main --head-sha 874a7d0cadab18218a84ad3579d329dc95c51820 --workflow-id 22253280836 --run-attempt 1 --job-id 64380329949 --job-name 'run-mi325x (layer_norm,gemm) / benchmark-rocm6.4-layer_norm,gemm-py3.12-mi325x' 2026-02-21T18:49:12.3506388Z ##[group]Run pytorch/test-infra/.github/actions/gather-runners-info@main 2026-02-21T18:49:12.3506670Z with: 2026-02-21T18:49:12.3506759Z venv: .venv/bin/activate 2026-02-21T18:49:12.3506865Z env: 2026-02-21T18:49:12.3506950Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:12.3507083Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.3507243Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:12.3507400Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.3507542Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.3507678Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.3507821Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:12.3507981Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:12.3508194Z ##[endgroup] 2026-02-21T18:49:12.3515572Z ##[group]Run set -eux 2026-02-21T18:49:12.3515692Z set -eux 2026-02-21T18:49:12.3515803Z  2026-02-21T18:49:12.3515894Z if command -v nvidia-smi; then 2026-02-21T18:49:12.3516024Z  DEVICE_NAME=cuda 2026-02-21T18:49:12.3516128Z  nvidia-smi 2026-02-21T18:49:12.3516238Z elif command -v rocm-smi; then 2026-02-21T18:49:12.3516353Z  DEVICE_NAME=rocm 2026-02-21T18:49:12.3516456Z  rocm-smi 2026-02-21T18:49:12.3516558Z elif command -v hl-smi; then 2026-02-21T18:49:12.3516674Z  DEVICE_NAME=hpu 2026-02-21T18:49:12.3516773Z  hl-smi 2026-02-21T18:49:12.3516856Z else 2026-02-21T18:49:12.3516945Z  arch=$(uname -m) 2026-02-21T18:49:12.3529682Z  2026-02-21T18:49:12.3529780Z  case "$arch" in 2026-02-21T18:49:12.3529898Z  aarch64|arm64) 2026-02-21T18:49:12.3530122Z  DEVICE_NAME=arm64-cpu 2026-02-21T18:49:12.3530244Z  ;; 2026-02-21T18:49:12.3530330Z  *) 2026-02-21T18:49:12.3530434Z  DEVICE_NAME=cpu 2026-02-21T18:49:12.3530543Z  ;; 2026-02-21T18:49:12.3530628Z  esac 2026-02-21T18:49:12.3530716Z  lscpu 2026-02-21T18:49:12.3530807Z fi 2026-02-21T18:49:12.3530917Z echo "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV 2026-02-21T18:49:12.3531181Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T18:49:12.3531315Z env: 2026-02-21T18:49:12.3531405Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:12.3531541Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.3531699Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:12.3531862Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.3532002Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.3532149Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.3532299Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:12.3532462Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:12.3532611Z ##[endgroup] 2026-02-21T18:49:12.4564094Z + command -v nvidia-smi 2026-02-21T18:49:12.4564435Z + command -v rocm-smi 2026-02-21T18:49:12.4567802Z + DEVICE_NAME=rocm 2026-02-21T18:49:12.4568100Z /usr/bin/rocm-smi 2026-02-21T18:49:12.4568353Z + rocm-smi 2026-02-21T18:49:12.5947383Z 2026-02-21T18:49:12.5947389Z 2026-02-21T18:49:12.5947616Z ============================================= ROCm System Management Interface ============================================= 2026-02-21T18:49:12.5947964Z ======================================================= Concise Info ======================================================= 2026-02-21T18:49:12.5948412Z Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% 2026-02-21T18:49:12.5949001Z  (DID, GUID) (Junction) (Socket) (Mem, Compute, ID)  2026-02-21T18:49:12.5949587Z ============================================================================================================================ 2026-02-21T18:49:12.5950228Z 0 3 0x74a5, 51110 29.0°C 130.0W NPS1, SPX, 0 134Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T18:49:12.5950645Z 1 5 0x74a5, 2987 31.0°C 142.0W NPS1, SPX, 0 133Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T18:49:12.5951043Z 2 4 0x74a5, 61326 30.0°C 159.0W NPS1, SPX, 0 2105Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T18:49:12.5951441Z 3 2 0x74a5, 9091 49.0°C 186.0W NPS1, SPX, 0 2103Mhz 900Mhz 0% auto 1000.0W 1% 0% 2026-02-21T18:49:12.5951843Z 4 7 0x74a5, 26567 32.0°C 130.0W NPS1, SPX, 0 134Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T18:49:12.5952247Z 5 9 0x74a5, 43978 33.0°C 136.0W NPS1, SPX, 0 133Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T18:49:12.5952629Z 6 8 0x74a5, 20463 30.0°C 167.0W NPS1, SPX, 0 2026Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T18:49:12.5953028Z 7 6 0x74a5, 33762 33.0°C 142.0W NPS1, SPX, 0 133Mhz 900Mhz 0% auto 1000.0W 0% 0% 2026-02-21T18:49:12.5953333Z ============================================================================================================================ 2026-02-21T18:49:12.5953605Z =================================================== End of ROCm SMI Log ==================================================== 2026-02-21T18:49:12.6010623Z + echo DEVICE_NAME=rocm 2026-02-21T18:49:12.6053115Z ##[group]Run set -eux 2026-02-21T18:49:12.6053306Z set -eux 2026-02-21T18:49:12.6053417Z  2026-02-21T18:49:12.6053688Z if [[ "${DEVICE_NAME}" == "cuda" ]]; then 2026-02-21T18:49:12.6053883Z  # Return the same device name as PyTorch 2026-02-21T18:49:12.6054164Z  DEVICE_TYPE=$(nvidia-smi -i 0 --query-gpu=name --format=csv,noheader) 2026-02-21T18:49:12.6054413Z elif [[ "${DEVICE_NAME}" == "rocm" ]]; then 2026-02-21T18:49:12.6054685Z  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs) 2026-02-21T18:49:12.6054956Z elif [[ "${DEVICE_NAME}" == "hpu" ]]; then 2026-02-21T18:49:12.6055239Z  DEVICE_TYPE="Intel Gaudi3 "$(hl-smi -q | grep "Product Name" | head -n 1 | awk -F ':' '{print $2}' | sed 's/^ *//') 2026-02-21T18:49:12.6055528Z elif [[ "${DEVICE_NAME}" == "cpu" ]]; then 2026-02-21T18:49:12.6056090Z  DEVICE_TYPE="$(lscpu | grep "Model name" | sed -E 's/.*Model name:[[:space:]]*//; s/Intel\(R\)//g; s/\(R\)//g; s/\(TM\)//g; s/CPU//g; s/Processor//g; s/[[:space:]]+/ /g; s/^ //; s/ $//; s/ /_/g')_$(awk -F: '/Core\(s\) per socket/ {c=$2} /Socket\(s\)/ {s=$2} END {gsub(/ /,"",c); gsub(/ /,"",s); printf "%sc", c*s}' < <(lscpu))" 2026-02-21T18:49:12.6056662Z elif [[ "${DEVICE_NAME}" == "arm64-cpu" ]]; then 2026-02-21T18:49:12.6056925Z  DEVICE_TYPE=$(lscpu | grep 'Vendor ID' | cut -f 2 -d ":" | awk '{$1=$1}1' | cut -f 2 -d " ") 2026-02-21T18:49:12.6057163Z fi 2026-02-21T18:49:12.6057290Z echo "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV 2026-02-21T18:49:12.6057699Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T18:49:12.6057862Z env: 2026-02-21T18:49:12.6057981Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:12.6058145Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.6058325Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:12.6058516Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.6058659Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.6058808Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.6058957Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:12.6072204Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:12.6072374Z DEVICE_NAME: rocm 2026-02-21T18:49:12.6072545Z ##[endgroup] 2026-02-21T18:49:12.7039951Z + [[ rocm == \c\u\d\a ]] 2026-02-21T18:49:12.7040098Z + [[ rocm == \r\o\c\m ]] 2026-02-21T18:49:12.7044791Z ++ rocminfo 2026-02-21T18:49:12.7044899Z ++ grep 'Marketing Name' 2026-02-21T18:49:12.7045031Z ++ tail -n1 2026-02-21T18:49:12.7045556Z ++ awk -F: '{print $2}' 2026-02-21T18:49:12.7046720Z ++ xargs 2026-02-21T18:49:12.9057865Z + DEVICE_TYPE='AMD Instinct MI325X' 2026-02-21T18:49:12.9058166Z + echo 'DEVICE_TYPE=AMD Instinct MI325X' 2026-02-21T18:49:12.9106024Z ##[group]Run set -eux 2026-02-21T18:49:12.9106200Z set -eux 2026-02-21T18:49:12.9106320Z  2026-02-21T18:49:12.9106449Z if [[ -n ".venv/bin/activate" ]]; then 2026-02-21T18:49:12.9106637Z  source ".venv/bin/activate" 2026-02-21T18:49:12.9106794Z fi 2026-02-21T18:49:12.9106902Z  2026-02-21T18:49:12.9107096Z python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T18:49:12.9107406Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py" 2026-02-21T18:49:12.9107779Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T18:49:12.9107966Z env: 2026-02-21T18:49:12.9108084Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:12.9108398Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.9108619Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:12.9108830Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.9109015Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.9109206Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:12.9109409Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:12.9109755Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:12.9109951Z DEVICE_NAME: rocm 2026-02-21T18:49:12.9110080Z DEVICE_TYPE: AMD Instinct MI325X 2026-02-21T18:49:12.9110260Z ##[endgroup] 2026-02-21T18:49:12.9921912Z + [[ -n .venv/bin/activate ]] 2026-02-21T18:49:12.9922310Z + source .venv/bin/activate 2026-02-21T18:49:12.9922591Z ++ '[' -z '' ']' 2026-02-21T18:49:12.9922818Z ++ '[' -n x ']' 2026-02-21T18:49:12.9923074Z ++ SCRIPT_PATH=.venv/bin/activate 2026-02-21T18:49:12.9923546Z ++ '[' .venv/bin/activate = /__w/_temp/cb3b8d8a-f7a6-4cc7-85b7-2805a49d382d.sh ']' 2026-02-21T18:49:12.9924027Z ++ deactivate nondestructive 2026-02-21T18:49:12.9924315Z ++ unset -f pydoc 2026-02-21T18:49:12.9924549Z ++ '[' -z '' ']' 2026-02-21T18:49:12.9924762Z ++ '[' -z '' ']' 2026-02-21T18:49:12.9924975Z ++ hash -r 2026-02-21T18:49:12.9925175Z ++ '[' -z '' ']' 2026-02-21T18:49:12.9925392Z ++ unset VIRTUAL_ENV 2026-02-21T18:49:12.9925645Z ++ unset VIRTUAL_ENV_PROMPT 2026-02-21T18:49:12.9925963Z ++ '[' '!' nondestructive = nondestructive ']' 2026-02-21T18:49:12.9926309Z ++ VIRTUAL_ENV=/__w/helion/helion/.venv 2026-02-21T18:49:12.9926657Z ++ '[' linux-gnu = cygwin ']' 2026-02-21T18:49:12.9926941Z ++ '[' linux-gnu = msys ']' 2026-02-21T18:49:12.9927201Z ++ export VIRTUAL_ENV 2026-02-21T18:49:12.9927448Z ++ '[' -z '' ']' 2026-02-21T18:49:12.9927665Z ++ unset SCRIPT_PATH 2026-02-21T18:49:12.9928663Z ++ _OLD_VIRTUAL_PATH=/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T18:49:12.9930402Z ++ PATH=/__w/helion/helion/.venv/bin:/github/home/.local/share/uv/python:/__w/_tool/uv/0.10.4/x86_64:/github/home/.local/bin:/__w/_tool/Python/3.12.12/x64/bin:/__w/_tool/Python/3.12.12/x64:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2026-02-21T18:49:12.9931437Z ++ export PATH 2026-02-21T18:49:12.9931678Z ++ '[' xhelion '!=' x ']' 2026-02-21T18:49:12.9931957Z ++ VIRTUAL_ENV_PROMPT=helion 2026-02-21T18:49:12.9932244Z ++ export VIRTUAL_ENV_PROMPT 2026-02-21T18:49:12.9932766Z ++ '[' -z '' ']' 2026-02-21T18:49:12.9932987Z ++ '[' -z '' ']' 2026-02-21T18:49:12.9933203Z ++ _OLD_VIRTUAL_PS1= 2026-02-21T18:49:12.9933504Z ++ PS1='(helion) ' 2026-02-21T18:49:12.9933644Z ++ export PS1 2026-02-21T18:49:12.9933753Z ++ alias pydoc 2026-02-21T18:49:12.9933862Z ++ true 2026-02-21T18:49:12.9933957Z ++ hash -r 2026-02-21T18:49:12.9934111Z + python3 -mpip install psutil==7.0.0 nvidia-ml-py==13.580.82 2026-02-21T18:49:13.4990311Z Collecting psutil==7.0.0 2026-02-21T18:49:13.5462160Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB) 2026-02-21T18:49:13.5615088Z Collecting nvidia-ml-py==13.580.82 2026-02-21T18:49:13.5678732Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl.metadata (9.6 kB) 2026-02-21T18:49:13.5815598Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB) 2026-02-21T18:49:13.5996073Z Downloading nvidia_ml_py-13.580.82-py3-none-any.whl (49 kB) 2026-02-21T18:49:13.6618703Z Installing collected packages: nvidia-ml-py, psutil 2026-02-21T18:49:13.6622944Z Attempting uninstall: nvidia-ml-py 2026-02-21T18:49:13.6636648Z Found existing installation: nvidia-ml-py 13.590.48 2026-02-21T18:49:13.6646218Z Uninstalling nvidia-ml-py-13.590.48: 2026-02-21T18:49:13.7426998Z Successfully uninstalled nvidia-ml-py-13.590.48 2026-02-21T18:49:13.7689934Z Attempting uninstall: psutil 2026-02-21T18:49:13.7708835Z Found existing installation: psutil 7.2.2 2026-02-21T18:49:13.7721236Z Uninstalling psutil-7.2.2: 2026-02-21T18:49:13.7724639Z Successfully uninstalled psutil-7.2.2 2026-02-21T18:49:13.8547804Z 2026-02-21T18:49:13.8551743Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2026-02-21T18:49:13.8552660Z tritonbench 0.0.1 requires triton, which is not installed. 2026-02-21T18:49:13.8572513Z Successfully installed nvidia-ml-py-13.580.82 psutil-7.0.0 2026-02-21T18:49:13.9563188Z + python3 /__w/_actions/pytorch/test-infra/main/.github/actions/gather-runners-info/../../scripts/benchmarks/gather_runners_info.py 2026-02-21T18:49:22.7368424Z ##[group]Run pytorch/test-infra/.github/actions/gather-dependencies@main 2026-02-21T18:49:22.7368655Z with: 2026-02-21T18:49:22.7368796Z venv: .venv/bin/activate 2026-02-21T18:49:22.7368918Z env: 2026-02-21T18:49:22.7369009Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:22.7369143Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7369311Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:22.7369470Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7369615Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7369772Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7369916Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:22.7370081Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:22.7370221Z DEVICE_NAME: rocm 2026-02-21T18:49:22.7370335Z DEVICE_TYPE: AMD Instinct MI325X 2026-02-21T18:49:22.7370447Z ##[endgroup] 2026-02-21T18:49:22.7377658Z ##[group]Run set -eux 2026-02-21T18:49:22.7377778Z set -eux 2026-02-21T18:49:22.7377869Z  2026-02-21T18:49:22.7377971Z # TODO (huydhn): Implement this part 2026-02-21T18:49:22.7378116Z echo "dependencies={}" >> "${GITHUB_OUTPUT}" 2026-02-21T18:49:22.7378353Z shell: bash --noprofile --norc -e -o pipefail {0} 2026-02-21T18:49:22.7378479Z env: 2026-02-21T18:49:22.7378570Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:22.7378697Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7378858Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:22.7379021Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7379159Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7379301Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7379441Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:22.7379691Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:22.7379833Z DEVICE_NAME: rocm 2026-02-21T18:49:22.7379932Z DEVICE_TYPE: AMD Instinct MI325X 2026-02-21T18:49:22.7380047Z ##[endgroup] 2026-02-21T18:49:22.7756990Z + echo 'dependencies={}' 2026-02-21T18:49:22.7806852Z ##[group]Run actions/upload-artifact@v6 2026-02-21T18:49:22.7807001Z with: 2026-02-21T18:49:22.7807114Z name: benchmark-results-mi325x-layer_norm,gemm 2026-02-21T18:49:22.7807253Z path: test/test-reports 2026-02-21T18:49:22.7807368Z if-no-files-found: warn 2026-02-21T18:49:22.7807470Z compression-level: 6 2026-02-21T18:49:22.7807588Z overwrite: false 2026-02-21T18:49:22.7807681Z include-hidden-files: false 2026-02-21T18:49:22.7807793Z env: 2026-02-21T18:49:22.7807881Z HELION_AUTOTUNE_LOG_LEVEL: INFO 2026-02-21T18:49:22.7808016Z pythonLocation: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7808177Z PKG_CONFIG_PATH: /__w/_tool/Python/3.12.12/x64/lib/pkgconfig 2026-02-21T18:49:22.7808350Z Python_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7808492Z Python2_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7808630Z Python3_ROOT_DIR: /__w/_tool/Python/3.12.12/x64 2026-02-21T18:49:22.7808792Z LD_LIBRARY_PATH: /__w/_tool/Python/3.12.12/x64/lib 2026-02-21T18:49:22.7808957Z UV_PYTHON_INSTALL_DIR: /github/home/.local/share/uv/python 2026-02-21T18:49:22.7809102Z DEVICE_NAME: rocm 2026-02-21T18:49:22.7809197Z DEVICE_TYPE: AMD Instinct MI325X 2026-02-21T18:49:22.7809314Z ##[endgroup] 2026-02-21T18:49:22.7811227Z ##[command]/usr/bin/docker exec bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b sh -c "cat /etc/*release | grep ^ID" 2026-02-21T18:49:23.0189810Z With the provided path, there will be 1 file uploaded 2026-02-21T18:49:23.0192333Z Artifact name is valid! 2026-02-21T18:49:23.0205337Z Root directory input is valid! 2026-02-21T18:49:28.2148030Z Beginning upload of artifact content to blob storage 2026-02-21T18:49:30.6009450Z Uploaded bytes 1092 2026-02-21T18:49:30.6309019Z Finished uploading artifact content to blob storage! 2026-02-21T18:49:30.6309677Z SHA256 digest of uploaded artifact zip is da5f5d4c771bb91f87752ffe34500a0098560e2d44352e64dd57426f1a62c685 2026-02-21T18:49:30.6310243Z Finalizing artifact upload 2026-02-21T18:49:30.7773477Z Artifact benchmark-results-mi325x-layer_norm,gemm.zip successfully finalized. Artifact ID 5602745621 2026-02-21T18:49:30.7773852Z Artifact benchmark-results-mi325x-layer_norm,gemm has been successfully uploaded! Final size is 1092 bytes. Artifact ID is 5602745621 2026-02-21T18:49:30.7774310Z Artifact download URL: https://github.com/pytorch/helion/actions/runs/22253280836/artifacts/5602745621 2026-02-21T18:49:30.7930212Z Post job cleanup. 2026-02-21T18:49:30.7933168Z ##[command]/usr/bin/docker exec bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b sh -c "cat /etc/*release | grep ^ID" 2026-02-21T18:49:30.9994843Z UV_PYTHON_INSTALL_DIR is already set to /github/home/.local/share/uv/python 2026-02-21T18:49:30.9996638Z (node:1612039) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2026-02-21T18:49:30.9997249Z (Use `node --trace-deprecation ...` to show where the warning was created) 2026-02-21T18:49:31.0104415Z Post job cleanup. 2026-02-21T18:49:31.0106663Z ##[command]/usr/bin/docker exec bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b sh -c "cat /etc/*release | grep ^ID" 2026-02-21T18:49:31.2142967Z Post job cleanup. 2026-02-21T18:49:31.2145357Z ##[command]/usr/bin/docker exec bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b sh -c "cat /etc/*release | grep ^ID" 2026-02-21T18:49:31.3886302Z [command]/usr/bin/git version 2026-02-21T18:49:31.3912382Z git version 2.43.0 2026-02-21T18:49:31.3938175Z Temporarily overriding HOME='/__w/_temp/88c9b076-d11a-4302-892a-3fb6b359936b' before making global git config changes 2026-02-21T18:49:31.3938594Z Adding repository directory to the temporary git global config as a safe directory 2026-02-21T18:49:31.3941503Z [command]/usr/bin/git config --global --add safe.directory /__w/helion/helion 2026-02-21T18:49:31.3960944Z Removing SSH command configuration 2026-02-21T18:49:31.3963389Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2026-02-21T18:49:31.3985655Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2026-02-21T18:49:31.4165461Z Removing HTTP extra header 2026-02-21T18:49:31.4167563Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2026-02-21T18:49:31.4183423Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2026-02-21T18:49:31.4337667Z Removing includeIf entries pointing to credentials config files 2026-02-21T18:49:31.4340106Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2026-02-21T18:49:31.4359963Z includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T18:49:31.4360192Z includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T18:49:31.4360406Z includeif.gitdir:/github/workspace/.git.path 2026-02-21T18:49:31.4360609Z includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T18:49:31.4369189Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git.path 2026-02-21T18:49:31.4381190Z /__w/_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T18:49:31.4386513Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git.path /__w/_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T18:49:31.4404291Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path 2026-02-21T18:49:31.4417842Z /__w/_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T18:49:31.4429588Z [command]/usr/bin/git config --local --unset includeif.gitdir:/__w/helion/helion/.git/worktrees/*.path /__w/_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T18:49:31.4445721Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git.path 2026-02-21T18:49:31.4457196Z /github/runner_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T18:49:31.4463792Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git.path /github/runner_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T18:49:31.4483170Z [command]/usr/bin/git config --local --get-all includeif.gitdir:/github/workspace/.git/worktrees/*.path 2026-02-21T18:49:31.4494400Z /github/runner_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T18:49:31.4498195Z [command]/usr/bin/git config --local --unset includeif.gitdir:/github/workspace/.git/worktrees/*.path /github/runner_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config 2026-02-21T18:49:31.4512650Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2026-02-21T18:49:31.4654799Z Removing credentials config '/__w/_temp/git-credentials-fb3d78f5-f9b2-44f5-8f6e-04c9363b1b4f.config' 2026-02-21T18:49:31.4755399Z Stop and remove container: 516247e566694a1eb38ddd3134e58d44_rocmdevubuntu2404644complete_8093c4 2026-02-21T18:49:31.4757972Z ##[command]/usr/bin/docker rm --force bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b 2026-02-21T18:49:36.0501773Z bbfd39aeded06bc62bd6cf19f7c681342c3a011ec44e4761e2edffd886cb0a1b 2026-02-21T18:49:36.0534898Z Remove container network: github_network_52abf34bb8e94476be1b321dcdd72aa0 2026-02-21T18:49:36.0537303Z ##[command]/usr/bin/docker network rm github_network_52abf34bb8e94476be1b321dcdd72aa0 2026-02-21T18:49:36.5764351Z github_network_52abf34bb8e94476be1b321dcdd72aa0 2026-02-21T18:49:36.5822252Z Evaluate and set job outputs 2026-02-21T18:49:36.5826043Z Set output 'benchmark-metadata' 2026-02-21T18:49:36.5827079Z Set output 'runners-info' 2026-02-21T18:49:36.5827447Z Set output 'dependencies' 2026-02-21T18:49:36.5827838Z Cleaning up orphan processes